Reliability
Reliability is a measure of consistency. Assessment of reliability focus on both characteristics of specific items and characteristics of an instrument (scale, or other construct).
Measures based on Raters and Evaluators
| Inter-rater reliability is an application of split-half or other non-parametric procedures. | |
| Rating rubrics provide reference criteria for systematic comparison of open-responses (essays, written responses, visual or auditory performances, etc.). Rubrics can increase the reliability in evaluation of performance. That reliability can be evaluated and compared among teams, or with reference to specific items or evaluators. |
At the item level using Item Response Theory (IRT): IRT is a set of statistical procedures that produce both measurements of individuals' performances and evaluations of item characteristics. These procedures are primarily used in standardized testing of knowledge-concepts (and are not very helpful for personality and behavior instruments). IRT requires an assumption that the set of items being examined are coherent as evaluated in CTT, see below)
| Item discrimination provides an evaluation of how effectively an item identifies, or separates between those who meet a criteria and those who do not meet a criteria. | |
| Item difficulty provides an evaluation of how hard an item is. | |
| Pseudo-guessing estimate provides an evaluation of the probability that a respondents can guess a correct answer without knowing the intended concept. |
At the scale or construct level using classic testing theory (CTT) item analysis:
| Factor analysis using principal component extraction from a pool of items - used to identify items that may have scale coherence | |
| Correlation of an item to a scale or other construction. | |
| Cronbach's alpha α to evaluate scale consistency. Conceptually this statistic is a mean average of all split-half permutations. | |
| Split-half reliability is a process to evaluate item consistency. (Split-half reliability could be "hand" calculated and was frequently used prior to computer software that made more powerful alternatives available.) |
At the instrument level:
| Issues of consistency in test-retest reliability focus on tools producing similar scores when administered at different times. | |
| Issues of agreement among alternative forms or parallel-forms reliability. Many of the issues arising in construction multiple forms can be addressed by application of IRT procedures. | |
| Issues of agreement among human evaluators/raters or rubrics -- inter-rater or inter-observer reliability. |
When considering reliability some statistics may help users and developers assess a tool for its effectiveness. A developer may want to share these types of statistics with users. Other measures of specific item and tool reliability may be of interest primarily to developers, and therefore not shared with users. Low levels of reliability on scales may be an indication of challenges from low levels of construct validity.
The following pages have more information
| Statistical processes for tool assessment by a developer | |
| Statistical summary of a tool assessment for a user |

