Assessing / evaluating an Item
Once an item is developed and deployed into tests, one needs to consider if
the item and test continue to serve their purpose.
Assessment and review of items and tests should be a continual process.
These examinations should occur after each utilization of an instrument.
The examination should consider the following criterion.
 |
Questions of validity and reliability
 |
Has the skill/knowledge base changed? Items should be reviewed for
currency and accuracy as they relate to the subject being examined by
experts or those knowledgeable in the field. |
 |
Are the targeted skills/knowledge being adequately tested?
 |
Are there voids or over-examined areas? |
 |
Is each area reliably examined? This is a balancing issue.
If one uses only one or a few items per area, then the evaluation is
subject to a high level of test-error. However, using many items,
while more reliable, may also make a test unmanageable. |
|
 |
Are the item's difficulty and discrimination acceptable? Changes in these measures may indicate either problems
with an item, or compromises in the security of a test. After some
time items may become widely known and thus loose their value, but these
changes should show up in lower difficulty and discrimination values. |
 |
Are test results predictive of other independent measures of skill and
knowledge? Thus, is the test and its items satisfying its design? |
|
 |
Questions of structure and purpose
 |
Is there a need to adjust the overall test's structural results to
standardized it, make it more or less difficult? If yes, then items
may be changed to accomplish these goals. |
 |
Is there a need to adjust the overall test for political or social
reasons? A test that validly and reliably tests subjects and produces
an average of 50% does not tell us more than a test that results in an
average of 80%. However, in some settings it may be appropriate
to adjust a test to help with stakeholder (public) or student perceptions
and attributions. These changes do not mean that one is "dumbing down"
the test if one maintains the discrimination value of items. However,
when the range of subjects knowledge is broad tests may have lower averages
in order to evaluate that wide range of knowledge and not "top out." |
 |
Is the purpose of the test to evaluate relative strength among test
takers, or is it to judge a pass/fail level of knowledge? Item choice
may be guided by a design that targets normal distribution, universal
distribution, or bi-modal distribution of scores. |
|