Tuesday, June 02, 2009

Sotomayor, Ricci, and Tests


Differential Item Function (DIF) is the main method for getting bias out of tests. The basic approach is as follows:

For every score level on a test (such as those scoring 68 out of 100), calculate the odds that a reference group will get the item correct, and that the focal group will get it correct (say the reference group are women, the focal group men). Divide reference group success rate for a problem by focal group success. If that difference is greater than some critical value determined by a statistical test, it is considered biased. You can see that if you try to equalize test scores, say by adding problems in Spanish to help Mexican immigrants, that would show up in a DIF test, because for those with otherwise similar scores, the Mexicans would do relatively better than those who don't speak Spanish. This is why cultural-specific argot of all kinds are not there, including 'classist' words like penthouse and polo (and thus avoiding the famous 'oarsman-regatta' analogy question that was eliminated 40 years ago).

Fundamentally, any good test question's accuracy should be positively correlated with other question's for any individual person. That is, getting wrong question #4 should statistically imply a higher chance of getting wrong any other questions, and vice versa. The magnitude of this positive correlation will vary question to question because tests will have questions with various degrees of difficulty so that one can discriminate within the really good and really bad test takers. That is, there will be some 'gimmes' that most people will find really easy, but allows you to distinguish between the first and tenth percentile, and there will be very difficult questions that allow one to distinguish between those in the 99th and 90th percentile. Nonetheless, getting wrong any question should be positively correlated with the frequency one gets wrong other questions.

The DIF is a seemingly straightforward way to address bias, yet in legal cases DIF analysis is an insufficient argument to defend a tests unbiasedness. One reason given is because the whole test could be biased in a very convoluted way. This is because although a black and white average score may be 30 and 40, respectively, each group has a distribution of scores around those means, and for those individuals with scores of 45 or 20, you can't tell if one is black or white if they passed a DIF test. The 'whole test is biased' allegation seems reasonable at 30,000 feet, but given the DIF tests applied to individual questions within groups, this is pure paranoia and opportunism. I can't imagine the nature of the bias would be if it appeared in that kind of form, conscious or unconscious.

Another reason tests are dismissed is because they are not shown to be sufficiently job related, and indeed, for most positions no one is has the data sufficient to prove they are. Yet aptitude is very useful at solving unforeseen contingencies, and a recent paper showed that cognitive tests were useful in predicting successful outcomes for truck drivers. There are very few jobs that do not benefit from higher cognitive skills.

In 1971, Griggs v. Duke Power found that employers have the burden of producing and proving the business necessity of a test. Title VII of the Civil Rights Act prohibits employment tests that are not a "reasonable measure of job performance," regardless of the absence of actual intent to discriminate, so you can't use an IQ test if you did not do am empirical study showing IQ is related to actual job performance for the specific position the test is used for.

Most large organizations subsequently got rid of g-loaded tests that tended to have, and have to this day, disparate impact in black vs. white groupings. The Carter Administration abolished standardized tests for government workers in 1981. Thus it is common to administer tests without a paper trail, so that at Google or Microsoft you get a constantly changing verbal test called an 'interview', but does substantively the same thing as giving the applicant the analytical and mathematical GRE test with some programming questions thrown in.

From the Ricci case hearing in front of Sotomayor:

JUDGE SOTOMAYOR: Counsel ... we're not suggesting that unqualified people be hired. The city's not suggesting that. All right? But there is a difference between where you score on the test and how many openings you have. And to the extent that there's an adverse impact on one group over the other, so that the first seven who are going to be hired only because of the vagrancies [sic] of the vacancies at that moment, not because you're unqualified--the pass rate is the pass rate--all right? But if your test is always going to put a certain group at the bottom of the pass rate so they're never ever going to be promoted, and there is a fair test that could be devised that measures knowledge in a more substantive way, then why shouldn't the city have an opportunity to try and look and see if it can develop that?

KAREN LEE TORRE: Because they already developed it, your honor.

JUDGE SOTOMAYOR: It assumes the answer. It assumes the answer which is that, um, the test is valid because we say it's valid.

Sotomayor appears to be making two assumptions. First, she thinks a meaningful test (ie, predictive of some future measure of performance) without disparate impact exists or can be developed. Secondly, she seems to think the job in question needs only a low set of skills as correlated with test taking, so that anything above a much lower pass criterion is irrelevant to job performance.

I disagree, but Sotomayor's view is the dominant view among educated people of all groups, so I hope there's a good airing of the arguments on both sides.

3 comments:

J said...

..It assumes the answer...

That is the problem. If you apply a testing process where general competence is a factor to a heterogenous set of individuals, the test will classify them according competence. If that is unacceptable, the competence factor should be taken out of the testing process and substituted by another factor, some relevant but "neutral" parameter. In the case of fire fighters, some parameters could be measured: allergic reactions to selected hazardous materials, color vision, tone hearing, underlying disposition to risk life to help collegues in situations of danger, instintive reaction to fire, heat resistance, and so.

J said...

Also the Army's approach could be implemented: the job and the tools must be re-designed so that competence is less vital. In the case of the Army, weapons were made idiot-friendly, written instruction totally eliminated, tasks to perform broken down to steps so simple that morons could do it and even excel at them. The fire fighter's job historically developed to fit an European population. An effort could be made to redesign it for other populations. One of the questions in the firefighter test demanded to optimize combinations of fixed length fire hoses. The whole system of men entering into a burning house with water hoses is obsolete and therefore, the question involving water hoses was irrelevant and tendentious. The Army was forced to simplify and redesign the soldier function because it was not recruiting enough competent candidates. In general, the competence level of American society is decreasing, competent candidates are becoming scarcer and tasks like fire fighting should be redesigned to make it doable by the population available. I know that universities are redesigning their courses to allow the available population (which is noticeably less competent than in my times) to perform as students and lecturers in the academy.

Anonymous said...

we need a test for lazyness and ambition, not cognitive ability.

this would weed out most of the largess in government bureaucracy, not to mention probably every industry in existence.

the guy who allegedly has the highest IQ in the usa worked as a bouncer for 20+ years on $6K/yr.
no advanced degrees, because he was "smarter than his instructors" (who cares?! pay the fees, get the creds and get a job!), instead he has produced a "theory of reality".

could the IQ test predict this output?

could his abusive childhood and impoverished homelife have predicted taking a job as a bouncer, and living below the poverty line?

who knows?

cognitive abilities and psychological trauma are easy to assess in standardised fashion, whether or not they are even accurate. but potential productiveness? is it not a judgment call? show me a standardised test for prodcutivity.

perhaps a probationary period is a form of test. either produce quota or be terminated.