Differential Item Function (DIF) is the main method for getting bias out of tests. The basic approach is as follows:
For every score level on a test (such as those scoring 68 out of 100), calculate the odds that a reference group will get the item correct, and that the focal group will get it correct (say the reference group are women, the focal group men). Divide reference group success rate for a problem by focal group success. If that difference is greater than some critical value determined by a statistical test, it is considered biased. You can see that if you try to equalize test scores, say by adding problems in Spanish to help Mexican immigrants, that would show up in a DIF test, because for those with otherwise similar scores, the Mexicans would do relatively better than those who don't speak Spanish. This is why cultural-specific argot of all kinds are not there, including 'classist' words like penthouse and polo (and thus avoiding the famous 'oarsman-regatta' analogy question that was eliminated 40 years ago).
Fundamentally, any good test question's accuracy should be positively correlated with other question's for any individual person. That is, getting wrong question #4 should statistically imply a higher chance of getting wrong any other questions, and vice versa. The magnitude of this positive correlation will vary question to question because tests will have questions with various degrees of difficulty so that one can discriminate within the really good and really bad test takers. That is, there will be some 'gimmes' that most people will find really easy, but allows you to distinguish between the first and tenth percentile, and there will be very difficult questions that allow one to distinguish between those in the 99th and 90th percentile. Nonetheless, getting wrong any question should be positively correlated with the frequency one gets wrong other questions.
The DIF is a seemingly straightforward way to address bias, yet in legal cases DIF analysis is an insufficient argument to defend a tests unbiasedness. One reason given is because the whole test could be biased in a very convoluted way. This is because although a black and white average score may be 30 and 40, respectively, each group has a distribution of scores around those means, and for those individuals with scores of 45 or 20, you can't tell if one is black or white if they passed a DIF test. The 'whole test is biased' allegation seems reasonable at 30,000 feet, but given the DIF tests applied to individual questions within groups, this is pure paranoia and opportunism. I can't imagine the nature of the bias would be if it appeared in that kind of form, conscious or unconscious.
Another reason tests are dismissed is because they are not shown to be sufficiently job related, and indeed, for most positions no one is has the data sufficient to prove they are. Yet aptitude is very useful at solving unforeseen contingencies, and a recent paper showed that cognitive tests were useful in predicting successful outcomes for truck drivers. There are very few jobs that do not benefit from higher cognitive skills.
In 1971, Griggs v. Duke Power found that employers have the burden of producing and proving the business necessity of a test. Title VII of the Civil Rights Act prohibits employment tests that are not a "reasonable measure of job performance," regardless of the absence of actual intent to discriminate, so you can't use an IQ test if you did not do am empirical study showing IQ is related to actual job performance for the specific position the test is used for.
Most large organizations subsequently got rid of g-loaded tests that tended to have, and have to this day, disparate impact in black vs. white groupings. The Carter Administration abolished standardized tests for government workers in 1981. Thus it is common to administer tests without a paper trail, so that at Google or Microsoft you get a constantly changing verbal test called an 'interview', but does substantively the same thing as giving the applicant the analytical and mathematical GRE test with some programming questions thrown in.
From the Ricci case hearing in front of Sotomayor:
JUDGE SOTOMAYOR: Counsel ... we're not suggesting that unqualified people be hired. The city's not suggesting that. All right? But there is a difference between where you score on the test and how many openings you have. And to the extent that there's an adverse impact on one group over the other, so that the first seven who are going to be hired only because of the vagrancies [sic] of the vacancies at that moment, not because you're unqualified--the pass rate is the pass rate--all right? But if your test is always going to put a certain group at the bottom of the pass rate so they're never ever going to be promoted, and there is a fair test that could be devised that measures knowledge in a more substantive way, then why shouldn't the city have an opportunity to try and look and see if it can develop that?
KAREN LEE TORRE: Because they already developed it, your honor.
JUDGE SOTOMAYOR: It assumes the answer. It assumes the answer which is that, um, the test is valid because we say it's valid.
Sotomayor appears to be making two assumptions. First, she thinks a meaningful test (ie, predictive of some future measure of performance) without disparate impact exists or can be developed. Secondly, she seems to think the job in question needs only a low set of skills as correlated with test taking, so that anything above a much lower pass criterion is irrelevant to job performance.
I disagree, but Sotomayor's view is the dominant view among educated people of all groups, so I hope there's a good airing of the arguments on both sides.