MORE ON THE DISTRICT COURT RULING IN BAZILE V. CITY OF HOUSTON (2012)
by Art Gutman Ph.D., Professor, Florida Institute of Technology
In an alert posted on 2/22/12, I summarized the major rulings by Judge Lee H. Rosenthal in Bazile v. City of Houston decided by District Court in the Southern District of Texas decided on 2/6/12. (see 2012 U.S. Dist. LEXIS 14712). In that alert (Part 1), I noted that this was a “battle of experts” and promised that in this alert (Part 2), I would summarize the major points made by the experts. In order of introduction by Judge Rosenthal, those experts are Drs. Jim Sharf, Mort McPhail, Kyle Brink, Kathleen Lundquist, Winfred Arthur, and David Morris. Testimony was also provided by William Barry, an Assistant Chief in the Houston Fire Department (HFD) affiliated with the human resources department. Please note that because of space considerations, the following summarizes major points and, with the exception of Mr. Barry, does not exhaust the contributions of these experts.
Dr. Sharf represented the Houston Professional Fire Fighters Association (HPFFA). He offered that the Uniform Guidelines on Employment Selection Procedures (UGESP) endorse methods of establishing validity (content, criterion and construct-related validation) that are outdated and, at best, are mere starting points. He noted that the UGESP do not preclude “other professionally acceptable techniques with respect to validation of selection procedures”, and that newer methods of establishing validity are endorsed by the APA Standards and the SIOP Principles. His main conclusion is that “validity generalization shows that the captain and senior-captain exams are valid, and that validity generalization better analyzes the validity of a test than the methods identified in the Guidelines.” He based his opinion on scholarly publications, most notably a 1981 article Schmidt, Hunter, and Pearlman in the Journal of Applied Psychology, which favors “unobservable cognitive skills” over “observable behaviors” identified in traditional job analysis methods. He argued further that cognitive skills “better predict performance after promotion than observable behaviors”, and that the HFD conducted a valid job analysis for development of an “objective, job-related Captain’s and Senior-Captain’s exam.”
Dr. McPhail conducted a criterion-related validation study of the 2006 captain exam. He compared candidates promoted based on the exam to others (engineer/operators) who “rode up” to captain after the exam. For the criteria, McPhail developed a Performance Dimension Rating Form (PDRF) based on performance dimensions established by subject matter experts (SMEs). The PDRF contained five rating categories on a12-point scales with behavioral anchors. The raw data showed means (standard deviations) of 79.35 (10.13) for 438 firefighters who took the exam compared to 90.10 (3.83) for those actually promoted, and 73.35 (3.83) for those who rode up. Correlations coefficients ranged between .37 and .51. However, McPhail found barbell distributions in which most performance ratings for captains were at the upper end of the distribution and most performance ratings for engineer/operators were located in the lower end of the distribution. He concluded that “among a much less restricted sample, test scores provided incremental prediction of performance . . . even after accounting for the relationship of promotion status with the performance ratings.” He described these results as being “equivocal.”
Dr. Brink, who represented the defendants, was critical the HFD job analysis and the reliability of the captain and senior-captain exams. His main points were that the exams lacked content validity and that there were better alternative tests. He reported that the HFD failed to retain critical information for documenting validity and called it “8,000 mishmash pages.” He opined that 63% of the captain exam and 86% of the senior-captain exam did not “reflect knowledge or skills necessary for the first day of work”, calling this a Violation of the SIOP Principles, noting that these Principals (and the UGESP) requires that a “selection procedure should be based on an analysis of work that defines the balance between the work behaviors, activities, and/or [KSAOs] the applicant is expected to have before placement on the job.” He concluded that the job analysis was, therefore, “completely irrelevant.” There were other issues discussed by Dr. Brink, including time limits for testing, arbitrary cutoff scores, questionable source materials for studying for the exam, and that the job analyses were conducted after the source materials were distributed, and shortcomings in the item analysis.
Dr. Lundquist, who also represented the defendants, provided both an affidavit and testimony related to the validity (or lack thereof) of multiple choice exams, and the superiority of assessment center methodology and situational judgment tests. Lundquist stated that “emphasis for any promotional process should be on assessing the critical knowledge, skills, abilities, and other personal characteristics (KSAOs) identified through a job analysis as being required to perform the essential duties of the job.” She noted that multiple choice tests could validly assess “the technical knowledge” for captain and senior-captain positions, but they “inadequately capture the range of KSAOs required for successful performance in a position such as Senior-captain.” Her strongest criticism was that multiple choice tests fail to test “supervisory and leadership skills and abilities.” She did not advocate replacing the multiple choice tests, but rather, supplementing them with assessment centers and situational judgment tests which, in her opinion, do capture leadership and supervisory skills, and which minimize adverse impact. Interestingly, she cited an article by Dr. Arthur that assessment centers can validly measure “organization and planning and problem solving, . . . [and] influencing others.”, and that his study found “true validities” for problem-solving, influencing others, and organizing and planning.
Dr. Morris based his testimony on his own work conducting job analyses, and opined that the captain and senior-captain exams were not valid and, like Lundquist, offered alternatives to multiple choice tests. Based on his work with SMEs and source material, he provided a detailed listing of the KSOAs, including (1) knowledge about equipment, structures, fires and firefighting and rescue tactics; (2) HFD standard operating procedures (“SOPs”) and administrative processes; and (3) supervising. He opined that the important abilities include “(1) leadership abilities, including directing subordinates, resolving conflict, and motivating subordinates; (2) decision-making and strategic abilities such as prioritizing and developing contingency plans; (3) communication abilities, ranging from communicating with superiors to recognizing grammatical errors; (4) critical-thinking abilities, such as comprehending “complex rules, regulations, and procedures” and recognizing “critical aspects” of a problem; (5) administrative abilities, such as recording and documenting information; and (6) tactical abilities related to firefighting. Morris analyzed only the senior-captain position and testified that the senior-captains has a stronger “supervisory element and a greater “span of control” than the captain position. Like Lundquist, Morris opined that the assessment center better captures leadership and supervisory skills, and that it should supplement the job knowledge tests. Unlike Lundquist, Morris did not endorse situational judgment tests.
Dr. Arthur testified on behalf of the HPFFA about the validity of many of the proposed changes to the captain and senior-captain exams. He agreed that job analysis should serve as the basis for test development, and opined that the validity of the captain exam was limited by the fact that the job analysis for this position was incomplete. He opined that cognitive-loaded exams lead to subgroup differences, but these differences are not uniquely tied to the use of multiple choice exams. He opined further that assessment centers reduce adverse impact “because of the things that assessment centers measure.” He criticized the use of banding methods, the use of knowledge tests on a pass/fail basis (which the City wanted to do), and using methods that deemphasize cognitive loading. He also opined that “the way in which a question is presented and the type of response the question demands can affect whether the exam produces subgroup differences”, and cited an article he authored that subgroup differences can be minimized by having candidates generate rather than select answers.
Mr. Barry, the Assistant Chief, testified that “the captains who scored high on the multiple-choice tests were not always as effective as captains who scored lower on these tests” and that he “never found a direct correlation between the scores on the test and their performance.” He opined that “taking tests and working in the four different ranks in the emergency operations and observing people who have been promoted through the system, the ability to memorize the correct 100 facts that are going to be asked on a test does not correlate to how these people perform in either complex personnel situations or emergency situations.” He stated further that “he has also seen firefighters who did well on the exams perform well after promotion”, though he thought the other direction was more common. He admitted, however, that those who “put the most time, energy, effort, and sacrifice into test preparation scored higher.”
In the end, Judge Rosenthal rejected the validity generalization argument, rejected that the multiple choice tests should be on a pass/fail basis only, accepted incorporation of assessment centers and situational judgment tests, but rejected according these methods greater weight than the multiple choice tests.
My goal in this alert was to capture the “essence” of what each expert offered. I recognize there is a lot to chew on here and that there is so much more I could have said about what each of the experts offered. For example, there was detailed discussion of statistical methods used to assess adverse impact (among other things). If I missed critical points, or misrepresented any of the experts, please let me know.