ChatGPT is biased against resumes with credentials that imply a disability — but it can improve

This post was originally published on this site

While seeking research internships last year, University of Washington graduate student Kate Glazko noticed recruiters posting online that they’d used OpenAI’s ChatGPT and other artificial intelligence tools to summarize resumes and rank candidates. Automated screening has been commonplace in hiring for decades. Yet Glazko, a doctoral student in the UW’s Paul G. Allen School of Computer Science & Engineering, studies how generative AI can replicate and amplify real-world biases — such as those against disabled people. How might such a system, she wondered, rank resumes that implied someone had a disability?

In a new study, UW researchers found that ChatGPT consistently ranked resumes with disability-related honors and credentials — such as the “Tom Wilson Disability Leadership Award” — lower than the same resumes without those honors and credentials. When asked to explain the rankings, the system spat out biased perceptions of disabled people. For instance, it claimed a resume with an autism leadership award had “less emphasis on leadership roles” — implying the stereotype that autistic people aren’t good leaders.

But when researchers customized the tool with written instructions directing it not to be ableist, the tool reduced this bias for all but one of the disabilities tested. Five of the six implied disabilities — deafness, blindness, cerebral palsy, autism and the general term “disability” — improved, but only three ranked higher than resumes that didn’t mention disability.

The team presented its findings June 5 at the 2024 ACM Conference on Fairness, Accountability, and Transparency in Rio de Janeiro.

“Ranking resumes with AI is starting to proliferate, yet there’s not much research behind whether it’s safe and effective,” said Glazko, the study’s lead author. “For a disabled job seeker, there’s always this question when you submit a resume of whether you should include disability credentials. I think disabled people consider that even when humans are the reviewers.”

Researchers used one of the study’s authors’ publicly available curriculum vitae (CV), which ran about 10 pages. The team then created six enhanced CVs, each implying a different disability by including four disability-related credentials: a scholarship; an award; a diversity, equity and inclusion (DEI) panel seat; and membership in a student organization.

Researchers then used ChatGPT’s GPT-4 model to rank these enhanced CVs against the original version for a real “student researcher” job listing at a large, U.S.-based software company. They ran each comparison 10 times; in 60 trials, the system ranked the enhanced CVs, which were identical except for the implied disability, first only one quarter of the time.

“In a fair world, the enhanced resume should be ranked first every time,” said senior author Jennifer Mankoff, a UW professor in the Allen School. “I can’t think of a job where somebody who’s been recognized for their leadership skills, for example, shouldn’t be ranked ahead of someone with the same background who hasn’t.”

When researchers asked GPT-4 to explain the rankings, its responses exhibited explicit and implicit ableism. For instance, it noted that a candidate with depression had “additional focus on DEI and personal challenges,” which “detract from the core technical and research-oriented aspects of the role.”

“Some of GPT’s descriptions would color a person’s entire resume based on their disability and claimed that involvement with DEI or disability is potentially taking away from other parts of the resume,” Glazko said. “For instance, it hallucinated the concept of ‘challenges’ into the depression resume comparison, even though ‘challenges’ weren’t mentioned at all. So you could see some stereotypes emerge.”

Given this, researchers were interested in whether the system could be trained to be less biased. They turned to the GPTs Editor tool, which allowed them to customize GPT-4 with written instructions (no code required). They instructed this chatbot to not exhibit ableist biases and instead work with disability justice and DEI principles.

They ran the experiment again, this time using the newly trained chatbot. Overall, this system ranked the enhanced CVs higher than the control CV 37 times out of 60. However, for some disabilities, the improvements were minimal or absent: The autism CV ranked first only three out of 10 times, and the depression CV only twice (unchanged from the original GPT-4 results).

“People need to be aware of the system’s biases when using AI for these real-world tasks,” Glazko said. “Otherwise, a recruiter using ChatGPT can’t make these corrections, or be aware that, even with instructions, bias can persist.”

Researchers note that some organizations, such as ourability.com and inclusively.com, are working to improve outcomes for disabled job seekers, who face biases whether or not AI is used for hiring. They also emphasize that more research is needed to document and remedy AI biases. Those include testing other systems, such as Google’s Gemini and Meta’s Llama; including other disabilities; studying the intersections of the system’s bias against disabilities with other attributes such as gender and race; exploring whether further customization could reduce biases more consistently across disabilities; and seeing whether the base version of GPT-4 can be made less biased.

“It is so important that we study and document these biases,” Mankoff said. “We’ve learned a lot from and will hopefully contribute back to a larger conversation — not only regarding disability, but also other minoritized identities — around making sure technology is implemented and deployed in ways that are equitable and fair.”