GPT-4’s Potential to Perpetuate racial, Gender Biases in Clinical Decision Making. (Representational Image: Unsplash)

MedBound Blog

Study Assesses GPT-4’s Potential to Perpetuate Racial, Gender Biases in Clinical Decision Making

Researchers analyzed GPT-4’s performance in four clinical decision support scenarios generating clinical vignettes, diagnostic reasoning, clinical plan generation and subjective patient assessments

MBT Desk

Published:28th Dec, 2023 at 10:30 AM

Large language models (LLMs) like ChatGPT and GPT-4 have the potential to assist in clinical practice to automate administrative tasks, draft clinical notes, communicate with patients, and even support clinical decision making. However, preliminary studies suggest the models can encode and perpetuate social biases that could adversely affect historically marginalized groups. A new study by investigators from Brigham and Women’s Hospital, a founding member of the Mass General Brigham healthcare system, evaluated the tendency of GPT-4 to encode and exhibit racial and gender biases in four clinical decision support roles. Their results are published in The Lancet Digital Health.

“While most of the focus is on using LLMs for documentation or administrative tasks, there is also excitement about the potential to use LLMs to support clinical decision making,” said corresponding author Emily Alsentzer, PhD, a postdoctoral researcher in the Division of General Internal Medicine at Brigham and Women's Hospital. “We wanted to systematically assess whether GPT-4 encodes racial and gender biases that impact its ability to support clinical decision making."

GPT-4's ability to correctly develop a differential diagnosis and treatment plan for 19 different patient cases from a NEJM Healer, a medical education tool that presents challenging clinical cases to medical trainees. (Representational Image: Unsplash)

Alsentzer and colleagues tested four applications of GPT-4 using the Azure OpenAI platform. First, they prompted GPT-4 to generate patient vignettes that can be used in medical education. Next, they tested GPT-4's ability to correctly develop a differential diagnosis and treatment plan for 19 different patient cases from a NEJM Healer, a medical education tool that presents challenging clinical cases to medical trainees. Finally, they assessed how GPT-4 makes inferences about a patient’s clinical presentation using eight case vignettes that were originally generated to measure implicit bias. For each application, the authors assessed whether GPT-4’s outputs were biased by race or gender.

For the medical education task, the researchers constructed ten prompts that required GPT-4 to generate a patient presentation for a supplied diagnosis. They ran each prompt 100 times and found that GPT-4 exaggerated known differences in disease prevalence by demographic group.

"One striking example is when GPT-4 is prompted to generate a vignette for a patient with sarcoidosis: GPT-4 describes a Black woman 81% of the time," Alsentzer explains. "While sarcoidosis is more prevalent in Black patients and in women, it’s not 81% of all patients."

GPT-4 was prompted to develop a list of 10 possible diagnoses for the NEJM Healer cases, changing the gender or race/ethnicity of the patient significantly affected its ability to prioritize the correct top diagnosis in 37% of cases. (Representational Image: Unsplash)

Next, when GPT-4 was prompted to develop a list of 10 possible diagnoses for the NEJM Healer cases, changing the gender or race/ethnicity of the patient significantly affected its ability to prioritize the correct top diagnosis in 37% of cases.

"In some cases, GPT-4’s decision making reflects known gender and racial biases in the literature," Alsentzer said. "In the case of pulmonary embolism, the model ranked panic attack/anxiety as a more likely diagnosis for women than men. It also ranked sexually transmitted diseases, such as acute HIV and syphilis, as more likely for patients from racial minority backgrounds compared to white patients."

While LLM-based tools are currently being deployed with a clinician in the loop to verify the model’s outputs, it is very challenging for clinicians to detect systemic biases when viewing individual patient cases.

Emily Alsentzer, PhD, a postdoctoral researcher, Division of General Internal Medicine, Brigham and Women's Hospital

When asked to evaluate subjective patient traits such as honesty, understanding, and pain tolerance, GPT-4 produced significantly different responses by race, ethnicity, and gender for 23% of the questions. For example, GPT-4 was significantly more likely to rate Black male patients as abusing the opioid Percocet than Asian, Black, Hispanic, and white female patients when the answers should have been identical for all the simulated patient cases.

Limitations of the current study include testing GPT-4's responses using a limited number of simulated prompts and analyzing model performance using only a few traditional categories of demographic identities. Future work should investigate biases using clinical notes from the electronic health record.

"While LLM-based tools are currently being deployed with a clinician in the loop to verify the model’s outputs, it is very challenging for clinicians to detect systemic biases when viewing individual patient cases," Alsentzer said. “It is critical that we perform bias evaluations for each intended use of LLMs, just as we do for other machine learning models in the medical domain. Our work can help start a conversation about GPT-4’s potential to propagate bias in clinical decision support applications.” (Newswise/FK)

Also read:Can ChatGPT Aid in Breast Cancer Screening Advice?

Study Assesses GPT-4’s Potential to Perpetuate Racial, Gender Biases in Clinical Decision Making

Researchers analyzed GPT-4’s performance in four clinical decision support scenarios generating clinical vignettes, diagnostic reasoning, clinical plan generation and subjective patient assessments

Also Read

Semaglutide Shows Promise in Preventing Alzheimer’s Disease in Patients with Type 2 Diabetes

Respiratory Virus Season Is Here, But There Are Effective Ways To Protect Yourself And Your Family

Study: AI Could Transform How Hospitals Produce Quality Reports

Social Norms Around Masculinity Linked to Gaps in Cardiovascular Risk Detection and Treatment

DMRC Removes Breast Cancer Awareness Ad Following Public Backlash