Home Tech & ScienceArtificial Intelligence (AI) Benchmarking LLMs for global health

Benchmarking LLMs for global health

by Delarno
0 comments
Benchmarking LLMs for global health


Large language models (LLMs) have shown potential for medical and health question-answering across various health-related tests and spanning different formats and sources. Indeed we have been on the forefront of efforts to expand the utility of LLMs for health and medical applications, as demonstrated in our recent work on Med-Gemini, MedPaLM, AMIE, Multimodal Medical AI, and our release of novel evaluation tools and methods to assess model performance across various contexts. Especially in low-resource settings, LLMs can potentially serve as valuable decision-support tools, enhancing clinical diagnostic accuracy, accessibility, and multilingual clinical decision support, and health training, especially at the community level. Yet despite their success on existing medical benchmarks, there is still some uncertainty about how well these models generalize to tasks involving distribution shifts in disease types, region-specific medical knowledge, and contextual variations across symptoms, language, location, linguistic diversity, and localized cultural contexts.

Tropical and infectious diseases (TRINDs) are an example of such an out-of-distribution disease subgroup. TRINDs are highly prevalent in the poorest regions of the world, affecting 1.7 billion people globally with disproportionate impacts on women and children. Challenges in preventing and treating these diseases include limitations in surveillance, early detection, accurate initial diagnosis, management, and vaccines. LLMs for health-related question answering could potentially enable early screening and surveillance based on a person’s symptoms, location, and risk factors. However, only limited studies have been conducted to understand LLM performance on TRINDs with few datasets existing for rigorous LLM evaluation.

To address this gap, we have developed synthetic personas — i.e., datasets that represent profiles, scenarios, etc., that can be used to evaluate and optimize models — and benchmark methodologies for out-of-distribution disease subgroups. We have created a TRINDs dataset that consists of 11,000+ manually and LLM-generated personas representing a broad array of tropical and infectious diseases across demographic, contextual, location, language, clinical, and consumer augmentations. Part of this work was recently presented at the NeurIPS 2024 workshops on Generative AI for Health and Advances in Medical Foundation Models.



Source link

You may also like

Leave a Comment

Booboone

Breaking News on Health, Science, Politic, Science, Entertainment!

 

Edtior's Picks

Latest Articles

@2023 – All Right Reserved. Designed and Developed by booboone.com