Health-LLM: Large Language Models for Health Prediction via Wearable Sensor Data

1Massachusetts Institute of Technology 2Google Research
Main Figure AgentClinic

Health-LLM is a framework for evaluating LLM performance on a diverse set of health prediction tasks, training and prompting the models with multi-modal health data.

Abstract

Large language models (LLMs) are capable of many natural language tasks, yet they are far from perfect. In health applications, grounding and interpreting domain-specific and non-linguistic data is crucial. This paper investigates the capacity of LLMs to make inferences about health based on contextual information (e.g. user demographics, health knowledge) and physiological data (e.g. resting heart rate, sleep minutes). We present a comprehensive evaluation of 12 state-of-the-art LLMs with prompting and fine-tuning techniques on four public health datasets (PMData, LifeSnaps, GLOBEM and AW_FB). Our experiments cover 10 consumer health prediction tasks in mental health, activity, metabolic, and sleep assessment. Our fine-tuned model, HealthAlpaca exhibits comparable performance to much larger models (GPT-3.5, GPT-4 and Gemini-Pro), achieving the best performance in 8 out of 10 tasks. Ablation studies highlight the effectiveness of context enhancement strategies. Notably, we observe that our context enhancement can yield up to 23.8% improvement in performance. While constructing contextually rich prompts (combining user context, health knowledge and temporal information) exhibits synergistic improvement, the inclusion of health knowledge context in prompts significantly enhances overall performance.

Main Result

(a): Average Performance Improvement over basic (bs) across contexts. (b): Best Performance Improvement across LLMs. (c): Best Performance Improvement across Datasets. Note that few models (Llama 2, Gemini-Pro, BioMedGPT and BioMistral) were excluded in this experiment due to the prioritization of models based on integration timelines.

Bias Figure AgentClinic

Case Study

Case Study on Readiness Score Prediction (READ) from PMData dataset. Here, we display the responses from 1) our fine-tuned model, HealthAlpaca, 2) GPT-3.5, 3) GPT-4 and 4) Gemini-Pro. Green Bolded texts highlights the valid reasoning and Red Bolded texts highlights the false or irrelevant reasoning to the input.

Bias Figure AgentClinic

BibTeX

@article{kim2024health,
      title={Health-llm: Large language models for health prediction via wearable sensor data},
      author={Kim, Yubin and Xu, Xuhai and McDuff, Daniel and Breazeal, Cynthia and Park, Hae Won},
      journal={arXiv preprint arXiv:2401.06866},
      year={2024}
    }