Medical Language Model Evaluation Harness

This project is a fork of the LM Evaluation Harness, specifically adapted for evaluating medical language models used in the paper. It provides a unified framework for testing generative language models on various medical and general-purpose evaluation tasks.

Features

Comprehensive support for evaluating medical language models.
Benchmarks for six widely recognized medical tasks:
- MedMCQA
- MedQA-USMLE
- PubMedQA
- USMLE Step 1
- USMLE Step 2
- USMLE Step 3
Easy-to-use CLI for evaluation and few-shot learning setups.

Installation

To install the lm-eval framework from this repository, follow these steps:

git clone https://github.com/emrecanacikgoz/lm-evaluation-harness/
cd lm-evaluation-harness
pip install -e .

Benchmarking Medical Models

1. Evaluate a Model

To evaluate a model on the supported medical tasks, run the following command:

python main.py \
    --model hf-causal \
    --model_args pretrained=emrecanacikgoz/hippollama \
    --tasks medmcqa,medqa_usmle,pubmedqa,usmle_step1,usmle_step2,usmle_step3 \
    --device cuda:0

Replace emrecanacikgoz/hippollama with your model's Hugging Face path if different.

2. Few-Shot Evaluation

For few-shot evaluation, specify the number of examples (e.g., 5) as follows:

python write_out.py \
    --model hf-causal \
    --model_args pretrained=emrecanacikgoz/hippollama \
    --tasks medmcqa,medqa_usmle,pubmedqa,usmle_step1,usmle_step2,usmle_step3 \
    --num_fewshot 5 \
    --output_base_path /path/to/output/folder

This will generate one text file per task in the specified output folder.

Acknowledgements

This repository is adapted from the LM Evaluation Harness. We extend our gratitude to the original authors for their contributions to open-source AI research.

Contributing

We welcome contributions to improve and expand the functionality of this evaluation harness. Please open an issue or submit a pull request if you have suggestions or enhancements.

License

This project follows the same licensing terms as the original LM Evaluation Harness. Please refer to the original repository for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1,446 Commits
docs		docs
lm_eval		lm_eval
scripts		scripts
templates		templates
tests		tests
.coveragerc		.coveragerc
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CITATION.bib		CITATION.bib
CODEOWNERS		CODEOWNERS
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
README.md		README.md
eval.sh		eval.sh
ignore.txt		ignore.txt
main.py		main.py
pile_statistics.json		pile_statistics.json
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Medical Language Model Evaluation Harness

Features

Installation

Benchmarking Medical Models

1. Evaluate a Model

2. Few-Shot Evaluation

Acknowledgements

Contributing

License

About

Releases

Packages

Languages

License

emrecanacikgoz/lm-evaluation-harness

Folders and files

Latest commit

History

Repository files navigation

Medical Language Model Evaluation Harness

Features

Installation

Benchmarking Medical Models

1. Evaluate a Model

2. Few-Shot Evaluation

Acknowledgements

Contributing

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages