ICLR 2025
There have been many benchmarks for evaluating long-context language models (LCLMs), but developers often rely on synthetic tasks like needle-in-a-haystack (NIAH) or arbitrary subsets of tasks.
We present HELMET (How to Evaluate Long-context Models Effectively and Thoroughly), a comprehensive benchmark encompassing seven diverse, application-centric categories. We also address many issues in previous benchmarks by adding controllable lengths up to 128K tokens, model-based evaluation for reliable metrics, and few-shot prompting for robustly evaluating base models.
Through a comprehensive study of 59 LCLMs, we find that (1) synthetic tasks like NIAH are not good predictors of downstream performance; (2) the diverse categories in HELMET exhibit distinct trends and low correlation with each other; and (3) while most LCLMs achieve perfect NIAH scores, open-source models significantly lag behind closed ones when the task requires full-context reasoning or following complex instructions—the gap widens with increased lengths.
Table: Average performance across seven categories on HELMET at 128K input length. The context window is the training or claimed context length of the model. Instruction-tuned models are highlighted in light blue, and base models are highlighted in yellow. Frontier models struggle significantly more at more complex tasks, such as generation-with-citations and passage re-ranking. For details on the models and more results, please refer to the paper. You may also find the results for each dataset on this spreadsheet.
Click on a category to explore the data. Note that we only show one demonstration in the examples in this viewer, while the actual data contains two in-context learning examples. We also truncate most of the text for brevity.
@inproceedings{yen2025helmet,
title={HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly},
author={Howard Yen and Tianyu Gao and Minmin Hou and Ke Ding and Daniel Fleischer and Peter Izsak and Moshe Wasserblat and Danqi Chen},
year={2025},
booktitle={International Conference on Learning Representations (ICLR)},
}
For any questions or feedback, please reach out to hyen [@] cs.princeton.edu