You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
MegaMath: An Open Math Pre-trainng Dataset with 370B Tokens.
About MegaMath
MegaMath is a large-scale pre-training dataset for math.
It is curated via the following three efforts:
Revisiting web data: We re-extracted mathematical documents from Common Crawl with math-oriented HTML optimizations, fasttext-based filtering and deduplication, all for acquiring higher-quality data on the Internet.
Recalling Math-related code data: We identified high quality math-related code from large code training corpus, Stack-V2, further enhancing data diversity.
Exploring Synthetic data: We synthesized QA-style text, math-related code, and interleaved text-code blocks from web data or code data.
How to Use
MegaMath includes many different data variants which is tailored for different training demands.
If you are training your LLM from scratch, we recommend you to use the full set of our web data.
We also provide MegaMath-Code which can enhance the performance of your LLM on solving math-related tasks via Python code. Moreover, MegaMath contains over 80B tokens of synthetic data, which can be used to further enhance the performance of your LLM on solving math-related tasks.
Please refer to the web_pipeline for more details. We are actively working on the code pipeline and will update the README soon.
Citation
If you use our dataset or find our work useful, please cite
@article{zhou2025megamath,
title = {MegaMath: Pushing the Limits of Open Math Corpora},
author = {Zhou, Fan and Wang, Zengzhi and Ranjan, Nikhil and Cheng, Zhoujun and Tang, Liping and He, Guowei and Liu, Zhengzhong and Xing, Eric P.},
journal = {arXiv preprint arXiv:2504.02807},
year = {2025},
note = {Preprint}
}
About
[COLM 2025] An Open Math Pre-trainng Dataset with 370B Tokens.