MegaMath: An Open Math Pre-trainng Dataset with 370B Tokens.

About MegaMath

MegaMath is a large-scale pre-training dataset for math. It is curated via the following three efforts:

Revisiting web data: We re-extracted mathematical documents from Common Crawl with math-oriented HTML optimizations, fasttext-based filtering and deduplication, all for acquiring higher-quality data on the Internet.
Recalling Math-related code data: We identified high quality math-related code from large code training corpus, Stack-V2, further enhancing data diversity.
Exploring Synthetic data: We synthesized QA-style text, math-related code, and interleaved text-code blocks from web data or code data.

How to Use

MegaMath includes many different data variants which is tailored for different training demands.

If you are training your LLM from scratch, we recommend you to use the full set of our web data.

from huggingface_hub import snapshot_download
snapshot_download(
    repo_id="LLM360/MegaMath",
    local_dir="./",
    repo_type="dataset",
    allow_patterns=["megamath-web/*"]
)

If you are performing continual pre-training from strong base models, MegaMath-Web-Pro may be your best choice.

from huggingface_hub import snapshot_download
snapshot_download(
    repo_id="LLM360/MegaMath",
    local_dir="./",
    repo_type="dataset",
    allow_patterns=["megamath-web-pro/*"]
)

We also provide MegaMath-Code which can enhance the performance of your LLM on solving math-related tasks via Python code. Moreover, MegaMath contains over 80B tokens of synthetic data, which can be used to further enhance the performance of your LLM on solving math-related tasks.

from huggingface_hub import snapshot_download
snapshot_download(
    repo_id="LLM360/MegaMath",
    local_dir="./",
    repo_type="dataset",
    allow_patterns=[
        "megamath-qa/*", 
        "megamath-translated-code/*", 
        "megamath-text-code-block/*",
        "megamath-code/*"
    ]
)

Data Pipeline

Please refer to the web_pipeline for more details. We are actively working on the code pipeline and will update the README soon.

Citation

If you use our dataset or find our work useful, please cite

@article{zhou2025megamath,
  title     = {MegaMath: Pushing the Limits of Open Math Corpora},
  author    = {Zhou, Fan and Wang, Zengzhi and Ranjan, Nikhil and Cheng, Zhoujun and Tang, Liping and He, Guowei and Liu, Zhengzhong and Xing, Eric P.},
  journal   = {arXiv preprint arXiv:2504.02807},
  year      = {2025},
  note      = {Preprint}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
code_pipeline		code_pipeline
web_pipeline		web_pipeline
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MegaMath: An Open Math Pre-trainng Dataset with 370B Tokens.

About MegaMath

How to Use

Data Pipeline

Citation

About

Uh oh!

Releases

Packages

Languages

License

LLM360/MegaMath

Folders and files

Latest commit

History

Repository files navigation

MegaMath: An Open Math Pre-trainng Dataset with 370B Tokens.

About MegaMath

How to Use

Data Pipeline

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages