CARVIEW |
Select Language
HTTP/2 200
date: Wed, 30 Jul 2025 22:13:35 GMT
content-type: text/html; charset=utf-8
vary: X-PJAX, X-PJAX-Container, Turbo-Visit, Turbo-Frame, X-Requested-With,Accept-Encoding, Accept, X-Requested-With
etag: W/"3e1e0a93f23c191576124ae50f067643"
cache-control: max-age=0, private, must-revalidate
strict-transport-security: max-age=31536000; includeSubdomains; preload
x-frame-options: deny
x-content-type-options: nosniff
x-xss-protection: 0
referrer-policy: no-referrer-when-downgrade
content-security-policy: default-src 'none'; base-uri 'self'; child-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/; connect-src 'self' uploads.github.com www.githubstatus.com collector.github.com raw.githubusercontent.com api.github.com github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-upload-manifest-file-7fdce7.s3.amazonaws.com github-production-user-asset-6210df.s3.amazonaws.com *.rel.tunnels.api.visualstudio.com wss://*.rel.tunnels.api.visualstudio.com objects-origin.githubusercontent.com copilot-proxy.githubusercontent.com proxy.individual.githubcopilot.com proxy.business.githubcopilot.com proxy.enterprise.githubcopilot.com *.actions.githubusercontent.com wss://*.actions.githubusercontent.com productionresultssa0.blob.core.windows.net/ productionresultssa1.blob.core.windows.net/ productionresultssa2.blob.core.windows.net/ productionresultssa3.blob.core.windows.net/ productionresultssa4.blob.core.windows.net/ productionresultssa5.blob.core.windows.net/ productionresultssa6.blob.core.windows.net/ productionresultssa7.blob.core.windows.net/ productionresultssa8.blob.core.windows.net/ productionresultssa9.blob.core.windows.net/ productionresultssa10.blob.core.windows.net/ productionresultssa11.blob.core.windows.net/ productionresultssa12.blob.core.windows.net/ productionresultssa13.blob.core.windows.net/ productionresultssa14.blob.core.windows.net/ productionresultssa15.blob.core.windows.net/ productionresultssa16.blob.core.windows.net/ productionresultssa17.blob.core.windows.net/ productionresultssa18.blob.core.windows.net/ productionresultssa19.blob.core.windows.net/ github-production-repository-image-32fea6.s3.amazonaws.com github-production-release-asset-2e65be.s3.amazonaws.com insights.github.com wss://alive.github.com wss://alive-staging.github.com api.githubcopilot.com api.individual.githubcopilot.com api.business.githubcopilot.com api.enterprise.githubcopilot.com; font-src github.githubassets.com; form-action 'self' github.com gist.github.com copilot-workspace.githubnext.com objects-origin.githubusercontent.com; frame-ancestors 'none'; frame-src viewscreen.githubusercontent.com notebooks.githubusercontent.com; img-src 'self' data: blob: github.githubassets.com media.githubusercontent.com camo.githubusercontent.com identicons.github.com avatars.githubusercontent.com private-avatars.githubusercontent.com github-cloud.s3.amazonaws.com objects.githubusercontent.com release-assets.githubusercontent.com secured-user-images.githubusercontent.com/ user-images.githubusercontent.com/ private-user-images.githubusercontent.com opengraph.githubassets.com copilotprodattachments.blob.core.windows.net/github-production-copilot-attachments/ github-production-user-asset-6210df.s3.amazonaws.com customer-stories-feed.github.com spotlights-feed.github.com objects-origin.githubusercontent.com *.githubusercontent.com; manifest-src 'self'; media-src github.com user-images.githubusercontent.com/ secured-user-images.githubusercontent.com/ private-user-images.githubusercontent.com github-production-user-asset-6210df.s3.amazonaws.com gist.github.com; script-src github.githubassets.com; style-src 'unsafe-inline' github.githubassets.com; upgrade-insecure-requests; worker-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/
server: github.com
content-encoding: gzip
accept-ranges: bytes
set-cookie: _gh_sess=LL20poRChXjbPIrSYnuJDSNApu7lOrTG82OWClFe0yil8n6TXkFkynTy4bO0obCsXJvZRgA1i8hCLG5C%2Bloo9CbYnF2I1zGxnsSA%2FR3KxzVvFBrU9hOSJLkpflt6%2B9zNaExNvx23DqYftRqWbfjBqbVLpsSgJOuYj6pgzbVpey2mMSoHFS0nxNZkklIMDlB0d%2Ftp5xV0QzSaICVjBP5M3KxkmGQPhZqeS0BUsVAJ63r4otYK5nVFsGlOa712Nx0FG0Y3%2F4DyVdUZYU1chvdodQ%3D%3D--cls6ob67cDc2CrHw--MHoMQagrTZs1OzDxKNN7lQ%3D%3D; Path=/; HttpOnly; Secure; SameSite=Lax
set-cookie: _octo=GH1.1.2075109419.1753913615; Path=/; Domain=github.com; Expires=Thu, 30 Jul 2026 22:13:35 GMT; Secure; SameSite=Lax
set-cookie: logged_in=no; Path=/; Domain=github.com; Expires=Thu, 30 Jul 2026 22:13:35 GMT; HttpOnly; Secure; SameSite=Lax
x-github-request-id: 8020:1F4F74:7A05B:B2088:688A990F
GitHub - jhclark/bigfatlm: Hadoop MapReduce training of modified Kneser-Ney smoothed language models
Skip to content
Navigation Menu
{{ message }}
-
Notifications
You must be signed in to change notification settings - Fork 10
Hadoop MapReduce training of modified Kneser-Ney smoothed language models
License
jhclark/bigfatlm
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
BigFatLM V0.1 By Jonathan Clark Carnegie-Mellon University -- School of Computer Science -- Language Technologies Institute Released: March 31, 2011 Liscence: LGPL 3.0 "Embiggen your language models" BACKGROUND Provides Hadoop MapReduce training of state-of-the-art large-scale language models. This allows building much larger models with commodity hardware. Other popular packages either have memory usage on the order of the number of N-gram types for the highest N (e.g. SRILM) or offer only approximate smoothing methods for training that offer no theoretical guarantees as the number of nodes scales up (e.g. IRSTLM). BigFatLM offers lower per-node memory usage on the order of the vocabulary size and scales well as the data size increases by making use of more nodes in a Hadoop cluster. Modified Kneser-Ney with order-wise interpolation has become an accepted standard of language model smoothing in the machine translation community (and likely has comparably good performance on other tasks such as speech recognition). System builders have also found that interpolating multiple models built on different corpora can often provide benefits in terms of perplexity and end-task performance. BigFatLM supports using both of these techniques with distributed training.** (** MODEL INTERPOLATION IS STILL UNDER TESTING AND WILL NOT BE DOCUMENTED HERE UNTIL TESTED **) For the sake of comparison, BigFatLM's results have been diffed with the output of SRILM's ngram-count -kndiscount -gt3min 1 -gt4min 1 -gt5min 1 -order 5 -interpolate -kndiscount -unk and found to be within floating point error. REQUIREMENTS * Java 6+ * A Hadoop cluster running Hadoop 20.1+ or, for much slower results, a local install of Hadoop. * If you wish to build BigFatLM from source, you'll need the Apache Ant build system * If you wish to filter the language model, you should also install KenLM's filter tool. See https://kheafield.com/code/mt/filter.html. * If you wish to run the regression tests, you'll need Python 2.6+ BUILDING Just type "ant" USAGE Assume we have: * A pre-processed, tokenized corpus on HDFS (the Hadoop distributed filesystem) at /home/user/corpus And we want to build: * An order 3 LM * Use order-wise interpolated modified Kneser-Ney smoothing (default) And we want to put the resulting files: * For the small local files, under ./corpus-3g-lm * For the large HDFS files, under $HDFS_USER_HOME/corpus-3g-lm * For the final local merged ARPA file, at /home/user/corpus-3g.arpa # The most basic usage, to get an uncompressed ARPA file: $ ./build-lm.sh 3 /home/user/corpus corpus-3g-lm /home/user/corpus-3g.arpa # To gzip the resulting ARPA file using bash process substitution: $ ./build-lm.sh 3 /home/user/corpus corpus-3g-lm >(gzip > /home/user/corpus-3g.arpa.gz) REFERENCES 1. Chen, Stanley; Goodman J. An Empirical Study of Smoothing Techniques for Language Modeling.; 1998. 2. Brants T, Popat AC, Och FJ. Large Language Models in Machine Translation. Computational Linguistics. 2007;1(June):858-867.
About
Hadoop MapReduce training of modified Kneser-Ney smoothed language models
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published
You can’t perform that action at this time.