CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 969
Releases: allenai/olmocr
v0.2.1
Compare
What's new
Commits
476e20c Bump version to v0.2.1 for release
2545408 Minor release cleaning up a few pipeline things
54719b6 Fixed
c63e97f Default max model len cleanup
4acc85e bolds in tables
c13f5aa Readme
44cb957 Readmes
783cacd Merge branch 'main' of https://github.com/allenai/olmocr
ce32ceb Hopefully a cleaner pipeline
Assets 4
v0.2.0
Compare
What's new
Commits
b4c5913 Bump version to v0.2.0 for release
35a5329 New version
0a6b2fe Lints - bringing back files
56296d6 Brining back a few files
6e82724 Lint fixes
5ec4967 New default model
a4752b5 Merge remote-tracking branch 'origin/main' into jakep/new_trainer
9ef3fd7 Adjusting temp by attempt
60c3944 More configs
f44d03f Don't break on errors
8eb3786 Fixing compressor again
da6bc45 Fix for compresor
eb200e7 Fixing some default configs on quantizer
0aa7479 More calibration samples by default
6e48012 Trying out an idea for dataset augmentation
df960cb 2 epoch config just to try
a326c96 Default to 1288
b88c71e Rounding to better image size, full soups
75bfa6a Adding full soup configs
0f733ff FIxes to compare vllm script
16145a4 Need accelerate
4785759 Adding some souping suppor to prepare checkpoint
2b63855 Compare has better downloader
0b40bd3 Better docker ignore
d21a164 Fixing async stuff
3ca305d Adding some souping configs
c0bf310 Fixing import
31c834d Constants
5ea4e8a Compare vllm script
939a76a Adding a compare vllm checkpoint script
2460895 Working on comparing to vllm
e6c9823 Adding more pipeline retry stats, compress code fixed
4dbbf91 Compression script
feb2dab Adjus config
022f437 w8a8-int8 version
5a4a836 Calibration
9115a02 Fixes
4b0960b Test
ee69faa Dataset
bd92f08 Errors propagated
fcd373d Calibration stuff
2218bf8 Merge branch 'jakep/new_trainer_vllm092' into jakep/new_trainer
b5f480d Working on calibration set for compressor, seems like qwen2.5 is not working
3f9fc8b Better compressor hopefully
287c827 Starting to cleanup and merge yaml front matter stuff in
1092213 Merge branch 'jakep/new_traininer_nojson_newprompt' into jakep/new_trainer
679063a Adding some more logging to compressor
43ae28d Prepare checkpoint works for older models too
f306a52 Compress fix
01360ba Compressor script
1ede76d Cleaning up compress and prepare checkpoint scripts
a5a0cd7 Trying a few more configs
384a1b1 Qwen 2 config too
24a3fb8 128batch config, wsd config
0c773c4 Let's do a 1280 no anchor yaml
da5f8f2 wsd config
336b000 Adding wsd as an option
69581cc More config fixes
ca8e503 Ugh, lost some training runs because files got saved to the wrong place
02f0706 Reverting back to json pipeline as it seems better by default
8ae9104 Calling it with a new name
3976cee Adding 8192 cap on day2 config
ca2609c No doc anchoring version
560a585 Configs with proper names
53cc1a0 Fixed json configuration
2c54c6d ALlow unicode in json
b1ab996 Day 2 json config
a1c2ee8 More workers by default
d26ae4b Easier way to test configs
a7e2f71 Start a preemptible one at least once
6d6476b One idea for resume fix
2a20607 Get rid of fused
59f11c7 Better names
210d170 Adding a standard JSON output option
6f2a426 Fresh prompt configs
5e8017b Oops
4a6ef91 Matching old trainer config
5e2f703 Trying some config changes
94d7900 Default configs are better
56e51ea Improving regex even more
98df1d5 Adding max length option
abdc907 Pipeline fix
e691ea1 Better regex for structured decoding, adding some new prompts to train with
a651cf0 Adding guided regex decoder
748e2ae With yaml formatted responses, make sure response finishes with code stop
9bf8e9e Preparing pipeline for new format
c6c1fbd Better prepare checkpoint script
8dcfdd0 Checkpoint prep tool
c029ccd Added a few more configs to try
79a7818 New trainer launch script for beaker
dcf026a Better script
9f0f912 Ugh
1d007d1 Perhaps fixing default config
e7020c7 More configs
7cf9879 Image 1600 configuration
d2ef9d7 Four basic training configs for new version
a3ad61b Small config updates
ee8bd9b Better resume logic I hope
208fabc Validating on procespool
4f46f10 At least get resuming from checkpoints to work perhaps
2375079 Torch compile off, gives warnings and no speed boost, padding to do multi batch is not working either
c11120a Trying to do batch size > 1
5c2d69a Some cleanup stuff
e86511e Weka fix
656dbef Frontier configs
e2f2d36 More typos
ea72ea2 Ugh stupid fix
55a737c script
ba49fd5 frontier train script let's see what happens
bde6f29 Bf16 only
44dd966 Wandb fixes
f8071c7 Loss config
a399741 Naming config entries better
8e5e18f Checking that anchor text works for each pdf page when initializing dataloader
dc7fff5 Collator fix
12b5cc3 Lowwering size of default data load for testing
c36b5df Cleanup collator
887190e Cleanup
330f465 Small fixes
214c44d Reporting to wandb, better eval dataset loading
600d967 Config changes
850b598 Sdpa
b96454b Merge branch 'main' into jakep/new_trainer
58e4fad torchvision requirement
1451dd1 weka
680377c Example config
dee3730 Gantry stuff
0d7836b Basic atttempt to run trainer script
d7e5037 New trainer launch script cleanups
91e7b5c Claude generated train script
0ebc35c Basic train config loader for datasets
b93c262 Prepping new config stuff
e9828cd Lints, adding more perf tracking to pipeline
9ab742b Outputting finished output tok/sec as well
cc0c62a Adding more workers by default to improve bench perf
43c94fe Bencharmk update
b1e064f Run benchmark script will also start a job to convert 10k docs from olmocr-mix to check performance
3d72f34 Fixing prepare_olmocrmix
c93ac4a Cleaned up loader
6033881 Cleaning up dataloader
cfe9aa1 Ok, dataloader from start to finish is running, now to write a trainer
105d590 Dataloader progress
9f50bda More refactoring
6a360fa Cleanup
d17bef8 Working on a more pipeliney thing
d0df380 Cleaning data loader
5bbc1ff Parsing and validating front matter
aedc295 Image params to loader
9a390e3 Validating that we get single pages
0689676 Rendering the pdfs in the dataloader
352287c Starting on dataloader
0e17b50 Ok, looks like we have a nice extractor script for the dataset
f19f7c1 Almost done extracting
f0d8ff7 First attempt at new trainer code
Assets 4
v0.1.76
Compare
What's new
Commits
24a2f9b Bump version to v0.1.76 for release
cd93ca5 Version bump
ecce181 Merge pull request #256 from allenai/jakep/dockerfix
0c6d199 Update README.md
ec5c5b6 Updating pareto plots
6c51829 Some helper scripts
626952a Adding news
9d26079 README updates
69524cb Updatinge bench readme
Assets 4
v0.1.75
Compare
Assets 4
v0.1.74
Compare
Assets 4
v0.1.73
Compare
Assets 4
v0.1.72
Compare
What's new
Commits
5e5c31b Bump version to v0.1.72 for release
715b841 0.1.72
b03feb3 Fixed
b588ae2 Remvoing sglang tests, switch to vllm
6e3fba3 Lints
e489b28 Lints
6fcd26d Updating readme
8c62072 Merge remote-tracking branch 'origin/main' into jakep/vllm_perf
3eda2c0 updated vllm to 0.9.1
a83a0da Cleanup of vllm perf branch with @amanr
316d0af added dtype functionality
c8a5361 fixing packages of 22.04
c5d075c fixed apt_pkg module
08fd82f made changes wrt ubuntu 22.04
6507a65 updated ubuntu to 22.04 for glbc 2.32
25dfe0b Weird glibc error
9539eab AWs creds fix
e0fda1a Passing aws creds to benchmark so we can run custom models stored in s3
ecf0d48 Dont allow uncomitted changes
134bba9 Run benchmark adjustments
7009a7a Trying out FP8 compression
aad8428 Reverting custom pipeline image
5c52e01 Include cuda 12.8
5c524b5 Cleaning up stats reportng
916f0cb Trying with flash infer installed
2ccef7d Ugh, this code is bad
2f1957b Performance fixes with vllm backend
d717033 Fixing parse for waiting
d1baa51 Python alternatives
581915f Fixes for docker image
153f1e5 Final uv fixes
97da87a Hopefully a much better dockerfile
04dd71c Trying to get onto vllm latest
106070d Moving pipeline to vllm
2235b82 Beaker tests
967c83d Better way to setup beaker
Assets 4
v0.1.71
Compare
What's new
Commits
23f4a0e Bump version to v0.1.71 for release
8b4f6cd Upping version
24b6822 Pushing beaker images now too
208c29d Not including fallbacks in olmocr_pipeline bench runner so we can measure direct model performance better
5faf570 Format fixes
587b73f Try with more aggressive anchor changing
8f5d5bd Revert "Trying to add repetition penalty"
90f754e Trying to add repetition penalty
9dcdef6 Going to try with up to 5k tokens
8d92620 Merge remote-tracking branch 'origin/main' into retry_improvements
2cb14cc ALlowing more tokens
022be37 Some better info strings in benchmark runner
22ee068 Merge remote-tracking branch 'origin/main' into retry_improvements
fbcd82a Cleanup attempt lookup code a bit
f8fd234 Idea to improve retry performance
61d427e Repo cleanup
7a50ee1 merge
241e5bf Merge branch 'main' of github.com:allenai/olmocr
470394d pareto plot
Assets 4
v0.1.70
Compare
What's new
Commits
e10a53c Bump version to v0.1.70 for release
76270f5 Upping to v70 to test new docker builds
a6d6c34 Refactored docker workflows
78ea21a Merge pull request #216 from allenai/amanr/docker
bea1873 Update README.md
7996a7d Update README.md
bdf0879 Merge pull request #202 from allenai/amanr/docker
74f4786 README updated with pip install and --markdown
Assets 4
v0.1.69
Compare
What's new
Commits
57238cf Bump version to v0.1.69 for release
71275cc Bumping version, adding more docs, more to come
7b640ae Merge branch 'main' of https://github.com/allenai/olmocr
8d8e323 Adding markdown flag to directly generate markdown outputs
1043491 Oops, removing submodule olmOCR bench repo, best if you just clone from hugging face
2c1c8a6 Updating readme more