You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A fast multithreading Jsonl converter to RWKV binidx files written in Rust.
Installation
$ cargo install json2bin
Usage
$ json2bin -h
Json converter to RWKV binidx file format
Usage: json2bin [OPTIONS] --input <INPUT>
Options:
-i, --input <INPUT> Jsonlines file to read
-o, --output-dir <OUTPUT_DIR> Output directory for binidx files [default: -]
-t, --thread <THREAD> Number of threads [default: 8]
-v, --verbose Verbosity
-c, --context-length <CONTEXT_LENGTH> Context Length [default: 4096]
-h, --help Print help
-V, --version Print version
Following command will convert the jsonl file src/sample.jsonl into src/sample.bin and src/sample.idx files.
$ json2bin -i src/sample.jsonl
The output directory can be set with the argument "--output-dir <OUTPUT_DIR>" or "-o <OUTPUT_DIR>"
$ json2bin -i src/sample.jsonl -o output
The default threads number is 8, it can be changed with the argument "--thread" or "-t"
$ json2bin -i src/sample.jsonl -t 4
Performance comparison
We converted a 19GB English Wikipedia (20231101.en) in jsonl format to binidx format in M2 Apple machine.
The Rust json2bin run with 7 threads, and it was 70 times faster than the Python json2binidx: