CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Releases: huggingface/datasets
4.0.0
b0de7a8
Compare
New Features
-
Add
IterableDataset.push_to_hub()
by @lhoestq in #7595# Build streaming data pipelines in a few lines of code ! from datasets import load_dataset ds = load_dataset(..., streaming=True) ds = ds.map(...).filter(...) ds.push_to_hub(...)
-
Add
num_proc=
to.push_to_hub()
(Dataset and IterableDataset) by @lhoestq in #7606# Faster push to Hub ! Available for both Dataset and IterableDataset ds.push_to_hub(..., num_proc=8)
-
New
Column
object- Implementation of iteration over values of a column in an IterableDataset object by @TopCoder2K in #7564
- Lazy column by @lhoestq in #7614
# Syntax: ds["column_name"] # datasets.Column([...]) or datasets.IterableColumn(...) # Iterate on a column: for text in ds["text"]: ... # Load one cell without bringing the full column in memory first_text = ds["text"][0] # equivalent to ds[0]["text"]
-
Torchcodec decoding by @TyTodd in #7616
- Enables streaming only the ranges you need !
# Don't download full audios/videos when it's not necessary # Now with torchcodec it only streams the required ranges/frames: from datasets import load_dataset ds = load_dataset(..., streaming=True) for example in ds: video = example["video"] frames = video.get_frames_in_range(start=0, stop=6, step=1) # only stream certain frames
- Requires
torch>=2.7.0
and FFmpeg >= 4 - Not available for Windows yet but it is coming soon - in the meantime please use
datasets<4.0
- Load audio data with
AudioDecoder
:
audio = dataset[0]["audio"] # <datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0> samples = audio.get_all_samples() # or use get_samples_played_in_range(...) samples.data # tensor([[ 0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 2.3447e-06, -1.9127e-04, -5.3330e-05]] samples.sample_rate # 16000 # old syntax is still supported array, sr = audio["array"], audio["sampling_rate"]
- Load video data with
VideoDecoder
:
video = dataset[0]["video"] <torchcodec.decoders._video_decoder.VideoDecoder object at 0x14a61d5a0> first_frame = video.get_frame_at(0) first_frame.data.shape # (3, 240, 320) first_frame.pts_seconds # 0.0 frames = video.get_frames_in_range(0, 6, 1) frames.data.shape # torch.Size([5, 3, 240, 320])
Breaking changes
-
Remove scripts altogether by @lhoestq in #7592
trust_remote_code
is no longer supported
-
Torchcodec decoding by @TyTodd in #7616
- torchcodec replaces soundfile for audio decoding
- torchcodec replaces decord for video decoding
-
Replace Sequence by List by @lhoestq in #7634
- Introduction of the
List
type
from datasets import Features, List, Value features = Features({ "texts": List(Value("string")), "four_paragraphs": List(Value("string"), length=4) })
Sequence
was a legacy type from tensorflow datasets which converted list of dicts to dicts of lists. It is no longer a type but it becomes a utility that returns aList
or adict
depending on the subfeature
from datasets import Sequence Sequence(Value("string")) # List(Value("string")) Sequence({"texts": Value("string")}) # {"texts": List(Value("string"))}
- Introduction of the
Other improvements and bug fixes
- Refactor
Dataset.map
to reuse cache files mapped with differentnum_proc
by @ringohoffman in #7434 - fix string_to_dict test by @lhoestq in #7571
- Preserve formatting in concatenated IterableDataset by @francescorubbo in #7522
- Fix typos in PDF and Video documentation by @AndreaFrancis in #7579
- fix: Add embed_storage in Pdf feature by @AndreaFrancis in #7582
- load_dataset splits typing by @lhoestq in #7587
- Fixed typos by @TopCoder2K in #7572
- Fix regex library warnings by @emmanuel-ferdman in #7576
- [MINOR:TYPO] Update save_to_disk docstring by @cakiki in #7575
- Add missing property on
RepeatExamplesIterable
by @SilvanCodes in #7581 - Avoid multiple default config names by @albertvillanova in #7585
- Fix broken link to albumentations by @ternaus in #7593
- fix string_to_dict usage for windows by @lhoestq in #7598
- No TF in win tests by @lhoestq in #7603
- Docs and more methods for IterableDataset: push_to_hub, to_parquet... by @lhoestq in #7604
- Tests typing and fixes for push_to_hub by @lhoestq in #7608
- fix parallel push_to_hub in dataset_dict by @lhoestq in #7613
- remove unused code by @lhoestq in #7615
- Update
_dill.py
to useco_linetable
for Python 3.10+ in place ofco_lnotab
by @qgallouedec in #7609 - Fixes in docs by @lhoestq in #7620
- Add albumentations to use dataset by @ternaus in #7596
- minor docs data aug by @lhoestq in #7621
- fix: raise error in FolderBasedBuilder when data_dir and data_files are missing by @ArjunJagdale in #7623
- fix save_infos by @lhoestq in #7639
- better features repr by @lhoestq in #7640
- update docs and docstrings by @lhoestq in #7641
- fix length for ci by @lhoestq in #7642
- Backward compat sequence instance by @lhoestq in #7643
- fix sequence ci by @lhoestq in #7644
- Custom metadata filenames by @lhoestq in #7663
- Update the beans dataset link in Preprocess by @HJassar in #7659
- Backward compat list feature by @lhoestq in #7666
- Fix infer list of images by @lhoestq in #7667
- Fix audio bytes by @lhoestq in #7670
- Fix double sequence by @lhoestq in #7672
New Contributors
- @TopCoder2K made their first contribution in #7564
- @francescorubbo made their first contribution in #7522
- @emmanuel-ferdman made their first contribution in #7576
- @SilvanCodes made their first contribution in #7581
- @ternaus made their first contribution in #7593
- @ArjunJagdale made their first contribution in #7623
- @TyTodd made their first contribution in #7616
- @HJassar made their first contribution in #7659
Full Changelog: 3.6.0...4.0.0
Assets 2
3.6.0
458f45a
Compare
Dataset Features
- Enable xet in push to hub by @lhoestq in #7552
- Faster downloads/uploads with Xet storage
- more info: #7526
Other improvements and bug fixes
- Add try_original_type to DatasetDict.map by @yoshitomo-matsubara in #7544
- Avoid global umask for setting file mode. by @ryan-clancy in #7547
- Rebatch arrow iterables before formatted iterable by @lhoestq in #7553
- Document the HF_DATASETS_CACHE environment variable in the datasets cache documentation by @Harry-Yang0518 in #7532
- fix regression by @lhoestq in #7558
- fix: Image Feature in Datasets Library Fails to Handle bytearray Objects from Spark DataFrames (#7517) by @giraffacarp in #7521
- Remove
aiohttp
from direct dependencies by @akx in #7294
New Contributors
- @ryan-clancy made their first contribution in #7547
- @Harry-Yang0518 made their first contribution in #7532
- @giraffacarp made their first contribution in #7521
- @akx made their first contribution in #7294
Full Changelog: 3.5.1...3.6.0
Assets 2
3.5.1
2e94045
Compare
Bug fixes
- support pyarrow 20 by @lhoestq in #7540
- Fix pyarrow error
TypeError: ArrayExtensionArray.to_pylist() got an unexpected keyword argument 'maps_as_pydicts'
- Fix pyarrow error
- Write pdf in map by @lhoestq in #7487
Other improvements
- update fsspec 2025.3.0 by @peteski22 in #7478
- Support underscore int read instruction by @lhoestq in #7488
- Support skip_trying_type by @yoshitomo-matsubara in #7483
- pdf docs fixes by @lhoestq in #7519
- Remove conditions for Python < 3.9 by @cyyever in #7474
- mention av in video docs by @lhoestq in #7523
- correct use with polars example by @SiQube in #7524
- chore: fix typos by @afuetterer in #7436
New Contributors
- @peteski22 made their first contribution in #7478
- @yoshitomo-matsubara made their first contribution in #7483
- @SiQube made their first contribution in #7524
- @afuetterer made their first contribution in #7436
Full Changelog: 3.5.0...3.5.1
Assets 2
3.5.0
0b5998a
Compare
Datasets Features
- Introduce PDF support (#7318) by @yabramuvdi in #7325
>>> from datasets import load_dataset, Pdf
>>> repo = "path/to/pdf/folder" # or username/dataset_name on Hugging Face
>>> dataset = load_dataset(repo, split="train")
>>> dataset[0]["pdf"]
<pdfplumber.pdf.PDF at 0x1075bc320>
>>> dataset[0]["pdf"].pages[0].extract_text()
...
What's Changed
- Fix local pdf loading by @lhoestq in #7466
- Minor fix for metadata files in extension counter by @lhoestq in #7464
- Priotitize json by @lhoestq in #7476
New Contributors
- @yabramuvdi made their first contribution in #7325
Full Changelog: 3.4.1...3.5.0
Assets 2
3.4.1
f742152
Compare
Assets 2
3.4.0
14fb15a
Compare
Dataset Features
-
Faster folder based builder + parquet support + allow repeated media + use torchvideo by @lhoestq in #7424
- /!\ Breaking change: we replaced
decord
withtorchvision
to read videos, sincedecord
is not maintained anymore and isn't available for recent python versions, see the video dataset loading documentation here for more details. TheVideo
type is still marked as experimental is this version
from datasets import load_dataset, Video dataset = load_dataset("path/to/video/folder", split="train") dataset[0]["video"] # <torchvision.io.video_reader.VideoReader at 0x1652284c0>
- faster streaming for image/audio/video folder from Hugging Face
- support for
metadata.parquet
in addition tometadata.csv
ormetadata.jsonl
for the metadata of the image/audio/video files
- /!\ Breaking change: we replaced
-
Add IterableDataset.decode with multithreading by @lhoestq in #7450
- even faster streaming for image/audio/video folder from Hugging Face if you enable multithreading to decode image/audio/video data:
dataset = dataset.decode(num_threads=num_threads)
General improvements and bug fixes
- fix: None default with bool type on load creates typing error by @stephantul in #7426
- Use pyupgrade --py39-plus by @cyyever in #7428
- Refactor
string_to_dict
to returnNone
if there is no match instead of raisingValueError
by @ringohoffman in #7435 - Fix small bugs with async map by @lhoestq in #7445
- Fix resuming after
ds.set_epoch(new_epoch)
by @lhoestq in #7451 - minor docs changes by @lhoestq in #7452
New Contributors
- @stephantul made their first contribution in #7426
- @cyyever made their first contribution in #7428
- @jp1924 made their first contribution in #7368
Full Changelog: 3.3.2...3.4.0
Assets 2
3.3.2
b37230c
Compare
Bug fixes
- Attempt to fix multiprocessing hang by closing and joining the pool before termination by @dakinggg in #7411
- Gracefully cancel async tasks by @lhoestq in #7414
Other general improvements
- Update use_with_pandas.mdx: to_pandas() correction in last section by @ibarrien in #7407
- Fix a typo in arrow_dataset.py by @jingedawang in #7402
New Contributors
- @dakinggg made their first contribution in #7411
- @ibarrien made their first contribution in #7407
- @jingedawang made their first contribution in #7402
Full Changelog: 3.3.1...3.3.2
Assets 2
3.3.1
4ead6ec
Compare
Assets 2
3.3.0
e9dae36
Compare
Dataset Features
-
Support async functions in map() by @lhoestq in #7384
- Especially useful to download content like images or call inference APIs
prompt = "Answer the following question: {question}. You should think step by step." async def ask_llm(example): return await query_model(prompt.format(question=example["question"])) ds = ds.map(ask_llm)
-
Add repeat method to datasets by @alex-hh in #7198
ds = ds.repeat(10)
-
Support faster processing using pandas or polars functions in
IterableDataset.map()
by @lhoestq in #7370- Add support for "pandas" and "polars" formats in IterableDatasets
- This enables optimized data processing using pandas or polars functions with zero-copy, e.g.
ds = load_dataset("ServiceNow-AI/R1-Distill-SFT", "v0", split="train", streaming=True) ds = ds.with_format("polars") expr = pl.col("solution").str.extract("boxed\\{(.*)\\}").alias("value_solution") ds = ds.map(lambda df: df.with_columns(expr), batched=True)
-
Apply formatting after iter_arrow to speed up format -> map, filter for iterable datasets by @alex-hh in #7207
- IterableDatasets with "numpy" format are now much faster
What's Changed
- don't import soundfile in tests by @lhoestq in #7340
- minor video docs on how to install by @lhoestq in #7341
- Fix typo in arrow_dataset by @AndreaFrancis in #7328
- remove filecheck to enable symlinks by @fschlatt in #7133
- Webdataset special columns in last position by @lhoestq in #7349
- Bump hfh to 0.24 to fix ci by @lhoestq in #7350
- fsspec 2024.12.0 by @lhoestq in #7352
- changes to MappedExamplesIterable to resolve #7345 by @vttrifonov in #7353
- Catch OSError for arrow by @lhoestq in #7348
- Remove .h5 from imagefolder extensions by @lhoestq in #7374
- Add Pandas, PyArrow and Polars docs by @lhoestq in #7382
- Optimized sequence encoding for scalars by @lukasgd in #7393
- Update docs by @lhoestq in #7395
- Update README.md by @lhoestq in #7396
- Release: 3.3.0 by @lhoestq in #7398
New Contributors
- @AndreaFrancis made their first contribution in #7328
- @vttrifonov made their first contribution in #7353
- @lukasgd made their first contribution in #7393
Full Changelog: 3.2.0...3.3.0
Assets 2
3.2.0
fba4758
Compare
Dataset Features
- Faster parquet streaming + filters with predicate pushdown by @lhoestq in #7309
- Up to +100% streaming speed
- Fast filtering via predicate pushdown (skip files/row groups based on predicate instead of downloading the full data), e.g.
from datasets import load_dataset filters = [('date', '>=', '2023')] ds = load_dataset("HuggingFaceFW/fineweb-2", "fra_Latn", streaming=True, filters=filters)
Other improvements and bug fixes
- fix conda release worlflow by @lhoestq in #7272
- Add link to video dataset by @NielsRogge in #7277
- Raise error for incorrect JSON serialization by @varadhbhatnagar in #7273
- support for custom feature encoding/decoding by @alex-hh in #7284
- update load_dataset doctring by @lhoestq in #7301
- Let server decide default repo visibility by @Wauplin in #7302
- fix: update elasticsearch version by @ruidazeng in #7300
- Fix typing in iterable_dataset.py by @lhoestq in #7304
- Updated inconsistent output in documentation examples for
ClassLabel
by @sergiopaniego in #7293 - More docs to from_dict to mention that the result lives in RAM by @lhoestq in #7316
- Release: 3.2.0 by @lhoestq in #7317
New Contributors
- @ruidazeng made their first contribution in #7300
- @sergiopaniego made their first contribution in #7293
Full Changelog: 3.1.0...3.2.0