Carview!

@lhoestq

What's Changed

Support datasets 4 by @lhoestq in #689

Full Changelog: v0.4.4...v0.4.5

@lhoestq

Bug fixes

support jiwer 4.0 by @lhoestq in #685
Fix Perplexity Score For Tokenizers without bos_token_id by @kylehowells in #682
Fix size attribute error for precision/recall/f1 by @Maxwell-Jia in #656

Other changes

Add required hf_token secret to build main documentation by @albertvillanova in #635
Pin numpy<2 as required by tensorflow to fix doc building by @albertvillanova in #631
Support nltk>=3.9 to fix vulnerability by @albertvillanova in #629
add tip in docs and readme referring to lighteval by @MoritzLaurer in #618

New Contributors

@MoritzLaurer made their first contribution in #618
@Maxwell-Jia made their first contribution in #656
@kylehowells made their first contribution in #682

Full Changelog: v0.4.3...v0.4.4

@albertvillanova

This release adds support for datasets>=3.0 by removing calls to deprecated code

What's Changed

Fix CI with temporary pin nltk<3.9 by @albertvillanova in #623
Replace deprecated use_auth_token with token by @albertvillanova in #621
remove ignore_url_params by @lhoestq in #624

Full Changelog: v0.4.2...v0.4.3

@krishnap25

What's Changed

Update the documentation and citation of mauve by @krishnap25 in #416
Remove unused dependency by @daskol in #507
Add confusion matrix by @osanseviero in #528
Update python to 3.8 by @qubvel in #571
Fix FileFreeLock by @lhoestq in #578
Fix example doc in load function by @alexrs in #575
Speeding up mean_iou metric computation by @qubvel in #569

New Contributors

@rtrompier made their first contribution in #510
@daskol made their first contribution in #507
@qubvel made their first contribution in #571
@alexrs made their first contribution in #575

Full Changelog: v0.4.1...v0.4.2

@stevhliu

What's Changed

Add code example to docstrings by @stevhliu in #374
[Minor fix] Typo by @cakiki in #403
[Docs] fixed a typo in bertscore readme by @hazrulakmal in #386
Add max_length kwarg to docstring of Perplexity measurement by @kdutia in #411
Fix minor typo in a_quick_tour.mdx by @tupini07 in #417
Fix Docs base_evaluator.mdx by @jorahn in #418
Update Gradio description to clarify text-based input by @BramVanroy in #427
fix add method by @hazrulakmal in #424
Fix broken link in docs/a_quick_tour.mdx by @tupini07 in #419
resolve #379 audio classification evaluator + docs by @Plutone11011 in #405
fixed kwargs not being passed in combine by @Plutone11011 in #425
add r^2 metric by @TKaanKoc in #407
Update spaces gradio version to 3.19.1 by @BramVanroy in #426
replace evaluate DownloadConfig with datasets by @lvwerra in #447
Render Text2TextGenerationEvaluators' docstring examples by @mariosasko in #463
Trigger CI on ci-* branches by @Wauplin in #467
Update comet by @ricardorei in #443
Fix datasets import in Meteor metric by @mariosasko in #490
fix scikit-learn package name suggestion by @bzz in #498
Release: 0.4.1 by @lhoestq in #505

New Contributors

@cakiki made their first contribution in #403
@hazrulakmal made their first contribution in #386
@kdutia made their first contribution in #411
@tupini07 made their first contribution in #417
@jorahn made their first contribution in #418
@Plutone11011 made their first contribution in #405
@TKaanKoc made their first contribution in #407
@mariosasko made their first contribution in #463
@Wauplin made their first contribution in #467
@ricardorei made their first contribution in #443
@bzz made their first contribution in #498
@lhoestq made their first contribution in #505

Full Changelog: v0.4.0...v0.4.1

@lvwerra

What's Changed

add trainer integration docs by @lvwerra in #325
Stop using model-defined truncation in perplexity calculation by @mathemakitten in #333
Don't use eval for Evaluator instances in the doc by @fxmarty in #341
fix caching by @lvwerra in #336
Fix #327 set default row of gradio webui to 1 and drop empty/blank row by @Raibows in #335
Update pr docs actions by @mishig25 in #344
Fix scikit-learn install in spaces by @lvwerra in #345
added MASE, sMAPE and MAPE metrics by @kashif in #330
fix sklearn dependency in mape, mase and smape by @lvwerra in #346
Update link text by @stevhliu in #360
Corrected range of MAE by @clefourrier in #359
Revert "Update pr docs actions" by @mishig25 in #363
Evaluation suite by @mathemakitten in #337
Matthews correlation coefficient by @sanderland in #362
fix tf version by @lvwerra in #372
Add TextGeneration Evaluator by @NimaBoscarino in #350
Fix typo in rouge types by @davebulaval in #364
Add Evaluate usage for scikit-learn by @awinml in #368
Adding metric visualization by @sashavor in #342
Add NIST metric by @BramVanroy in #250
add GitHub Actions CI by @lvwerra in #375
Add Evaluate Usage for Keras and Tensorflow by @arjunpatel7 in #370
fix version by @lvwerra in #380
CharacTER: MT metric by @BramVanroy in #286
CharCut: another character-based MT evaluation metric by @BramVanroy in #290
asr model evaluator addition + doc by @bayartsogt-ya in #378
Docs for EvaluationSuite by @mathemakitten in #340
Update the documentation of Mauve by @krishnap25 in #377
fix-ci-badge by @lvwerra in #385

New Contributors

@Raibows made their first contribution in #335
@kashif made their first contribution in #330
@clefourrier made their first contribution in #359
@davebulaval made their first contribution in #364
@awinml made their first contribution in #368
@arjunpatel7 made their first contribution in #370
@bayartsogt-ya made their first contribution in #378
@krishnap25 made their first contribution in #377

Full Changelog: v0.3.0...v0.4.0

@fcakyon

What's Changed

add multilabel f1 eval usage by @fcakyon in #221
Force get_supported_tasks() to return a list instead of dict keys by @mathemakitten in #227
Unpin rouge_score by @albertvillanova in #220
Remove import statement in Measurement Card by @meg-huggingface in #231
make rouge support multi-ref by @lvwerra in #229
Fix enforce string by @lvwerra in #230
Fix examples in perplexity measurement docs by @mathemakitten in #238
Add Wilcoxon's signed rank test by @douwekiela in #237
Add support for two input columns for TextClassificationEvaluator by @fxmarty in #205
fix bug in TEMPLATE_REQUIRE: add comma by @BramVanroy in #248
Minor quicktour doc suggestions by @stevhliu in #236
Clarify error message for ChrF no. references by @BramVanroy in #247
only track unique missing dependencies by @BramVanroy in #246
Update evaluate in spaces by @lvwerra in #228
add commit_hash to args by @lvwerra in #253
Change perplexity to be calculated with base e by @mathemakitten in #242
Rebase for previous PR by @mathemakitten in #254
Fix docstrings with new perplexities with base e by @mathemakitten in #255
add a tokenizer option to rouge by @lvwerra in #258
Adding list_duplicates=True to example. by @meg-huggingface in #263
Minor change in describing what this does. by @meg-huggingface in #267
Mapping example output to returned output. by @meg-huggingface in #268
Changes "duplicates_list" to "duplicates_dict" (since it's dict) by @meg-huggingface in #265
Changes "duplicates_list" to "duplicates_dict" in the example. by @meg-huggingface in #264
Add slow flag to two column parity test by @lvwerra in #273
Remove handle_impossible_answer from the default PIPELINE_KWARGS in the question answering evaluator by @fxmarty in #272
Toxicity Measurement by @sashavor in #262
Automatically choose dataset split if none provided by @mathemakitten in #232
Fix YAML in Toxicity by @lvwerra in #278
Added metric Brier Score by @kadirnar in #275
Check for mismatch in device setup in evaluator by @mathemakitten in #287
Fix transfomers import in the evaluator by @mathemakitten in #291
Add support for name field when loading data by @mathemakitten in #283
Adding regard measurement by @sashavor in #271
Raise exception instead of assert in BertScore by @BramVanroy in #292
fix regard yaml by @lvwerra in #295
Add CONTRIBUTING.md by @mathemakitten in #293
Refactor kwargs and configs by @lvwerra in #188
Revert "Refactor kwargs and configs" by @lvwerra in #299
Add missing split and subset kwarg into other evaluators by @mathemakitten in #301
Adding HONEST score by @sashavor in #279
fix wrong sorting in check by @sanderland in #305
Fix HONEST yaml by @lvwerra in #303
Refactor current_features to selected_feature_format by @mathemakitten in #306
replace datasets list with local list of tasks by @lvwerra in #309
Adding torch to the requirements by @sashavor in #311
Honest space fix by @sashavor in #312
Use HTML relative paths for tiles by @lewtun in #318
Test for valid YAML files by @mathemakitten in #308
add versioning the HubEvaluationModuleFactory by @lvwerra in #314
Add text2text evaluator by @lvwerra in #261
try main if tag does not work by @lvwerra in #322

New Contributors

@fcakyon made their first contribution in #221
@meg-huggingface made their first contribution in #231
@stevhliu made their first contribution in #236
@kadirnar made their first contribution in #275
@sanderland made their first contribution in #305

Full Changelog: v0.2.2...v0.3.0

@lvwerra

What's Changed

Update CLI docs by @lvwerra in #218
Add a fingerprint for each EvaluationModule by @mathemakitten in #206
Fix loading error by @lvwerra in #222

Full Changelog: v0.2.1...v0.2.2

@lvwerra

What's Changed

Add measurements to quality and style checks by @lvwerra in #203
Add comparisons and measurements to code quality tests by @lvwerra in #204
Remove mention to datasets from docs by @albertvillanova in #207
Adding label distribution measurement by @sashavor in #202
Fix spaces tagging by @lvwerra in #217
set datasets to >=2.0.0 by @lvwerra in #216

Full Changelog: v0.2.0...v0.2.1

@pn11

What's New

`evaluator`

The evaluator has been extended to three new tasks:

"image-classification"
"token-classification"
"question-answering"

`combine`

With combine one can bundle several metrics into a single object that can be evaluated in one call and also used in combination with the evalutor.

What's Changed

Fix typo in WER docs by @pn11 in #147
Fix rouge outputs by @lvwerra in #158
add tutorial for custom pipeline by @lvwerra in #154
refactor evaluator tests by @lvwerra in #155
rename input_texts to predictions in perplexity by @lvwerra in #157
Add link to GitHub author by @lewtun in #166
Add combine to compose multiple evaluations by @lvwerra in #150
test string casting only on first element by @lvwerra in #159
remove unused fixtures from unittests by @lvwerra in #170
Add a test to check that Evaluator evaluations match transformers examples by @fxmarty in #163
Add smaller model for TextClassificationEvaluator test by @fxmarty in #172
Add tags to spaces by @lvwerra in #162
Rename evaluation modules by @lvwerra in #160
Update push_evaluations_to_hub.py by @lvwerra in #174
update evaluate dependency for spaces by @lvwerra in #175
Add ImageClassificationEvaluator by @fxmarty in #173
attempting to let meteor handle multiple references per prediction by @sashavor in #164
fixed duplicate calculation of spearmanr function in metrics wrapper. by @benlipkin in #176
forbid hyphens in template for module names by @lvwerra in #177
switch from Github to Hub module factory for canonical modules by @lvwerra in #180
Fix bertscore idf by @lvwerra in #183
refactor evaluator base and task classes by @lvwerra in #185
Avoid importing tensorflow when importing evaluate by @NouamaneTazi in #135
Add QuestionAnsweringEvaluator by @fxmarty in #179
Evaluator perf by @ola13 in #178
Fix QuestionAnsweringEvaluator for squad v2, fix examples by @fxmarty in #190
Rename perf metric evaluator by @lvwerra in #191
Fix typos in QA Evaluator by @lewtun in #192
Evaluator device placement by @lvwerra in #193
Change test command in installation.mdx to use exact_match by @mathemakitten in #194
Add TokenClassificationEvaluator by @fxmarty in #167
Pin rouge_score by @albertvillanova in #197
add poseval by @lvwerra in #195
Combine docs by @lvwerra in #201
Evaluator column loading by @lvwerra in #200
Evaluator documentation by @lvwerra in #199

New Contributors

@pn11 made their first contribution in #147
@fxmarty made their first contribution in #163
@benlipkin made their first contribution in #176
@NouamaneTazi made their first contribution in #135
@mathemakitten made their first contribution in #194

Full Changelog: v0.1.2...v0.2.0

Releases: huggingface/evaluate

v0.4.5

What's Changed

Contributors

Uh oh!

v0.4.4

Bug fixes

Other changes

New Contributors

Contributors

Uh oh!

0.4.3

What's Changed

Contributors

Uh oh!

v0.4.2

What's Changed

New Contributors

Contributors

Uh oh!

v0.4.1

What's Changed

New Contributors

Contributors

Uh oh!

v0.4.0

What's Changed

New Contributors

Contributors

Uh oh!

v0.3.0

What's Changed

New Contributors

Contributors

Uh oh!

v0.2.2

What's Changed

Contributors

Uh oh!

v0.2.1

What's Changed

Contributors

Uh oh!

v0.2.0

What's New

evaluator

combine

What's Changed

New Contributors

Contributors

Uh oh!

`evaluator`

`combine`