v2.1.0 -- Supporting more ML tasks and data formats

@huiwengoh

v2.1.0 begins extending this library beyond standard classification tasks, taking initial steps toward the first tool that can detect label errors in data from any Supervised Learning task (leveraging any model trained for that task). This release is non-breaking when upgrading from v2.0.0.

Highlights of what’s new in 2.1.0:

Major new functionalities:

CROWDLAB algorithms for analysis of data labeled by multiple annotators — @huiwengoh, @ulya-tkch, @jwmueller
- Accurately infer the best consensus label for each example
- Estimate the quality of each consensus label (how likely is it correct)
- Estimate the overall quality of each annotator (how trustworthy are their suggested labels)
Out of Distribution Detection based on either:
- feature values/embeddings — @ulya-tkch, @jwmueller, @JohnsonKuan
- predicted class probabilities — @ulya-tkch
Label error detection for Token Classification tasks (NLP / text data) — @ericwang1997, @elisno
CleanLearning can now:
- Run on non-array data types including: pandas Dataframe, pytorch/tensorflow Dataset objects, and many other types of data formats. — @jwmueller
- Allow base model’s fit() to utilize validation data in each fold during cross-validation (eg. for early-stopping or hyperparameter-optimization purposes). — @huiwengoh
- Train with custom sample weights for datapoints. — @rushic24, @jwmueller
- Utilize any Keras model (supporting both sequential or functional APIs) via cleanlab’s KerasWrapperModel , which makes these models compatible with sklearn and tensorflow Datasets. — @huiwengoh, @jwmueller

Major improvements (in addition to too many bugfixes to name):

Reduced dependencies: scipy is no longer needed — @anishathalye
Clearer error/warning messages throughout package when data/inputs are strangely formatted — @cgnorthcutt, @jwmueller, @huiwengoh
FAQ section in tutorials with advice for commonly encountered issues — @huiwengoh, @ulya-tkch, @jwmueller, @cgnorthcutt
Many additional tutorial and example notebooks at:
docs.cleanlab.ai and https://github.com/cleanlab/examples — @ulya-tkch, @huiwengoh, @jwmueller, @ericwang1997
Static type annotations to ensure robust code — @anishathalye, @elisno

Examples of new workflows available in 2.1:

Out of Distribution and Outlier Detection

Detect out of distribution examples in a dataset based on its numeric feature embeddings

from cleanlab.outlier import OutOfDistribution
ood = OutOfDistribution()
# To get outlier scores for train_data using feature matrix train_feature_embeddings
ood_train_feature_scores = ood.fit_score(features=train_feature_embeddings)
# To get outlier scores for additional test_data using feature matrix test_feature_embeddings
ood_test_feature_scores = ood.score(features=test_feature_embeddings)

Detect out of distribution examples in a dataset based on predicted class probabilities from a trained classifier

from cleanlab.outlier import OutOfDistribution
ood = OutOfDistribution()
# To get outlier scores for train_data using predicted class probabilities (from a trained classifier) and given class labels
ood_train_predictions_scores = ood.fit_score(pred_probs=train_pred_probs, labels=labels)
# To get outlier scores for additional test_data using predicted class probabilities
ood_test_predictions_scores = ood.score(pred_probs=test_pred_probs)

Multi-annotator -- support data with multiple labels

For data labeled by multiple annotators (stored as matrix multiannotator_labels whose rows correspond to examples, columns to each annotator’s chosen labels), cleanlab v2.1 can: find improved consensus labels, score their quality, and assess annotators, all by leveraging predicted class probabilities pred_probs from any trained classifier

from cleanlab.multiannotator import get_label_quality_multiannotator
get_label_quality_multiannotator(multiannotator_labels, pred_probs)

Support Token Classification tasks

Cleanlab v2.1 can now find label issues in token classification (text) data, where each word in a sentence is labeled with one of K classes (eg. entity recognition). This relies on three inputs:

tokens: List of tokenized sentences whose ith element is a list of strings corresponding to tokens of the ith sentence in dataset.
Example: [..., ["I", "love", "cleanlab"], ...]
labels: List whose ith element is a list of integers corresponding to class labels of each token in the ith sentence. Example: [..., [0, 0, 1], ...]
pred_probs: List whose ith element is a np.ndarray of shape (N_i, K) corresponding to predicted class probabilities for each token in the ith sentence (assuming this sentence contains N_i tokens and dataset has K possible classes). These should be out-of-sample pred_probs obtained from a token classification model via cross-validation.
Example: [..., np.array([[0.8,0.2], [0.9,0.1], [0.3,0.7]]), ...]

Using these, you can easily find and display mislabeled tokens in your data

from cleanlab.token_classification.filter import find_label_issues
from cleanlab.token_classification.summary import display_issues
issues = find_label_issues(labels, pred_probs)
display_issues(issues, tokens, pred_probs=pred_probs, given_labels=labels,
               class_names=optional_list_of_ordered_class_names)

Support pd.DataFrames, Keras/PyTorch/TF Datasets, Keras models, etc.

CleanLearning can now operate directly on non-array dataset formats like tensorflow/pytorch Datasets and use arbitrary Keras models:

import numpy as np
import tensorflow as tf
from cleanlab.experimental.keras import KerasWrapperModel
dataset = tf.data.Dataset.from_tensor_slices((features_np_array, labels_np_array))  # example tensorflow dataset created from numpy arrays 
dataset = dataset.shuffle(buffer_size=len(features_np_array)).batch(32)
def make_model(num_features, num_classes):
    inputs = tf.keras.Input(shape=(num_features,))
    outputs = tf.keras.layers.Dense(num_classes)(inputs)
    return tf.keras.Model(inputs=inputs, outputs=outputs, name="my_keras_model")
model = KerasWrapperModel(make_model, model_kwargs={"num_features": features_np_array.shape[1], "num_classes": len(np.unique(labels_np_array))})
cl = CleanLearning(model)
cl.fit(dataset, labels_np_array)  # variant of model.fit() that is more robust to noisy labels
robust_predictions = cl.predict(dataset)  # equivalent to model.predict() after training on cleaner data

Change Log

Fix edgecase divide-by-0 in entropy-score by @jwmueller in #241
Fix some typos. by @Yulv-git in #242
Updated project urls in setup.py by @calebchiam in #249
FeatureReq #33: Added custom sample_weight by @rushic24 in #248
Allow users to pass custom weights for ensemble label quality scoring by @JohnsonKuan in #255
Fix line index of CleanLearning(), some text of links, etc. by @Yulv-git in #260
Copy the docs build artifacts to the "stable" folder by @weijinglok in #231
Add Negative Log Loss Weighting Scheme for Ensemble Label Quality Score by @JohnsonKuan in #267
Developed class that allow the use of cleanlab with tensorflow and huggingface models by @MattiaSangermano in #247
Add KNN distance OOD scoring function and unit tests by @JohnsonKuan in #268
Dataset documentation clarifications by @jwmueller in #270
Add issue templates by @anishathalye in #278
Fix bug. get thresholds broken for multi_label by @cgnorthcutt in #264
Clarify labels format by @cgnorthcutt in #282
Drop dependency on SciPy by @anishathalye in #286
Make CleanLearning work with pandas and other non-numpy feature objects X by @jwmueller in #285
Allow CleanLearning to use validation data in each fold by @huiwengoh in #295
Created FAQ Page in the Cleanlab documentation by @ulya-tkch in #294
Proper validation of labels values/format across package by @jwmueller in #301
Add static type checking by @anishathalye in #306
error for missing classes, consistency on determining num_classes by @jwmueller in #308
Added support to build KNN graph for OOD detection with only training data by @ulya-tkch in #305
Standardize naming on K, num_classes and N, num_examples by @huiwengoh in #312
Added outlier detection tutorial into docs by @ulya-tkch in #310
Updating tutorials hyperlink to 2.0.0 release by @aravindputrevu in #318
Allow KNN object to be returned by get_outlier_scores, Improved OOD tutorial by @jwmueller in #319
Some FAQ tips on how to improve CleanLearning by @jwmueller in #324
Updated tutorials to include quickstart by @ulya-tkch in #323
Add y argument as alternative to labels in CleanLearning.fit() by @elisno in #322
validation.py: Annotate function args and return values by @elisno in #317
Fixed package version issues for audio tutorial by @ulya-tkch in #325
Add compatibility for tensorflow and pytorch Dataset objects by @jwmueller in #311
Re-order find_label_issues args for better clarity by @jwmueller in #329
Comment on missing/rare classes in FAQ by @jwmueller in #332
update sphinx to v5 by @jwmueller in #327
Allow missing classes in get_label_quality_scores by @huiwengoh in #334
Allow missing classes in assert_valid_class_labels by @huiwengoh in #335
Changed all docstring instances of np.array to np.ndarray by @ulya-tkch in #336
Update Contributing.md with Projects link and getting started instructions by @jwmueller in #349
Switch docs links from latest release to stable by @elisno in #379
Extending cleanlab to find label errors in token classification datasets by @ericwang1997 in #347
Cleanlab functionality for multiannotator data by @huiwengoh in #333
Cleanup token classification code by @elisno in #390
Fix typing for find_label_issues by @elisno in #391
Match token/s in color_sentence by @elisno in #397
Escape special regex characters by @elisno in #404
Add FAQ question on how to get predicted labels by @jwmueller in #402
Implementing get_ood_scores function by @ulya-tkch in #338
Add termcolor dependency by @huiwengoh in #415
Add token classification tutorial notebook to docs.cleanlab.ai by @elisno in #411
Update examples links by @huiwengoh in #421
Polish multiannotator docs by @huiwengoh in #422
Text tutorial improvements by @jwmueller in #429
suppress tensorflow warning logs in tutorials if not properly installed by @jwmueller in #432
Add autodoc-typehints extension for sphinx by @elisno in #412
Strip input prompts when copying code snippets by @elisno in #439
Extend KerasWrapper to Functional API by @huiwengoh in #434
Deploy documentation for token classification module by @elisno in #438
Updated labels to allow array_like by @ulya-tkch in #426
Add keras wrapper to docs by @jwmueller in #443
Format all return docstrings and add typing by @jwmueller in #437
make num_label_issues = cj calibrated offdiag sum by @cgnorthcutt in #445
fix bug in hard-coded test. generalize the test by @cgnorthcutt in #448
Change output of display_issues by @elisno in #450
More improvements to token classification code and documentation by @jwmueller in #452
Fix details disclosure elements in docs by @anishathalye in #456
Add missing backticks and language annotation by @anishathalye in #461
Error handling for rare classes in multiannotator data by @huiwengoh in #455
Fix docs build in CI by @anishathalye in #462
Added support for returning ranked issue idxs by @ulya-tkch in #459
update readme for v2.1 by @jwmueller in #457
Clearer code examples on docs main page by @cgnorthcutt in #430

New Contributors

@Yulv-git made their first contribution in #242
@rushic24 made their first contribution in #248
@MattiaSangermano made their first contribution in #247
@ulya-tkch made their first contribution in #293
@huiwengoh made their first contribution in #295
@aravindputrevu made their first contribution in #318
@elisno made their first contribution in #322
@ericwang1997 made their first contribution in #340

Full Changelog: v2.0.0...v2.1.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!