You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This library provides Apache Spark backend for joblib to distribute tasks on a Spark cluster.
Installation
joblibspark requires Python 3.6+, joblib>=0.14 and pyspark>=2.4 to run.
To install joblibspark, run:
pip install joblibspark
The installation does not install PySpark because for most users, PySpark is already installed.
If you do not have PySpark installed, you can install pyspark together with joblibspark:
pip install pyspark>=3.0.0 joblibspark
If you want to use joblibspark with scikit-learn, please install scikit-learn>=0.21.
joblibspark does not generally support run model inference and feature engineering in parallel.
For example:
fromsklearn.feature_extractionimportFeatureHasherh=FeatureHasher(n_features=10)
withparallel_backend('spark', n_jobs=3):
# This won't run parallelly on spark, it will still run locally.h.transform(...)
fromsklearnimportlinear_modelregr=linear_model.LinearRegression()
regr.fit(X_train, y_train)
withparallel_backend('spark', n_jobs=3):
# This won't run parallelly on spark, it will still run locally.regr.predict(diabetes_X_test)
Note: for sklearn.ensemble.RandomForestClassifier, there is a n_jobs parameter,
that means the algorithm support model training/inference in parallel,
but in its inference implementation, it bind the backend to built-in backends,
so the spark backend not work for this case.