Fast on Machines
Dask is lightweight, and runs your raw code on your machines without getting in the way. No virtualization or compilers.
As the Python stack matures your code matures. Today Dask is 50% faster than Spark on standard benchmarks.
| CARVIEW |
Dask DataFrames use pandas under the hood, so your current code likely just works. It’s faster than Spark and easier too.
import dask.dataframe as dd
df = dd.read_parquet("s3://data/uber/")
# How much did NYC pay Uber?
df.base_passenger_fare.sum().compute()
# And how much did drivers make?
df.driver_pay.sum().compute()
Parallelize your Python code, no matter how complex. Dask is flexible and supports arbitrary dependencies and fine-grained task scheduling.
from dask.distributed import Client
client = Client()
# Define your own code
def f(x):
return x + 1
# Run your code in parallel
futures = client.map(f, range(100))
results = client.gather(futures)
Use Dask and NumPy/Xarray to churn through terabytes of multi-dimensional array data in formats like HDF, NetCDF, TIFF, or Zarr.
import xarray as xr
# Open image/array files natively
ds = xr.open_mfdataset("data/*.nc")
# Process across dimensions
ds.mean(dims=["lat", "lon"]).compute()
Use Dask with common machine learning libraries to train or predict on large datasets, increasing model accuracy by using all of your data.
import xgboost as xgb
import dask.dataframe as dd
df = dd.read_parquet("s3://my-data/")
dtrain = xgb.dask.DaskDMatrix(df)
model = xgb.dask.train(
dtrain,
{"tree_method": "hist", ...},
...
)
Dask is lightweight, and runs your raw code on your machines without getting in the way. No virtualization or compilers.
As the Python stack matures your code matures. Today Dask is 50% faster than Spark on standard benchmarks.
import pandas as pd
df = pd.read_parquet("s3://mybucket/myfile.parquet/")
df = df[df.value >= 0]
df.groupby("account")["value"].sum()
import dask.dataframe as dd
df = dd.read_parquet("s3://mybucket/myfile.*.parquet/")
df = df[df.value >= 0]
df.groupby("account")["value"].sum().compute()
Computers are cheap. Humans are expensive.
Fortunately, humans already know how to use Dask.
It’s just Python. It’s just pandas. It’s just NumPy.
Dask’s dashboard guides you towards efficiency, quickly teaching you to become a distributed computing expert.
Fast humans + Fast machines = Cheap Computing
Run Dask on your laptop (it’s trivial) or deploy it on any resource manager like Kubernetes, an HPC job schedulers, cloud SaaS services, or even legacy Hadoop/Spark clusters.
from dask.distributed import LocalCluster
cluster = LocalCluster(
processes=False,
)
client = cluster.get_client()
# Use Dask locally
import dask.dataframe as dd
df = dd.read_parquet("/path/to/data.parquet")
df.value.mean().compute()Run Dask in the cloud with open source Kubernetes, or with an easy SaaS solution. Coiled is free for individuals with modest use and easy for anyone with a cloud account.
from coiled import Cluster
cluster = Cluster(
n_workers=100, region="us-east-2",
)
client = cluster.get_client()
# Use Dask on the cloud
import dask.dataframe as dd
df = dd.read_parquet("s3://data.*.parquet")
df.value.mean().compute()























$ conda install dask
$ pip install "dask[complete]"
Copyright © 2022 Dask core developers. New-BSD Licensed.