You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
DataFusion for Ray is a distributed execution framework that enables DataFusion DataFrame and SQL queries to run on a
Ray cluster. This integration allows users to leverage Ray's dynamic scheduling capabilities while executing
queries in a distributed fashion.
Execution Modes
DataFusion for Ray supports two execution modes:
Streaming Execution
This mode mimics the default execution strategy of DataFusion. Each operator in the query plan starts executing
as soon as its inputs are available, leading to a more pipelined execution model.
Batch Execution
Note: Batch Execution is not implemented yet. Tracking issue: #69
In this mode, execution follows a staged model similar to Apache Spark. Each query stage runs to completion, producing
intermediate shuffle files that are persisted and used as input for the next stage.
Getting Started
See the contributor guide for instructions on building DataFusion for Ray.
Once installed, you can run queries using DataFusion's familiar API while leveraging the distributed execution
capabilities of Ray.
# from example in ./examples/http_csv.pyimportrayfromdatafusion_rayimportDFRayContext, df_ray_runtime_envray.init(runtime_env=df_ray_runtime_env)
ctx=DFRayContext()
ctx.register_csv(
"aggregate_test_100",
"https://github.com/apache/arrow-testing/raw/master/data/csv/aggregate_test_100.csv",
)
df=ctx.sql("SELECT c1,c2,c3 FROM aggregate_test_100 LIMIT 5")
df.show()
Contributing
Contributions are welcome! Please open an issue or submit a pull request if you would like to contribute. See the
contributor guide for more information.