This document helps you understand the architecture and execution of workflows in Dataform.
You can use Dataform to develop, test, and version control workflows that you can run in BigQuery to transform data for analytics purposes. You can develop workflows with Dataform core, using SQLX files and optionally JavaScript files, or with JavaScript.
A workflow can consist of the following objects:
- Data source declarations
- Declarations of BigQuery data sources that let you reference these data sources in Dataform table definitions and SQL operations.
- Tables
- Tables that you create in Dataform based on the declared data sources or other tables in your workflow. Dataform supports the following table types: table, incremental table, view, and materialized view.
- Assertions
- Data quality test queries that you can use to validate table data. Dataform runs assertions every time it updates your workflow and it alerts you if any assertions fail.
- Custom SQL operations
- SQL statements that Dataform runs in BigQuery as they are, without modification.
- Includes
- JavaScript files with definitions of variables and functions that you can reuse across your workflow.
Visualisation of a workflow
You can view your workflow visualized in the form of a Directed Acyclic Graph (DAG). The DAG displays all objects of the workflow defined in your workspace, and the relationships between them. You can zoom in and out and use drag-and-drop navigation in the DAG. If there are compilation errors in your workflow, Dataform displays an error message instead of the DAG.
To view the DAG of your workflow, in your workspace, click Compiled graph.
Execution of a workflow
In your development workspace, you can manually trigger a run of your entire workflow, a selection of actions, or a selection of tags.
You can schedule runs with Dataform release configurations and workflow configurations. First, create a release configuration to create compilation results of your repository. Then, create a workflow configuration, select a release configuration, select workflow actions that you want to run, and set the run schedule.
Alternatively, you can schedule runs either with Cloud Composer, or with Workflows and Cloud Scheduler.
During execution, Dataform runs SQL queries in BigQuery, following the order of object dependencies in your workflow. After execution, you can use your defined tables and views for all your analytics purposes in BigQuery.
Execution configuration options
To run a specific group of your workflow actions, you can add Dataform execution tags to the files you select. You can then run only the files with a selected tag when you manually trigger a run.
By default, Dataform runs your workflow with the execution
settings defined in the dataform.json
file.
You can override these execution settings with compilation overrides.
With workspace compilation overrides, you can turn workspaces into isolated execution environments. This means that when you manually trigger execution in a workspace, Dataform runs the output in an isolated location in BigQuery.
To create and run a single compilation result with compilation overrides you can pass requests with the Dataform API.
With release configurations, you can configure compilation overrides for your whole repository, as well as the frequency of creating compilation results with the applied settings.
To learn more about ways to configure compilation and code lifecycle in Dataform, see Introduction to code lifecycle in Dataform.
What's next
- To learn how to declare a data source, see Declare a data source.
- To learn how to declare dependencies to define relationships between objects in your workflow, see Set dependencies.
- To learn how to define custom SQL operations, see Create operations.
- To learn how to reuse variables and functions across your workflow with includes, see Reuse code across a single repository with includes.