Now that we have data to work with, it's time to expand our pipeline with data transformations! In this section, we discuss the different types of assets that exist in pipelines and transform the source assets from section 3 into staging models.
In a pipeline, we apply data transformations to assets, which results in new assets. These resulting assets are called models. For example, we can take two source assets, join them together, and save the result as a model.
Typically, we use the following types of assets to refine data increasingly:
- Source assets: the raw data as we ingest it into your data warehouse
- Staging models: one-to-one relations to source assets, but with basic data cleaning applied (e.g., data types, renaming, reordering, parsing)
- Intermediate models: meaningful combinations of staging models that are useful for data engineering but unsuited for the needs of business users
- Mart models: the final layer that combines intermediate and staging models in a way that can be presented to business users (e.g., through dashboards)
Let's dive straight in and create models that transform our source data. We'll create a
stg_orders model that cleans up the
- Navigate to the
Listview in the Asset Editor
- Press on
📁and create a
modelsdirectory with a
- Press on
models/stagingas its path
- Name the model:
- Insert the following query
- Preview the model by clicking on the
▶️ (Preview query)button at the top left of the SQL editor or by using the hotkey CTRL/CMD + RETURN while having your cursor in the SQL editor. Verify that you can see the renamed columns and the parsed
In the background, Y42 has automatically created two files:
stg_orders.yml. If you navigate to the
Code view, you can see the contents of these files. The
.sql file contains our query, and the
.yml file includes a reference to the query.
Commit and push these changes with an appropriate commit title (e.g.,
model: add stg_orders model)
- Click on the table in the left file selector
- Open the bottom drawer
- Navigate to
- You should see a pre-defined build command. If not, enter
y42 build -s stg_orders
- Click on
- Observe the build job and wait for the job to be
- Click on
Sync columns, which will add the following columns:_10order_id_10customer_id_10fulfillment_method_10line_items_10service_station_10order_status_10order_received
- Commit and push the table metadata with an appropriate commit title (e.g.,
meta: add stg_orders schema)
Because we applied a transformation to the
order_status column, let's write an extra test for that column.
- Click on
Column Testscolumn for the
Accepted valuesas a test type
- Enter the values from the query above:
- Commit and push these changes with an appropriate commit title (e.g.,
test: add accepted_values test for order_status column)
From here, you can also transform the other source tables into staging models and add all of the appropriate metadata.
Outstanding! We've added the
stg_orders model that contains the transformation logic for our data pipeline. Next, we'll create an orchestration to schedule the pipeline.