4. Transform data

Transform data

Now that we have data to work with, it's time to expand our pipeline with data transformations! In this section, we discuss the different types of assets that exist in pipelines and transform the source assets from section 3 into staging models.

Data transformations

In a pipeline, we apply data transformations to assets, which results in new assets. These resulting assets are called models. For example, we can take two source assets, join them together, and save the result as a model.

Typically, we use the following types of assets to refine data increasingly:

  • Source assets: the raw data as we ingest it into your data warehouse
  • Staging models: one-to-one relations to source assets, but with basic data cleaning applied (e.g., data types, renaming, reordering, parsing)
  • Intermediate models: meaningful combinations of staging models that are useful for data engineering but unsuited for the needs of business users
  • Mart models: the final layer that combines intermediate and staging models in a way that can be presented to business users (e.g., through dashboards)
A pipeline based on the source assets from the previous section.

A pipeline based on the source assets from the previous section.

Create a new model

Let's dive straight in and create models that transform our source data. We'll create a stg_orders model that cleans up the raw_orders table.

Add a staging model

  1. Navigate to the List view in the Asset Editor
  2. Press on 📁 and create a models directory with a staging subdirectory
  3. Press on +
  4. Select Model and select models/staging as its path
  5. Name the model: stg_orders
  6. Insert the following query
models/staging/stg_orders.sql

_32
WITH
_32
_32
source AS (
_32
SELECT * FROM {{ source('raw_space_fuel_shop', 'public_orders') }}
_32
),
_32
_32
renamed AS (
_32
SELECT
_32
ID AS order_id,
_32
CUSTOMER_ID AS customer_id,
_32
FULFILLMENT_METHOD AS fulfillment_method,
_32
LINE_ITEMS AS line_items,
_32
SERVICE_STATION AS service_station,
_32
STATUS AS order_status,
_32
UPDATED_AT AS order_received
_32
FROM source
_32
)
_32
_32
SELECT
_32
order_id,
_32
customer_id,
_32
fulfillment_method,
_32
line_items,
_32
service_station,
_32
CASE
_32
WHEN order_status = '0' THEN 'CANCELLED'
_32
WHEN order_status = '1' THEN 'DECLINED'
_32
WHEN order_status = '2' THEN 'ACCEPTED'
_32
END AS order_status,
_32
order_received
_32
_32
FROM renamed

  1. Preview the model by clicking on the ▶️ (Preview query) button at the top left of the SQL editor or by using the hotkey CMD / CTRL + RETURN while having your cursor in the SQL editor. Verify that you can see the renamed columns and the parsed order_status column.

Commit & push the changes

In the background, Y42 has automatically created two files: stg_orders.sql and stg_orders.yml. If you navigate to the Code view, you can see the contents of these files. The .sql file contains our query, and the .yml file includes a reference to the query.

Commit and push these changes with an appropriate commit title (e.g., model: add stg_orders model)

Build the model

  1. Click on the table in the left file selector
  2. Open the bottom drawer
  3. Navigate to Build
  4. You should see a pre-defined build command. If not, enter y42 build -s stg_orders
  5. Click on Build now
  6. Observe the build job and wait for the job to be Ready

Add metadata to the model and preview data

  1. Click on Sync columns, which will add the following columns:

    _10
    order_id
    _10
    customer_id
    _10
    fulfillment_method
    _10
    line_items
    _10
    service_station
    _10
    order_status
    _10
    order_received

  2. Commit and push the table metadata with an appropriate commit title (e.g., meta: add stg_orders schema)

Add test for order_status column

Because we applied a transformation to the order_status column, let's write an extra test for that column.

  1. Click on + in the Column Tests column for the order_status row.
  2. Select Accepted values as a test type
  3. Enter the values from the query above: CANCELLED, DECLINED, and ACCEPTED
  4. Commit and push these changes with an appropriate commit title (e.g., test: add accepted_values test for order_status column)

From here, you can also transform the other source tables into staging models and add all of the appropriate metadata.

Up next

Outstanding! We've added the stg_orders model that contains the transformation logic for our data pipeline. Next, we'll create an orchestration to schedule the pipeline.