Before we create a data pipeline, we'll need data to work with. This section explores data assets, the asset lifecycle, and how to set up our first source assets.
Y42 follows the ELT paradigm for processing data: we extract data from a source, then load it, and finally transform it. Extracting data from its source and loading it into a data warehouse is what we call ingestion.
Within Y42, there are over 500 connectors we can use for data ingestion, allowing us to pick and choose the most useful ones. All data ingestion starts with creating a source asset.
Assets are the building blocks of data pipelines. They are definitions in code that when executed materialize tables in our data warehouse. We call this process a data build. For example, building a source asset results in a source table, and building a staging model results in a staging table (more on those in section 4).
It helps to understand the four lifecycle stages that every asset goes through.
Definition: This is the initial phase where we define the asset either directly as code or through the Y42 interface.
Version Control and CI/CD: After defining the asset, we save and track changes by committing and pushing the changes to Git. This ensures that the asset is properly stored and that we can fix any breaking changes by reverting to a previous version.
Building: When the changes have been successfully saved to the remote Git repository, we can builld the asset with the
y42 buildcommand. At this point, Y42 creates the table in our data warehouse and runs any data tests we have defined.
Publication: Once tested and verified, we can publish a specific version of an asset on different branches, such as
develop. This is when the version of an asset becomes the production version that most people in our team will see. We can set read, write, and discover permissions on the organization, space, and asset-type level to restrict access to published assets as necessary.
Now that we have some background knowledge on assets, let's start with defining source assets. The first source we'll ingest is the Space Fuel Shop dataset. This dataset is hosted in a Postgres database and details orders from intergalactic gas stations. We'll use Airbyte's Postgres connector to ingest the data.
As you may have gathered from the asset lifecycle, Y42 follows the GitOps paradigm. We'll dive further into this in section 6, but let's implement a best practice immediately. We should create a new branch whenever we want to make changes to our pipelines — such as adding a data source. This gives us a safe environment to work in without disrupting live pipelines.
To create a new branch, click on
main [default] at the top center of the screen and type in a name for the new branch (e.g.,
getting-started-guide). Now hit enter; that's all there is to it.
To ingest the Space Fuel Shop dataset, we'll use Airbyte's Postgres connector:
- In the left panel, select the
- Press on
+in the file navigator
Postgres (Airbyte)as the Ingest type
- Name the source:
Now, we'll configure the source asset to point to the right Postgres database and pull in all of the data.
Source assets (generally) need a secret to authorize their connection to the data source.
Authorizationpanel, click on the search box
Search a secret. When no secrets exist yet, you'll be prompted to add a new secret. If secrets do exist, you can select one of them or click on
Name the secret
YOUR_NAMEwith your name.
Insert the following type-specific configuration variables:
Variable Value Host
SSH Tunnel Methodto
JDBC URL Parameters (Advanced)empty
When you have entered all authorization settings, click
Next in the
Settings panel, and wait for the schema to load in the
Schema selection panel.
- Open the
- Wait for the schema to load. For larger datasets this may take a few minutes.
- Select all three tables:
- You can open the dropdown of a single table to see the individual columns and types. For now, just import all columns.
Our first source is now fully configured. Before we can build the tables and ingest the data, we'll need to commit and push our changes.
- Click on the
Commit & Pushbutton at the top.
- Enter a name for your Commit:
ingest: add Space Fuel Shop data source.
- Wait for Y42's pre-configured checks to run.
Y42 will automatically push the changes to the Git remote.
With our source successfully set up, we can build the three tables and load them into the data warehouse. This is where the actual ingestion occurs: we take the raw data from the Postgres source and move it to a place where Y42 can work with it.
- Open the drawer at the bottom of your screen
- Select the
- Insert the build command to build and import all three tables:
y42 build -s source:raw_space_fuel_shop_data.*
- Click on
Visualize DAGto see the assets this command will build
- Click on
Build now. You'll see a new row added to the build window. As the blue quarter bar indicates, the build is in queueing mode.
- Click on the build row to see more details, where you'll see the pending jobs. You can also navigate to
Logsto view build logs. If you click on an individual build job, you'll see that job's details.
- After a couple of minutes, the build should be
We'll add metadata to the assets to get the most out of our data. Y42 has extensive metadata features, which is really helpful when working with many data assets and experts. For now, we'll focus on just a few metadata features: column types, column tests, and column descriptions.
raw_ordersin the left file selector
Seven columns should be added. If they do not have the correct
Type, update the type by double-clicking on the
Typeand entering the appropriate one:
Column Type ID
Add the following column tests:
Column Test Values ID
Add the following descriptions:
Column Type UPDATED_AT
Time at which the order was published to the database in CET.
0 for canceled, 1 for declined, 2 for accepted
Commit and push the changes with an appropriate commit title (e.g.,
meta: add raw_orders metadata)
You can add more extensive tests and metadata, of course! As a general practice, it is a good idea to have at least a description per column and the appropriate data types selected.
We can now take a sneak peek at the data to see how it's shaping up:
- Open the bottom drawer
- Navigate to
Data, where you will see a preview of the row count
- Wait for the data to load and verify that it shows up
Now that we've created source tables in the data warehouse, we can start building a pipeline that combines the data and transforms it into new models!