Data processing and transformation within ETL processes is performed in a wide range of tools – for example, Google BigQuery, Keboola or Data Studio. For clarity, easy control of the entire data flow and easier orientation when editing or managing, it is important for us to visualize the entire process, so we started developing a Data pipeline visualization in AppScript. It is a clear diagram of all stages of our ETL process and also serves as a signpost to links to documentation or other settings. It also allows visualization for linking people from different areas, not directly affected by ETL. Whoever is processing the visualization can see the path that led to the creation of their source table, and the data assistant can see what is happening where and when.
Main advantages of visualization
The visualization consists of cells and connecting arrows that define their interrelationships. Each cell provides sufficient space to fill in all the relevant information.
1. Variability
In each cell there is space for filling in basic information about the process level. We have space to name a specific ETL step (default database/Keboola extractor/BigQuery table/Data Studio report).
2. Clarity
The second row of the cell refers to the naming of the step – source account name, table name, orchestration name. The third row is for a more detailed description of the cell (for example, the dataset name), which can be expanded by referring to the table or operation documentation in the fourth row.
3. Better planning
The fifth line is reserved for information about scheduled operations – when and how often orchestrations or scheduled queries are run in the ETL. Additional notes can be added on the next line. However, we rarely use this space because the main thing is already mentioned above. The very last row of the cell contains cells that allow easier navigation in the visualization.
4. Colour coding
The colour coding is tailored to the client’s needs so that they have an overview of the processes. In our case, however, the colours define individual data flows from the same source or stages on the dataflow (e.g. Keboola, BigQuery, marketing channels).
You can keep the same color scheme for a given data flow until it connects to another data source (GA and marketing). At that point, we switch to a neutral color because it already defines a table/level that comes from multiple sources.
Navigation in visualization
Arrows
Clear signage is essential for us – that’s why the arrows indicate the relationships within the data flow at a glance. They are especially useful when connecting multiple data sources, where you can easily get an overview of all the sources involved.
Signpost
The visualization also acts as a handy signpost, as it can contain links to all available documentation, settings, source accounts or final ETL outputs. The signpost makes it easier to check the optimization of the process and we are able to avoid duplicate process setups.
The importance of process visualisation
The advantages are countless – we have written enough about them. ? In addition, it is a practical overview for all those who are not involved – facilitating an outside view of data processes.Of course, every solution has its limits. In the case of AppScript, the limit is the number of cells and the creation of arrows, and for some, the poorer clarity and in some cases the overlapping of cells and processes themselves. But we can live with that, and we are working on solving these beauty defects as well.