In my previous post, we explored the 4 transformation layers to ensure structure and easier maintainability in your logic. Now, we will dive into the practical side:
How does this translate into actual database structures and schemas? How do you make these architectural decisions in the real world?
The truth is, there's no one-size-fits-all answer.
A small 100-people Start-Up with a single full-stack data guru MUST have a vastly different setup than a large 5,000-person Enterprise with multiple data teams.
However, two fundamental considerations apply to everyone:
Views vs Tables and Schemas vs Databases
Views vs. Tables:
It’s a trade-off between Compute and Storage.
The core question here is whether to materialize the results of a transformation (creating a TABLE) or simply store the transformation logic (creating a VIEW).
I will write TABLE, VIEW, SCHEMA and DATABASE in capslock to refer to the database entities, to be less confusing.
Here is a quick recap of the differences between both:
Tables
You execute a query and store the resulting data in a TABLE
This leads to faster access to the transformed data, as it's readily available.
You pay for the storage space the TABLE occupies.
Views
A VIEW is essentially a stored SQL query
It doesn't store the data itself, only the definition of how to retrieve it
Saves storage space but incurs a compute cost every time the view is queried*
*Modern Cloud Data Warehouses are quite good at using cached results for similar queries.
Consider these factors:
Input Data Size: If your raw data includes very large data (like massive JSON payloads in a string-field), processing and extracting only the necessary information into properly parsed fields can significantly improve performance and storage. Constantly dragging around the huge JSON blobs in VIEWS can be costly.
Frequency of Use: If the transformed data is accessed frequently, a materialized TABLE is likely more efficient, because the transformations will be only run once. If it's only used occasionally, a VIEW might be preferable.
So, how do you choose?
I assume the cost will be the major criterion for most use cases, whereas performance/time will be for others.
Optimizing Cloud Cost
The most crucial step is to analyze your data warehouse spending. Are you spending more on Compute or Storage? If you use common platforms like Snowflake, BigQuery or Databricks, the first thing to do is to look up this info in their Admin Console.
For others, you might have a more flat “on-prem-like” pricing, where you cannot tune a lot of cost.
Also, keep in mind how the different platforms also have different pricing mechanisms. Some focus on query execution time, others on scanned data volume. Understanding your platform's cost model and mechanics is key.
But identifying where the bigger leverage is, is crucial to make the right decisions about VIEW vs TABLE materializations.
In case you need to optimize storage, having the Clean- and smaller Prep-Models as VIEWS makes a ton of sense.
In my experience, however, Compute drives 70-90% of Cloud Warehouse Cost. Tweaking storage isn't really that meaningful then. It's still worth testing if VIEWS can save you some Compute, but the other way around is more common.
Optimizing Execution Time
Another factor to consider is that TABLES need to be refreshed to include new data, whereas VIEWS are always up-to-date as they consider the latest data from their sources.
Let’s assume you have 5 models chained in sequential order.
If you use only VIEWs, you don’t need to update your pipeline:
As soon as new raw data is landed you can query model #5 to see the latest results. So in theory you don't even need to run your pipeline (0sec execution time) to refresh your models. Sounds great, right?
But since you “daisy-chain” 5 queries on top of each other you can expect very slow results and subsequently stakeholders complaining about unusable dashboards. Whenever you change a filter in your query of model 5, all transformations will start again.
Also, the saved compute for the refresh might just pile up for consumption instead. Most likely the amount of consumption queries is much higher than the queries to build the Core- and Mart-Models (5 in this example).
Materializing Core and Mart assets as TABLES must be done.
A feasible solution would be to have 3 VIEWS (which are always up to date) for Clean and Prep steps and 2 TABLES (which need to be updated to include new data) for Core and Mart at the end to ensure snappy dashboards.
This option could be a good compromise between a materialized Single Source of Truth with good dashboard performance and no additional storage cost for the preparation logic before. Additionally, your pipeline might be a bit faster because only 2 out of 5 models (in this example) are physically stored in the database.
But often the heavy lifting actually happens in the early steps of the data pipeline and prep-models can be referred to more than once. Therefore, I would strongly recommend testing, if this setup actually delivers improved pipeline execution time and comparing the accumulated cost of a pipeline run.
Then analyze if it’s worth the change considering your overall situation.
Saving 60 seconds of execution time on a pipeline that refreshes once a night probably isn’t worth it.
Also, let’s briefly talk about table management:
Perhaps the ease of use of real-time VIEWS sounds appealing to you.
You essentially don’t need to touch them as long as the logic does not change. But don’t be afraid of the overhead of managing database TABLES in modern times. Transformation frameworks such as dbt, Dataform or SQLMesh make table-management incredibly easy. You don't need to manually manage table creation, updates, or deletions. This reduces the operational overhead of using TABLES over VIEWS.
The difference within the setup can be as simple as a single parameter:
This means that the pipeline’s execution time becomes the only criterion worth paying attention to.
Schemas vs. Databases:
The guiding principle here is "as simple as possible, as complex as needed."
Simple Setups
When starting, a single database with multiple SCHEMAS is often sufficient. Use prefixes or distinct SCHEMA-names to separate your
raw data (e.g., raw_hubspot, raw_stripe, …)
clean and prepared data (e.g., clean_, prep_)
your final core and mart layers (e.g. core_, analytics_).
your development and testing environment
Even when co-developing you can easily assign 1 personal SCHEMA per developer.
The Importance of Separation
You noticed that while advocating that SCHEMA-Separation isn't crucial,
I immediately split Clean & Prep and Core & Mart. It's crucial to keep your early layers in separate SCHEMAS with restricted access. This ensures data integrity, prevents accidental modifications by analysts and focuses them on the analytics-ready datasets. Only the final Core-tables and Marts should be exposed to analysts, data scientists, and reporting tools.
I love the Kitchen analogy of a restaurant by Ralph Kimball: only the chefs and professional staff should be working in the preparation area.
Depending on setup and technical literacy you might even hide core-tables and just share the actual marts.
Having these layers represented in different physical SCHEMAS of your DATABASE allows you to enforce these rules with simple permission management.
Complex Setups:
As your data platform grows, you might consider splitting it into dedicated DATABASES for development and production. You might even have a dedicated DATABASE for raw data ingestion once you reach a critical mass. Avoiding hundreds of raw_-SCHEMAS in your DATABASE is desirable.
However, managing multiple databases or cloud projects increases overhead (managing different service accounts, configurations, etc.) and should only be done when the organizational benefits or cost savings outweigh the added complexity. Don't over-engineer! Also, you shouldn’t blindly copy the practices of Tech Giants as the attempt could overload your team.
Often, the big players have dedicated teams focusing just on Governance, DevOps, and Infrastructure as Code. Trying to sprint before walking can make you stumble very easily.
As a rule of thumb in the majority of setups you shouldn't have more databases to manage as team members.
Conclusion
Materializing your data transformation layers is a balancing act. There's no single "right" way. You should understand and test the trade-offs between VIEWS and TABLES, and strategically use SCHEMAS and DATABASES to create a data platform that is efficient, maintainable, and scalable.
If your storage costs aren’t significant and you’re using a modern transformation framework such as dbt, I don’t see any major benefits of using VIEWS over TABLES. But feel free to experiment and report back in the comments! :)
Remember to analyze your costs, adapt to your organization's needs, and always prioritize clarity and organization.
I would be curious about your Compute vs Storage Ratio and how you use VIEWS.
Leave them in the comments :)