Spreadsheets are ubiquitous and incredibly powerful tools.
Their flexibility allows users to mix and match different data types, even within the same tab or column. This flexibility can be both a blessing and a curse. Eventually, you might need to read these spreadsheets into a data warehouse for analysis or feed data exports into them. To do this effectively, spreadsheets need to be machine-readable. In this post, I'll cover some examples and best practices to keep in mind when designing or using spreadsheets, as well as tips to share with your business stakeholders to create these sheets with the right purpose and rules from the start.
Common Pitfalls to Avoid
Merged and Hidden Cells
Merged cells can be confusing for both humans and computers. While they might make the spreadsheet look cleaner to the human eye, they can cause issues for machines trying to read the data. Ideally, you want each header to be clearly associated with its corresponding row. If a header applies to multiple rows, consider using separate headers for each row.
Multi-Line Headers
Multi-line headers, where you have to read multiple cells to understand the header, can be problematic. For example, having a merged header with "Revenue" and then "Plan" and "Actual" in the cells below. It's better to have both pieces of information in a single cell to make it easier to understand.
Missing Headers
Sometimes, it might be clear to you what a value represents (like the Month in the Screenshot above), but it's always best to have explicit headers. This makes it easier to detect what's in each cell and ensures that the data is machine-readable. All values in the spreadsheet must be clearly identifiable by a header-cell.
Multiple Tables
Having many tables below each other or side by side can be challenging.
A few whitespaces are not enough to clearly separate one table from another.
It’s better to separate them into individual tabs
Colored Cells
While color logic can help guide the eye, please try to avoid it.
I have seen huge tables with rows as “green” for “done
” and “red” for “issue
”, blue for “in progress
”, …
I probably shouldn’t tell you that you can even filter spreadsheets by color :D
But to help potential color blind coworkers and future automation projects, rather use a clearly defined column called “Status” to define this information.
Best Practices for Spreadsheet Design
Raw Data Tabs
I prefer having raw data tabs where the data is stored in its most basic form. You can then have separate tabs for consuming the data, where you structure it differently using references and pivot tables. This approach makes it easier to manage and analyze the data.
Long Tables
Long tables that grow vertically are generally better than wide tables.
While widescreens are common, they don't necessarily make your life easier when it comes to data management. Having long tables with as few columns as possible is more manageable.
For example, instead of having separate tables for revenue and cost, you can structure them into a single table with columns for
Month
Revenue (Plan)
Revenue (Actual)
Cost (Plan)
Cost (Actual)
As you can see below, the data know looks more structured as every Column has exactly 1 header.
Metric Definitions
You can go a step further and make the table a bit leaner.
This makes it easier to define and calculate metrics, too.
For example, you can have a column for Type (“Plan” or “Actual”) and list Revenue and Cost only once. If you then want to create a calculated field such as ROI (Revenue / Cost), you only need to define it once.
This approach makes it easier to organize and manage calculated fields.
Pivot Tables
A good rule of thumb is that if something works in a pivot table, it will also work in a database. Pivot tables are powerful tools that can help you organize and analyze your data effectively.
Avoid Overcomplication
While it's possible to go a step further and have separate columns for Metric Types and Metric Values, this can make the spreadsheet overly complex.
The goal is to have spreadsheets that help you, not ones that are cumbersome to manage.
Conclusion
In summary, avoiding common pitfalls like merged cells, multi-line headers, missing headers, and multiple tables can make your spreadsheets more machine-readable and easier to manage. Using raw data tabs, long tables, and pivot tables can add significant value to your data management process. Remember, the rule of thumb is that if it works in a pivot table, it will also work in a database. Pivot tables are powerful tools that can enhance your spreadsheets, even without any data warehouse implications.
By following these best practices, you can create spreadsheets that are not only powerful but also easy to manage and analyze.
I hope that these tips can boost the data literacy in your organization :)