Like many of us, you might have been around when the term Big Data was coined. It was 2009 when Apache Hadoop was first released. A lot has happened since then. In the following article, I’ll be sharing what I think are 5 exciting data stack trends in 2022.
1. It’s time to standardize our metrics
On Oct 15 of 2021, Drew Banin shook up the data world with his PR titled “dbt should know about metrics”. With that announcement, dbt became the front runner to win the race towards the data OS (in my opinion). The implementation of the metric layer will be a major factor in making data accessible to more types of users. This is a very exciting announcement and will make it easier for companies to leverage data for business intelligence by improving the ease of creating queries and dashboards.
A metric layer is an approach to defining core business with a semantic model on top of the data warehouse. This feature will allow dbt to support metric definitions as a new node type. Similarly to exposures, the metrics defined in dbt will be able to be expressed in YAML files and will be reproducible across applications. By defining metrics in dbt projects, analytics engineers can add business logic is tested, version controlled code. What is also exciting is that these metrics can be used in downstream tools, which will create a uniform set of metrics across all reporting stacks.
With a properly defined metrics layer, data teams can create repeatable and embeddable work that any tool or department can use in their preferred method. This represents a decisive inflection point for modern data infrastructure, which we're excited to be a part of. In the next year, we believe the metrics store will become the dominant topic of conversation of data structure in the modern data stack. We strongly believe that the metrics layer should be added to the stack for future scalability and efficiency. As more teams decide to include the metrics layer in their data stack, closed-source alternatives like Supergrain and Transform should also gain significant traction.
2. The new “data workspace” will emerge
The way we read, share and consume information has changed drastically over the years, and it has the potential to continue to do so in the future. As a result, an "all-in-one" data workspace tool might be a solution that addresses the needs of data teams today as well as how stakeholders consume data tomorrow.
The BI/dashboard has been the frontend of the data stack for many years. A variety of tools are available today to enable data knowledge to be shared with stakeholders outside the data team. Some tools use a notebook-style approach, some are much more technical data catalogues, some help data teams build data apps and some are help data teams create a knowledge hub for all data knowledge by combining a few tools into one. Since the BI market is so large, it seems likely that all of these approaches will be able to coexist as different views of the "data workspace".
The "data workspace" is in many ways the evolution of the Data Catalogue, the BI dashboard, and the notebook. It has a better focus on communication, it has more transparency and most importantly, it helps data teams work with stakeholders and business users to deliver data knowledge. To me, this is a natural step for anyone working in analytics or BI.
3. The reverse ETL race heats up
Stakeholders outside of the data team are becoming more data literate and in doing so, are starting to require a different set of tools to work with data. This is partially why the reverse ETL space has become one of the fastest-growing data categories in 2021. One of our primary predictions is that open source reverse ETL will reach the same adoption as both Hightouch and Census in 2022. This may seem like a bold claim, but one that we feel is backed by substantial evidence based on what has taken shape in the ETL space.
Open source tools are gaining traction in the industry and are being used to support the round trip functions of data transformations. The long tail of extractions is easier to support in an open-source format, which opens the door for a reverse ETL tool to grab significant market share. Its growth will be determined by a few factors: first, how fast organizations become data-driven; second, how well data teams are integrated within various business areas; and finally, how well vendors of Reverse ETL tools adapt to open source entering their market.
In addition to the increased competition in the reverse ETL space, we believe that we could see a major acquisition in this space from a larger company like Twilio or Fivetran. The synergies between Reverse ETL and ETL are beneficial to both parties and we think this is inevitable in the reverse ETL space.
4. Predictive models are coming
I’m extremely excited about how predictions are going to start improving the accuracy of metrics in the modern data stack. Continual is one company that is making it easy to maintain predictions – from customer churn to inventory forecasts – directly in your cloud data warehouse. The easies it becomes to implement predictive analytics and machine learning, the more popular it will become within the modern data stack.
The nice thing about Continual is that it’s built for modern data teams onto tools like dbt, which they are already using. This can help data engineers leverage machine learning to drive revenue, streamline operations, and power innovative products and services without complex engineering.
Predictive models have already proven extremely useful to our data platforms, but they will be even more valuable in the future. Originally, they were created to assist businesses in making specific decisions, such as which leads to conversion. Eventually, though, models will be created to direct teams and influence their behaviour by pointing them in the right direction or providing leads to take action upon. The potential for improvement is vast, and we can’t wait to see what happens next. Predictive modelling is just the beginning.
5. Data operations take shape
Ironically, data teams frequently don’t have the information to help us to make decisions and take action in a data-driven way. We need data about the data we provide to make decisions about this data, also called metadata. For example, which tables are being relied upon the most by end-users? What is the business definition of this metric? Are any ETL pipelines delayed?
Answers to these sorts of questions are increasingly important as data is becoming a product used beyond simple reporting to power a wide surface area of applications. The operations around data are almost as important as the data itself. But most teams today are building products without process, documentation, monitoring or analytics.
Without documentation, it’s as if the data team is a product team that also handles support tickets. Without analytics, the usage and impact of your product are unknown, as are the results of your efforts. Without processes, teams will answer the same questions over and over again. Without monitoring, it will be tough to evaluate if the information being used is correct and up to date.
The first problem that better data operations can help solve is that of inbound requests. If you’ve worked in a data org for even one day, you know how relentless and disruptive these requests can be, whether they’re questions about the definition of a field, how to access a certain piece of data, a request to create an extract, or anything else. Some of these requests are important and are fundamental to a data team’s purpose: to empower the rest of the organization. Many of them are unnecessary. The second problem that better data operations can help solve is that of incident management. Whether it's a metric that's skewed or a table that's late, data can have bugs just like software. Tools like Metaplane.dev make this possible. Unlike software engineers, data teams frequently don’t have the tools to identify incidents, diagnose the root cause, and analyze downstream impact. We believe that improving data operations through metadata management tools, observability tools and tools that improve a data team's processes will become a top priority in 2022 as data teams continue to work to become more proactive.
In summary, we believe the future of the modern data stack will continue to make it easier than ever to extract even more value from your data. It’s an exciting time to be a data geek, and I’m happy to be riding this wave.