Scaling Data Quality: Harnessing Automated Testing in dbt Projects

In this webinar, dbt experts Grace Peace (TLM, Analytics Engineering at Cityblock Health) Gleb Mezhanskiy (CEO at Datafold), and Lindsay Murphy (Head of Data at Secoda), discuss best practices on how to effectively scale your data testing initiatives as your dbt models grow.
Last updated
May 2, 2024
Author

Scaling Data Quality: Harnessing Automated Testing in dbt Projects

Lindsay Murphy: All right, we are live. So super excited to have everyone here today and thank you so much to our audience for joining. So for our talk today, we're going to be focused on scaling data quality and talking about how we can harness automated testing in dbt projects. My name is Lindsay Murphy. I'm the Head of Data at Secoda and I'm joined by two amazing community members here, Grace and Gleb, and I will hand it over to each of them to give them a chance to introduce themselves and then we're going to just kick things off and dive right into the discussion. So Grace, if you want to kick us off and introduce yourself, tell us a little bit about your role, and how data quality affects you. 

Grace Peace: Sure. I forgot to do something in preparation for this and refresh myself on my intro. My name is Grace Peace. I’ve been at Cityblock for a couple of years. My background, I started my career in data in healthcare and have since bounced around a little in banking and even major league baseball, and I found myself back in healthcare. It's an extremely interesting challenge. We have a team of about 10 analytics engineers who support the analytics environment in data here at Cityblock and a huge data team even outside of that, and we're continually doing new things, facing new challenges, making new mistakes. So, I'm glad I'm here. 

Lindsay Murphy: Awesome. And Gleb, over to you.

Gleb Mezhanskiy: Hi everyone, I'm Gleb Mezhanskiy. I am CEO and cofounder of Datafold. We help data teams using dbt automate testing of their code. Before Datafold, I was a data engineer pretty much my entire career at companies including Autodesk and Lyft. And the inspiration for starting Datafold was me blowing up Lyft's entire data platform by committing a three-line hotfix into one of our core models when I was on call engineer at 3 a.m. in the morning. So, if you want to learn more about that, you can find me on LinkedIn.

Lindsay Murphy: I've heard some of that starting part, it's a good one. I skipped myself! I'm Lindsay Murphy. As I've said, I'm the Head of Data at Secoda. Secoda is focused on helping data teams and data consumers to discover their data with AI powered features. So definitely check us out. I joined the team recently, about two months ago, and I've actually launched a data team at a startup from scratch and I've also led a data team of analytics engineers at my last company. So I am also very familiar with data quality challenges myself, and especially scaling them. Super excited to talk a bit about this.I'm also an instructor with CoRise. So I will be teaching a course in a few cohort weeks. It's called Advanced dbt. So definitely check that out as well if you're looking to increase your skills there. 

Alright, let's dive into the topic. The first thing we want to talk about is just defining data quality. This is a pretty big topic and I think it covers a lot of different areas. That means a lot of different things to different people. So why don't we just set the stage and talk about it, what it means to each of us, and what we want to focus on today. So maybe Gleb, I'll start with you since this is kind of the focus of your company and what you spend your day on. What does data quality mean to you?

Gleb Mezhanskiy: I'm really curious to hear Grace's answer for this, but I can take a first step. I think that data quality is really hard to define. Unlike software quality where there's typically a very clear expectation from the user of what the software should be doing and sometimes, equality can be expressed in unit tests, so the function should be doing this or that. In data, it's much more subtle because the inputs are constantly changing. The data is always changing with the business. And in general, we're dealing with, I would say, much more complexity in the data domain than in software. And so I found that it's really hard to map software quality to data quality. The closest I've come to the definition of data quality is probably maintaining the high trust of the data user and sometimes, the data user can be an executive looking dashboard. Sometimes, it is a sales team looking at data that was imported into their system like Salesforce from the warehouse. And sometimes, it is the end user of the product looking at some of the analytics in the app. So it can come in multiple assets. 

I think that trust is probably the most important metric to optimize for. And then to maintain trust, then you have multiple other things you want to optimize. When it comes to data quality, one of the least obvious ones is consistency because no one knows what the actual baseline truth is. What is really important is that whatever we are providing to our end users probably stays consistent over time because even if we have a slight bias in the actual metric we're providing, as long as that stays unchanged, the business feels like they can rely on this. But if we're changing definitions and the business doesn't know about this and even though those changes are for the better, they can actually be quite detrimental to trust and therefore, to the data driven decision making. So that's my attempt but I'm really curious to hear Grace because she's actually running into the data quality, maintaining -- well, optimizing data quality, I think, in the day to day of the job.

Grace Peace: I agree very much with your definition. I want to add one other demographic to those who need to trust the data and let's see analysts who are feeding the data that we transform and consume from our payers and from other third-party sources. The analysts are on the hook for what they present to the business and therefore, definitely our toughest customers, and one of our biggest goals is to know about a data problem before the analysts do. We definitely have to know about it before the executives do, but also before the analysts do, because they'll find it. And then they have to wait. Who knows how long for us to fix it if we don't fix it proactively. 

I've always considered my definition of data quality as visibility into the state of the data. It's not fixing data. It's letting everyone know the state of the data down to the tiniest detail as much as possible. And we struggled a little bit with where to draw the line. We don't want to really overhear engineer’s solutions or overfeed information on quality to our users but trust is the end goal. The data may not be right, but you know which pieces of it you can trust.

Lindsay Murphy: Yeah, I think that's a big one. It's apparent there are a couple of themes, and it's things that I've been chatting about in other webinars as well, but it sounds like a lot of it goes back to even definitions and descriptions of things and just getting alignment at the business level because I know many, many times when you get some things as a data team, and it's like, hey, these numbers don't match. And someone's asking, why don't these numbers match, and what numbers do I trust? And I feel like 80% of the time, it comes down to a definitional change. That's maybe not even that the data is bad or there's really anything underlying happening. It's just that the difference between the two things is a definitional change. I'm sure, Grace, that probably applies a lot to the analysts as well. Like if you're creating data assets for them and they're not fully aware of how something is measured, or how it's pulled. That feels like data quality, but to me, it's part of it but it feels almost a little bit more like definitional and business alignment. 

Grace Peace: Yes. 

Lindsay Murphy: One of the other things that I heard was finding the issues before they make them to the stakeholders. A lot of the things that I've noticed in the teams that I've managed is we did things very reactively with the data quality. So we kind of waited until something was broken and then we had an alert or a test or something that would fail. So I guess when we talk about data quality, I think a lot of times, people think about data tests failing. But maybe talk a little bit either Gleb or Grace, a little bit your thoughts on how do you find issues before they happen? And how do we think about data quality, maybe in a different sense?

Grace Peace: I can tell you the way that we approach it is we test as far upstream as possible. And then we only test again if we transform join, union, filter, things like that. And we have implemented gates in different things that some are still implementing, others, we have implemented. We do tests on raw files before we even put them anywhere for any code to consume them and they get moved one of two places like they get moved to quarantine or they get moved to the next step. And that's pretty basic. I think most companies do this but we also provide reporting on that, depending on what's wrong with it. We try to route the information either to the vendor that provided us the data, and we also let others know. I plan to hold up today, there's an issue with so and so file, this is what we're doing about it. This is ETA, which, again, seems kind of basic but it's leaps and bounds ahead of where we were a year ago where the jobs just ran. And sometimes, they would run okay because the data looked fine to the jobs, but maybe the birth dates indicated that all of our patients were 200 years old, and we didn't see it but of course, the analysts did, or worse, the rest of the business. So we're putting more tests in further upstream. 

And in some cases, we still have the data if that's a problem. If it's a problem, maybe you decided the business doesn't want it to go through if that's an issue but the rest of the business wants to see the rest of the data that came in. So we're making very deliberate decisions about whether we're still going to let the data through, what we're going to let people know, and who cares. Just so you know, 50% of the values in this column are wrong in this file. It's still in there, but don't count on it. We're working on it. So, far upstream.

Gleb Mezhanskiy: Yeah, it's maybe also helpful to think about the data quality in two dimensions. One of the dimensions is essentially the flow of data. So for example, you start with raw data like events or production, data exports, and then you go into warehouse transformations, for example, with dbt. And then you have a consumption layer where you sync data to your business app. So you consume with your dashboards so you train machine learning models. And ideally, you catch issues as early as possible. So as Grace described, trying to catch as many errors in the upstream layers. 

And then the second dimension is essentially the workflow of the data team. So essentially, we write code, we have typically a lot of complexity in the business logic that gets expressed in dbt. In SQL, sometimes, it will come out but that's down to anti-pattern. And then that is essentially a data engineer workflow where you start with developing code, and then you check it into a repository like GitHub, and then it gets into production. So that's the second dimension. And just like Grace said, you want to detect things earlier in your pipeline in production. You also want to try to detect issues as early as possible in the developer workflow. And what it means is, ideally, you don't let any mistakes that get introduced in the form of SQL bugs or unintended changes get into production. You try to catch them as early as possible in the analytics engineer or analyst workflow as they develop the code. And to do that, you have to apply the testing as early as possible in the workflow. For example, if I'm developing a new model, refactoring a dbt model, how can I make sure that my changes are not impacting the business in any kind of negative way, be the mistake or a change that gets accounted for or change that leads to some dashboard being broken. So I think that thinking about that shift left in terms of the workflow is also quite important.

Lindsay Murphy: Yeah, for sure. It's kind of that shift from like the reactive to sort of like a proactive approach. It's like trying to prevent making mistakes rather than waiting for a mistake you've made to alert you and then you have to fix it. That's great. That kind of leads us into our next topic, which is perfect, which is types of data tests and use cases. So maybe we want to get into the nitty gritty here a little bit and just talk a little bit more about what are some of the different types of data tests? What are the use cases that you would apply those for? And maybe, Grace, if you want to start us off in terms of how your team thinks about this?

Grace Peace: When we try to use pretty much the full built in dbt suite, we also have several custom macros that we've written. And then there's also testing that we do as part of development and pull requests and review processes. At the very earliest stages with the raw files, we do tests, just basic tests as far as determining whether or not a file that comes in matches a name format that we're expecting. If not, within a general alert that we’re looking into this mystery file that we do tests on the file itself for schema and data types, etc. I think what most people are interested in are the tests we do later. So in dbt, we actually have pre-commit checks. We use pre-commit checks on our PRs to require at least one unique test on every new model. There are other test requirements depending on what type of data that we're working with. We use tags on tests extremely liberally to figure out how to route issues, failures or warrants with tests and also it helps us aggregate or filter on dashboards as far as test results or things that are going on. 

I would say the canned tests that come with dbt, the not null, the unique, the referential, those are pretty much what we use. That's the meat of the testing that we do in dbt and then we analyze it all downstream later. We capture the manifester results. It's not all built in for us because we're still on Core. We're not on Cloud so we built something that parses that out for us. We also add the development layer for Datafold. We rely very heavily on the data feature for Datafold. It's on our PRs. It's triggered. it's a required test on all of our PRs. It is extremely common for a reviewer to look at the data diffs and see that a downstream model has sent 50% fewer records than it did before. The code change and ask about it if it's not already explained on the PR description. Some people are learning when they create a PR that they need to just go ahead and explain that. I'm expecting this to go way down because of blah blah blah. And then further on change approval board reviews for large changes, I'm often the one they don't want at the meeting because I'm always asking about the data diff. I don't see anything in your PR about this data diff. I don't see anything in your ticket that you're expecting records to go down. So can you explain this? So those are the types of tests that we use.

Lindsay Murphy: It's very comprehensive. And Gleb, in your view, what are some categories or ways that you think about data tests and the use cases they launch?

Gleb Mezhanskiy: Yeah, first of all, I should comment that I think the rigor of testing and just the depth of the change management process that Grace's team introduced is actually very advanced. So I think she mentioned the change in reviewing boards. So that just speaks for how seriously the organization approaches the change management process which I think is very, very interesting. I actually tried to put together a taxonomy of tests within the analytics engineering workflow and what I arrived at is that there are probably three main types of tests that you can run. And I assume that we're talking about testing within developer workflow before the code change goes into production. 

So basically, there are assertion tests, where you say, well, my data should look like this for a particular model or for a particular column. And those tests can be done on the live data, meaning that you basically query the actual data in production and you validate that against the expectation and that's what dbt ships with. That's the dbt test framework. And then there are mock tests where you essentially do the same but instead of using the data in production, you use a predefined input data set and output data set which is effectively similar to unit testing in software but you do this for SQL. And that is currently not available in dbt out of the box but there are a couple of community packages that can be used for that. 

And the cool thing is that, or I think the important distinction, is that assertion tasks are really easy to write because you just say, well, this dimension should always take one of these five values. That is easy to express. The disadvantage of this is these tests can be quite noisy because they are dependent on the input data, which can also change and they sometimes can be long to run because you are actually clearing production data. The mock tests are probably the purest way of testing but their disadvantage is that you have to curate input out datasets and output data sets just like in unit tests. And this can be extremely powerful when you have to test some really business critical logic and they are great because you separate the data from the actual code. So regardless of how the data looks, you always get the same output. But the disadvantage is that you have to curate them and invest much more time in them. 

And I think in general, the assertion tests are a great way to test the most important assumptions about the business that your team has. The challenges that they're not really scalable because in the world where you have hundreds or thousands of models in dbt, each with, let's say on average, 20, 30 columns, you can't really cover a meaningful enough number of that data work or percentage of that data with assertion tests just because you have to curate them and write them. 

And then the third type of tests which Grace mentioned is data diff which essentially comes without any expectations of how the data should look, but it reverses the problem and helps you understand how the code change the developer introduces, actually affect the data that is produced by that code. And what that does is it tells you how your change is going to impact the data in the modeling change and then takes down your models, potentially even your BI tools. And that is interesting because it pretty much gives you the preview of the change and allows you to understand the full effect of the code modification without having to define any kind of tests in advance. And so in a way, if I would think about what is a comprehensive strategy for a data team, I would probably suggest that they should use the assertion tests like dbt test and mock test for the most important parts of their business logic. Like, when you know that you have four prototypes and you want to make sure that there is no fifth one introduced by mistake because that could throw off dashboards or ML models. So that should always be done. But then data diff is the type of testing that can provide you coverage for all the long tail of issues that you don't cover in assertion tests because it just catches every single modification and having that as part of the pull request review is quite powerful because it helps catch the long tail of potential big and small mistakes that can make it into production. 

So overall, I would say there are probably three types of tests, so dbt tests, mock tests and then data diff that I think are very powerful in the tool belt for analytics engineers to come up with a comprehensive strategy for testing as part of the developer workflow. And then in production, I think Grace mentioned that you can also run some of these tests in production to see whether your assumptions are valid as the data gets rebuilt every day.

Grace Peace: Yeah. 

Lindsay Murphy:Yeah. Oh, sorry, Grace. Go ahead.

Grace Peace: I was going to say, it's been huge since we deployed data diff. Not to have to set up our tests, you know, slip count from this, account from this and select these distinct values and lists and accept clauses. We don't have to set any of that up in SQL anymore. It’s just done as part of the PR and it's great. Now, I also want to call out. They recently came out with an inline plugin for Visual Studio so that we can do data diffs before we even commit or do a PR and just see as we code, what's going on with our changes, and that's huge as well.

Lindsay Murphy: Yeah, that's definitely a huge time savings. I have also spent a lot of time comparing SQL statements to try and figure out what's broken or what's wrong. I have not been a data diff user yet, but it's on my list to try out. So yeah, I can imagine that saves a ton of time because that's very tedious. Awesome. So I'm picking up on some best practices. I'm hearing a couple of things so we can jump to the next because I think there's going to be a pretty meaty topic. Yeah. So I guess, Grace, it sounds like you've been pretty pivotal in moving your team forward and the company in general in terms of data quality. So what has been your experience in terms of scaling data quality with your team and across the company and what are some of the best practices that you try to follow that you think are helping them?

Grace Peace: Well, I can tell you, as one of the leaders of this team, I got really tired of being embarrassed, so I begged to create a data quality team and was allowed to do that. So rather than telling you best practices, I can tell you the things I wish I had done differently because I made a lot of mistakes shredding it up. And some things that we can continue to do as we add new things. And forgive me, I'm going to be looking at my notes because I want to make sure everyone's in good notes about that. One of the biggest things if you're starting from scratch in revamping or setting up a dedicated data quality initiative is to put the right people on it. Not everybody is right for this as far as people aren't necessarily the best people for it because smart means so many different things. You're going to want people who are really good at anticipating gaps and people who jam on projects that have no blueprint because no matter what, you're going to be figuring out from scratch for your organization. 

You want somebody who wants to listen to problems, wants to listen to people complain because that's the exact problem that you're trying to solve. And you need to dedicate a product manager or project manager because if you put the type of people I just described on this, they're not going to be good at that piece. And you don't want them to have to worry about that stuff. You want to make really good friends with DevOps, find out their favorite liquor or favorite beer or whatever because they're going to be supporting you way more than you anticipate. And you need to set up the downstream accountability for all of what you're building because you can build all of this visibility into your data quality, into your raw file quality, into your model quality, and it will mean absolutely nothing if nobody's watching it or doing anything about it and you don't have someone to send those issues to address or do a coaching, head it off, trend it to figure out what can we do about it, those kinds of things. 

And my last few things are to enlist stakeholder participation early. We didn't do that. We just came up with our blue-sky plan on how we want to set it up and then ask people later and then like, it's cool but I really wish that we could see this or do this or whatever. We're doing a lot of that now but I wish we had started out of the gate. It would have saved us a lot of time. And then building time to experiment because we didn't know what any of this would look like. We didn't know how a lot of this stuff would work and what it was capable of. And sometimes, we just had to leave really neat stuff on the floor and come back for it later because we didn't build time for it. So I don't know if that counts as best practices for scaling, but it is best practices for implementing. 

Lindsay Murphy: Yeah.

Grace Peace: We're scaling now, so.

Lindsay Murphy: You mentioned that you were able to get buy-in for the team. What did that look like? Did you have to build a business case or was there just enough issues that didn't happen that it was an easy discussion? 

Grace Peace: I chose my timing well. 

Lindsay Murphy: Yeah.

Grace Peace: I would plant little seeds about whenever somebody way up there was really upset about something being wrong or we were having a particularly hard time with a specific data source or payer. I would say it would be awesome if we had time to spend on this or whatever. And I would just plant those seeds periodically and then something blew up and I said, can we please just think about doing this? And it turned into a conversation.

Lindsay Murphy: Yeah. Sometimes, unfortunately, we have to wait and pull some things if they go wrong, and at times that it's affected enough people but, yeah. And I'm interested too a little bit how the upstream accountability of the sources affect your team. So in a lot of analysts teams that I've been on, it's like, maybe the engineering team will change the schema, and then they break all of our models or we had an instance of an engineering team even just changing something in JIRA and an increase like, change how our JIRA models were set up and it ends up breaking things. So sometimes, those things feel out of control of the data team. So I'm curious how you manage some of those challenges and how they get worse as the team gets bigger.

Grace Peace: Visibility is key. Putting visibility on it, putting measurements in place so that everyone can see this is what broke, this is where it broke, this is why. So we have that from some other inhouse and engineering teams. They are also innovative and doing things to make our system of record better which sometimes breaks our jobs downstream that consume that data. We also have players who really don't care whether or not we have a hard time with their files. Sometimes, it's like, you get what you get, and we might decide to resend it because we made a mistake and we're not going to tell you where we are sending it. We're just going to drop a new file and you have to figure out why. Or, we might add a column because some other company asked for it and we only run one format of files and give them to everybody or whatever. And it was breaking our jobs and we didn't have any visibility around it. 

And now, we can say like almost in the moment, this is what happened to this. X person, please contact this payer or X person on this engineering team, please fix this, revert it, whatever. And it's very visible. And everybody in our company is extremely helpful, extremely friendly, and extremely collaborative. When things get broken in house, from in house data sources, it's only because somebody was trying to solve a big problem for somebody and they react really quickly. We're lucky like that. There's not a lot of ego or any fighting or anything like that. So more than anything, it's just helped everybody collaborate better. And then as far as vendors, it's helped put some muscle to the vendor implementation teams because the business now sees the direct impact of the payer not cooperating on agreed upon file types. And it makes that conversation go more smoothly when we can put out exactly what they impact on this.

Lindsay Murphy: Yeah, that's great. It's that transparency drives accountability, but also ownership and understanding of the problem, which was super important. Gleb, any thoughts to add there in terms of your thoughts on best practices?

Gleb Mezhanskiy: Yeah, I've got to work with quite a few teams for the past few years and also from my prior experience, definitely huge plus one to everything that Grace said about putting together a task force and I think she has a lot of wisdom in terms of how to structure that team and collaborate with stakeholders. I think from a more technical and process standpoint, some of the interesting things that I've seen successful companies do are, for example, it's helpful to frame the question of data quality versus the tradeoff between speed and quality. Because at first when you start olden times, whatever you introduced as the process that helps you test data slows down the speed of your team, and you think of it as a tradeoff. But what I've also seen is that the most successful teams, they able to get on the completely different curves where for the same level of speed and team velocity, they actually get better quality and think of that as the goal for the data quality process and data quality task force because I think that with the right tools and the right process, you don't have to sacrifice velocity for better quality. 

And then the other best practice I would think is, I think automation is really key. It's really hard to command any kind of process unless we have automatic gate checks. For example, Grace mentioned pre-commit hooks. I think this is a really powerful one or anything that you run as part of the continuous integration process for every pull and merge requests your code base either dbt tasks or data diff or SQL Linter. That automation is really, really key to the successful data quality practice because it levels the game for everyone. If your staff analyst engineer or your junior analyst, everyone gets the same checks automatically run for their work, and that prevents a lot of the human error but also eases the burden for both the person doing the work and the person reviewing the work to test things. 

And then automation can be really advanced. And so once you get to a lot of these pre-commit hooks and data diff and all of this stuff, it's like, there's a lot going on, I think the really key part is to start with automation of something. It can be just making sure that you build your dbt project and maybe do a dbt compile stuff. That's it. That's part of your CI. And once you have that, maybe add some tests. And once you have that, maybe out of data diff, instead of trying to automate everything at once, because once you have this automation building and people at your team and stakeholders get accustomed to, okay, there's a certain gate checks that are happening for every change, then it's really easy to layer things. And it's also easy to restrict once you have the automation in place. Because if you start with too much restrictions, for example, every pull request should add new five tasks, it becomes a very high barrier for people to really buying to the new process but if you start very subtle and you automate maybe one piece of your entire data quality process and then you layer more things on top, it's easier to get entire team and organization on into the whole practice. And I think as data professionals, we didn't really invent this. I think there's a lot of wisdom in DevOps and SRE practice from software in terms of the main principles that we can borrow from.

Lindsay Murphy: Yeah, for sure. I think there's lots of areas we can go back to and always bring those into data to an extent what you said too about unit testing. It's like unit testing for data is so different from unit testing in software but yeah. Some of the things I can just add in here in terms of, like, I was on a much smaller team than your experience but when I was coming into the team that was already built, we had some very basic data quality initiatives. So it was kind of like, for every model, these are the tests that you should add, but they were almost so simple that they didn't scale properly. So it was something that we realized over time that we were probably adding too many assertion tests. And so that would actually, in some cases, those tests might be redundant. So for testing, at least, the primary key on every model makes sense but if you haven't done any transformations between a staging and a core model, maybe you don't need to run the same tests in both models. 

So those are things that we were learning over time as we're looking at our test float, kind of going up and seeing that four times. I think it's really that concept of balancing coverage with quality and making sure that you're not overdoing it, if you will, on some of the assertion tests. And then yeah, just to echo your point, Grace, on leveraging metadata as much as possible to organize your tests and have that be enforced with pre-commit hooks. I think it's definitely something that teams should be looking at as they're moving out of that infancy phase of their project into larger projects. 

Awesome. All right, so we can kick it off to the next topic, I think. So now, we're going to I think into the future maybe.  So I'm curious to get your thoughts. Maybe Gleb, maybe we could start with you. What do you think over the next 6 to 12 months? There's so much going on in the data industry with LLMs and all the other things that are coming up with data contracts and data math. What are your thoughts on where data quality will be in the next 6 to 12 months and what you think data teams should be focused on?

Gleb Mezhanskiy: Yeah, I think there are a few major trends. One is probably the decentralization of the data platform, which is what I think they did mesh concept is about and I think what's also interesting is seeing how major platforms like dbt embrace it with support for multiple projects and support for cross project and then see the contracts and versions. It basically speaks to the fact that the conversation for the data team be centralized or embedded in the matrix structure. We settled it over the last few years and I think now, it's also clear that the actual code base and they'd ask that someone else will have decentralized ownership and infrastructure should support it and whatever policies we should use should also probably embrace that. 

The other trends which kind of related to the AI and LLM advances, I think it became much easier to create things with infrastructure, write queries, run fast in Salt Lake City to package SQL into models with dbt. It's easy to do dashboarding with modern BI tools. But because we're dealing with such an enormous complexity which was possible due to these advances in the foundational platforms, the new challenge is how do we help individual engineers, analyst engineers, analysts manage that complexity. So how do you work effectively when you have thousands of dbt models and thousands of dashboards downstream of that? I think the next big advancement of tooling will probably be around the quality-of-life improvement for the practitioners, part of which is around helping people to potentially write code more easily with LLMs. 

But I think the really big productivity improvement will come from having a lot of the metadata supplied to the workflow when helping the developers understand their work and dependencies better once they develop. For example, Grace mentioned the VS Code extension that we build the Datafold that allows developers to see the effects of their code change on the data right in their workflow. This is one example where having the context of the impact of your work as you code is very helpful for both your productivity but also for quality. And I think this will only become a more important topic for data teams over the next 12 months.

Lindsay Murphy: And then yeah, any thoughts on that, Grace, from your perspective?

Grace Peace: I want to echo the use of metadata. There's so many ways I want to use metadata in the pipeline that we just were not able to do yet. For example, we have these base models in dbt and we use pre-checks on the sources before we even kick off the base models. The intermediate, they're basically either snapshots or views into the actual sources, but we run the sources before we do anything else. We run tests on the sources. 

And I would like to leverage those test results to make this the initial models, the intermediate models dynamic that filter out any records that have unacceptable values because right now in dbt, a lot of leader fails, warns or passes. You can't do line by line processing of a model which makes a lot of sense for most applications. But there are some times when we'd really like to be able to set thresholds on, okay, let's look at some of the data then at least and we report back to people that we left 40% of it out and here's why or whatever but right now, we can't do that. And they knew that leveraging those test results was going to be a really expensive process. So that's where I'm thinking it will go. That's where I'm hoping it will go is the ability to use test results in real time in order to impact the rest of the data stream. It's just something that I'm thinking about a lot and it's where I'm thinking/hoping data quality will go.

Lindsay Murphy: Yeah, I love that. That's a very interesting idea. I hadn't considered that before but now, I'm like, yeah, I'd like to see that, too. That's awesome. Yeah, and I think for me, I feel similar in the sense of the metadata. I think one of the things that I was seeing a lot in my last team and I think I'm hearing a lot in the community is this feeling of data sprawl. It's so easy to create models and create new assets but it's hard sometimes to have observability on all that and so then you've got assets floating around that you don't even know exist and you're still responsible for the quality of those. So I think I'm interested to see more of “how do we alert teams when a model have been queried and 6 months”, or something like that, or even 6 weeks? Should we be pruning those things more? Or, do we have pipelines that are used extensively and we're not really optimizing them or we're not really putting effort into those? So that might be more on the observability piece, but I think it does come back to data quality because it's kind of this feeling of not really knowing what's important and where everything is and how big your data assets, your tech stack actually is. So yeah, excited to see how that develops.

Grace Peace: I think that's actually what we're just starting, we're dipping our toe into making query metadata and query data so that we can surface models that aren't getting used and what models are getting used the most. I didn't mention this but we're, actually right now, rebuilding our data platform. We're kind of building a new house and we're going to take the furniture we want with us after we install the right door locks in the house. And one of the main reasons we're doing this is because we have a lot of models for all and we've been doing analysis on the existing models because we have duplicates. We have 18 different versions. Sometimes, one basic model, so we're going to figure out, okay, which one does everybody use the most? You know, so that we can figure out how to consolidate things. Yeah. 

Lindsay Murphy: For sure. I think it's that kind of that piece of like analytics on the analytics, which is where the metadata piece comes in. And I think it's a very under-focus area for a lot of teams. And then, I think the ease of the modern data stack is good to an extent but then if you don't have good guidelines or good monitoring, you kind of think you have a runaway problem. So yeah, definitely interested to see. And I think I would like to see more built-in features as well for dbt. So like unit testing built into dbt would be nice, so you don't have to build your own solution or go by the package, those types of things. 

Awesome. All right. Well, we can click off to Q&A. I don't see any questions in the chat. If anyone has any questions, feel free to throw them in there. I have one that I'm going to throw to the group here. What do you think is the biggest, I don't want to say mistake, but maybe blind spot or most common mistake that data teams make with respect to data quality? Maybe it's something they do early on their scaling? Either Grace or Gleb can jump on that.

Grace Peace: I touched on this a little bit ago but I would say it's not making a plan for when data tests fail. When I talked about where we were a year ago, we had a ton of tests that we were running a dbt but we weren't doing anything with them. The only time we ever knew there was an issue was if the data test was set to a failed threshold and a job broke. So there may be other tests that were worn and the only time you know, oh, 40% of these failed that warn is if we went and dug through the airflow redaction logs for that day to look and see what it said. We weren't harnessing any of it. We weren't doing anything about any of it. People were just throwing tests and willy-nilly. And some people thought this model should not generate if this condition is not met, and other people are like, why didn't this model generate? I don't really give a damn about that field but I need to see this data. So just having a plan for the tests that you are putting in place. Otherwise, why are you doing it?

Lindsay Murphy: Yeah, I like that. We had a similar situation at my last company where we implemented a testing system and then we had a system where the test failures would go into Slack and we'd come in as a team every morning and it would just be us like, who's picking it up? There wasn't really a process and it would just fall to whoever so sometimes, things fall through the cracks. So yeah, definitely, it's one thing to go and implement everything but you still have to have the human processes in place. Gleb, do you have anything to add there?

Gleb Mezhanskiy: Yeah, I would say the broken windows analogy is probably appropriate here. So if tests fail, and then that's like okay to merge a PR with a failing tests or it's okay to just carry on your day with failing tests, that just snowballs so quickly, and the value of tests pretty much plummets to zero. So having good tests hygiene is very important, which also what makes the assertion tests not very scalable because at some point, some tests will inevitably be running and so optimizing for the number of tests created is probably not the wrong one because then you'll get a lot of failures that are probably not all worth really the time with the team to investigate. 

I'd say that the other, well, maybe conceptual mistake that in my career, we've been making and I've seen teams make as well is over indexing on the monitoring versus the actual developer process when you think about data quality. So monitoring is important in terms of understanding, okay, what failed the production? What are the anomalies? But we have to also understand that a lot of these failures actually come from, unless there are really errors related to changes in the data that are completely outside of our control. These are basically the blogs that are introduced by people when they code. 

And so investing in the data quality from the preventative standpoint, so how can we help developers not introduce regressions when they develop has an outsized impact on overall data health and quality. Because not only minimizes the things varies you have in production, but it also has a lot of second order degree effects because every time that something is broken into production that is very disruptive for the team, someone has to drop whatever they're doing, going to investigate at the time is, they're not in the state of focus on that problem. They have to just shift their attention and then go and investigate. Sometimes, it's actually very hard to investigate because you don't know whether this is a problem with the business. Maybe our orders didn't go down or it's actually their logic or infrastructure. Whereas if you implement data quality, when you leave in the developer process, then you focus everyone's attention on these problems right when they should. So for example, either conversation around broken tests or diffs or impact happens within the code review. This is exactly when the developer and the reviewer are focused on this and this is the best time for them to do it. And so making sure that the team thinks about the data quality productively is probably the biggest advice I would have to offer.

Lindsay Murphy: Yeah, even just thinking that through it, I think it gives teams the comfort to move a little bit faster, too, because I think what I was finding a lot of times in my last team was we would have really complex models that someone would have offered and then someone else is terrified to touch it because they’re like, they understand how it works. So maybe not, maybe the model was a little too complex. But I feel with that data diff capability, you have that built in of like, you have some backup before you push something into production that there's testing going on more with the change in the data rather than just assertion tests.

Grace Peace: I think also, Gleb reminded me, creating, testing standards is also really key. And so coding standards are great but testing standards also give people a really good place to start and also, your SREs will thank you for it because they're not wondering, why are all these tests on this model but not on this model, which they're very similar? Or, why did they do a not null tests on this and did a no threshold test on this because it's the same type of thing? So testing standards give your developers confidence that they're putting the right tests in place and not too many, maybe not too little. And also, just kind of ensure consistency.

Lindsay Murphy: Yeah, absolutely. I think that's a good one, too. And then revisiting all the time because you can write those ones but then your situation changes as your team grows and your data changes. So it's probably something you need to revisit every so often, too.

Grace Peace: Just like coding standards. It's a living document. 

Lindsay Murphy: Yeah. Exactly. Yeah. Awesome. Well, I don't see any more questions in the chat so maybe we can wrap things up there. Before we do, is there anything, Grace or Gleb that either of you want to plug or share with the audience before we have off? It could be something you're doing, something you're excited about, something about your company, anything like that. A fun fact that you like.

Gleb Mezhanskiy: Couple of things. So for those who don't know, data diff is actually open source and if you like to get it, just go to github.com/datafold/data-diff. We also run labs that help folks install it on their dbt projects. And also, I recently wrote a guide on dbt testing where I tried to make sense of all the different types of tests that you can run in dbt and where to use it like unit tests and also investigated different packages. So if you'd like to get a copy of that, just slack me in dbt community.

Lindsay Murphy: Grace, anything for you?

Grace Peace: I don't have any solutions or code that I can share. I do want to maybe share that I'm really excited about a couple of things that we're doing in the future with dbt. We're going to be exploring a lot of contracts very soon which we think will alleviate a lot of the data quality work. If you go back to what we were talking about with the upstream data that we're consuming, especially from other in-house engineering teams, we're putting them on their own repo for their export models, and we're putting a lot of contracts on those and they're as excited about it as we are which is going to eliminate a lot of that issue where their coaching has break our jobs. We're really excited to explore that and we're also moving to Cloud very soon. We hope we're in the mid to late exploration stages, and they were trying to sell it to leadership and our data org. So we're having to start over on making the case for that, but we're optimistic. So we're looking forward to a lot more of the built in features that dbt Cloud house for data quality. So we've learned a lot by growing our own but we're ready to move to maybe some more sophisticated and tried and true methods that someone else supports.

Lindsay Murphy: Yeah. That's awesome. I will be excited to hear. We'll have to keep in touch to hear how your engineering team turns out

Gleb Mezhanskiy: Yeah, that's very exciting. 

Grace Peace: Mm-hm. Yeah.

Gleb Mezhanskiy: Yeah, I have another plug. Sorry, I just remembered. 

Lindsay Murphy: Sure. Yeah, absolutely. 

Gleb Mezhanskiy: We are hosting a meetup on running data migrations in two weeks, on July 27th. So if you are migrating to a new warehouse or migrating to dbt, come and join the conversation. I spent two years of my life migrating Lyft off right shift and that was a very painful experience. So happy to tell horror stories.

Lindsay Murphy: Nice, nice. Awesome and I'll wrap it up with a quick plug as well. So as I mentioned at the beginning of the session, I'm launching a course with CoRise in a few weeks. So if any of the stuff that we talked about today, if you'd like to learn more about actual implementation and advice on how to get into the weeds, that course will cover it and Gleb's actually going to be a guest speaker with us. So if you want to hear a little bit more from him and how we can kind of get into the weeds of scaling your data tests, feel free to look that up, CoRise. Would love to have you in the course. 

Grace Peace: And I'll second that plug. A couple members of my team actually found this course on their own and came and brought it to me wanting to attend so we're trying to get that approved. We're excited. 

Lindsay Murphy: Yeah, awesome. Super excited to get the ball rolling on that. Awesome. Well, thank you very, very much, Grace and Gleb. This was an awesome session. So great to learn from you both and hopefully, everyone has a wonderful day and thanks to our audience for joining us.

Gleb Mezhanskiy: Thanks so much, Lindsay, for having us.

Grace Peace: Thank you everyone. I enjoyed it.

Lindsay Murphy: All right. Bye.

Gleb Mezhanskiy: Bye-bye.

Keep reading

See all stories