Implementing Data Contracts with dbt and across the MDS

In this session, Chad Sanderson (Chief operator at Data Quality Camp) and Lindsay Murphy (Corise Instructor and Head of Data at Secoda) discuss the benefits of data contracts and how to implement them with dbt. This talk is for advanced data practitioners looking for ways to manage their dbt projects at scale as well as data leaders looking for actionable strategies to implement data contracts.
Last updated
May 2, 2024
Author

Lindsay: Hello, everyone. Thanks so much for joining us today. I'm super excited to be joined by Chad Sanderson. Today, we're going to be talking about implementing data contracts. So we're going to go deep on the topic, talk all about different things to do with data contracts. I'm super excited to be here with you today, Chad. If anyone hasn't joined one of our live streams before, I'm Lindsay Murphy. I'm the head of data at Secoda. In the near future, I will be an instructor with CoRise as well, where Chad will be joining me for an advanced dbt course as a guest speaker. So what we're going to talk about today is also something that we'll be teaching in the course. So if you're interested, I can throw some more information in the comments. And from there, Chad, I'll turn it over to you if you want to do a quick introduction, and we can talk a little bit about how you got into data contracts.

Chad: Thank you, Lindsay. I'm Chad, I spent the last three years managing the data platform team at a company based in Seattle called Convoy. Convoy is a b2b freight technology marketplace with two sides. It serves the shippers, so big companies that are essentially trying to move freight all around the country, and then also carriers who are truckers that own these truck businesses that are trying to move some freight. Data was very essential to Convoy. Our big value proposition as a company was based on how intelligent we could be with the data we were getting from routes and batches and contracts, and cool things like that. So it was really, really essential that the machine learning models that ultimately sat very downstream of many different data sources were protected, and didn't change unexpectedly.

Lindsay: Awesome. And so I can imagine in that world, you were dealing a lot with data quality?

Chad: Yeah. After we had built our initial set of pipelines and infrastructure, data quality was probably our biggest issue. I would say that there's really two impacts of bad data quality on us. One was the monetary impact, which I can always talk about more, but then the other was the impact on the team. When you don't trust the data, and there's not clear ownership of the data and something can fail at any time, it creates this really weird situation in a company where data consumers are just constantly on edge. They're not really sure when something is going to just fall off a cliff, they don't know if they can trust the data or not, and because their salaries and bonuses are tied to their ability to make money for the company, if the data as an asset class can't be trusted to do its job, then they're just very, very hesitant and unsure of themselves, and that causes a lot of interpersonal conflict.

Lindsay: I've been in those situations before, myself. And so tell me a little bit, if folks aren't aware, you also run a community called Data Quality Camp as well. So how did your experience at your company take you into creating that community, and then how did that lead you into data contracts?

Chad: I do have a community, it's called Data Quality Camp. You can get there by just typing dataquality.camp/slack into your browser, quick plug. I think a big part of my frustration when I was at Convoy was that there was a lot of information about how to build pipelines, about how to set up a data lake. There were books and articles going all the way back to the 1980s and 1990s about how to best construct the data warehouse and data marks, and all these other things. But I didn't see anything about the best system to ensure quality that wasn't 30 or 40 years old, right? All of the recommendations were, well, you need to have a governance council, and you need to have representation in every deployment meeting, and there needs to be a data engineer in the room, and you need to follow this three 10-step methodology. That was really, really hard to do, and I felt like there were more modern ways of managing quality in a federated working environment for software engineers. And there was. That actually was the case. It's just that a lot of that knowledge was held by people at big tech companies like Google, and Facebook, and Netflix, and I just wanted people to talk about it more openly, and that's what the community is for.

So folks who come in, they could ask questions and say, “Hey, I'm working on this challenge of managing data quality at scale. How do people deal with it?” And they would get 10, 15 professionals who had worked on that problem before, providing answers. So in some ways, it was like a stack overflow, but it was used less for these very explicit software engineering problems and more of the problems that a data team might face once data starts really adding value to the business.

Lindsay: That makes sense. So a bit of a different approach, I think. What I've seen, at least in my experience, a lot of data teams, when they're working with data quality, it's a very reactive thing. It's like you wait for something to go bad, and then you get an alert and you fix it. So this is flipping the paradigm on its head a little bit to get ahead of it?

Chad: Yeah. That's exactly it. I think it is necessary for a lot of things. Like, you're not going to be able to prevent every single poor decision made to some code base that impacts a downstream consumer. And so, you need to have some set of reactive mechanisms in place so that you can catch stuff. But if you don't have the preventative mechanisms in place, and there are a lot of preventive mechanisms, of which I think contracts are foundational, but even something like, hey, data producers should probably know who they're going to be impacting through their changes, and data consumers should have the ability to advocate for themselves for PRs that are going to break them, and you should be able to push low quality data in flight to a staging table or something like that. All of that falls into this category of preventative data quality, and there's just not that much written about that, or it's not talked about that much in the public discourse right now.

Lindsay: Yeah, for sure. Awesome. I guess with that, maybe we can just dive into the first topic, which is really around understanding data contracts a little bit more deeply. So we talked a little bit so far about data quality. Maybe if we want to take a step back, it's like data quality might be like the thousand-foot view, and as we start to narrow in a little bit on maybe data contracts as a solution, how would you go about it? What is your standard definition of what a data contract is for folks who may be less familiar?

Chad: A data contract is composed of two parts; it's a spec and it's an enforcement mechanism. If you just have the spec and no enforcement, then it's documentation, and if you just have the enforcement and no spec, then it's a test. The contract is a combination of those two things into a single system, and what that means is you're essentially describing, hey, here is the data that I, as a consumer, actually need, here are the constraints that I need to put upon the data or the metadata, so to speak, and here's the mechanism that some technical system can use to automatically take some type of preventative action. And depending on the technical system through which you are enforcing a contract, that action might look different. If I'm trying to prevent data from ever leaving a database, if it doesn't meet some conformed schema definition, then that's going to require a set of technologies. If I want to get rid of that data mid-flight as it moves from one Kafka topic to, I don't know, Snowflake, then it's going to be something different, and if I want to stop changes backwards and compatible changes from being made to a dbt model, then that would require a different set of technologies.

Lindsay: I really like that way of describing it. I don't think I've heard that description yet, but just how you broke it into the two steps or the two pieces of spec and then an enforcement, that makes it much easier to understand conceptually, I think. I think it's one of these words that we're hearing a lot about now about data contracts, but I think for most people, it's like, what does that mean? Like, how do I actually go about implementing that? I know you and I had chatted a little bit previously and you mentioned thinking about contracts as a system. How do you consider that as... if it's an end-to-end pipeline or something that we're trying to enforce quality through, how does the data contract work as a system?

Chad: That's a really good question. I think it's worth quickly mentioning when you might want to use contracts, and once that's understood, then I think it makes a lot more sense what the contract-oriented system looks like. So in my opinion, data does have some pretty – It's actually not my opinion, but data does have some significant differences from software engineering. When you are building a software product, you are typically working from some type of user facing outcome, some sort of user journey in your head that you're trying to facilitate. There's something operational that needs to happen. You will architect your system, create a design spec, you'll show it to people, then you'll go out and you'll actually build a system, you'll deploy it to a developer environment, you'll ensure that it works operationally the way that you intend. And then from there, you will push it into production, you'll see how real traffic flows into it, and if there's any bugs, then you'll fix those. And then if not, then it just exists in perpetuity until you want to change it meaningfully in the future.

But data doesn't really work like that. Data is much more of an act of discovery. So people are typically building these data products off of assets that already exist. So you've got some Postgre database or some MySQL database that's producing data, you're piping it into S3, and then people are going in and they're trying to figure out if there's anything meaningful, they're constructing pipelines based on that, then maybe they're using airflow, maybe they're using dbt, and then from there, they're turning it into a production thing. And they're trying to test out to see if it's even useful, and that might be a dashboard, maybe it's a machine learning model. And if it's not useful, then it gets deprecated, and if it is useful, then it continues on. So the big difference is, doing data product management is all about iterating from the data that already exists, and that means that not every data asset is going to be useful from day one, and in many cases, it's not going to be useful at all. Data contracts with this enforcement mechanism are designed to slow people down. That's the whole point. It's like guardrails. It still lets you deploy stuff quickly, but there is some level of governance, and all governance adds a bit of friction into the process.

So if you, all of a sudden, start adding contracts everywhere, everywhere that possibly could make sense, you are going to slow down all of the data producers, whether they're data engineers that are building dbt models, or if you're doing this upstream, it's software engineers that are trying to ship features, and that's usually not received very well. In fact, it's typically received very, very negatively, considering that the modern software organization is based entirely around speed and flexibility and all that. So what you want to do instead, is you want to find the use cases for contracts that actually make sense. So those are what I call the production grade use cases. So maybe after I've gone through that iteration process, I have found a data set that adds an enormous amount of value to the business. So let's say I have a pricing model, and that pricing model is responsible for bringing $100 million a year to my company. Okay, well now, every incremental lapse in data quality actually has an incremental business impact. So I can definitely justify why the upstream teams need to take ownership over that, and I need to have contracts in place throughout that entire pipeline. So starting from the asset and working backwards to all the source systems. So if it's coming from... maybe there's a Postgres database, then it goes into a data lake, and then it goes maybe over Kafka or something like that, and then it goes into Snowflake, and then there's some dbt modeling, and then it goes into the dashboard, at every single one of those transformation steps, I need a contract and I need the owner to agree with the consumer, “Hey, I'm going to be thoughtful about this change from now on.”

And in that way, you now have this protected asset that's treated in the same way as any other production application is. You've got great end-to-end integration testing, you've got pretty clear visibility, and you have the API, which, in effect, is the contract.

Lindsay: That makes a lot of sense. So there's a bit of a tradeoff there in the decision-making process that has to go in. And it's not just like, data contracts belong everywhere, they belong where they're justifiable and there's an actual benefit to them, because obviously, it sounds like there's cost associated as well, which is the overhead and then implementing everything. 

Chad: Exactly.

Lindsay: Makes sense. You mentioned a couple common use cases. Have you come across in your work, other real-world examples of how people are using data contracts or ones that you might have used at Convoy?

Chad: Data contracts are generally very useful for data products. I know there's a lot of crazy, wacky definitions of data products out there. My personal definition is it's just any data asset that requires production grade quality. It has a customer, it's code, it's solving a business problem, and it's very useful. So if you've got a machine learning model, that would definitely fall under something that needs to be under contract if you're deploying any data asset that's being consumed by a customer... So we were doing this at Convoy as well. We had our shipper partners like Walmart and Target, and things like that, and they would get a dashboard, and the dashboard would contain some metrics about all their shipments and whether they were on time or not. That was being pulled from Snowflake because there were a lot of aggregations and a lot of analytics essentially being performed on top of those. And that needed a contract. We didn't just want a customer's shipment ETA metric to suddenly nose-dive by 50%. That would be really bad. Any data that we were sharing with a customer that we were giving to them, needed to be under contract because we didn't want them to build systems on top of our data that failed and then broke all their stuff. That would be really bad. So that's one category of stuff, it's like customer facing.

And then, the other category of stuff I would say is internal, very, very high-value data use cases. And I would just connect it back to that earlier statement I made about, where you need incremental data quality, you can say, if my data becomes X percent more accurate we make an amount of more dollars, that's really where contracts become essential. So a great example of this is for anything really related to accounting, especially if you have usage-based pricing and that pricing might be impacted significantly from some engineer changing a feature, worst case scenario is going back to a customer and saying, “Whoops, we overcharged you by 10 times the amount.” Or actually probably even worse than that, “We undercharged you by 10 times the amount because of some feature, and you had automatic bank payments set up or something like that.” So that's the main use case, where you are treating data like a product, and it's adding a bunch of business value, it needs a contract.

Lindsay: Yeah. I like that because it feels like a simpler solution to a varied set of problems that fall into the same category. But generally, they're handled often by different teams, and maybe people think about solving them in different ways. Like I've been in the boat with external reporting going to clients, and we were dealing with quality issues, and we didn't really know how to solve it. So it was like, do we have humans check the reports before they go out? That was the worst of it, the most manual versus alerting or something like that, which again, is only half of that piece of the data contract. Very interesting. Awesome.

So maybe we can switch gears to talk about, how do people actually go about implementing data contracts? So maybe before we dive into that, I guess, if we were to list out some prerequisites or something in terms of tooling or other capabilities that you think teams need to have to be able to do this, what would that look like?

Chad: Great question. So there's really only a couple prerequisites, and this is technology agnostic, and then we can talk about very specific implementations like dbt being one of them. But the high-level prerequisites are, I always recommend starting with schema definition and some schema registry as the foundational principle of contracts. And that's just because schema and data types are some of the easiest things that you can enforce in a CI/CD pipeline, where it makes the most sense. It's truly preventative, and it gives you a foundation where you can start to add on some of the other checks over time, some of the more complex semantic checks on values and relationships between objects and stuff like that. It just gets so much harder to do that if you don't have the schema data type foundation. So some sort of registry, some place where you can say, look, this is the state of the schema, these are the data types, we decided that this contract was going to exist. And there's some mechanism where you can check those things in CI/CD.

The other thing I would say is you need the ability to error-out somehow. There's some action that needs to be taken, there's a bunch of things you can do. You can fire an exception, you can error-out, you can just inform a producer and say, “Hey, add a little notification”, say like, “Hey, you're about to make a change, it's going to violate the data contract, you shouldn't do that.” As long as you have some mechanism of doing that, basically those two parts together, then I think you're going to be golden. And the nice thing about dbt is that it abstracts most of the implementation details of those underlying things away from you and all you have to worry about is creating the contract itself.

Lindsay: That makes a lot of sense. I like that; setting the foundation of this schema, because that really is where everything begins, and before you start building any other assumptions on top of that, you need to make sure that nothing is changing there, which tends to be a lot of where the data quality issues come from. An engineering team somewhere doesn't know that the data team is using a table in a certain way and they go and change something and it breaks an entire pipeline.

If you were to think of basic implementation of a data contract, whether that be in dbt or across different tools, what does this look like in your experience? What are maybe some of the tools or formats that data contracts can be written in?

Chad: I always recommend implementing the spec of the data contract as a YAML file. There's a few reasons for that. A, it's very, very flexible. B, a long tail maturity curve for contracts is that really, the spec is just operating as an abstraction layer between some consumer who's ideally working with config, and the technology that's going out and figuring out, okay, where do all these constraints defined in the contract actually need to be materially enforced in our pipeline, and YAML is really good for that too. It's just a very, very flexible system.

And in the spec, again, I think starting just with pure schema is great. Schema and data types. What some organizations are doing is they set certain requirements like, if you're going to create a new contract file, then it has to include a description. And in that way, they require really good descriptions and documentation for anything that's a contract, which boosts contracts to a higher level of quality that people can depend on and trust. What we did at Convoy is we would create these YAML files, and we would drop them into a central Git repository. And then from there, we had a process that would automatically run and essentially read the schema, generate a proto file, drop that into a registry, and then spin up a monitor on the data asset where the contract needed to be enforced. And then if that data set was ever changed, so let's say someone went in and was changing a Postgre database, we'd run the monitor, we’d say, “Okay, something under contract is changing, let's reach out to the registry.” We would compare the schema in their dev environment to the schema that was stored in the registry, and if those checks were backwards and compatible, then we could take some action. And in a lot of cases, we did... kind of what you were referring to before, we just wanted to make people aware of what are the dependents of this stuff, so that the producer could have the context to go and make a good decision. And in some cases, we would just straight away fail CI.

Lindsay: That makes sense. So in that sense, that's a very custom solution that you guys have built at Convoy. How are you seeing that coming into the modern data stack? We know dbt released some features recently that helped to enforce data contracts. Do you think we're going to continue to see this start to be added to different tools? Or do you think that maybe there's going to be a vendor that specializes in data contracts? Do you think it'll be more the open-source route, I guess?

Chad: I think dbt’s implementation is really great. And one of the reasons I think it's so great is because it's very, very simple. The data contract in dbt, you implement it the same way you would implement any other tests. It's just a further extension of that testing framework that requires its own unit of logic. It goes right into the YAML file of the dbt data model. I would say the big difference is that the test framework in dbt is basically a constraint on the data itself, whereas the contract framework in dbt is more of a test on the metadata. This is a very simple value that you add into any dbt data model YAML, it just is like a data contract enforced. And then basically, any time that schema ever changes or the data type ever changes, it just creates an error in the dbt build. And previously, all that had to just be assumed and inferred by looking at the code, but now it's very explicit. I think that's a really awesome starting place because it's very easy, it's straightforward, and I think that that level of simplicity is what leads to the adoption of these ideas. Ultimately, I think that we'll start seeing data contracts pop up in many different places. I think that they will begin popping up in ETL tools. I think that data catalogs are going to start introducing this concept of data contracts. In fact, I'm about 99% sure that this is going to be happening. I think that data contracts are like data lineage, in my opinion, and I don't actually think it's a standalone system. I don't think you could just build a platform that only does data contracts, but I do think that it is a piece of a larger puzzle just like lineage. Lineage doesn't really make a lot of sense to me as a standalone thing, but if you can operationalize lineage to do something very useful, then I think it becomes super valuable, and I think data contract is the same way. If you can operationalize them to do something very, very useful, if you combine it with lineage and data cataloging and monitoring, and now you have this really awesome end-to-end test framework for data that starts at the producer level, that, I think, becomes a lot more interesting.

Lindsay: I definitely agree with you. I think that this area of metadata is the next phase of things where it's data about the data, and then data contracts fits in as a piece of that puzzle, but there's also governance, and you mentioned lineage, and just understanding how the whole data stack fits together is super important for sure.

From there, why don't we move over into best practices? I feel like talking about best practices is an area that's still so new, and is probably a bit of a challenging thing, because it's still evolving so quickly. But from your perspective, what do you think? I know you've mentioned a few things as we've been talking through this, but what do you think are some of the things that you would recommend to teams thinking about implementing contracts or curious about where to start?

Chad: One thing I would recommend, and this is definitely something, in my opinion, people get wrong, contracts are not producer defined. Data contracts are consumer-defined and producer-owned. And so, it has a very similar relationship as a requirements document has between a product manager and a software engineer. A product manager is going to create a set of requirements, they'll review them with the software engineer. Ultimately, the software engineer owns the code, but you could argue that the product manager owns the actual feature itself. And it's a very similar implementation for data contracts. The reason you don't want to go the inverse where the producer essentially decides what the contract should be and makes that available to the downstream consumer is because the producer doesn't know how the data is going to be used, and so they have no idea what the constraint should be, or what the SLA should be, or what should be under contract, and what shouldn't or what has value. They lack all of these things, and so the only real option for them is to either, number one, put everything under contract. And most data producers don't want to do that because like I mentioned before, it slows them down, or they put nothing under contract. And that's just not the space you want to be in. Or they put some things under contract and they just change the contract whenever they want, and ultimately, that doesn't really solve the problem for the consumer, so the consumer has to be the one going to the producer and saying, “Hey, here's the tier zero or tier one data assets that I have that need to be under contract, and here's what the constraints are, and here's why those constraints are very useful for me. That's one thing.

The other thing that I would say is different companies are at different levels of maturity and how ready they are for contracts. There are some companies that have built incredibly high-value data systems, and the whole business essentially knows that they need a contract in place. Like, oh, I've got a fraud detection model and it's saving me a billion dollars a year. I think pretty much everybody gets that you need to have contracts in place there. It is just a necessity. But if your company's not quite at that level, I've talked to a bunch of folks where the data producers have no clue how the data’s even being used. In a lot of cases, they don't know the data is being used at all, they're just maintaining some database for an application, and then they find out two and a half years later, that someone built some very useful dashboard or something that the CEO is looking at, and then it broke and the CEO gets mad. So you can use the contract as a way to generate awareness first and foremost before you move to breaking things. I think breaking pipelines is pretty hardcore. Software engineers usually don't like that if they don't understand why they're being broken. But if you're just informing people of what they're about to break and who has a dependency on this thing, that actually goes way better, and it starts to bring people together to collaborate more about the data.

When we started doing that at Convoy, we started seeing so many more conversations from producers to consumers about things like, “Hey, I'm making a change. Is this going to impact you?” Or like, “Hey, I'm going to be shipping some new data. Are you going to use this?” And one of the other things that's very common today is because data producers get scared, instead of deprecating stuff that exists in their databases, they just constantly add on new columns, and that's not very helpful for a data consumer either, because you might be using old, outdated information, and never know about it. Whereas if these open lines of communication actually exist, then you can start treating your databases if it was an API, which is much nicer. So those are some things I'd recommend.

Lindsay: I love that, because it creates this space for data teams to kind of be a steward between typical data producers and data consumers that generally is so far apart. Like engineers are not really talking to the business users at a company or even the end users of a data product typically. So it forces these conversations that I don't think are happening otherwise, which is a really important place for data to be progressing in, for sure.

And so I guess, if I were to get you to predict the future, we talked a little bit about this, but what do you think is going to be the biggest change for data contracts over the next six to 12 months or longer?

Chad: I think there's going to be a few really interesting things happening. One is, I think the industry is going to start to rally around what the data contract spec is supposed to look like. There was a really great spec that was put out by the team over at PayPal, JGP did that. I have some opinions on what a spec looks like, but I think we're going to just start rallying around that and coming to a set standard. That's one thing. I think the other thing is that we're going to see a lot more implementations of data contracts, and those implementations are going to be pretty varied. Like there was the PayPal example, there was another example I was looking at. There was an article that was released on doing data contracts for third party events like Amplitude, and Mixpanel, and Google Analytics, and stuff like that. And that was a really, really cool implementation. And there's a lot of big companies that I don't know if I can name, but they're in the process of rolling out data contracts, or they've already rolled out data contracts, and they're considering writing articles about it. So I think we'll start seeing a lot more varied implementations, which is pretty cool.

The other thing that I think is coming is people, and this is already happening, but if people are moving forward with data mesh, which is another really cool, more of like an organizational paradigm for how you structure your business to get the most value from data, and the core of that organizational paradigm is this idea of a data domain or a data quantum. And that basically says, hey, we've got this microservice for data and that includes the actual microservice, like the thing that's doing all of the work on the application end, and it includes the database, but it also may include some data pipelines, and it also may include some tables in Snowflake or Databricks or something like that. That is the data microservice or the data quantum, and then the data product team vends that data out to the rest of the organization. It's definitely growing in popularity to solve a lot of these issues around data management, and what people are realizing is organizationally, that's really cool, but you need to have some technology that makes that happen. There needs to be some technical object that allows people to share data very easily, and where it seems the industry is going is using data contracts for that. And that seems to be becoming a... I wouldn't say it's a standard yet, but it's definitely starting to be leveraged in these data mesh-oriented implementations.

And then the final prediction I would have, and I always make this prediction, and I hope I'm right, but I don't know, maybe I'll just be disappointed yet again, is I think that artificial intelligence and its growing popularity is going to start to necessitate more of a focus on the foundations. [Andrew Ng] said we're moving away from model-driven AI, and we're moving towards data-driven AI. It's actually the quality of the data that needs to be enforced first and foremost, and that means great ownership and great documentation and, and tests, and monitors, and all that stuff. I think what people will eventually realize is look, if we're going to start to take these massive dependencies on AI and we're going to make all these huge de decisions, potentially $100-million-decisions on AI, we really need to trust the data that's feeding into the system, and the only way that you can do that is if you have clear ownership and essentially APIs at every level of those production pipelines. So those are some of my predictions.

Lindsay: That makes a lot of sense. The last one there, I hadn't really considered that one, but definitely, I think if we're moving away from doing all the data modeling by humans and having computers do a bit more data modeling, then you still do need a bit of enforcement across the system. So that's definitely a really interesting application.

One of the big challenges that we're seeing a lot with dbt specifically, or I guess maybe just data team specifically, is this tech stack sprawl. So it's like, it's so easy to create models or add dashboards or add new things to your system. How do you think about whether data contracts can help with that problem of scale where we don't even really know how big our tech stack is anymore, where the bound of it is, and where the quality of what we're responsible for keeping in check. Do you think data contracts can help with that problem? Or is that more of a data team being an organizational problem?

Chad: I think any cultural problem is actually a secret technical problem that the technical solution hasn't been identified yet. So it comes back to a bit of what we talked about before. I think that data teams need to have two very different environments for their users to operate in. And because of the nature of doing data work, one environment needs to be more exploratory and allow for experimentation and just incrementally testing things and adding on things, and see if it works and see if it's useful. But in that environment, there also needs to be some expectation that these pipelines are not going to exist forever. They all have a limited shelf life. Like, hey, look, as long as you're doing experimentation, that's great, but you'll have nine weeks until we deprecate this thing. And if you want to extend it manually, then you can do that, but just know that the countdown for the timer is going to begin again. And so if you've actually found something that's very, very useful and valuable, then there should be an expectation that a contract needs to be put around it. And when a contract is put on some valuable data asset or pipeline, it transitions it to this higher, more trustworthy state, and then that can be forked back into your more prototyping environment and people can begin using the trustworthy data again.

If there's not a system like that, and I'm certainly open to other things that could work, but at Convoy, until we had that type of system, we were trying to manage all that just manually. You look around and you just look at the tables, you're like, “Has anybody used this table in the last six months and does that mean I can necessarily deprecate it? And wow, it's really expensive. Is it actually doing something useful or not?” It's just very hard to tell because all of the production grade use cases and all of the prototype e experimental use cases are collapsed onto the exact same infrastructure, and so it's very challenging to tease all that stuff apart.

Lindsay: Yeah, for sure. It's like you need to have a sandbox for people to play in and then have something else that's like a governed system that only allows a certain amount of things to enter, because I think what we start to see is you hear about dbt projects with over 5,000 models, and I'm like, I just can't imagine how all of those models are needed, depending on how big a company is, I suppose. But to that end, it's like there's just a bit of order that probably needs to come back into it. So I like the idea of proactive governance there so that things have a limited shelf life, and then they get moved into a production system or something.

Chad: Yeah, exactly. And this is a really fun conversation because it's touching on some things that in the data world, I don't think we get the chance to talk about very much, like incentives, especially cultural incentives. This is a big thing I was having with a friend of mine who was implementing a large-scale monitoring system in his company, and the problem they were running into is alert fatigue or test fatigue, meaning they had hundreds, if not thousands of these alerts firing all the time. And once you've got just many, many, many of these alerts going off, it's very challenging to parse out which ones are useful, then they all become equally valuable, which is to say not at all. The reason people get to that point is because there is no negative incentive until you reach that scale for just creating constant incremental monitors for the things that you care about. Every single thing that pops into my mind that I think is going to be useful, I'm going to create a test for, but I'm not thinking about the broader ecosystem of everybody doing that. And then ultimately, it yields this like, oh, well there's so much stuff that I can't even be bothered to check it anymore.

And this might seem a bit backwards, but I think that there actually needs to be some friction in the process of creating these types of checks and creating the governance where you should only want to do it, you should only be able to do it if it actually is valuable, and it is useful, and that means that the process should be a little bit harder and take a little bit more work than just pressing a button and saying, hey, I want to create a new validation rule that selects star on some massive table.

Lindsay: I definitely agree with you there. I think it requires a little bit more foresight too. It's a lot of teams just building conventions early on and then not really going back to look at the drawing board of like, how do we want to enforce these things, and is this still serving the system that we have in its current state versus where we were a year ago? I've been in that world where we just had very basic test conventions, and it was like, we would just add tests to all of our new models, and the test bloat just starts to pick up over time. I think it's a trade-off of obviously compute resources and time and things like that versus the amount of coverage that you actually need, and then level of noisy tests. So definitely some thinking there for data teams to be doing, I think.

We actually have quite a few questions in the chat, so maybe we can switch gears to answering some of those now. Let's go to the first one. We have one from Eric. He says, how do you separate data quality from data entry operations example and transportation? It was common that the time for an activity would be mis-keyed?

Chad: That's a good question. We have this issue a lot too, as I'm sure others do in business where there's a large operational component. Human error is just common, right? People get tired. There's someone who's in the Philippines that's been working like 10 hours straight, and they might mistype something, or it might just be down to laziness. I've seen some examples where you may have a salesperson and they're doing some data entry in Salesforce, and they're required to add a tax identification number, and they don't know what it is and they can't really be bothered to go and look it up because they need to go work on other things. So they just type in like 55555, and say okay. That's obviously not correct, but it still causes issues for folks downstream. So I think that the data contract certainly can be used as a circuit breaker to interject in the actual data entry workflow as well, but it's going to depend a lot on the type of systems that you use to collect the data basically.

So as an example, Salesforce, you can actually implement data contracts in Salesforce. They have a product; I believe it's called Salesforce DevOps. And in Salesforce DevOps, they expose their CI/CD pipeline, and you can essentially have a mechanism that does a check in the Salesforce CI, and it says like, hey, there's been some new record that's been added into Salesforce, and we expect it to conform to these standards, and it doesn't, and so we're going to prevent that from being merged, and then we can send some alert in Slack, or on email, or whatever to the salesperson that entered that and say, “Hey, you didn't enter this the right way, so we didn't actually allow you to push it into the system.” But there's some systems where you can't do that. In fact, I would say most third-party tools, you probably can't do that. And so you'll need to have a bit more of a reactive approach in flight, so to speak.

What I've seen other people do, and this is not something that we were doing at Convoy, but they would essentially try to intercept bad data before it got into their production pipeline, so to speak. So there were checks running on the data in flights, or they would just push everything to more of a staging table in Snowflake, and they would do checks in batch there, and if those checks passed, then it would flow back into the pipeline, and if not, then it would trigger some alert, it would message someone, it would create a Jira ticket and say, “Hey, consumers, we haven't loaded the data today because there's some error and we don't want to break all your dashboards.” So not quite a contract it's more of like an expectation. But it does still fall into that preventative data quality framework, I think.

Lindsay: I think even the most basic level, even just ensuring that the system that you're collecting data is only collecting data in the format that you want to accept it in, or that requires fields are always included before something can be submitted, if it is a human process, I think that stuff often gets overlooked, but it's pretty critically important too.

Let's jump to another one. We have one from Corey here. How do you convince data producers that contracts are worth owning? The upside seems to be for consumers?

Chad: Yeah, that's totally true. I think that data engineers and consumers are in this very weird place right now where we all recognize that there are huge problems in data quality in basically every single organization. It's just a nightmare. Things are changing all the time, and stuff is breaking, and we don't know what's trustworthy, but that doesn't really get pushed back onto the producer in any way. They're just going through their life and making changes, and they're more or less totally abstracted from all of these damaging impacts that they're creating. I feel like I saw a meme one time, of someone skipping through a field, and behind them, everything was burning. It's kind of like that. But I don't think that data producers are bad people. And in fact, I would say that they’re really biased towards managing risk and not causing damage for other people in the organizations. It's like a really good quality that I think most application engineers have. And so, what is a great way of getting them more involved in the problems that they're causing, is by essentially calling out what's going to happen if they make a change. And you can do that with a contract, right?

The scenario you want to be in is if a data producer drops a column or changes the data type... Like we had at Convoy, there was actually a message that would pop up and it would say, “Hey, you are about to do something bad.” And I've seen this continue at other companies. Sometimes that message can actually be quite comprehensive, and it can actually tag the consumers that are going to be impacted and show context around what is the data asset that's going to be broken and what does that actually mean for the business. So there's some message in GitHub that that engineer sees, it says, “You are about to break a dashboard that the CEO looks at every single day. This is what that dashboard is used for, and if this thing breaks, we are going to ping the CEO and let them know what just happened.” And that very quickly, completely inverts the incentive. And like I said, again, it's not because these data producers are bad, it's just because they have no idea what impacts or changes they're going to make. So the more that you can make them aware and you can bring visibility into that process, the more likely they are to use the contract as a mechanism of self-protection.

And this was the case at Convoy. Essentially, the producer had a choice. Once we started populating a lot of those things, the choice was, either you could manually go and have a conversation with each one of these people that owns a valuable data asset, or you could just take a contract and you're completely protected. You don't have to worry about that anymore, but you have to make the problem visible before someone's willing to do it.

Lindsay: For sure. I think it's like that communication piece, is actually then codified. So it's like, rather than just a handshake between teams, and hopefully somebody remembers that this important field, if you drop it, it's going to break all of our production dashboards, I think it just makes that communication a lot clearer. And in a way, it's like, yes, they are owning that piece of the contract, but at least they know what their impact is in the company. And if you are using data contracts as you said properly for some of the most important business impactful assets, then hopefully, the impact of what they're doing is elevated enough that it gets their buy-in.

Chad: Yes. Exactly.

Lindsay: Currently, dbt does inferred or partial contracts?

Chad: I'm not exactly sure what that means. I'm not sure what partial contracts means. Is partial a right place to start, or do we start with... I see what you're saying. There's nothing wrong with starting with partial contracts. I think that full schemas under contract is a good way to promote behavioral change within an organization. I think that's actually more useful than the actual things that it's protecting, at least to start off with. And the reason why is if you're trying to make some change to a dbt model and all of a sudden, you get an error out, and you say, “Oh wait, there's a contract here in place”, what that's going to do is it's going to help you realize, wait, there's someone that really cares about this data, and it's useful, so I should probably figure out what that is. And so, the more places I think that you can implement something like that, the better it is to start. So actually, I would probably go the inverse direction. I would probably start off with the full schema under contract, and then once we really understand, okay, who exactly are these consumers, what exactly do they need under contract? Like maybe they really need this ID field under contract, then you can start to parrot back a little bit. In the analytical database environment, I think that's a much easier conversation to have. I've personally found that people like data engineers and analytics engineers, it is much easier to have that conversation with them about these things than software engineers and data producers that are very isolated from the process.

Lindsay: Yeah, that makes a lot of sense. And then we have one more here. I love this idea that data product managers should essentially – I guess this isn't a question. That's just someone that loves the idea – should essentially be drafting data contracts to present to producers. What role do you think data product managers would play in this contract definition? Do you think that's something that role would own?

Chad: Yeah. I think Eric hit the nail on the head there. I think in the ideal world, it would be a data product manager that is drafting these contracts. They're saying, hey, upstream producers, we have some data product, and it's very useful, and here's why it's useful and how it's useful, and here are all the considerations that I need you to make, and they're just being very thoughtful about that. I think that would be awesome. I think that today, in companies that don't have data product managers, that role could still easily be fulfilled by whoever is building out the data set, whether it's the analytics engineer or data scientist or data engineer. In fact, a lot of the first movers of data contracts have been data engineers at companies that I've worked with. And there's a bunch of different use cases, just like things that are very scary for them. So for example, there's one company that's going through a big monolith decomposition right now, and they have a bunch of machine learning models. And the data engineering team wants to get the data contracts up on their intermediary tables so they can protect all of their consumers. So it's totally fine if it starts from a data engineering first perspective, and then as the company becomes more mature in how they think about data domains, and data merge, and data product management and stuff like that, then they can begin adding those other things in.

Lindsay: Yeah, that makes a lot of sense. We'll just take one more here. Thank you so much for all your questions. Where do the data contracts go wrong, and when can a well-intentioned consumer draft a poor contract?

Chad: I totally think that they can. I think there's a lot of ways they can go wrong. There's many different things to be thoughtful about here. I think one way they can go wrong is tying back to something I said earlier, which is, the less valuable your contract is to some business asset, the more likely, I think, it's going to start to raise some eyebrows about like, okay, why is some consumer making my life worse and making me go slower for this thing that's not even valuable? And so, imagine if you were a product manager. And this is not to pick on product managers, because generally, I think that they're great. But imagine if you're some product manager and you just asked for a dashboard that looks at the number of people who are using the filters in your application and you say, “Look, I absolutely need a contract on this, I need to make sure that this thing doesn't change, because I need to know exactly how many people are using my filters.” And then the upstream team says, “Okay, we'll give that to you.” And then a month later, you've moved on to a new feature and you never look at it again. That's probably not good. Because now, the upstream team is going to be maintaining this API for a consumer that's not even using the data anymore. And that's a really quick way to violate trust.

The other thing I would say is there are certain data assets where you just need to be very thoughtful in how you implement the contract. For example, I know in a lot of businesses, and this was also true at Convoy, sometimes we would have very, very wide tables with dozens of columns. And in some cases, I know at other businesses, it can get up to hundreds if not thousands of columns. That's actually not an exaggeration. And so if you say, hey, I want a contract on this table with a thousand columns, that basically is probably going to prevent anyone from making any change to that table ever, and not very useful either.

I think the last thing that I would mention is just something to watch out for, it's the responsibility of the consumer to very clearly communicate what the constraints are that they need in the contract, and how they are being used. Otherwise, a producer is not going to be able to fulfill the contract on their behalf. So if you're just giving this very basic documentation, you're like, oh, I need this under contract, I need that not to change, then it's totally fair game for the producer to then do the bare minimum enforcement because you haven't actually told them what the data should look like. So as an example, if I have something under contract, let's say it's a distance column, and Convoy is a shipping company, and I say, “This column is just called distance and I need that under contract.” It's like, okay, well, that's cool. Maybe that protects the schema, and it protects the data type from changing, but what if the data producer decides to change that column from miles to kilometers? Well, that would completely break me, but because I didn't clearly explain that I always expect this to be a mile column, then there's no expectation that the producer actually needs to follow that.

The final thing that I would say here is that in my opinion, the data contract is actually bidirectional. What we've been talking about primarily, and what the conversation is almost always about right now, is the responsibility of the producer to the consumer. There's some data that's being generated for a valuable thing, and you need to ensure consistency. But I think it goes the opposite way as well, there is a responsibility of the consumer to the producer to actually use the data in the way that they intended. And I think there's some really interesting ways that you can get at that, that again, falls a bit further to the right on the maturity curve. It's like, okay, I can see how many times you're actually refreshing this table, and how often this table is being queried. And if it's under a certain amount that you said that you would guarantee, then okay, the producer doesn't actually need to continue maintaining this contract anymore, because it is working for them. So that's another thing I would think about.

Lindsay: Yeah, for sure. I like that it's like a two-way street relationship that you need to have accountability on both sides, and then make sure that if you're going to ask for a contract, that you're actually going to use it, because it is definitely a balancing act to put those things in where you need them and not just apply them liberally.

Awesome. Well, thank you so much to you, Chad, for all of your amazing responses and a great discussion today. And thank you to everyone who's submitted questions. Great to see lots of engagement in the audience and really great questions. Before we wrap things up, I just want to plug a few quick things. The first is that Chad and I are going to be joining each other again in a course in a couple weeks. So there's a course launching called Advanced dbt With CoRise. So if you're interested in learning even more from Chad about data contracts, we'll go into that in a guest lecture in week four in that course. And then Chad, I'll just plug for you your data quality community, but is there anything else that you want to share with the audience before we pop off? I know you're working on a book.

Chad: I am working on a book, so I'm working on an O'Reilly book called Data Contracts. So hopefully, this will clear up a lot of misconceptions about what data contracts are and how they're used. So if you're interested in getting a copy of that, I actually will be pushing out incremental updates in chapters and sharing them with folks through LinkedIn and also through my Substack. I have a blog that I try to write on every week or every two weeks, but it gets sketchy because I'm busy. So that's just data products.stack.com. And then, like I mentioned before, I also have a Slack channel, which I really recommend folks to join, which is dataquality.camp/slack, and we talk about data contracts a lot there. I'd love for you to join there if you're interested.

Lindsay: Awesome. Well, thank you again so much, Chad. This was a really awesome discussion, and thanks to everyone for attending. We'll close things off there. I hope everyone has a good afternoon. Thanks.

Keep reading

See all stories