A Complete Guide To Data Engineering
February 21, 2021
Data engineering is the process of moving data from its raw form, such as sensor data, into a structured format that can be used to produce desired insights. Data engineering is a field that's only recently been recognized as a distinct discipline. The field is similar to the more established data science discipline in that it involves data manipulation and analysis. The difference is that data engineering involves a more hands-on approach to the data. Data engineers play the role of moving the data around and organizing it in a way so that other people can use it. This article is meant to serve as a brief introduction to data engineering, metadata management, and data catalogs. The purpose of the article is to give an overview of the different areas of data engineering as well as the common tools and processes in the role. While this article uses a top-level definition, many organizations have definitions that differ from the one we'll use in this article.
What is Data Engineering?
We think about the data engineer as the engineer for the data team. This means that the customers for the data engineer are the data science team within the organization. The role of the data engineering team is to take data from its raw and unusable state and transform it to clean data that the data science team can use. This means that the data engineers work in the background and assist the data scientists when they need to answer a specific question. They usually work for tech companies or high-end consulting firms because those companies have to deal with a large amount of data. The more data you have, the more time you have to spend processing and analyzing it. In fact, the breakdown of time spent on data preparation versus data analytics is woefully lopsided; less than 20% of time is spent analyzing data, while 82% of the time is spent collectively on searching for, preparing, and governing the appropriate data. As data engineers, data scientists play the role of keeping the data clean and organizing it so that other people can get value out of it.
Data Engineering vs Data Science
Data engineering and data science are both disciplines that make use of advanced algorithms and structured data analysis. Yet, the two disciplines differ in the techniques used and how they're applied. Data engineering focuses more on organizing, cleaning, and manipulating the data as it makes its way through the pipeline. Organizing requires data engineers to put together the structure in the warehouse so that data can be accessed when getting queried. Cleaning requires data engineers to remove duplicates, monitor ingestion, and make sure that the data presented in the right format. Data science, on the other hand, is used in more isolated and exploratory phases.
Common tools used by data teams
Data teams use a variety of tools to move, manipulate and store data. These tools involve a production database, a data warehouse, an ETL tool (or ELT if you’re fancy), some modeling layer, and a visualization layer. These core technologies have gained adoption over the past few years, which has led to a more standardized data stack. Although these tools solve a major problem for data teams, there are still gaps in the traditional data stack, especially at a smaller stage. The one prominent gap that we’ve come across in our time talking with data teams is the ability to have complete observability and discoverability to your data. Many data engineers and data scientists are left in the dark when it comes to knowing about their data (where it lives, what’s broken, what’s related to what).
Why is data discovery the missing link?
Data discovery is the missing link between a data-driven culture and a culture that is dependant on the data team to answer any critical questions. This is because understanding where to find data, which data you can trust, and what different tables or columns mean can be easy when you’re more familiar with the data. But as a company scales, the tribal knowledge does not, and many are left in the dark about what specific tables or visualizations mean. Most teams we’ve spoken to have been using a mix of tribal knowledge and confluence documents to record important information. There are a few problems with this solution:
- It becomes outdated because it is completely manual.
- It is difficult to discover (most teams pin confluence docs to slack channels that get forgotten
- They don’t adapt to the different data collected (it’s confluence docs at the end of the day)
That the best way to discover your data is to do it with a tool that is distributed to everyone, automatically documenting (through extensive integrations), and can interpret different kinds of data. In our opinion, the key to having this kind of understanding in a tool is building a data catalog on a graph database which includes data, visualizations, and people as nodes in the graph. This graph database should collect the important metadata from all sources and display it in a way that allows anyone to discover data. Because of the distributed format of such tools and the growing demand for self-service business intelligence, this data catalog should also have some features that promote data management and data governance.
What Is Metadata?
Metadata is data about data. The term metadata is often used interchangeably with the term data. Yet, the term metadata is not a synonym for data, it is only data about data. The word metadata is derived from the root words meta (meaning "about") and data (meaning "information"). So what does metadata mean? Metadata is information about data. Metadata is stored in the file's header and is not usually visible to the user. If you use a camera as an example, metadata consists of information related to:
- The location of the photo (where it was taken)
- The time that the photo was taken
- A detailed technical description of the camera and its settings
- Who is in the photo?
- What were the settings for the camera when the photo was taken?
In the data world, metadata tells the same story to data engineers. They can understand:
- Where does the data come from
- Who created this table?
- When it was last updated?
- What certain things in the table mean?
- How do I use this table?
- What are similar tables?
These kinds of questions can help build up a knowledge repository for your data. No more asking around for what different tables or rows mean, it’s all in the data catalog. While data catalogs can document data, the fundamental challenge of allowing users to “discover” real-time insights about data has remained unsolved. This is because traditional data catalogs sit in the data organization and do not scale with the data stack. Additionally, they are undistributed and don’t automate the process of documenting data. Below are some of the areas we see changing for data catalogs.
Understanding data requires collaboration
While modern teams grow and demand more data, more self-service, and more insights, the data catalog has remained siloed. Data teams should be able to easily search and understand data using their data catalog without a dedicated support team. By building an understanding of how teams interact with data, a data tool that is distributed could highlight what teams are using what data and create knowledge that is distributed instead of centralized. Since traditional data catalogs are not distributed, it’s near to impossible to use as a central source of truth about your data. This problem will only grow as the data and the company grows. More users want access to the data, making simple analytics complex.
Data automation from day one
Most data catalogs or data documentation solutions (confluence) rely on the data team to document and update the information. Without strict rules about data documentation, this can become outdated. Even with good data documentation rules, documenting data takes a lot of effort and time for the data teams. Additionally, data teams are still pinged on Slack about the same question repeatedly, which can become frustrating. The majority of this process should be automated and self-documenting. When someone on the data team answers a question about data, it should be recorded in a place that is searchable, similar to Stack Overflow. This way, simple data questions can become a thing of the past, and data teams can focus on only answering nuanced questions once.
Data catalog that understands a variety of data
As machine-generated data increases and companies invest in ML initiatives, unstructured data is becoming more and more common, accounting for over 90 percent of all new data produced. This unstructured data is flexible with its transformation. Understanding this unstructured data requires data catalogs to have lineage and understanding of data sets. This requires data catalogs to answer second-order questions including:
- Why was this data collected this way?
- What is the hypothesis behind the data?
- When was this last used and updated?
These kinds of questions need data catalogs to infer information or collect it while it is transforming unstructured to structured data.
Data Discovery built for modern data teams
Finding a solution to these problems is not simple. There is some work on catalogs done by large tech companies that are inspiring to reference. The new data tools to document and understand data will enable data discovery for everyone in the organization. They will do this through a decentralized and automated format. This format allows all employees to understand what is going on with their company data. Below are the core features of a data discovery tool.
Data Discovery to Help You Find Data
Data discovery tools will allow anyone to find and understand the data you need. You can search for tables or spreadsheet files or raw data. Sometimes you don't know exactly what you need or the correct term for a key metric. To solve this, data discovery tools could use a fuzzy search to reference related information that might answer your question. This will all be searchable through a familiar, search-based interface.
Data Discovery to Help You Understand Data
Finding the right data isn’t enough. Data needs to have as much context as possible so that teams can understand the granular insights in your data stack. On the first layer, this means metadata analysis of the information in the database. On a second layer, this means understanding the relationships between the data and the lineage of data sets. On a third layer, understanding the data means understanding the granular, column-level data. Field-level lineage can help data teams have full insight into how their data is used.
Data Discovery to Help You Share Data
Data is a social asset. People collaborate on tables and visualizations to make decisions. Today, these conversations and decisions are made through Slack or in a meeting but are rarely recorded. Teams that build up context around data assets will start to notice the benefits of a team driven by similar metrics. Old decisions can get referenced and all organizational tribal knowledge can get documented. This is especially important in a remote-first environment, which favours asynchronous conversations. A data discovery tool that can make data a social resource can help teams elevate their understanding of their data.
Data discovery is going to change the way data-driven teams adopt data in the coming decade. As more unstructured forms at an accelerated pace, understanding where it’s coming from and how to use it will be imperative to success. You can try out Secoda if you're interested in a tool that automates data discovery.
Only by understanding your data, the state of your data, and how it’s being used – at all stages of its lifecycle, across domains – can we even begin to trust it.