Data engineers are responsible for building, maintaining, and improving data infrastructure within a company. These are the people who are designing and implementing scalable data practices at companies, alongside maintaining these practices. More and more people who aren’t in technical roles are looking towards data to make decisions– the marketing team wants historical data to inform their advertising decisions, the product team wants to understand usage to bring forward improvements, the list goes on and on.
This means that now, instead of the data team being a separate function at a company, they’re the ones tying the pieces together for everyone to understand and make decisions from. In such an integral role, what are some of the best data engineering practices to follow?
1. Use tools that help you do your job- until they get in the way of doing your job
We all have tooling that we rely on, whether it's an IDE, database software, or package management system. These are things we use in the day-to-day of our jobs; they get in the way less and less as time goes on. However, it's important to remember that a tool is only as good as its user; if you don't know how to use it properly—if you're not getting the most out of your tool—it's time to move on.
The first thing you need to do when choosing a new tool is to understand what it does. If you're working with computers, look at what other engineers might be using; if you're working with unstructured data sets, look at what companies like Google and Facebook are using. Do your homework on tools so that they become extensions of your own capabilities instead of hurdles between you and progress in your chosen field.
2. Focus on repeatability.
Repeatability is essential for a successful data engineering project. The first step to ensuring repeatability is to create tests that can be run as part of the development pipeline. This includes unit tests, integration tests and end-to-end tests.
Unit tests are written at the level of individual modules, such as functions and classes. They allow for testing small parts of code in isolation, which makes them easier to write and debug, and allows developers to focus on solving smaller problems one by one. Integration tests require integrating multiple modules together so they can be tested simultaneously in a more realistic setting than unit tests allow for. End-to-end or acceptance tests exercise the entire application from outside (from the user’s point of view), just like a user would, after it has been deployed into a production environment.
3. Focus on modularity.
You should aim to build a data processing flow in small, modular steps. Each step along the way is built to solve one specific problem, like reading a file or computing some statistic. This makes your code more readable and easier to test (see below), and also lets you adapt each part independently as your project evolves. A straightforward example might be reading raw data from files and writing them into clean JSON objects on disk: that way, you can add new sources of data without having to update any parsing code.
Modules should be reusable: building modules with a set of inputs and outputs that make sense in multiple contexts will help keep your pipeline clean and easy for others to understand. Even if you don't expect to reuse a module it's still worth keeping it generic enough so someone else could extend it later if they wanted.
4. Focus on reliability.
Data engineers, it's time to stop pretending—we all know that things are going to go wrong. Your job is to make sure those issues don't disrupt everyone else.
- Build monitoring and alerting into your data pipelines. You can't fix something if you don't know it's broken. And things will break. Whether it's data validation that detects bad records, or alerting for long-running jobs, make sure you have a way to catch failures as soon as they happen and take action to resolve them quickly.* Adopt tools like ELK (Elasticsearch Logstash Kibana) Leverage these powerful tools for monitoring the health of your systems and troubleshooting problems when they arise.
- Adopt the right set of open source tools tailored toward working with big data in a distributed environment: HDFSHadoop Distributed File System, Apache SparkApache Spark and HBaseHBase are great examples.* Even when using open source frameworks, understand what’s going on under the hood so that you can troubleshoot potential issues in production
5. Build for failure.
It is imperative that you assume failure—and plan accordingly.
We must not think of the system as perfect, but rather as constantly in flux. The more components your system has, the more likely it is to fail; and if you’re doing big data right, it will have a lot of components. Systems are not autonomous beings; they require constant care and feeding by people who need to sleep every once in a while.
So how do we build for this? By asking ourselves the following questions:
- What is the biggest single point of failure in my system?
- What are the consequences of this node failing? Can I make it so that if the application goes down, no one notices? Or can I make it so that everything keeps functioning at reduced capacity until someone can fix it? Is there an elegant way to make my applications self-healing or self-fixing?
Data engineering is a challenging field and taking some time to think about how to organize a project can pay big dividends.
Data engineering is a challenging field and taking some time to think about how to organize a project can pay big dividends. Data engineering does not have the wide range of well-established best practices that, for example, software engineering enjoys. This means it's more important than ever to devote time up front to adhering to standards that are likely to be fruitful.