There's been no shortage of data roles popping up across the industry over the last 5 years. The reason is obvious: managing and maintaining data is top of mind for any forward thinking company, regardless of what industry you're in. So what exactly is a data engineering manager, and what should you expect when hiring one?
What is a data engineer?
A data engineer is responsible for building, maintaining, and improving data infrastructure. They often work closely with data scientists, data analysts, and the business intelligence team. Data engineers that are a part of a smaller team sometimes find that their work overlaps with other data functions- since they're wearing multiple data hats for their company.
What does a data engineer do?
According to Indeed.com, a typical data engineer job description will include:
- Assembling large, complex sets of data that meet non-functional and functional business requirements
- Identifying, designing and implementing internal process improvements including re-designing infrastructure for greater scalability, optimizing data delivery, and automating manual processes
- Building required infrastructure for optimal extraction, transformation and loading of data from various data sources using AWS and SQL technologies
- Building analytical tools to utilize the data pipeline, providing actionable insight into key business performance metrics including operational efficiency and customer acquisition
- Working with stakeholders including data, design, product and executive teams and assisting them with data-related technical issues
- Working with stakeholders including the Executive, Product, Data and Design teams to support their data infrastructure needs while assisting with data-related technical issues
What are the responsibilities of a data engineering manager?
Data engineering managers are responsible for more than just managing people.
It may seem that data engineering managers are responsible for more than just managing people. You see, data engineering is a fast-growing field, and there's an ever-increasing need for technical talent to understand and leverage new tools, methods, and technologies. Here are five responsibilities of a data engineer manager at this stage in the game:
- Selecting tools and data architecture
- Understanding the limitations of engineers
- Creating a modern data stack
- Communicating with business users
- Hiring junior data engineers
1. Selecting the right tools
As a data engineering manager, you'll need to be able to select the right tool for the job at hand. The right tool is situational and it depends on many factors including the problem itself (what kind of data are we processing?), your team's skillset (do they know Python or Java better?), your company's budget (can we afford a SQL license?) and even what other technologies are in use in your organization (do we already have a NoSQL database?).
What’s more, picking the right technology is always evolving. New tools come out all the time, often with different functionality than existing tools. So it’s important that you stay up-to-date on what technologies are available and their latest features. For example, four years ago Apache Spark was completely unknown but today it is quickly becoming the de facto standard for stream processing.
Of course, working closely with their team to roll out the right data tools and ensure there is adoption amongst all users is a big part of the role of a data engineering manager. So, on top of understanding what's already in your tool belt, keeping an eye on the latest in data tooling, and maintaining the overall data system, data engineering managers must also oversee onboarding onto new tools.
2. Understanding the limitations of your engineers and how they approach problems
As you understand more about the industry you work in, you’ll have a better idea of how to structure your team. For example, if you have a solution that requires a machine learning algorithm, and none of your engineers are familiar with this type of work, it may make sense to bring on another data engineer who has experience in the field.
If your team is not familiar with something that’s required for the project (maybe they don’t have business analysis or statistics experience), awareness of this is essential so that you can help them or hire someone else. You also need to be aware of what kinds of tasks your engineers can do well—and what they might struggle with. Data engineers are great at writing code and putting databases together—but they aren’t always good at business analysis or understanding how businesses work outside the world of technology.
3. Creating a modern data stack
Data engineering managers are responsible for managing the many complex components of a modern data stack. A modern data stack is an integrated set of tools that helps facilitate the handling, cleaning, processing, and storing of data. Ideally, it is designed to work with (and not against) the needs of your data science and analytics team. Modern data stacks include:
- Data warehouses
- Data lakes
- Pipelines for ingesting and processing data
- Catalogs like Secoda for organizing your data so you can easily find what you're looking for when you need it
4. Communicating with business users
As a data engineering manager, it's your responsibility to educate and communicate with business users about the benefits of your team's work. It's also important to teach them about how you do what you do, so that they can be informed about what successes or limitations their requests have on a technical level.
To achieve this, it helps to have answers at the ready for any questions they might have. For example: “Why did my request take so long?” or “What is MapReduce?” Structured explanations of why tasks take time and how they're accomplished will help them understand the work more deeply so that they can prioritize in more informed ways. This can also prevent unnecessary micro-managing by upper management who may not fully grasp the reasons behind progress moving slowly at times.
For similar reasons, it's important to avoid jargon when communicating with business users. Dropping terms like “CLUSTER BY” and “ETL” into emails or meetings will only frustrate your audience—so leave them out, along with other phrases that may sound techy but offer little insight into actual processes.
5. Hiring junior data engineers and helping them grow into senior roles
Whether you’re hiring junior data engineers or senior ones, you’ll be responsible for the interviewing process. Interviewing skills are not always part of a data scientist or engineer’s training, so it can be hard to know what to look for in an interview. You need to be ready to mentor and grow with the data analysts and not focus the interview process on any bias such as sexual orientation, individuals with disabilities or veteran status.
For junior engineers, we look for candidates who have a demonstrated ability to learn quickly, a passion for learning new technologies and techniques, and an ability to communicate their ideas clearly and concisely. We also look for people with good problem-solving abilities and the ability to take initiative when working with ambiguous requirements. When interviewing your candidates, you should get a sense of their previous experience as well as how they approach problems.
The best way to help your junior engineers grow into senior roles is through individualized mentorship from more experienced team members and providing opportunities for them to take on responsibility with increasing complexity over time.
Managing Communication between teams
In an ideal workplace, communication between colleagues is open, friendly, informative and professional. But as teams have moved towards remote work, it has become much harder to have communication between teams. Although Slack and Microsoft teams exist, there are times when information exchanged between colleagues is not recorded in these tools, making it tough to use them as the central source of truth. Unfortunately, the majority of workplaces struggle with communication. Misunderstandings, communication and missed deadlines all add up to quite a bit of stress for everyone involved.
One particular area of the organization where communication is not as efficient as it could be is between the data and product team. The data team relies on the product team to update them on the changes being made to the data model that might impact the downstream data resources. Unfortunately, changes are made so quickly to the data model that it can feel almost impossible to update all the appropriate stakeholders every time there is a minor change in the data model. This communication problem only becomes harder to manage as companies scale. As companies become larger, the models tend to change at a faster rate. Managing the data model becomes much less about a central data team that manages every data asset and more about every team being responsible for exposing their data to a centralized platform that is easy to search. Without the right tools and processes in place, this can become a bottleneck for teams trying to grow quickly.
Some folks in the data space claim that the best way to solve this problem is to become close friends with software engineers. Once you buy them coffee in the morning, compliment them and tell them that their work is invaluable, they will start to give you a heads up on new features that might impact the data team. This is not easy to do at scale.
Some people might diagnose this as a leadership problem. Because the software development team isn’t monitored on their contribution to the data model or whether data breaks, it makes it tough to have them involve the data team in every decision. Additionally, there are data quality issues that may be created when they change the data model, which never affects their work. This type of thinking creates data debt. Unfortunately, talking to the team doesn’t seem to solve the problem because they have been directed to focus on their product.
There are many instances where leadership may not know that this is a problem. Without haven’t a direct cost and cause associated with data breaking, it’s tough for leadership to identify the root cause of the problem. For many teams, the problem has become one that they just have to live with and instead of being proactive about changes, they react to data breaking only after someone on the leadership team tells them that something is incorrect with the data.
How can data engineering managers solve this problem?
Although this problem is difficult to manage on growing teams, it’s not impossible. This problem becomes harder to solve as the software engineering, product and data teams become more siloed in the organization. One way to solve this problem is to assign a data product manager, who is responsible for facilitating the communication between the two departments. But this alone is not enough. Below, we’ve outlined three areas that teams can focus on to reduce the number of times that this kind of problem occurs.
- People: The first step teams should take is to find lines of communication that are frequent and detailed. One way some teams solve this is through the Data Product Manager (DPM) role, whose job is to manage the data requests backlog and to communicate with the software and product teams to understand what may affect the data team in the future. By working with other product managers and by communicating to others across the organization how their work may impact the data and unlimitedly, the decisions that the company makes, the data product manager can advocate for the data team.
- Process: Most data teams already use Jira or some other tools to manage data requests. We recommend creating a process for every formal PRD - no matter the size - to include a data and reporting component in the scope. This way, data teams are informed about the changes happening upstream and what the impact of the new feature is on the business. Before any feature can be sure, this PRD should be reviewed by the software and data product managers to make sure that nothing was missed from the scope of the feature. Teams should also think about ways to communicate data knowledge externally such as the data dictionary and data analysis. Ideally, all data knowledge is centralized in one place that everyone can access.
- Tech: Lastly, teams should make sure that there are webhooks that flag data model changes in an automated way whenever there’s a new PRD that proposes data model changes. Our team is working on a new feature that allows software developers to use a .yml template to automatically document changes / new schema to the data model. The idea is that instead of finding out about changes to the data warehouse reactively, data teams could implement something like this to become proactive about changes to the schema. If your team wants to test this feature out, feel free to reach out to us.
There’s a problem to be solved in this space, as most teams have been struggling to find a good solution to communicate with product teams. Centralizing all the definitions and reacting proactively to changes in the data model is the best way to start chipping away at this problem. Even if it feels like it’s a huge hill to climb, we promise it’s well worth climbing!