Data engineering is the most popular and in-demand job among the big data domain across the worldwide. Data Engineers build, monitor and refine complex data models to help organizations improve their business outcomes by harnessing data power.
In this post, we’ll highlight the 20 most commonly used data engineering tools at mid-sized tech companies based on research from over 150 interviews with data engineers. We’ll also briefly dive into some trends we noticed during our conversations, further exploring how different data engineering teams are thinking about their roles in the future. By the end of this post, we should’ve shed some light on the following questions:
What data tools are analytics teams really using?
How many teams are using Snowflake over BigQuery or Redshift?
What tools can data engineers not stop talking about?
Which tools are here to stay?
1. Amazon Redshift
Amazon Redshift is a fully managed cloud warehouse built by Amazon. It is used by about 60% of the teams that we spoke to during our interviews. Amazon’s easy-to-use cloud warehouse is another industry staple that powers thousands of businesses. The tool allows anyone to easily set up your data warehouse and scales easily as you grow.
2. Big Query
Similar to Amazon Redshift, BigQuery is a fully managed cloud data warehouse. It is commonly used in companies that are familiar with the Google Cloud Platform. Analysts and engineers can start using it when they are small and scale with the tool as their data grows. It also has built-in, powerful machine learning capabilities.
Tableau is the second most commonly used BI tool from our survey. One of the oldest data visualization solutions, the main function is to gather and extract data that is stored in various places. Tableau uses a drag and drop interface to make use of data across different departments. The data engineer works with this data to create dashboards
Looker is BI software that helps employees visualize data. Looker is popular and commonly adopted across engineering teams. Unlike traditional BI tools, Looker has created a fantastic LookML layer. This layer is a language for describing dimensions, aggregates, calculations, and data relationships in a SQL database. One tool that has launched recently as a way to manage teams' LookML layer is spectacles, which allows teams to deploy their LookML layer with confidence. By updating and maintaining this layer, data engineers can make it easier for non-technical employees to use company data.
5. Apache Spark
Apache Spark is an open-source unified analytics engine for large-scale data processing. Apache Spark is a data processing framework that can quickly perform processing tasks on very large data sets, and can also distribute data processing tasks across multiple computers, either on its own or in tandem with other distributed computing tools. These two qualities are key to the worlds of big data and machine learning, which require the marshalling of massive computing power to crunch through large data stores.
Apache Airflow is an open-source workflow management platform. It started at Airbnb in October 2014 as a solution to manage the company’s increasingly complex workflows. Creating Airflow allowed Airbnb to programmatically author and schedule their workflows and monitor them via the built-in Airflow user interface. It is the most commonly used workflow management solution and was used by around 25% of the data teams we interviewed.
7. Apache Hive
Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. The three important functionalities for which Hive is deployed are data summarization, data analysis, and data query. The query language, exclusively supported by Hive, is HiveQL. This language translates SQL-like queries into MapReduce jobs for deploying them on Hadoop.
Segment makes it simple to collect and use data from the users of your digital properties. With Segment, you can collect, transform, send, and archive your customer data. The tool simplifies the process of collecting data and connecting it to new tools, which allows teams to spend less time processing and collecting data.
Snowflake’s unique shared data architecture delivers the performance, scale, elasticity, and concurrency today’s organizations require. Many teams we spoke to were interested in Snowflake and its capabilities to store and compute data, which makes us expect more teams to switch over to Snowflake in the coming years. In Snowflake, the data workloads scale independently from one another, making it an ideal platform for data warehousing, data lakes, data engineering, data science, and developing data applications.
DBT is a command-line tool that allows data engineers and analysts to transform data in their warehouse using SQL. DBT is the transformation layer of the stack and doesn’t offer extraction or load operations. It allows companies to easily write transformations and orchestrate them more efficiently. The product is built by Fishtown Analytics and has raving reviews from data engineers.
Redash is designed to enable anyone, regardless of the level of technical sophistication, to harness the power of data big and small. SQL users leverage Redash to explore, query, visualize, and share data from any data source. Their work in turn enables anybody in their organization to use the data without much of a learning curve.
Fivetran is a comprehensive ETL tool. Fivetran allows efficient collection of customer data from related applications, websites, and servers. The data collected is moved from its original state to the data warehouse and then transferred to other tools for analytics, marketing, and warehousing purposes.
great_expectations is a Python-based open-source library for monitoring, validating, and understanding your data. It focuses on helping data engineers maintain data quality and improve communication between teams. Software teams have used automated testing software for some time to test and monitor their code, great_expectations brings the same processes to data engineering teams.
14. Apache Kafka
Kafka is primarily used to build real-time streaming data pipelines and applications that adapt to those data streams. Streaming data is data that is continuously generated by thousands of data sources, which typically send the records in simultaneously. Kafka was originally built at LinkedIn, where it played a part in analyzing the connections between their millions of professional users in order to build networks between people.
15. Power BI
Power BI is a business analytics service by Microsoft. It aims to provide interactive visualizations and business intelligence capabilities with an interface simple enough for end users to create their own reports and dashboards. The data models created from Power BI can be used in several ways for organizations, including telling stories through charts and data visualizations and examining "what if" scenarios within the data.
Stitch is a cloud-first, open source platform for rapidly moving data. A simple, powerful ETL service, Stitch connects to all your data sources – from databases like MySQL and MongoDB, to SaaS applications like Salesforce and Zendesk – and replicates that data to a destination of your choosing.
17. Periscope (Acquired by Sisense)
Periscope is a business intelligence and analytics tool. The software lets you integrate your data from multiple sources and gives you the ability to create visuals to share with your team. Periscope is very similar to Redash, which allows you to use SQL to visualize your data.
Mode Analytics is a web-based analytics platform. Mode gives employees an easy-to-use workspace with some external sharing capabilities. The team at Mode focuses on producing reports, dashboards, and visualizations. Additionally, Mode analytics SQL is a semantic layer that helps non-technical users with the platform.
Prefect is an open-source tool to ensure data pipelines operate as expected. The company’s dual products are Prefect Core, which is a data workflow engineer tool, and the Prefect Cloud, a workflow orchestration platform.
Presto is an open-source, distributed SQL query engine. Presto can query data where it is stored, without needing to move data into a separate analytics system. Query execution runs in parallel over a pure memory-based architecture, with most results returning quickly.
What are data teams most excited to use?
Data engineers almost unanimously agree that the most exciting tool they want to learn or use is DBT. The team at Fishtown analytics has done an amazing job of creating a community around analytics engineering. The tool is a command-line that allows data engineers to transform their warehouse using SQL. It has recently raised a significant funding round due to its simplification of workflows for data engineers.
Secondly, many people we talked to wanted to try, or were in the process of moving towards, Snowflake. The current users really enjoy the functionality of the tool and would recommend it to anyone looking for a data warehouse.
What’s next in Data Engineering?
It’s hard to speculate what’s next in data engineering. Based on our research, the primary focus of data engineering teams after they have solved the data warehousing, ETL, and quality problems is to start defining and analyzing the data. This follows the framework of the “Analytics Hierarchy of Needs” created by Ryan Foley.
Solving the communication problem
After the collection and cleaning steps have been accomplished, teams seem to start running into problems with defining and analyzing data. The reason that these problems arise is not from a lack of knowledge or skills to get to the final answer. We call this the data debt problem. Instead, it’s a problem created by siloed data that isn’t collaborative. This creates a communication problem for teams trying to get on the same page about what to measure and how to define key metrics.
- A data team defines the “number of rides per week” as the total number of rides that were completed between Jan. 1, 2020, 12:00 AM → Jan. 7, 2020, 11:59 PM.
- The marketing team defines the “number of rides per week” as the total number of rides that were started between Jan. 1, 2020, 12:00 AM → Jan. 7, 2020, 11:59 PM.
- The sales team defines “number of rides per week” as the total number of riders that paid for a ride Jan. 1, 2020, 7:00 AM → Jan. 8, 2020, 6:59 AM
Conversations about what data means can take weeks and multiple meetings to solve. Because of this, we believe that the next frontier for data engineering tools will focus on making the communication problem less painful. By having a place where teams can view the same information while also understanding the nuances in different definitions, the define/track and analysis steps of Foley’s hierarchy are streamlined, if not solved. Our team is building a centralized way for teams to discover data collaboratively with Secoda. We believe this will be an important piece of the puzzle as more teams embrace the modern data stack across departments.
Solve today's data communication problems with a new model of working and communicating.
At a time when we're trying to solve communication problems in our data-driven modern world, it's easy to forget that the solution is not just about communicating better.
It's easy to blame and criticize others for our communication failures. It's tempting to point out how other people "don't get it" or how they simply don't work well with others. But at the end of the day, we're all just humans trying to make sense of our shared world, and as such there will always be those who don't seem to be able to understand us.
We can do better than this. There are other ways that we can approach solving the problem of data-driven communication: build a system where everyone works on understanding each other better; build a platform that allows everyone in your team, at your organization, and even around the globe—to see and understand each other's work; use a common platform so that everyone can see what everyone else is working on; create a productized learning experience for anyone who wants to learn more about data communication from an expert.
Save time by building data communication into your existing processes.
Saving time is paramount for any business. But in a fast-paced global economy, it’s more important to be able to react quickly than ever before. Using the same tools that already are familiar to your teams will streamline how quickly they can get up and running with data communication and collaboration.
Get everyone on the same page, using the same data.
To solve this problem, you need to ensure that everyone on your team is using the same data and the same information. With any team, you're going to have different people with different roles. Some are responsible for data collection, some are responsible for data entry, and others are responsible for analyzing and sharing the insights they've gleaned from the data. Each of these roles requires a slightly different perspective on the information; nonetheless, everyone needs a system that works consistently and predictably across all of your company's departments.
Tools that allow employees outside the data team to make confident decisions must continue to adapt to the needs of modern businesses. If you’re interested in talking about the future of the data stack and how we can help your team, you can reach us at firstname.lastname@example.org