Making Data Discovery Simple
February 4, 2021
Data discovery from day one
Even with great data practices, many organizations still struggle to get value out of data. Up to 73% of all enterprise data goes unused. One of the big contributors to this problem is that organizations create data siloes from day one by not documenting and centralizing their data in a place where everyone can access the information. When collaboration happens between people in a variety of roles who understand the meaning of the data, a larger percentage of relevant data will be put to use.
For most companies, a data "transformation" begins with hiring a data scientist or a data engineer who is given the task of making the data more available across the company. To begin this transformation, this employee will usually evaluate data warehousing tools, business intelligence tools, and any additional ETL tools they might require to transform the raw data into usable insights. For growing companies, the data stack is starting to become universal.
Before any value is created from data science, this employee should evaluate the data stack and prepare the data to become useful. This employee spends months cleaning, renaming, and preparing the data for storage in the data warehouse and in doing so, becomes the broker of data in the organization. Any question about business metrics or performance starts coming to the data department because most employees don't know how to access the data, what data to trust, or what certain data means. This tribal knowledge problem is extremely painful and only gets worse as an organization scales. Because data grows faster than people, it is almost impossible to keep up with the information produced at a fast-growing company. New data analysts and scientists spend countless hours trying to find the right stakeholder to ask about a particular table while getting bombarded with requests from managers. Tracking down and understanding siloed information slows decision making down and creates a culture that isn’t able to take full advantage of self-service tools.
Over the past year and a half, we experienced this problem. I was one of the employees asking questions and Andrew was one of the employees trying to keep up with data requests while working his regular role. The communication around data between technical and non-technical employees is extremely frustrating for both sides. On the one side, the employee waits a long time to hear back from the knowledge broker and on the other side, the knowledge broker is constantly overwhelmed by the requests from non-technical employees.
After dealing with this problem for what felt like way too long, we decided to start exploring potential solutions. We found a common trend, which was that at many companies, the largest barrier around data was the lack of data context. In one instance, we learned that a data scientist spent 4 weeks tracking down the meaning behind a duplicate row. Although this sounds like a completely inefficient use of time, it's quite common. Most data scientists have experienced the pain of missing data context.
Most of the time, this information is just tribal knowledge that isn't shared across the organization. The process most teams follow today is asking a colleague what certain things mean. We believe that organizations who want to become data-driven should start to think about this problem from day one. By using a data documentation and discovery tool, growing teams can prepare themselves for the challenges that come with scaling the data across the company.
Good data discovery means great data documentation
There are a few reasons why a tool like this hasn't received mass adoption yet, and why we believe that now is a great time to start building a tool that will help document data. The primary reason is that no one likes to document their work, especially not data scientists. Some teams have even brought on a documentation specialist to solve this problem. The value of documentation is indirect, and usually, isn't noticed for months down the line. So why should a data scientist spend time documenting work when someone will just figure it out down the line?
The reason to invest in data discovery today is to stay ahead of data debt. We know that it's extremely tough to get people to change their behaviour. That's why we believe the solution could come from a tool that automatically documents the data and proactively alerts teams about the resources contributing most to data debt
This means that important metadata that help answer the important questions is automatically documented as people use the data the same way they were before. A great data documentation tool would help employees answer "who created this data, what are the columns of this data, when was the data created, where is this data, how can I access this data, and why was this data created" without having to ping a colleague on Slack.
Data can be a social resource
Most people can agree that siloed data is useless. Similarly, most people can agree that data without context is useless. The third vector, which we believe is extremely important, is that good data is used as a social resource. Companies traditionally collaborate on documents or look at code created by other employees to learn about what's going on across the organization. Data is one of few resources in a company that is not used this way today. This means that there isn't an ongoing thread about important data resources, we don't know what the most commonly used resources are by department or role, and it usually takes a long time to find who owns what data.
We believe that data can be a social resource. Meaning that employees can collaborate around tables and visualization to speed up decision making. Today, this collaboration happens asynchronously through Slack or in a confluence document. These undocumented conversations lead to isolated decisions that are difficult to reference in the future. With a powerful social data discovery tool, teams could open up discussions on visualizations and get to decisions faster. Additionally, social interactions with data could help employees identify the most relevant resources faster using references such as "most liked resources, favourited resources, most active discussions". We believe that the data democratization problem is a people problem and requires both data-savvy and non-data-savvy employees to collaborate towards a solution.
Data that works where you work
At launch, Secoda will be able to integrate with the following resources with the click of a button:
- Amazon Athena
- Amazon Glue
- Amazon Redshift
- Apache Cassandra
- Apache Druid
- Apache Hive
- Delta Lake
- Google BigQuery
- IBM DB2
- Microsoft SQL Server
- Mode Analytics
- Apache Airflow
These integrations should cover a majority of the use cases for Secoda, although we are aware of some other important integrations that we have on the roadmap.
The last piece of the puzzle is where this kind of tool should run. While we are confident that a dashboard is an important component of the tool, we are equally confident that a bot that integrates into the workplace messaging tool is an equally important component. This bot would assist employees when they are looking for a specific data resource or when they have a question about a particular data resource. Rather than leaving their messaging tool, they can simply use Secoda to find what they need when they need it.
Collecting and analyzing data is essential to ensuring your company is making the right decisions and direction.
Secoda is the place to centralize company data. We’ve built Secoda as a single place for all incoming data and metadata, a single source of truth. The goal of Secoda is to help employees find and understand the right information as quickly as possible. Secoda doesn’t only add context to data, it goes beyond that. Secoda tracks the relationships between people and data to help you to visualize all the interactions between the different collaborators of the enterprise. This allows Secoda to identify who owns data, who is affected by changes, what tables are most commonly used together, and the commonly used resources by team or individual.
If you're interested in giving Secoda a shot, please sign up to our beta at https://www.secoda.co
Etai & Andrew