Data Discovery Tools: Should You Build or Buy?
Every data team has data discovery on their mind, even more so as they begin to scale. Whether they already have a data discovery solution, or are considering taking the first step in implementing one— the growing pains of a scaling data organization means looking for ways to empower both you and your team to search, find, and derive conclusions from data with minimal hand holding.
This is where data discovery comes in, and the time that finding the optimal data discovery service or solution is important. After all, it’s a big transition to implement one, and an even bigger transition to move from one tool to another should your first solution not be sufficient. Based on our experience with the dozens of data organizations we work with, we’ve put together a guide on finding the best data discovery solution for your organization. We’ll be covering:
- What does a data discovery tool do?
- Why do you need a data discovery tool?
- What is data discovery?
- Enterprise data discovery solutions
- Things to consider when finding or building a discovery tool
- When and how to build your own data discovery solution
- The best enterprise data discovery tools
What is Data Discovery?
Data discovery is the process and technique that involves applying various tactics such as data mining and interactive visualization, to a company's data with the goal of finding and understanding patterns in the data.
In the broader sense, it’s a process, tooling, and organization from data creators and gatekeepers (typically the data analysts or engineers in an organization) to make data accessible to those who need it. Sometimes, they’re making it accessible to people from within the data organization (i.e. their fellow analysts and engineers), whereas other times they’re ensuring that data is reliably accessible to people outside of the data organization (i.e. stakeholders in sales, marketing, engineering, etc.)
What does a Data Discovery Tool do?
Data discovery tools help both data stewards and business users (non-technical users) access and analyze complex data sets within their organization. The tools provide visualizations and other pre-built analyses that allow business users to answer specific questions about the data. The key components of a data discovery tool include:
- Data preparation. This is the process of getting your raw data ready for analyzing and discovery— typically using ETL tools to do so, but it’s a multi-layer, multi-tool job to get the data from initial ingestion to analyzed and presentable to external stakeholders.
- Data exploration. While this term is sometimes used interchangeably with “data discovery”, it encompasses the broader practice of understanding and indexing your raw data. This typically happens before a specific question or query is asked.
- Data visualization. After a query or question has been asked of the raw data, and the components that are necessary to answer these questions are identified, data discovery tools will help build the narrative of the response in a way that everyone, technical or not, will be receptive to.
Why your team needs a Data Discovery Tool
There are many reasons why organizations need to index on data discovery sooner rather than later. For most, there’s a tipping point, usually within the data organization, that causes the search for a tool.
- Your data analysts and engineers want to stop dedicating manual labor and mental headspace towards repeatable questions, queries, and reports.
- Your business teams want to access insights from data without waiting on the data team to release this information to them.
- Your stakeholders, business teams, and data analysts want to be able to utilize the data beyond specific queries and questions— i.e. be empowered to interact with data in an exploratory manner.
- Your data team is scaling quickly. Onboarding new members to the data organization is a time consuming task that is again, repeatable if automated.
Why should data teams approach data tooling differently?
Many teams choose to purchase data tools because they want to start using the tools as quickly as possible, without any concern about the tradeoffs between speed and flexibility. Similarly, many teams choose to build their solutions off an open-source tool because they believe their use case is so unique that nothing that exists can fulfil their unique requirements. No matter which approaches your team chooses to go with, there will always be benefits and drawbacks to choosing one tool over the other (speed, customization, support, reliability).
Enterprise Data Discovery
When evaluating which data discovery solution you’d like to use, you should ask yourself the following questions:
- What job do these tools need to accomplish?
- How will it be managed and iterated on over time?
- Who is affected by this solution internally? If a large majority of your team is using this solution consistently, they will need the solution to be reliable and consistent.
- Can this tool scale with our needs?
- How much work will it take to set up?
Data Discovery Tooling: Build or Buy?
Once it’s time to make a decision, we believe that it’s important to choose a solution based on the end goals your team has with the product as well as the dependency that other stakeholders will have on the product.
Suppose you’re looking to implement data discovery and start using the tool to align teams on what certain terms mean, how to access data and what data to trust. In that case, it could make more sense to buy from a vendor who already has built and manages a product that can achieve these functions.
However, suppose you are more interested in deep data governance and that your data infrastructure requires unique features that are not covered by traditional vendors. In this case, you should consider building a tool for your specific use case.
As the data stack becomes more fragmented with tools like Reverse ETL, data quality, data observability, data catalogues and headless BI tools, teams will have to pick which of these tools they want to maintain internally vs. buy from a vendor. We believe that in the future, teams who make the right decisions about which products to manage vs. purchase will be able to leverage their data teams' core competency and provide much more value to the business.
In the case of data discovery, data teams should ask themselves if they are well equipped to build a user-friendly data discovery and governance tool, which requires a mix of user experience, product management and data engineering abilities.
Building a Data Discovery solution
At first glance, building off an open-source tool appears to be a good option because it allows you to create a tool that is the perfect fit for your specific business model. But, teams who choose to build using open-source products can introduce unique challenges and that might end up requiring even more data engineering effort. The end result may have been just as expensive as buying the solution. Teams should consider that it’s highly unlikely that they will get the resources to build this vision of the perfect tool internally. One of the reasons for this is that investing in a tool that is not part of your core differentiation might not be a great use of company resources.
This is especially true at the beginning, and with a tool that is used by a variety of stakeholders. When data teams decide to manage open source tools that are used by a variety of stakeholders, they risk having to meet the demands of those stakeholders for future iterations of the product. The management of an internal open-source tool can very easily start to consume a data team that is not prepared to manage a product that is built for the entire organization. That being said, there are a lot of open-source tools that can work well for data teams. Tools that are used internally by the data team and perform a very isolated role are a great use case for open source tooling.
Here are some of the pros and cons of building data discovery tools from an open-source library or scratch:
- Your team has complete control of the product and where you want to take it in the future
- Your team benefits from contributions made by other members of the open-source community
- You can fit the tool to your exact use case
- It takes a longer time to see value
- Building the initial proof-of-concept version is relatively easy, but generating deep features and making sure they are accurate gets increasingly complex and challenging.
- Any additional functionalities and integrations need to be built on a custom basis by your team.
Over time, the volume, complexity and scope of the tools might change as the needs of your business and technical requirements change. When you’re planning your product, you need to think about how the tool may change as things become more complex and need to be prepared to build support for that future iteration of the product.
Buying a Third-party Data Discovery Tool
The primary reason to purchase instead of building software is to save time, money and resources. Additionally, teams should consider what building the tool internally adds to the organization's core competency. This way, teams can configure the data discovery tool to their exact data stack and specific needs.
By buying from a vendor, you are guaranteed to see continuous changes and developments to the tool, regardless of your companies resources. If it takes time for your team to develop new features and for the open-source community to innovate on the product, it might make more sense to consider purchasing a solution from a vendor instead of building your tool.
Just like the above, there are still tradeoffs to purchasing your tool. Below are some of those tradeoffs.
- You can get started with the product right away without a large or complex setup period
- You can trust the results that you see on the product because they are vetted by multiple engineers that work on the product as a part of the vendor's team.
- Most integrations are built by the vendor and work out of the box
- The product continuously improves to match the latest trends
- The vendor might not customize the product to match your team's exact needs.
- Setting up processes, documentation and tools might still require resources.
- Reliability and scaleability is dependent on the vendor
Once you have a good understanding of the cost associated with building, you should try to understand what it takes to build your tool or manage one internally.
Best Data Discovery Tool for your team
- Secoda. With our platform, you’re able to make data discovery directly in your workflow possible. The context you’re able to add to tables, documentation, and entire databases is all interconnected, and you can run queries directly in documentation alongside visualization capabilities. Secoda also integrates directly with all of the data discovery tools listed below, making the transition if any, a seamless one. See it in action here.
- Tableau. This is one of the most popular data discovery tools in the market currently— a part of the Salesforce suite, Tableau is an obvious option for those who already use Salesforce or plan to in the future. Ad hoc analyses is made easy with Tableau, as is collaboration.
- Looker. Another popular data discovery tool, Looker is a part of the Google suite. Similar to the utility that Tableau provides for Salesforce users, those who already use other Google products will find ease working with Looker.
- Qlik Sense. Qlik sense makes interacting with data directly in the platform as well as visualization seamless. Those working with Qlik Sense are able to collaborate and share live dashboards that draw directly from their data.
Is your Data Discovery Solution Working?
Whichever path your team chooses, we believe making a decision is way better than taking too long to evaluate alternatives. This is because the amount of data accumulated and the requests a data team will recieve only grows with time. The earlier that teams adopt data governance and data discovery tools, the faster that they will be able to trust their data. Of course, there’s a lot at stake when making decisions about data and the impact it can have on your business.
If you are currently facing some challenges about building or buying a data discovery tool and would like to get a second opinion (we promise it won’t be biased), our team can help you define your goals and technical requirements so you avoid any serious roadblocks along the way.