There is more data produced today than ever before. Companies at every stage are creating data and are looking for ways to operationalize this information. Because there is so much data and so few members of the data team, tribal knowledge is a problem faced by most data teams. Data analysts and scientists spend countless hours trying to disseminate the information in their heads to the rest of the employees, but just end up falling short.
Some even compare it to a nightmare scenario out of a movie, where every new day just looks like the one before. No matter what tools or processes data teams put into place, they are never enough to enable self-service. There have been a few solutions that have been introduced over the last year that promise to solve this age-old problem, but it seems like none of them treat the root cause of the issue.
We need to think about this problem as a knowledge problem and instead look for solutions that provide a holistic picture of the data knowledge generated across the different functions of the data team. Before jumping into the different solutions, it’s important to understand the different pieces of knowledge that the data team generates and how that information is generated.
In reality, data knowledge is extremely fragmented because of the number of new tools that have come into the data space. Data teams use dbt and Snowflake for data cataloging, Google Sheets or Segment for events, Confluence or Looker for general knowledge, Slack and Google forms to manage data requests, Jira for project management, Github for queries, emails for more data requests google docs for data definitions, stickies on a wall for sanity, whiteboards for collaboration, and meeting after meeting to try to get on the same page. The list goes on and on.
There are a few areas of data knowledge that the team generates from its data. The first category is metadata and metadata documentation. This piece of the puzzle captures the information about data that help data teams understand information about their data. Many large companies have created their data catalog to manage this data knowledge and have open-sourced their solution. Some data teams have quickly adopted these solutions but still find that the number of requests from employees doesn’t decrease once they introduce a data catalog. It’s a useful piece of the puzzle, but it misses important context for many employees who are not extremely SQL literate.
Small companies should remember that solutions that work at large companies very rarely work perfectly for smaller teams. Data catalogs require a lot of manual work to set up and also assume that most employees in the organization are capable of writing SQL. At larger tech companies many employees are required to learn SQL if they want to advance in their roles. Because of this, many employees at larger companies are incentivized to use data catalogs in their day to day roles. In smaller teams, there is less of a need for a complex data catalog and more of a need for a central place to share information about data with the rest of the company.
The second piece of data knowledge is the business definitions associated with the data. This data knowledge might describe how a company calculates MRR or Active users. A data dictionary is a collection of these key terms and metrics and small teams usually document this knowledge in Google Sheets. Although this seems like a simple exercise, it’s very difficult to align different departments with the same definitions. A ride-sharing company we wrote about in a previous article shared an example of the difficulties related to data definitions. Conversations about what a particular definition means can take weeks and multiple meetings to solve. Because of this inefficiency, a data dictionary is possibly one of the most valuable artifacts that a data team can deliver to the business.
Without a data dictionary, decision-makers may disagree about what the data show and what actions to take. Reports among teams might show different numbers for the same metric from the same data source due to inconsistent business logic. Teams may even argue about the correct definition and defend their turf, perhaps because their definition makes their numbers look better. This is not good for business.
Today, some data teams document this information in google sheets or confluence documents. More frequently, data teams don’t document this information. Recently, some larger companies have written some interesting articles about their metrics layer, which aims to solve this problem. Even though we’re pretty optimistic about the way that the metrics layer could help get teams speak the same language, we believe that there are other pieces to the puzzle that also need to be considered.
The majority of analysis that data teams can fit into one of two buckets: reactive or proactive. Reactive analysis is an analysis that is usually done in a dashboard tool and is often requested by a business user. For most analysts, this type of analysis is rarely the fun part of the job. This type of analysis involves creating, fixing, or explaining dashboards. Today, most data teams have a slack channel or google form to take in data requests and a Jira board to help prioritize the data requests on the roadmap. The problem with this approach is that questions that have been answered in the past are difficult to find, which makes it very likely that a data team will receive the same question twice. In our short time investigating this problem, we found that 50% of the data requests that are asked have been answered in the past. The problem is that most data teams don’t have a central place to store all their analysis besides the BI tool. This means that it’s usually difficult for employees to find old analysis and to understand which pieces of data analysis are reliable.
The second type of analysis is much more proactive. This type of analysis is less structured and allows analysts to explore topics that might require deeper analysis than a simple dashboard. The surprising thing is that there’s no great system for keeping track of these “analyses”. Most likely, this analysis is built onto the BI tool, but the final product - the document that contains the analyses and their recommendations - is often scattered in different parts of analyst computers, emails, and slack posts, while being built on top of ungoverned queries and notebooks. This makes it difficult for data teams to reference old conclusions and insights. Centralizing the analysis is an important part of making data discovery work. Without a holistic picture of the data, data discovery only works for a very select use case.
Keep all your data knowledge in one place
We believe that all this knowledge should be accessible in one place. Rather than having to switch tabs and go down rabbit holes, data teams should have a central place to go when they are trying to generate knowledge that will be used by the rest of the team. That’s why we’re excited to announce the new Secoda. Instead of just a data discovery tool, Secoda is going to become the collaborative workspace for data knowledge. In this new tool, data teams will be able to do the following:
- Metadata Management: Teams can integrate their data warehouse, transformation tool, BI tools into Secoda to get an automatically generated data catalog. This catalog shows how often a resource is used, who it is used by and allows teams to add additional context to show how it can be used. Additionally, Secoda auto generates lineage and has out-of-the-box governance features. The best part is that it only takes a simple 5-minute setup to get this catalog running and Secoda has already created over 15 no-code integrations. Instead of using Confluence, data teams get notifications when schema or documentation changes and can create rich documentation around the metadata.
- Data definitions: Teams can also create a separate data dictionary in Secoda to manage all the different definitions of their KPIs. These definitions are defined either in SQL, text, or through a README document and can be referenced throughout the rest of the documentation. Data teams can also choose to get updates when tables related to their definitions change so they can keep up with any changes. Instead of using Google sheets, teams can have a central place to find all the information about their definitions.
- Analysis documents: Teams can now create analysis documents in Secoda. These documents can execute queries and charts in the document. This feature is a step towards capturing the ad hoc analysis that traditionally gets lost. In the analysis documents, team members can tag data resources from their catalog or definitions from their data dictionary. Teams can also share these analyses with teammates and give everyone a single view of past conclusions that have been delivered with the data.
- Data requests: Lastly, data teams can also create a data requests process in Secoda to manage incoming requests that can’t be answered with the above information. If a question has been answered in the past, an employee can search for the information in Secoda. If they are unable to find the answer to their question, they can submit a data request directly in Secoda and the data team can answer the question with an analysis document. This way, no question is asked twice and all answers are recorded in Secoda.
Seocda has merged these features in a collaborative environment that can let data teams easily curate the information for the rest of the organization. Anyone looking for the information can easily search for metadata, analysis, or definitions right in slack and Secoda will retrieve the information without having you go into the tool itself. The vision for this tool is to help capture every piece of data knowledge and we’re excited to announce these new features that will hopefully improve the functionality of the tool.