Data discovery is an essential aspect of data analytics which will be undergoing a significant transformation thanks to the emergence of LLMs (Large Language Models) and chat-based interfaces. Traditional data catalogs primarily served as a list of information, which often led to challenges in data retrieval and accessibility. However, with the advent of LLMs and chat-based interfaces, data accessibility and enablement take a giant leap forward, empowering even non-technical business users to make sense of complex data with simple searches. I’m extremely excited about what this means for the category of tools and how much closer this gets Secoda to our day 1 vision of “searchable company data”.
Today we’ve launched we launched Secoda AI - - the first data discovery solution powered by LLM. With a chat-based interface, think of it as ChatGPT for your data stack. It lets anyone at a company, regardless of technical ability, answer any data question at the speed of thought.
Particularly for data teams, analysts and engineers can get contextual answers to questions like:
- “Help me find the best table to calculate MRR”
- and from that answer, “Write a query that calculates MRR since last quarter”
- “Can I drop customers_id without impacting other data?”
With Secoda AI, we use our integrations/metadata to power a search model that is fed into Chat GPT with specific LLM prompts to act as a data assistant for anyone that has any questions about data. This could include (but is not limited to):
- Tell you what things mean across your database, dashboards, metrics, queries etc.
- Ask it to write queries to certain tables and explain queries
- Tell you what are relationships between tables for natural language lineage analysis
- Write dbt / airflow / sql / LookML code with context about your data stack
- Tell you additional context about your data (ie. what has PII, what is owned by who etc.)
- Summarize docs, questions and dictionary terms across your assets.
- Helping you find the right asset through any search method:
- "Help me find the best table to calculate MRR"
- "Help me find the best dashboard to view customer churn"
- Summarize docs/tables/dictionary terms/tables
- Write your data documentation for you
Needless to say, it's very powerful. The unique thing about this is that it's all based on your metadata and inputs (LookML, dbt YAML, Snowflake tags etc.). This model is not just a text to SQL editor. It is much more. We believe this has the potential to lift the data discovery category into the next-gen AI for a company's data stack.
This should also allow business teams to gain a better understanding of their data, which in turn enables them to make informed decisions more quickly. By removing many of the technical barriers that previously hindered their ability to access and analyze data, teams can now take advantage of the wealth of information at their disposal all through search. This not only improves their decision-making process but also helps business users identify new opportunities and areas for data use within their organization.
Secoda's Pursuit of Searchable Company Data for Business Users
Secoda was established with a singular, ambitious goal in mind: to make the world data-driven. The name “Secoda” stems from an abbreviation of “Searchable Company Data”. There are various challenges businesses face in organizing and accessing company data scattered across various departments, tools, and platforms, which has allowed us to gain momentum as the single source of truth for data in the MDS today. Secoda was always designed to break down the barriers that siloed information and made data discovery a laborious task.
Thanks to the transformative capabilities of LLMs and chat-based interfaces, our dream of creating a searchable company data platform has now become a reality.
These technologies are able to revolutionize data discovery for business users by making data more accessible, easy to understand and use. With LLMs, Secoda is now able to become the first LLM-driven data discovery tool.
How can LLM make data workflows better
Large language models (LLMs) are an extremely powerful advancement that has led to AI features from even the most obscure, non-AI related products. Used in the right context and with the right prompt engineering, we've seen that they can be an extremely powerful way to transform metadata discovery for data and business teams, which enables:
- Efficient metadata extraction: LLMs can quickly extract metadata from large volumes of unstructured documents, data or descriptions. They can also identify and extract complex metadata fields, such as information about named entities or events and relationships between documentation based on their content.
- Improved search accuracy: LLMs can use their advanced language processing capabilities to ensure accurate search results. This can also allow users to talk to their data discovery tool the way they work. Users don’t need to know the specific table or dashboard they are looking for to find what they need, they can just ask the model for the best table for their specific task.
- Consistent metadata tagging: LLMs can help ensure consistent metadata tagging across a large number of assets. This can improve the quality of metadata and make it easier to find and analyze relevant tags.
- Identification of patterns and relationships: LLMs can analyze metadata to identify patterns and relationships that may not be immediately apparent. For example, they can identify relationships between different users, dashboards, or tables based on usage.
- Improved documentation workflow: LLMs write extremely well and can even pass medical exams. Copywriters are already starting to see the impact on their work. LLM that can be trained on specific datasets can help teams improve or add documentation at an incredible pace.
- Improved data workflow: LLMs can also write code. When plugged into a tool like Github co-pilot, LLM’s can automate coding tasks. When pointed to the right training data, they can be an extremely useful way to speed up coding workflows and make them more accessible to people across the team.
This feels like a real inflection point for data. I strongly believe that we the right inputs, LLMs can greatly enhance data discovery by providing a more complete picture of the data context. As some other data assistants have come out of the last month, I’ve been more intrigued about the future of interfaces with data.
The context around data is a crucial aspect that can drive a lot of accuracies and trust around the results outputted from these models. With LLMs, data and business teams should gain a deeper understanding of the data they are working with all through search.
Integrating LLMs and Chat-Based Interfaces with Secoda
By integrating LLMs and chat-based interfaces with Secoda, the platform can further enhance data accessibility, enablement, and workflows for our users.
- Data Accessibility: LLMs can understand and interpret complex questions in plain English, allowing users to ask their questions to an AI before jumping to the data team. This should eliminate the need for non-technical users to learn SQL or possess advanced knowledge of the database schemas / content. This improved data accessibility aligns with Secoda's goal of providing a single place to search company data.
- Automated Documentation: LLMs can write data documentation for users, saving data teams hundreds of hours. Unlike traditional LLMs, Secoda’s LLM is powered on your metadata, meaning that all the context your data team has about your data is available to the LLM. This can dramatically improve data tagging and documentation efforts.
- Data Collaboration: Combining LLMs with chat-based interfaces promotes data collaboration among team members. Users can share their queries and insights directly within the chat platform, fostering a data-driven culture within the organization. Users can also ask about what teammates are working on, using and own, creating a layer that can point someone in the right direction before asking the data team a question.
- Data Lineage through Chat: With an LLM that’s powered by your metadata, you will simply be able to ask an LLM if you can drop a column or what a table is related to. All your context is at your fingertips and based on your metadata, saving data teams hours of time tracking down if their column change is going to break something downstream.
- Integration with the Modern Data Stack: Tools like Looker, dbt, Snowflake, BigQuery, Hightouch, and Fivetran all form the backbone of the Modern Data Stack and have unique metadata that is locked away. By seamlessly integrating all these tools into Secoda’s LLMs and chat-based interfaces, Secoda can help users harness the power of their entire data ecosystem without needing to switch between multiple applications.
For example, dbt, a data transformation platform, can be integrated with Secoda to allow users to ask questions about their dbt models, jobs, runs and transformations. Secoda can even write your dbt code based on your data.
The new paradigm for data discovery is search and we’re excited to be at the forefront of this massive change to make the data discovery experience for business and technical users unbelievable.
Here are some use-cases to get started. This is by no means an exhaustive list
Writing a query / documentation
Locating Institutional Knowledge
Challenges about LLM implementation in reality
Despite the significant benefits of LLMs for data discovery and analysis, there are also some challenges associated with their implementation in practice. Some of these challenges include:
- Data Quality: The effectiveness of LLMs relies heavily on the quality of the data being analyzed. If the data is inaccurate, incomplete, or inconsistent, it may lead to inaccurate results and hamper the effectiveness of the LLMs. In data, this is obviously no different. One thing that we think can be encouraged from this is to document data thoughtfully to make sure that you’re able to provide as much coverage for the LLM to index. In Secoda, one of the considerations we've always had is to make sure that the data / metadata is that being viewed is always trustworthy. The admins in
- Cost and Scalability: Implementing LLMs can be expensive, and it may require significant computational power and infrastructure to process and store large amounts of data. Additionally, as the amount of data being analyzed increases, so does the complexity of the LLM models, making it challenging to scale effectively.
- Privacy and Security: LLMs require large amounts of data to be trained effectively, which can raise privacy and security concerns for businesses. It's important to ensure that sensitive information is protected and that only authorized personnel have access to the data.
- Bias and Fairness: LLMs are susceptible to bias, and it's crucial to ensure that they are fair and impartial in their analysis. Developers must take steps to identify and address any biases in the training data and ensure that LLMs are transparent in their decision-making process.
- User Adoption: To be effective, LLMs must be user-friendly and intuitive. Business users must be able to easily interact with the LLMs, and the results must be presented in a clear and understandable way. Ensuring user adoption can be a significant challenge, particularly for businesses that are new to the technology.
Future opportunities of LLM for data discovery
As LLM technology continues to evolve, there are numerous opportunities for data discovery and analysis that will become possible in the future. Some of these include:
- Integration with other pieces of the data stack: LLMs can be combined with other metadata, and they will be able to provide a more comprehensive understanding of data. This integration could lead to even more accurate insights and recommendations for business and technical teams.
- Improved accuracy and explainability: As LLMs become more sophisticated, they will likely become even more accurate in their analyses. Additionally, there will be a greater focus on making LLMs more transparent and explainable, allowing users to better understand how the model arrived at its conclusions.
- Automated metadata preparation: LLMs could be used to automate the data preparation process, such as identifying and correcting errors in data, normalizing data across different sources, and identifying relationships between data sets.
- Customization for specific industries: LLMs can be trained on specific data sets, allowing for customization to different industries and use cases. This could lead to more accurate and relevant insights and recommendations for businesses in various sectors.
Overall, the future of LLM data discovery is bright, and there are numerous opportunities for businesses to leverage this technology to gain insights and make data-driven decisions. As LLMs continue to evolve and improve, we can expect to see even greater advancements in data accessibility, accuracy, and usability.
LLMs and chat-based interfaces are revolutionizing data discovery for business users, making data more accessible and actionable than ever before. As these technologies continue to advance, we can expect even greater enhancements in data accessibility, real-time insights, and data collaboration. By integrating LLMs and chat-based interfaces with tools like Looker, dbt, Snowflake, BigQuery, Secoda, Hightouch, and Fivetran, businesses can fully harness the power of the Modern Data Stack and unlock the true potential of their data.
For the current Secoda customers, this functionality is now in your workspace. We extend our heartfelt gratitude to you for your constant support, which has helped us reach this milestone. Let's keep the data-driven journey going together.
For those who are new to Secoda, explore the platform here.