Updated
June 10, 2025

Why there isn't an LLM for data (...yet)

Despite major advances in AI, a universal LLM for data still hasn’t arrived. In this MDS Fest 3.0 recap, Secoda CEO Etai Mizrahi breaks down the unique challenges—and what it’ll take to finally get there.

Etai Mizrahi
Co-founder
Despite major advances in AI, a universal LLM for data still hasn’t arrived. In this MDS Fest 3.0 recap, Secoda CEO Etai Mizrahi breaks down the unique challenges—and what it’ll take to finally get there.

🎥 Recap from Etai Mizrahi's talk at MDS Fest 3.0

The data industry has seen remarkable technological shifts over the past three years. 

First came the expansion of specialized tools, then consolidation through acquisitions, and now we're entering the third wave: AI integration. But despite the clear demand for AI-powered data tools, no universal "LLM for data" has emerged to transform analytics the way tools like Cursor have revolutionized coding or Harvey has transformed legal work.

"A tool could be something like what we're seeing Cursor do for coding. A tool could be something like Harvey for legal. The question is really, where is that for data and analytics, and why haven't we seen it just yet?" - Etai Mizrahi, CEO, Secoda

At MDS Fest 3.0, Secoda CEO and co-founder Etai Mizrahi explored this shift, breaking down exactly why building effective AI for data is uniquely challenging—and what it takes to get it right.

What a universal AI tool for data should accomplish

Before getting into the challenges, it's worth establishing what success would look like for a data team using an AI tool. 

An ideal AI tool for data should handle five core capabilities:

  1. Context-aware code generation: Writing SQL and Python that leverages actual schemas, understands table relationships, and produces better CTEs, joins, and aggregations based on your specific data environment.
  2. Intelligent schema interpretation: Understanding lineage, parsing complex joins across tables, and providing insights into data relationships without requiring manual documentation of every connection.
  3. Universal dashboard creation: Building visualizations across different BI tools and source systems, translating natural language queries into meaningful charts and insights.
  4. Cross-stack integration: Working seamlessly whether your data lives in Snowflake, your transformations are in dbt, or your dashboards are in Looker—without requiring separate configurations for each tool.
  5. Intent-driven modeling: Learning from user behavior and organizational context to improve recommendations over time, similar to how Cursor ingests your codebase to understand your specific coding patterns.

The vision is compelling, but the execution has proven elusive - here’s why.

The two fundamental challenges blocking progress

"It really comes down to two fundamental challenges. The first challenge is the data fragmentation challenge, but I think the bigger underlying problem is the lack of interoperability around how we communicate with those different tools." - Etai Mizrahi, CEO, Secoda

Challenge 1: Data fragmentation and lack of interoperability

The most obvious problem is that data context is scattered across disconnected tools. 

Your Snowflake warehouse stores the actual data, your dbt YAML files contain transformation logic, and your Looker LookML defines business metrics. Each represents a critical piece of context, but they don't communicate effectively with each other.

The deeper issue is semantic inconsistency across tools. What Power BI calls a "chart," Looker might refer to as an "embedded data source." These aren't just naming differences—they represent fundamentally different approaches to organizing and presenting data that an AI system must learn to navigate.

This problem compounds when working with newer tools that lack extensive public documentation. A platform like Sigma might not have the domain context available for LLM training that more established tools possess, creating knowledge gaps that are difficult to fill.

Schema evolution adds another layer of complexity. Unlike static documents, data environments are fluid. What's true on day one might be completely different by day ninety as schemas evolve, lineage changes, and pipelines are rebuilt. Any AI system must account for this constant state of change.

Challenge 2: Contextual ambiguity and organizational knowledge

"MRR return may vary across organizations. We could, in theory, try to define them with a central definition, but there's likely some nuance, and there needs to be some sort of definitional logic that requires us to understand that institutional and legacy knowledge." - Etai Mizrahi, CEO, Secoda

The second major challenge involves the institutional knowledge that makes data meaningful. MRR (Monthly Recurring Revenue) might seem like a standard metric, but its calculation can vary significantly between organizations or even departments. Some include one-time fees, others exclude them. Some count annual contracts monthly, others don't. See where this is going?

This contextual ambiguity extends beyond metrics to fundamental questions about data interpretation. Understanding which tables are authoritative, which joins are appropriate, and which metrics are trusted requires deep organizational knowledge that exists in people's heads, not in documentation.

Technical limitations that compound the problem

"Obviously, with something like data and analytics we want consistency to be very right even if that means speed is a little bit slower. There's obviously an issue around staleness of architecture. We might have metrics or lineage that is outdated." - Etai Mizrahi, CEO, Secoda

These fundamental challenges create several technical constraints that current LLM architectures struggle to handle:

  • Limited token windows: There's simply not enough space in a single prompt to include all the context needed for complex data questions. Trying to maximize token usage often leads to inconsistent results.
  • Hallucination risk: In data work, accuracy is non-negotiable. A plausible-sounding but incorrect SQL query can lead to wrong business decisions. This creates tension between speed and reliability that many current AI tools simply haven't resolved yet.
  • Stale architecture context: Even when you include the right metadata, it might be outdated. Pipeline changes, table deprecations, and metric redefinitions happen faster than documentation updates.
  • Long-horizon workflows: Analytics work is often exploratory, requiring multiple iterations and context retention across extended conversations. This demands larger context windows and more sophisticated memory management than most current systems provide.
  • Security and access control complexity: Every table, column, and dashboard may have different permission requirements. An effective AI system must understand not just what data exists, but who can access it, creating a complex web of authorization logic.

What's working today: Narrow solutions and specialized tools

"There are some things that are working. There are some things we're seeing that indicate that this is the right direction for data to be going in. The first is we're seeing some narrow copilots that work really well for certain use cases." - Etai Mizrahi

While no universal solution has emerged, several approaches are showing promise in specific contexts:

  • Narrow copilots: Tools like dbt Labs copilot work well within their specific environments because they have access to rich, contextualized metadata within a single system/tool.
  • RAG-based agents: Systems that fetch comprehensive schema context before answering questions can provide more accurate responses, though they're typically limited to specific data warehouses or tool combinations.
  • Static data analysis: ChatGPT and Claude work reasonably well with uploaded CSVs or structured datasets, but this represents single-shot analysis rather than integration across a full data stack.
  • SQL-to-visualization tools: LLMs can generate charts from SQL queries, but again, this typically works within constrained environments rather than across complex data architectures.

These solutions work because they operate within defined boundaries with clear context. The challenge is scaling this approach across the full complexity of modern data stacks.

The path forward: What it will take to build universal AI for data

"What do we need to actually get to the next rung of great tooling for data and AI? I think it comes down to four things. The first is a unified knowledge graph." - Etai Mizrahi

Based on current progress and technical requirements, four key components seem necessary for a breakthrough:

1. Unified knowledge graph architecture

Moving beyond traditional data catalogs to create queryable knowledge graphs that include tables, dashboards, metrics, owners, data quality scores, and governance rules. This knowledge graph must update automatically as lineage changes, track usage patterns, and reflect schema evolution in real-time.

2. Multimodal reasoning engine

AI systems need to combine text, structured metadata, and visual context to understand data relationships. This means parsing not just table schemas but also dashboard configurations, transformation logic, and business context to validate outputs across different formats.

3. Centralized security layer

Rather than trying to replicate RBAC logic for each tool, successful AI systems will need to integrate with centralized security models that can propagate appropriate access controls to AI interactions.

4. Human-in-the-loop feedback systems

The most promising approaches involve data teams actively training AI systems through validation and correction. This creates organizational memory where correct reasoning paths are reinforced and incorrect ones are avoided.

A multi-agent approach: Learning from how data teams actually work

"We've been modeling workflows by giving different agents a specific task. For example, an agent could be really good at lineage parsing. An agent could be really good at writing queries or searching for the information. And all of those agents work in tandem." - Etai Mizrahi, CEO, Secoda

One of the most interesting developments is the emergence of multi-agent systems that mirror actual data team workflows. Instead of trying to solve everything with a single large model, these systems assign specialized agents to specific tasks:

  • Lineage parsing agents that understand data relationships
  • Query synthesis agents that combine information from multiple sources
  • Search agents that find relevant context across different tools
  • Validation agents that check outputs for consistency and accuracy

These agents communicate with each other, plan multi-step workflows, and validate their work before presenting results. When they lack context, they can request additional information from other agents, creating an iterative problem-solving approach that resembles how experienced data teams tackle complex questions.

This approach has shown promise because it can handle the multi-step exploration that characterizes real analytics work, while maintaining accuracy through validation and cross-checking between specialized components.

The road ahead: Benchmarking, integration, and organizational change

"I really do believe this is a game-changing inflection point for how we think about data stacks. How do we think about a data coordination platform? Something that really allows embedded agents, memories, enforced rules, and different data sources to come into one place." - Etai Mizrahi, CEO, Secoda

Looking forward, several areas need development to realize the vision of universal AI for data:

  • Specialized benchmarking: Unlike general LLMs, data AI tools need benchmarks that measure accuracy, speed, formatting, and reasoning quality specifically for analytics use cases. Standard NLP benchmarks don't capture the nuanced requirements of data work.
  • Model Context Protocol (MCP) integration: As MCP adoption grows, data AI tools need to plug into broader AI ecosystems, potentially allowing tools like Cursor to access data context directly within development workflows.
  • Data team role evolution: The emergence of reliable AI for data will likely shift data team responsibilities toward model preparation, data quality assurance, and AI validation—requiring new skills and workflows.
  • Data coordination vs. Data cataloging: The future may require thinking beyond traditional data catalogs toward "data coordination platforms" that support embedded agents, organizational memory, and rule enforcement across multiple systems.

Building trust through incremental rollouts

"The best way to launch something like this is to really think about your modeled data that has context around it. These models deteriorate in quality as soon as they are working with extremely messy data and extremely undocumented environments." - Etai Mizrahi, CEO, Secoda

Perhaps the most practical insight from current implementations is the importance of building trust through controlled deployments. Rather than launching AI tools across entire data warehouses, successful teams start with 25-30 well-modeled, well-documented tables. This approach allows for:

  • Quality validation in a controlled environment
  • Gap identification and iterative improvement
  • Team confidence building before broader rollout
  • Context refinement based on actual usage patterns

As trust builds and the system proves reliable in constrained scenarios, scope can expand to include more complex data sources and use cases.

The bottom line

The question isn't whether AI will transform data work—it's how quickly we can solve the unique challenges that make data different from other domains where AI has already succeeded.

Data comes with complex relationships, evolving schemas, organizational context, and strict security requirements. The solutions that emerge will need to account for all of these factors while remaining practical for everyday use.

The most promising approaches combine specialized AI agents, comprehensive metadata management, and iterative human feedback to create systems that understand not just data, but how organizations actually use data. While we're not there yet, the foundation is being built for AI tools that can finally deliver on the promise of universal, reliable, context-aware data assistance.

For data teams exploring AI today, the key is to start with strong metadata, clear documentation, and well-defined scope—then expand gradually as trust and capability grow. The universal LLM for data may not exist yet, but the path to building it is becoming clearer.

Heading 1

Heading 2

Header Header Header
Cell Cell Cell
Cell Cell Cell
Cell Cell Cell

Heading 3

Heading 4

Heading 5
Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote lorem

Ordered list

  1. Item 1
  2. Item 2
  3. Item 3

Unordered list

Text link

Bold text

Emphasis

Superscript

Subscript

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5
Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

  1. Item 1
  2. Item 2
  3. Item 3

Unordered list

  • Item A
  • Item B
  • Item C

Text link

Bold text

Emphasis

Superscript

Subscript

Keep reading

See all stories