Mastering scale and complexity: best practices to scale your data infrastructure

At MDS Fest, the scale and complexity track aims to address real issues that data teams are facing now, and offer insights into scalable solutions. Learn more and get your tickets at mdsfest.com.
Last updated
April 11, 2024
Author

As the modern data stack evolves, businesses are increasingly grappling with the challenges posed by the sheer volume and complexity of data.

At MDS Fest, the scale and complexity track aims to address real issues that data teams are facing now, and offer insights into scalable solutions that enhance efficiency and ensure a positive return on investment.

🎫 Interested in learning more? Visit mdsfest.com for your free tickets.

What are the growing pains of data management?

Infrastructure that once empowered organizations now presents significant challenges. The proliferation of data has led to overly complex DAGs,  a tangled web of dependencies, and the daunting task of maintaining increasingly resource-intensive infrastructure. A huge concern for many data teams is the escalating cost of managing this infrastructure, which often threatens to outweigh the benefits it delivers.

What solutions can data teams use to scale?

Recognizing these challenges, we’ve curated a series of talks focused on cutting-edge strategies and technologies designed to streamline the modern data stack. Industry leaders and innovators will share their experiences in deploying scalable solutions that not only tackle the complexity and cost issues but also drive operational efficiencies. These sessions are designed to equip attendees with the knowledge to navigate the intricacies of data management effectively.

Not sure which sessions to attend? We’ve got a rundown of all of the speaker sessions on data-driven organizations to make things a little bit easier for you.

To register for these talks and more, visit mdsfest.com.

Navigating the data ecosystem

A talk by: Dylan Anderson, Lead Data Strategy Consultant, Redkite

When you think about the Modern Data Stack, what comes to mind? ML? AI? Dashboards? Cloud Platforms? Countless SaaS tools solving for these things? Enterprise Architecture to bring it all together? Data Governance because the data quality is poor? All of the Above?

In reality, these are all part of the Modern Data Stack and each has their part to play. But how can we keep on top of it all? The meaning of Modern Data is growing faster than a single person can account for and few people have the experience or knowledge to speak to all the different domains and areas. Enter this talk (and a lot of other material I'm creating related to this talk's subject) This is a talk intended for all data nerds, leaders and practitioners.

A new era in B2B data exchange

A talk by: Pardis Noorzad, CEO, General Folders

Businesses collaborate through data exchange. However, data exchange pipelines are time consuming to build, prone to leaks, difficult to monitor, and costly to audit. In this talk, we present an overview of the why and the how of cross-company data exchange. We then discuss solutions that better match the efficiency and security standards of today.

Empowering dbt developers: self-serve dbt cloud scheduling from your dbt repo

A talk by: Pádraic Slattery, Ph.D., Analytics Engineer, Xebia

I work as an Analytics Engineer for a data consultancy, as part of this work I frequently help clients to orchestrate dbt Cloud. As a result I’ve seen a lot of pain points that are encountered when doing this while at the same time I’ve seen a lot of different approaches to overcoming these pain points. I’d like to share an open-source package that resulted from these experiences. Typical approaches to version controlling dbt Cloud jobs involve either Terraform or in-house scripts that call the dbt Cloud API. Neither of these approaches encourages self-serve scheduling by the people that actually develop dbt models. Recently I helped a client to schedule dbt Cloud jobs using YML files that live in the same repo as their dbt project. This setup offers a number of advantages.

How to monitor and reduce AWS Spend

A talk by: Jack Sweeney, Solutions Engineer, Omni

High cloud bills are never a welcome surprise. Even though all the data to effectively monitor your spend exists, getting it into an operational state in one usable place never seems to be a high enough priority to build what you may need. In this session, we’ll walk through how to go from raw AWS cost data to a monitoring dashboard you can use to be proactive about those costs before your next bill comes in. We’ll cover a step-by-step overview, key tricks, and potential pitfalls to help you along the way. We’ll also wrap up with some KPIs for your new dashboard and how to include it in your workflow to ensure you can prevent unnecessary costs.

Setting up Snowflake for scale and cost efficiency - learnings from 5 years implementing the MDS

A talk by: Bijan Soltani, MD Technology, Gemma Analytics

Ensure an easily scalable and maintainable Snowflake environment using an Infrastructure as Code approach. Keep costs under control but stay flexible and agile. At Gemma, we regularly set up Snowflake from scratch for our clients. This talk presents the best practices we developed to set up Snowflake in less than an hour, minimize manual maintenance and administration, and easily optimize for cost efficiency later on.

Timing (is) everything: embracing event-centric data modeling

A talk by: Bryce Codell, Data Engineer, Ashby

I'm a data team of 1, and to help scale my output, I adopted a data modeling paradigm centered around defining business events as standardized building blocks to power all downstream analysis and reporting. I'll talk about my understanding of the methodology, the tradeoffs, where else it can be found in the data world, and resources available to help with implementing and operating it.

Hidden event data and where to find it

A talk by: Timo Dechau, Founder, deepskydata ApS

We explore to find and unlock event data in all your existing source tables. Once we get some, I will show you how to use this data to make your product, growth, and sales team happy.

Mastering real-time processing while effectively scaling data quality with Lambda or Kappa Architecture

A talk by: Vipul Bharat Marlecha and Jai Balani, Senior Data Engineers, Netflix

In a world that creates 2.5 quintillion bytes of data every day, auditing data at scale becomes a challenge of unprecedented magnitude. ‘Mastering Real-time Processing while effectively scaling Data Quality with Lambda or Kappa Architecture’ provides a deep-dive into powerful methodologies, revealing design patterns that turn this challenge into an opportunity for businesses. Join us as we navigate the complexities of data audits and discover how leveraging these techniques can drive efficiency, reduce latency, and deliver actionable insights from your data - at any scale.

Scaling Support of Dagster OSS with LLM's and RAGs

A talk by: Pedram Navid, Head of Data Engineering, Dagster Labs

Dagster is an open-source data orchestration framework that has seen rapid adoption over the past few years, and with that adoption have come increased support questions. To help scale as we grow and unblock users' questions, we've created an LLM Support Bot that works. In this talk, we'll discuss the challenges of supporting open-source software, best practices for dealing with support questions, and how we prototyped and built a support bot to help deal with our users' questions.

Evolving data pipelines at scale

A talk by: Marisa Smith, PhD, Developer Advocate, Tobiko Data

Creating new data pipelines has never been easier, thanks to tools like dbt, which have commoditized pipeline creation, making it more accessible than ever before. However, the landscape shifts when it comes to changing and evolving existing data pipelines; serious problems can emerge and block your team from succeeding. Some of these problems include: costly data outages, inaccurate reporting or metrics affecting the business outcomes, and delayed deliverables due to complex communication chains between engineering and stakeholders. Because of these problems, engineers and data practitioners often resort to building new pipelines and datasets instead of changing existing ones, escalating technical debt, operational costs, and increasing complexity.

In this session, we will talk about the challenges data practitioners face today, the core concepts of SQLMesh: including virtual data environments, automatic detection of breaking changes, continuous testing and accurate previews of changes and end with a demo.

Join us at MDS Fest 2.0

Join us at MDS Fest from April 8 - 12 to explore the cutting-edge of data-driven strategies, and take home actionable insights that could redefine the future of your organization. Get your tickets at mdsfest.com today.

Keep reading

See all stories