Engineering

How Monte Carlo Prevents Data Downtime, Featuring Founder & CEO Barr Moses

By

Admin

on •

Aug 22, 2023

How Monte Carlo Prevents Data Downtime, Featuring Founder & CEO Barr Moses
Listen to Crafted, Artium's podcast about great products and the people who make them.

Listen and subscribe to Crafted: Apple Podcasts | Spotify | All Podcast Apps

“This problem was so painful and so meaningful to people that I just couldn't believe a world where a solution to this didn't exist.” When Barr Moses identified the very costly problem of what she coined “data downtime”, she knew she needed to solve it. Barr is the Founder and CEO of Monte Carlo, a data observability platform that’s on a mission to eliminate data downtime, a problem that can cost companies millions of dollars each time there’s an outage and the numbers — and the systems that rely on them — go haywire. And with the growth of AI, data problems are even more important to prevent. 

On this episode, Barr explains how she used the scientific method to home in on the problem to solve and the company to found — she actually launched three companies simultaneously before seeing the most traction with Monte Carlo, and going all in. We also learn about Monte Carlo’s customer-led approach that helped them create an end-to-end solution that leaves no data stone unturned.

Full transcript below — but we recommend you listen for the best experience. 

Barr Moses: Data was wrong all the time, and I was really frustrated by that. I would wake up to this barrage of emails from unhappy customers, unhappy executives, unhappy internal stakeholders saying, "WTF, why is the data wrong?"

Dan Blumberg: That's Barr Moses, the founder and CEO of Monte Carlo, a data observability platform that's valued at more than a billion dollars. Monte Carlo was on a mission to eliminate what it calls data downtime that wake up in a cold sweat experience for data teams when something's broken and the numbers are all wrong. It's a huge problem that can cost companies millions. And after discovering there were no solutions, Barr decided to build one.

Barr Moses: This problem was so painful and so meaningful to people that I just couldn't believe a world where a solution to this didn't exist.

Dan Blumberg: Not surprisingly, Barr is very data-driven. On this episode, she shares how she used the scientific method when launching Monte Carlo, validating her idea through hundreds of conversations and launching two other companies at the same time to see which one had the most traction.

Barr Moses: I think the thing that helped me was being very hypothesis-driven and asking myself, what do I need to believe in order to believe that this can be a meaningful company where I can actually make an impact in the world?

Dan Blumberg: Welcome to Crafted, a show about great products and the people who make them. I'm your host, Dan Blumberg. I'm a product and engagement leader at Artium, where my colleagues and I help companies build incredible products, recruit high-performing teams, and help you achieve the culture of craft you need to build great software long after we're gone.

Barr, you really grew up in data. Your dad ran a lab that you spent a lot of time in. I'd love if you could start by just talking about those roots in science and the path that led you to where you are today.

Barr Moses: Yeah, for sure. My dad is a physics professor and my mom is actually a meditation and dance teacher, so I had the pleasure of living in both worlds. I remember growing up with my dad and playing a lot of guesstimates games if you will. So we would sit in a movie theater and try to guesstimate how many people are in the audience. As an eight-year-old, that was a lot of fun. I studied math and statistics, and my favorite course was actually mathemagics. So using mathematical algorithms to construct magic tricks.

So I think growing up, the joy of data and numbers and math and everything was something that was really presence in my life. Growing up in Israel, I was drafted to the Israeli military. I was in the Air Force working in intelligence unit. Again using data but for very different reasons. And obviously, introducing the concept of diligence and what it means to deliver very high-quality data where the stakes are incredibly high and the data is mission-critical.

Later on in my career actually, I wanted to follow in the footsteps of my dad. So thought that I was going to academia and actually realized that it's really not for me and did a strong 180 and went to consulting, actually. Consulting was like a business school for me because I got to work with Fortune 500, many other organizations, on their data transformation. Prior to starting Monte Carlo, I was at a company called Gainsight, which created the Customer Success category. The reason I joined was one, I was excited to help create the category, but two, bring some of the quantitative background to customer success and actually thinking about how can you use data to predict churn? How can you use data to drive increase in revenue? How can you actually use data in reality for businesses?

Dan Blumberg: You wanted to start a company and so you ended up starting three companies just to figure out which one worked. What made you land on Monte Carlo?

Barr Moses: I was working at Gainsight, and a lot of what I was focused and excited about is how do you bring data to back up customer success and customer relationships. This was circa 2015, 2016, so organizations were just... We called ourselves data-driven, but I'm not sure we actually used data. Just counting how many customers we had was hard enough. I think that these were years when we actually started to use data in board decisions, or for example, in daily reports that our executive team was looking at. In some cases, also in reports that we had in our product that our customers were using. As a person responsible for that data, the data was wrong all the time. I was really frustrated by that. I had one job, get the data right. The data was wrong, and I would wake up to this barrage of emails from unhappy customers, unhappy executives, unhappy internal stakeholders saying, "WTF, why is the data wrong?"

Not only was I the last person to learn about data being wrong, I was also then swept into a multi-day or multi-week process to try to figure out why it was wrong and whose fault it was, etcetera. And being really frustrated by that, turning to my engineering counterparts, and I was like, "Oh, they have all these fancy off-the-shelf solutions to make sure that their products are reliable, and yet we're here flying blind hoping that someone will tell us the data is wrong."

We actually ended up building our own solution, hacking it together, which allowed us to be the first to know when data breaks, and also being able to communicate why, and what's the impact of that. It worked so well that we actually implemented it for some of our customers. I remember leaving that thinking, "Huh, how does this product or solution doesn't exist? If there's anyone out there in the world who's actually trying to use data, they're probably being hit with similar problems. How could it be that there's nothing like this?"

When I left Gainsight, remembering back on that experience and asking myself, can that actually be a real viable product and a real viable company? But I'm actually not a big risk-taker. And so being very risk averse, I wanted to make sure that if I'm leaving my cushy job and getting others to leave their cushy job, there better be something behind what I'm doing. So I actually decided to start three companies in parallel with the idea of actually trying to see if there's a real problem here and now, and a problem that's big enough and material enough to build a real company around.

Dan Blumberg: While she developed the ideas for her companies, Barr needed to flesh out what it would mean to solve data downtime. So she started gathering intel.Barr Moses: Literally getting on a call with hundreds of people and asking them, "Do you have this pain point? What does that look like? Could we work together in a solution for this? What does a solution look like? Where would you get started? What are the main risks?" And throughout the process actually gained confidence that this problem that we later coined data downtime was so painful and so meaningful to people that I just couldn't believe a world where a solution to this didn't exist.Dan Blumberg: How else did you identify that there really was... You say you're risk averse, you wanted to be sure that this was the right bet to make. It was through talking to lots and lots of people. Were there also experiments and things that you ran in the early days?Barr Moses: Oh, yeah. It was as hypothesis-driven as you could be. It's a very surreal experience. You're in your room alone and you're like, "Let's start a company." Also, by the way, 99.9% of startups in the world, fail. So you're doing something that's by definition going to fail. You wake up every morning ready, eager, and excited to do that. So living in that world is a little weird. I think the thing that helped me was being very hypothesis driven and asking myself, "What do I need to believe in order to believe that this can be a meaningful company with very happy customers where I can actually make an impact in the world?" And actually laying out the hypothesis. So for example, with data downtime and data reliability, we had to believe that data is going to be more important than it is. Remember, this is four years ago. It actually wasn't obvious that data was going to be this big.

I think today you could argue that there's no space that's more interesting than data. But back then, this was before the rise of Snowflake and Databricks, before the big acquisitions of Looker and Tableau, before the rise of roles like data engineering, machine learning engineering. None of that existed. So we're making a pretty sizable bet here that we think data will become more important. That's one thing that I needed to get conviction on. The second thing I needed to get conviction on is, is this problem meaningful enough right now. I cold-called hundreds of people to ask them. People who don't owe me anything, they don't love me, they don't care about my success-

Dan Blumberg: They'll give it to you straight?

Barr Moses: Exactly. And ask them, "What are the top problems that are keeping you up at night?" The very visceral problem of, "I wake up in the morning and I don't know if the data that I'm responsible for is going to be right or if some of my stakeholders are going to be angry at me." It was very emotional for them and very present here now. People had real tangible examples of how that's impacting not only their jobs, but also their businesses.

Companies lose material money related to these data issues. It's not uncommon for companies to lose tens of thousands, hundreds of thousands, of millions of dollars because of one data issue. That became really clear to me that that is a real problem here now that we can help solve. The third question or the third hypothesis that I needed to build a conviction on was the how, meaning that there is actually a way to solve this problem.

And that is something that was very controversial because many people in the data space told us every company's data and every company's data structure is totally different so you're never going to find patterns and you're never going to find a way to identify data downtime issues. Our hypothesis was that they are all wrong. And that in the same way that in software engineering, every application is different and people build very different products, but there is a coherent set of metrics. There's a certain framework that people look at to define uptime and reliability of software. And we believe that we could develop a similar framework in data as well. It doesn't have to be perfect. It's not going to solve all the data problems under the sun, but it'll give people the tools and the confidence in their data, and that's what we were after. So going through those three hypothesis is really what gave us the conviction to get going.

Dan Blumberg: I love it. I love the scientific method. So how does data get messed up and how does Monte Carlo work to prevent it from being wrong?Barr Moses: I would say data has always been wrong. Just because people make mistakes, data breaks, pipelines break. Systems are brittle and they're error-prone, and so by definition, data would be wrong. That has always existed. However, two things have changed in the last decade, I would say, that made this problem bigger and more important. One is, data is actually used. Five, 10, years ago, it was not really used. The second thing is the way in which we manage, process, transform, analyze data, has become a lot more complex. In the past you had one Oracle database, you dumped everything into there and you call it a day. Today you have five different data teams, six different engineering teams who are all working with the data. You have multiple data lake, data warehouses. You have several ETL pipeline frameworks. You have multiple BI solutions. And data can go wrong anywhere along that chain.

So if you're really thinking about the reliability of your data, you have to include not only where data breaks, but also who's responsible for that data at that point. The problem is just becoming a lot more complicated. How does data actually break? In speaking with hundreds of folks, we basically collected all the different reasons for why data goes wrong, and we classified those into five key pillars. Those are freshness, volume, schema, quality, and lineage.

Very briefly, the first one is freshness. So data going stale or being wrong because it's just not up to date is probably one of the top reasons why data is wrong. And this could be, for example, if a company is dependent on third-party sources for their data, and that data just isn't arriving on time. The second is the volume of data. So it's not uncommon for the job to be completed, everything is fine, but only half of the data was actually sent over.

The third is schema changes. So we talked about different teams working together on different data pipelines. It's not uncommon for some team upstream to make a change in the data without actually understanding the downstream implications of that. Often someone will make a change upstream and that will totally break things downstream. The fourth is quality. So this is at the field level. Maybe the most common example is if folks are using credit card numbers and suddenly there's a letter in the credit card field.

The fifth pillar is lineage. When I speak about lineage, I'm talking about both column-level and table-level lineage and overlaying data health information on top of that. The power of lineage is that it helps us answer the question of if something broke in the data, who's impacted by that? Meaning who are the downstream consumers that are impacted by that data issue? And then vice versa, upstream, what is the root cause?

Dan Blumberg: Monte Carlo built its product around those five pillars and focused the solution on three key categories.

Barr Moses: The first is detection, meaning building alerts and notifications to the right person at the right time based on data issues. It's not only detection and detecting the problem, but also detecting who's the right person that should know about this and what is information that they need at the same time in order to determine whether this is a real problem or not. The second is around resolution. So data teams oftentimes spend weeks or months trying to identify the root cause and resolve the problem. And oftentimes that has to do with looking at, for example, query logs or changes in your DBT models. All of those things can give you clues as to what is the root cause of the particular data issue, and reduce the time to resolution from weeks to hours or minutes. And then finally prevention. So really strong data teams that they also actually reduce the amount of incidents overall because they're building pipelines more thoughtfully in a way that creates less chaos and downstream implications.

Dan Blumberg: The critical part of Monte Carlo's approach is looking at the stack as one big living breathing system.

Barr Moses: Oftentimes traditional data quality solutions really focus on one part of the stack. Data can break anywhere. And so I think a very important part of observability solution connects to that stack end to end. And so we have integrations to each of those and we're able to bring all of that data into one place to have a unified view of your data health so you can understand at any moment if something's breaking where. And then I would say the data teams are thinking more and more, the concept of data products. That is very new, and I think that's a great best practice that we're taking from engineering. That means looking at data quality at a data product level. So for example, something that we just released is a data product dashboard, which allows you to bring together different data assets that are all contributing to a particular data product and understanding the health of the data assets at that level. I think that's pretty revolutionary in how we think about managing data.

Dan Blumberg: Now, Monte Carlo's clients are re-imagining how they do data.

Barr Moses: One of our customers is JetBlue National Airline. Maybe you flew recently and was wondering where's your suitcase and will it ever make it on time, and will you be able to make your connecting flight or whatnot? Actually, JetBlue's data team manages all that data both for support for their passengers, but also for running their operations. Before working with Monte Carlo and having their own data observability solution, they had a team that had to have multiple people steering at the data at all times to make sure that the data is accurate. For me personally, that's a terrifying thought, being in their shoes, I had to do the same.

Now with data observability, they've introduced contracts between teams and expectations and SLA, service level agreements, for when should a particular data asset be up-to-date, and what is the accepted level of freshness and timeliness that we can have and making sure that every single alert, every single data incident is being addressed, so they're running it with a lot of diligence. It's pretty cool to see some of the journey of a data team throughout that.

Dan Blumberg: Can you share a bit more on Monte Carlo's journey of how you've iterated the product?

Barr Moses: For sure. When we started the company, this was just me and my co-founder, Lior. His background is in engineering and security. When we started the company, I was pretty clear on what the problem is, who the person is who we were going to help, and what does the solution look like. I was like, "Okay, Lior, why don't you go code the solution and let me know when you're done and we'll go work with customers and everything will be great. And he was like, "Hell no, are you kidding me? I'm not going to build anything without a customer. That's just not simply how we do that."

Dan Blumberg: Good for him.

Barr Moses: Yeah, totally. And actually, that has carried forward with us as a company today. So we don't build anything, we don't go into the dungeon, build something and come back and say, "Ta-da, here's a solution." We build everything hand in hand with customers. But at the start of this journey, you're just two people with an idea. You have to get customers excited enough to work with you. They're really taking a chance on you.

I think for us, we were able to find folks who, this problem was so visceral for them that they were able to give us access to their data and for us to build with them. Actually, the first thing that we built was schema changes. I remember being so skeptical. I was like, "Folks can get their schema changes in their data warehouse pretty easily. I'm not sure why they'd want that from us." But I was so wrong. We were like, "I just need that daily report of schema changes. If you could just send it to me in email. I don't even care what the format is, just email it over." And the feedback from them, they were like, "This is amazing. It's saving my life."

That was so powerful. So I was completely wrong. Give credit to the team and to our customers, but I think every single step of the way we learn from that. So we would build something, I think the first prototype we built in maybe six weeks, and put it in the hands of our customers really quickly and just try to see if they would use it or not. Another example of something that I was wrong about was, I always wished I would have a data health dashboard.

In the very early days, one of the first things that we did was created this overview of data health, and turned out I was totally wrong because it's really hard to have a strong data health dashboard without all of the pieces that are leading up to it. If you don't have a strong handle on freshness, volume, schema, quality, that dashboard is useless. Now, I think that data health dashboard can be really helpful and actionable for teams, so we are seeing people today use it, but I would say that a data product dashboard was totally useless four years ago when I thought it was really cool.

So I think it has a lot to do with timing and the readiness of your customers as well. At the end of the day, it's really our customers who know the answers to everything.

Dan Blumberg: I think by law we have to talk about generative AI at some point in this conversation, so maybe let's do that now. We're consulting to a lot of companies that are trying to figure out their generative AI strategy, and they come to us. But I'm curious, when they come to you, what are you telling them about how should they approach it and what are the underlying issues that they might need to address first before they get to the really sexy generative AI stuff that's been making so much news the past six months or so?

Barr Moses: I think you're right, a lot of data leaders are under pressure right now to deliver on generative AI. I think we're seeing different reactions to that. I would say there's teams who are definitely already experimenting and up and running, and others who, I think, are trying to understand what use case to solve. I would say that latter problem is a lot harder for data teams to think about how do we actually deliver value to the business. In general, the way that I think about generative AI is whether we make really big advancements in the next 18 months or in the next 18 years, it is making data and data observability even more important because the number one problem that folks have is... I don't know when the last time you asked Chat GPT for a question and it was hallucinating or giving you the wrong answer, and you're like, "Ah, this is useless."

So in those instances, that's when customers and folks really think about data reliability and data observability and how do you make sure that the data that you're surfacing to your customers is accurate? We're really excited about this trend. I think it's great, and I think there's a ton of open questions for how do people actually turn that into real value. I think from an industry perspective, I'm seeing two areas where I think we will see the most disruption. The first is in the BI layer. There's tons of startups in this space already, which basically again, make it easier for folks to, in natural language, ask and get answers about data.

The second area that I think will get disrupted is data engineering productivity. If you look at what Copilot is doing for engineering, I think there's an opportunity to do that for data engineers. So for example, at Monte Carlo, we have something that helps data engineers write SQL queries or fix their SQL queries, which is working quite nicely. So I think that second area, that's interesting. If I have any advice, first of all, my advice is not listen to advice. I think that's advice that I got early on-

Dan Blumberg: Run an experiment, I think that's your advice, right?

Barr Moses: Yeah, exactly. I think run an experiment and then also, start with the why and have a really clear understanding of what's the value that you're going to add or what's a particular use case that you're going to solve. I think there's a lot of folks who are tinkering with things, but actually defining how am I actually showing value with this and who's the customer that I'm solving the problem or those that I'm seeing success with?

Dan Blumberg: Yeah. If you look into the nearish future, is there something, that today sounds like science fiction that you think is going to be totally commonplace?

Barr Moses: I think there's a question of whether the entire modern data stack will be disrupted. And I think that's a very interesting question. Meaning, would we transform, aggregate, process, data in a totally different way? I do think that could happen. I don't think that's imminent. Maybe that's in 10 years from now. Even today, answering pretty basic questions about data is hard, with generative AI or without, for very many reasons. Maybe the data is inaccurate or maybe you don't have all the data, or maybe you're not sure how to translate that into actionable insights. There's lots of different reasons why it's hard, but I do think if we are doing our job well with generative AI, then it'll be a lot easier for us to deliver on the promise of data. I do think that will happen. I'm optimistic about that.

Dan Blumberg: What's next for Monte Carlo?

Barr Moses: Great question. For us, our goals have always been from day one, working with as many customers as possible and making them extremely happy. Those are our goals for eternity. That will never change. The other thing that we're really focused on at Monte Carlo, is creating the category data observability, didn't exist three to five years ago, but it's existing today, which is really exciting. So I'm really looking forward to data observability becoming more and more important. I think it's still in its early days today. When I look at observability and engineering, there's huge companies like Datadog, for example. I think one of the most iconic companies of the decade in engineering. We are really excited for what that means about the potential for what Monte Carlo can be and what data observability will be.

Dan Blumberg: Amazing. Thank you so much for your time. This has been fascinating.Barr Moses: Yeah, thanks so much for having me, Dan. This was fun.

Dan Blumberg: That's Barr Moses, and this is Crafted from Artium. If you've got an idea for a software that solves big problems, let's talk. At Artium, we can help you build great software, recruit high performing teams, and achieve the culture of craft you need to build great software long after we're gone. You can learn more about us at thisisartium.com and start a conversation by emailing, "hello" at thisisartium.com. If you like today's episode, please subscribe and spread the word because the data is clear. A regular dose of Crafted does the body good.

Barr Moses: This is amazing, it's saving my life.