Postgres is Eating the World

Introduction

In this episode, Josh Lee and Robert Hodges sit down with Celeste Horgan, OSS Advocate and OSPO at Snowflake, to dissect Andy Pavlo’s latest database retrospective blog post. The conversation explores why Postgres has become the undisputed standard for transactional databases and whether it’s following the same path as Linux.

They dig into latest Postgres acquisitions, the community’s fierce independence, and why MySQL lost the battle for developer mindshare. Celeste shares her take on the tension between corporate money and community control, plus concerns about maintainers being pulled into closed-source startups.

The second half shifts to Apache Iceberg and why it’s emerging as the API layer for data sharing. Celeste explains why Iceberg isn’t something you just grab off the shelf. It’s a developer experience tool you reach for when you already have a problem. The trio wraps up exploring “data gravity,” the bottled water analogy for open source business models, and why hiring for standards beats hiring for specific vendors.

Episode Highlights

  • [7:31] Postgres’s Fierce Independence from Foundations
  • [12:22] Why MySQL Lost to Postgres
  • [18:52] Brain Drain: Maintainers Moving to Closed Source?
  • [21:40] Multigres: Sharded Postgres for Kubernetes
  • [28:28] Data Has Gravity: Why Services Cluster Around Databases
  • [33:00] Apache Iceberg as an Emerging Standard
  • [36:56] Apache Iceberg Is Developer Experience, Not a Product
  • [43:10] Hire for Standards, Not Vendors

Episode Transcript

[00:00:09.04] – Josh Lee

Hello, hello, hello, and welcome, everybody, to episode 5 of Unevenly Distributed. I am joined by my co-host, Robert Hodges, and today we are joined by Celeste Horgan of Snowflake. I am Josh Lee, and we’re going to be talking about a couple of things today. I think it’s going to be really interesting to start off with Celeste. Would you like to give us a short introduction?

[00:00:34.19] – Celeste Horgan

Hello, my name is Celeste. I recently joined Snowflake at their open source developer advocacy team with a specific focus on the open source programs office. So I’m at Snowflake to help grow both its presence in open source communities that it cares about. So things like Postgres as a part of our recent acquisition of Crunchy Data, but also to grow the internal practice of being open source maintainers, of releasing open source projects, and being excellent participants in the open source space. So really excited to be at Snowflake. Really excited to be on this podcast. Yeah, that’s me.

[00:01:14.02] – Josh Lee

Okay. Awesome. And Robert and I, of course, both work at Altinity. We are providers of hosting and support for ClickHouse, which is another database technology. And so we thought it would be fun to take a look at this blog post from Andy Pavlo about a little database retrospective. And we’ll go ahead and link that in the notes so that everyone can follow along with us. But we’re going to be reading along through this blog post.

[00:01:43.25] – Celeste Horgan

And I guess just to interrupt Josh, but also to introduce. So Andy Pavlo is a professor at Carnegie Mellon University. In his own words, he’s a professor of, quote unquote, databaseology. He spent a lot of time in the startup space in the US doing database-related startups, and he releases a rundown or a year-in-review post of the world of databases on a yearly basis. It’s a great read. He’s got a lot of insight and a lot of knowledge of both the industry, but also the technical trends. And so when Josh asked me to be on the podcast, I was like, I would like to talk about this blog post, please.

[00:02:22.01] – Josh Lee

Nice. Yeah, yeah, yeah.

[00:02:24.17] – Robert Hodges

Isn’t the three-word summary of Andy Pavlo, he’s a God?

[00:02:31.01] – Celeste Horgan

I mean, he- It’s only along those lines.

[00:02:32.26] – Robert Hodges

Yeah. Make it extended to four words. He’s a database God.

[00:02:36.26] – Celeste Horgan

Yeah. Yeah. For whatever that means, and whether or not that’s bragging rights is questionable, quite frankly.

[00:02:44.05] – Robert Hodges

We’ll be investigating that topic.

[00:02:51.28] – Josh Lee

Okay. So do we want to start with his thoughts on Postgres? He spends a lot of time in the article talking about Postgres. Yeah.

[00:03:00.09] – Celeste Horgan

Do you maybe want to summarize the high-level thoughts for the crowd?

[00:03:06.05] – Josh Lee

Yeah, definitely. Let’s definitely do that. Databricks paid a billion dollars for Neon, which is, of course, a Postgres startup. Snowflake, your employer paid 250 million for Crunchy Data. So we have all of these Postgres acquisitions. There’s some other exciting things happening in Postgres that Andy touches on, especially around horizontal scalability and sharding. Compute storage, separation, seems to be a topic that is coming up across various vendor implementations. And then the commercial landscape is shifting, but let’s talk about those first few things first, maybe. Because that’s a lot.

[00:03:51.06] – Celeste Horgan

Yeah, no pressure. And I guess, let me caveat this section, but also, frankly, this entire podcast asked with the classic, I am not a spokesperson for Snowflake. All spokesperson-related things for Snowflake should refer to the official press releases on these things. So what I’m going to offer you is a personal opinion of a person who works at Snowflake and who was not working at Snowflake at the time that this acquisition happened. But I will tell you, when I joined, I joined in about September, and I joined at the same time as another developer relations person in the UK, Chris Jenkins. And so naturally, we were talking one day about the Crunchy Data acquisition, and how it seemed that to some members of the Postgres community, that acquisition didn’t seem to make sense from Snowflake’s perspective. But at least from the perspective of both of us who are fairly engaged with developer communities as a whole, it seemed to make a lot of sense because Snowflake is at its core an OLAP data store. It’s intended for analytical use cases. OLTP capability is the thing that always comes up in customer calls. So that’s another obvious thing.

[00:05:11.26] – Celeste Horgan

Postgres is the most popular OLTP database. It was a pretty straightforward… From both of our perspectives of people who had just joined the company, it was very straightforward. It was plaguing a gap in developer capabilities. But it’s a part of a much larger trend in the community, isn’t it?

[00:05:34.03] – Josh Lee

Yes. Yeah, it definitely is. There’s this consolidation happening. Yeah, so you mentioned the community. I guess that’s a good time to talk about the shifting in the commercial landscape as well. So we have Microsoft introducing HorizonDB. We have Supabase hiring the creator of Vitess. Yeah, everybody’s got their OLTP hosted offering now, it seems like. All the big clouds.

[00:06:12.12] – Robert Hodges

Plus there’s a whole bunch of new creative things like Supabase which are emerging out of Postgres, right? To me, this feels like a continuation of this wave of new database products based on Postgres, and it goes way back. I mean, the beginning of the Postgres project at Berkeley, it forked off and became the basis for a commercial product, which I believe was Illustra, which I think came out of the Postgres project. It was certainly a Mike Stonebreaker project. But there’s very early on people were looking at Postgres and turning it into commercial products. I think you see this drumbeat of products that have been generated out of Postgres because one, it’s a good database. It wasn’t very good in the ’90s because performance is hard, for example. But people just kept working on it. The other thing is it has very permissive licensing. So if you reaching for code, it’s the database to go to. I’m just curious, Celeste, how you think, for example, issues like licensing have played into this enduring attraction to Postgres.

[00:07:31.01] – Celeste Horgan

Yeah, I mean, I think an interesting thing, I’m a bit more involved in the Postgres project these days than I have been in past years, I suppose. I think the interesting thing about Postgres as a project, to me, as somebody who thinks about open source really broadly, is how independent, fiercely independent that community is. So unlike a lot of open source projects which have coalesced around open source foundations like the Eclipse Foundation, like the Apache Software Foundation, like the Linux Foundation and its offshoots. Postgres has, again, remained very fiercely independent. They have spun up their own community organizations, their own nonprofits. They work closely with the SPI, but I believe that relationship actually just manages the finance and the IP of it, and it ends there. And a part of that is the Postgres license is their own license. It is a license that they were heavily involved in the drafting of. So I think that the licensing is a part of it, and the fact that they are so fiercely… And that’s actually a part of what’s interesting around this acquisition and startup-heavy space in the Postgres world, is if you really look at trying to interact with that community as a large organization, for example, Snowflake or Crunchy Data, that looks to donate money into the community as a means of participating.

[00:08:55.26] – Celeste Horgan

There’s very limited ways of doing that, that the Postgres project actually sanctions, and there’s very little money that they’ll accept in a way that is sanctioned by their community bylaws. So I think the licensing is a part of it, but I think actually maybe a bigger part of it is how much the extension framework in Postgres is a very first class citizen. And it always has been. Postgres was always designed to be a very extensible system. And so I think it’s very, very easy to write an extension that you can then license in a bit more of a source available or business source license way and build a business around it while core Postgres remains open.

[00:09:40.08] – Robert Hodges

I think you just described the business model of databases like Timescale.

[00:09:44.28] – Celeste Horgan

Yeah, 100 %.

[00:09:46.13] – Robert Hodges

Which is put an extension in. You’ve got vanilla Postgres. So, hey, if you like triggers, they’re there. All the cool Postgres features. And then you’ve got this plugin which allows you to do time series data. And Yeah, and there seems to be infinite appetite for specialized versions of databases, really. Right. Right. And what do you think about… Do you think there’s a comparison between Postgres and Linux, for example? Because Linux has become pervasive. It’s a standard. If you’re doing backend server computing, it is a standard. Of course, it’s a standard on phones as well. But when you have these standards, you get these strong network effects. And in the end, it’s the Highlander principle. There can only be one. And it feels like Postgres has taken the position in the database community for transactional databases. It’s a bit like Linux. Do you think that’s what I’m getting at there?

[00:10:49.04] – Celeste Horgan

I see what you’re getting at there. I think that’s a pretty reasonable comparison, in my opinion. I do think, and this is not shade on either the Linux community or the Postgres community, I do think that there’s an aspect of you use the tool that’s lying around. And I say this particularly thinking about Linux and how deeply intertwined Linux has become with the containers in Kubernetes space, where, again, it’s the de facto thing that you deploy into a container if you need an operating system of some sort. But also, when you think about it, it was the only operating system that people could use freely and available without incredible licensing fees, because at the time, Microsoft was charging, again, astronomical fees for a Windows license. And I think Apple was still charging for OSX at that time. And also OSX has never been broadly deployable on anything but Apple hardware. And the entire idea of a container is to deploy something repeatably in a programmatic fashion. So you cannot be bound by licenses or all of a sudden entire modelbreaks. And I think a similar thing maybe it’s a bit different with Postgres because I think there’s more options for open source databases.

[00:12:11.05] – Robert Hodges

Yeah, I think, though, but don’t you feel like I wanted to raise that point because I think that if we’d been talking 10 years ago, we would have said, what are your options for off the shelf? Just grab it and go transactional database to be Postgres, but also be MySQL. And MySQL has really faded. And I think it took a long, long time. But I think what the nail in the coffin is that Oracle has finally gotten serious about turning it into a cloud database like with HeatWave. And they’re just not interested in the open source anymore. They don’t see. I think they don’t see it as a threat.

[00:12:48.22] – Celeste Horgan

I mean, I think the other half of it, too, is that Postgres put at least a little bit of development into things like pgBouncer and things like just multiple concurrency that enables being deployed in a cloud environment. And I don’t see… I’m not as in tune with the MySQL community and what it’s working on. But again, I feel like Postgres at least has the beginnings of being able to do things like horizontal scaling or things like deploying it into a cluster format. And I feel like that was actually the nail in the coffin. I think it was more of a cloud computing paradigm. Yeah, Oh, go ahead.

[00:13:30.20] – Josh Lee

These things have to have enough basic functionality. To become a de facto standard, the licensing needs to be there, but it needs to also have enough basic functionality. The licensing tells you that it won’t go away, it won’t be taken away from you. But it needs to have enough basic functionality that these flavors have a common enough base to be building on. Andy mentioned in his article, two features that recently came to Postgres that are old news in the database world. I think one of them is skip indexes, and I forget what the other one was. I don’t have it at that part of the article.

[00:14:04.13] – Celeste Horgan

But these things- self joins?

[00:14:08.21] – Robert Hodges

Yeah, yeah, yeah.

[00:14:10.03] – Celeste Horgan

Yeah, yeah.

[00:14:11.07] – Robert Hodges

Yeah, yeah. Yeah, dropping the dependence on the page cache, which makes me sad because we work on ClickHouse. ClickHouse uses it.

[00:14:20.18] – Josh Lee

I did have that actually when I was reading that sentence. I was like, Oh, but we love the paid cache.

[00:14:27.10] – Robert Hodges

Andy is not going to be happy. Getting back to this MySQL issue, I think we can put it to bed, but I worked with VMware for four years. During that period, it was from 2014 to 2018, and VMware shifted big time to Postgres during that time. Not that they really ever had a deep footprint with MySQL to begin with, but the thing that drove it was licensing for them because they were shipping it. They were shipping it in appliances. To your point about containers, that if you’re talking about shipping things, making binaries that you’re sending places, the Postgres licensing was what did it. I think the people that ran VMware didn’t care less what Postgres did. But there was a beneficial effect that VMware hired Postgres developers, and they contributed to the community. There were people at VMware who essentially did nothing but work on Postgres, and they helped sustain that community.

[00:15:34.25] – Celeste Horgan

Yeah, and that’s the interesting… I actually brought this up in a talk at PG Con EU earlier this year with the because the character of the Postgres community, again, is very fiercely independent in a lot of ways. There is a sense that the work that they’re doing is so community-driven and that encroachment by companies is treated with a bit of trepidation. And this is a personal read. I don’t think the Postgres community would necessarily agree with that statement. But I was also coming from my background in open sources that I participated in the Kubernetes project. I worked directly for the Linux Foundation, which is a very, very different sphere of how open source plays out. So to me, the Postgres project is so fiercely community-driven and so protective of that. But at the same time, there’s a bunch of acquisitions happening for a lot of money, tangential to the space. All of the major cloud providers now have a Postgres-shaped thing that they deploy for their cloud customers. I would actually say most of the major cloud provider versions. So AlloyDB is closed source. I believe HorizonDB is closed source. And AWS’s various Postgres-shaped flavors of things that, again, deployed directly to AWS and claimed some level of compatibility with core Postgres, but don’t.

[00:17:07.18] – Celeste Horgan

It’s not really Postgres, like Aurora in specific. And so I did bring this up just because I feel like that community needs to think about it a little bit more, which is there’s a lot of money in this right now. There’s a lot of money, and there’s a lot of companies, including my own, that really, really want to stake in this community. And I think that’s at least partially driven by things like AI. What is AI, but a means of consuming and translating data? It makes perfect sense that there’s a bit of a gold rush in this space right now, because Postgres is the database that everybody uses at some point for something. But I think I worry for the community that there’s a rug pull that’s about to happen without them necessarily noticing it, and that there’s a lot of money flowing into companies in this space that maybe don’t have a relationship with the open source community.

[00:18:07.17] – Robert Hodges

Or do you think there’s an issue with people like Hakey Linacongis, who’s off it? I think he’s at Supabase now. I can’t remember which one he’s at, but these are people who’ve been long-standing members of the community who are then getting pulled out to go help build these startups. Do you think there’s a… Because I think there is a bit of a, there is definitely a bit of a conflict because these, as you say, these are closed source implementations, and it’s pulling community members out. So, unless the question for me is, well, is that going to strip the community of the people who have maintained that independence and kept this consistency over decades?

[00:18:52.20] – Celeste Horgan

Yeah, and I don’t necessarily know what the answer is in the world of, I guess, modern open source in general, because I think you can argue the same thing about is happening in a lot of different communities. Postgres, again, is a bit unique, one, because it’s such a long lived community in comparison to, again, something like Kubernetes. It’s been going for decades at this point. It doesn’t seem to have stopped the community thus far. Knock on wood. But. And if you do go to a PG con or a PGD, you will see Microsoft has booths there. I get sent there on behalf of my company as do other people from Crunchy Data. And just to be clear, I came in as Snowflake not Crunchy Data. That is a distinction that some people think about. So you do definitely see people coming back into the community. And as a part of my role at Snowflake, I get to do things like work on PG days and attend stuff. But does the feature development then live in a closed source bubble and do some of those innovations around, for example, horizontal scaling, for example, Neon, do those ever make it back out to the broader community, right?

[00:20:23.14] – Robert Hodges

Yeah, I think, and the one that I’ve looked at a little bit is Supabase, which I think that’s a really interesting. That’s an incredibly interesting product because what they have done is made it just drop dead simple for devs to pick up Postgres. Yeah, it runs in their cloud, so you can’t run it on your desktop. I see a lot of the innovation that they’re doing is around usability of Postgres, but the core, they’re basically building on core Postgres. My guess, I have no insight into the internals and how they’re approaching it. My guess is that they’ll make changes to enable their stuff to work, but I think their motivation would be to push those back into upstream. I think one of the things that will hold people together or hold the community together is that if you want to have migration, drop in migration, it’s in everybody’s interest to push their changes back to upstream.

[00:21:17.20] – Celeste Horgan

It is. However, and Josh, this brings up a point from the article, just to loop it back, and I’ll introduce it for you, which is Supabase recently announced that they hired one of the creators of a test, which is Sharded Postgres, done specifically for Kubernetes deployments. I believe it’s a CNCF project. But they hired him specifically to lead Multigres, which is intended to be Sharded Postgres, specifically deployed as a Kubernetes cluster. So that’s a separate project. And I would assume that Sugu, having donated Vitess to the CNCF once, would potentially do the same with Multigres. As opposed to have that live under the Postgres umbrella.

[00:22:04.22] – Robert Hodges

Right. Or it would just be a… I wonder if it could also… Do you think it could end up as proprietary fork, like what Aurora has done?

[00:22:15.18] – Celeste Horgan

That’s a great question. So I can tell you, I just pulled up the website in another tab right now, and Josh, poor long suffering Josh, will probably provide the link at some point to the people following along. It’s copyright footer on the website for Multigres is copyright 2025 Supabase, licensed under Apache 2. Apache 2 is probably the most business friendly and permissive of the open source licenses available. It’s the one we always use at the Linux Foundation. It’s the one that I recommend people use at Snowflake unless they have some compelling reason to use a different license. And then I guess we can… But just looking a little bit through this repository, I don’t… I get the sense that they’re intending to keep this open source.

[00:23:19.04] – Robert Hodges

That would be consistent with Sugu’s background, as you say. Yeah. Yeah, and I know him from when he was at the test, and yeah, he’s pretty, I think he’s a believer. Yeah. As are a lot of us. I mean, that’s our whole shtick is there’s a bunch of stuff that we did at our company because we happen to believe in open source, and we tend to make decisions which don’t have too much to do with economics, because there’s a lot of things which it doesn’t matter either way. So we just follow our we just follow our ideology, if you will, about keeping it open.

[00:23:58.28] – Celeste Horgan

Yeah. I mean, I’m also a pretty big believer all said and done. I’m pretty much as close to as true believer as you get in some ways. But I do think it comes into some of the conversation we were having earlier around some of these maintainers are getting pulled away from the project to work on things that are closed source. But there’s a lot of difficulties, and I think you can probably speak to this, in running open source startups and having a viable whole business model around that. And so there’s this part of me that goes like, well, I don’t know. It’s not great for the product or for the project if some of these maintainers get pulled away to develop open source things or closed sourced things, sorry. But at the same time, running an open source focus startup is not easy from an economic perspective. So I can understand why some of these things end up closed source to begin with.

[00:25:02.28] – Robert Hodges

I mean, the fundamental problem is it’s just hard to get people to pay for stuff that’s free. So the classic analogy is selling bottled water, which actually turns out to be just to the point of, Hey, you can make money on open source. Look at the size of the bottle water industry in the United States. It’s huge. It’s tens of billions of dollars. But at the same time, most people, where do they get their water? They get it out of the tap. I mean, 96 % of the water that people drink comes out of the tap.

[00:25:35.24] – Celeste Horgan

But there’s a use case for bottled water. Oh, absolutely. When you’re at an airport, for example, or when you’re on the airplane, and there’s a real need for strict control of exactly what is in the bottle for other security purposes. And that’s where I think that, A, there’s always space for open source startups because somebody will need the package version of it somewhere. But where I think there’s quite a few startups in the space of open source security and providing secure images of open source software. I think that’s a fantastic use case.

[00:26:13.23] – Robert Hodges

Yeah, it seems like from what we’re seeing in things like Crunchy Data with Crunchy Bridge, Supabase, these are all products that run in the cloud. The thing that is really the foundation of those startups is that they’re just making it drop dead simple to use Postgres in a way that I think is completely different. If you go back 20 years, the thing that used to drive me nuts about Postgres is I couldn’t find a build. Whereas with MySQL, what they did very early on was, this was pre-cloud, what they did was they got MySQL plugged into Linux. It was virtually every Linux distribution, if you do pseudo root and then type MySQL is a really good chance you’re talking to a MySQL database. That’s, I think, a big reason, one of three big reasons why MySQL ended up just being embedded in all these properties. Like, name a… Yahoo, for example, one of the big early adopters, Facebook, a huge proportion of their business data processing was on MySQL. I think that’s still the case for their main web properties. Anyway, yeah, the cloud has actually been very good to Postgres because it’s created this opportunity to add value on this based software.

[00:27:42.23] – Robert Hodges

You’ve highlighted Kubernetes, and you’ve highlighted Kubernetes, and I think you’re probably seeing… I mean, when people are reaching for a database in Kubernetes, what is it that’s making it easy for them to use Postgres in that environment?

[00:27:59.04] – Celeste Horgan

You know, so I’m going to use a quote from Liz Fong-Jones of much fame in general, but currently at Honeycomb, which is an observability company. She and I were talking on Bluesky a little while back, and she said something in response… I don’t even remember what the thread was about, but this quote stuck out, which is data has gravity. The thing is, as soon as you deploy a database, other services tend to cluster around it because latency becomes an issue. Because ingress/egress becomes an issue, because managing this size becomes an issue. As soon as you deploy a database or a data store somewhere, everything starts to clump around that. So when we talk about what makes it easy for cloud deployments to use Postgres? I think there’s two haves to that question. I think it’s one, what is easily, readily available from the cloud providers, which is why they all offer Postgres, but they all offer a customized version of it as well. I think it’s number two, if they’ve got data in a cloud anyways, they’re going to want to attach a database like Postgres for read activities in general, which means it’s going to just be co-located on that cloud.

[00:29:23.22] – Celeste Horgan

And three, I think it comes back to the network effects of becoming a standard. Postgres, regularly ranks as 50 % or more of all developers use it as their preferred database. And I think that it’s one of those things where unless you have a compelling, and not to say things about ClickHouse or even stuff like that are not beneficial. In a lot of cases, if you don’t really know what database you need, Postgres is probably going to do it just fine. Yeah, I think Postgres has completely won that.

[00:30:04.08] – Robert Hodges

And it makes me sad because I love MySQL. I worked with it for over a decade. But yeah, it’s just the default. It’s just the go-to. You got a go-to. When you’re doing a startup, you got a lot of stuff to think about, but just grab Postgres. It worked for a lot of other people. It’s going to work for you. And it might be the thing you already know anyway. So that’s the one to grab. What do you think about…

[00:30:26.27] – Josh Lee

Oh, go ahead. I think I worry that’s what we’re losing, though, right? Maybe with this fracturing of the ecosystem, that you don’t have to think about it, you don’t really have a choice to make until you need to for a specific feature is going away now because it’s the de facto standard. Maybe Postgres was Linux, Postgres dialect is POSIX. But now we have all these various implementations and flavors, just like we do with Linux. And the more fractured it becomes the less compatible and portable everything becomes, even if it’s happening in open source.

[00:31:03.10] – Robert Hodges

Yeah, I think one of the things that actually, speaking of standards, one of the things that Postgres did, which got them a lot of… There’s two ways to approach databases. One is to… SQL databases. One is to adhere to the SQL standard. The other one is to be loosey-goosey and make developers’ lives easier. Postgres took the standards approach. It was not a good choice for a very, very long time.

[00:31:26.25] – Celeste Horgan

Until it was.

[00:31:28.02] – Robert Hodges

Until it was. And then at At some point, if it’s being used by a bunch of people, and it’s highly performant, and it’s got all the features of people, implementations, all the features that people need, that’s how you make a standard. I think that has played to their benefit. Whereas MySQL, if you use, or one of the things I love about MySQL is it has a very easy to use dialer, and it’s just scripty. There’s a family of databases to do that. Sybase was another one that was like this that turned into Microsoft SQL Server. Postgres has really standardized on SQL, and I think that’s one of the things that’s definitely driving use, because then it means there’s conformance across all these. As people spin off and do these things, they all have strong SQL conformance. They have window functions, for example. That’s a really key feature for a lot of database applications.

[00:32:25.10] – Celeste Horgan

Yeah. I mean, and this is, again, me taking the steering a reel from Josh on this, so I apologize, Josh, for. That’s why we’re here. But one of the questions that we had in our shownotes that we wanted to get to was around Iceberg, because my thoughts and feel… You’re involved in the Iceberg space. We’re involved in the Iceberg space. And one of my thoughts and feelings on Iceberg is that it’s also becoming a standard way of interacting with data layers, especially as data becomes less about the database and more about S3 buckets and handling streaming data and trying to consolidate how somebody accesses all of those. And I’m not really sure if I have a question here. I just wanted to pivot the conversation. I have a question.

[00:33:16.02] – Josh Lee

No, I love the segue. Thank you. I was about to say we need to talk about this other standard for a while. It’s a very interesting topic, but we do have more to discuss. So Iceberg is a standard that we all love. I feel like it’s a little bit inverted from how Postgres felt. Postgres was a standard implementation that was open, that became the de facto standard. It was standard conforming the whole time, but the implementation was really what drove the adoption, I think, that core implementation. Iceberg, there is no standard implementation. There are many flavors of Iceberg, and that’s our starting point, which to me feels somewhat unique.

[00:33:55.05] – Robert Hodges

Well, Josh, I’m not sure that that’s quite how it worked out because what Iceberg really… It evolved from an application problem at Netflix, was they need to manage very large amounts of data, and Hive wasn’t doing it, which was the table standard that are merged out of Hadoop. And in fact, if you want to get stuff to work in Iceberg, you got to work in Java, because that happens to be what Netflix did. And I know that when we’re looking at, Hey, how does the Iceberg standard for a specific thing. I bring a client up in a debugger and I go look through to see what the job is.

[00:34:37.17] – Josh Lee

So the implementation is still the spec.

[00:34:39.07] – Robert Hodges

The implementation is a spec. So I think that was… There is a spec, and I’ve heard other people say that it’s a really good one compared to, I guess, which means that the things they’re comparing it must be really gnarly. But it’s been a… It’s an interesting evolution because it emerged to solve a specific problem. And it feels to me, I don’t know what I need to think about this, the Iceberg, to me, is more like a core technology. It’s not really a product in and of itself, not in the sense that Postgres is where you can just grab it and build an application. Iceberg, there’s a lot more work to actually turn it into something useful.

[00:35:19.16] – Celeste Horgan

Yeah. So the question, though, I have in shownotes, and Robert just answered. Oh, my gosh, I’m going to call you Robert all day. Robert just answered it, which is why is there no big marquee Iceberg startup, as opposed to, for example, I think, with Apache Kafka, Confluence isn’t the name that always comes up. And I think I agree with you. And I didn’t know much about Iceberg before I started working at Snowflake, so I had to on-ramp really, really quickly. But I think it’s finally started to click to me at some point that Iceberg is not a thing you implement unless you already have a problem. Does that make sense? It’s a tool to solve a problem. A database is the thing you implement because you’re like, I need a database. I need to record stuff. Iceberg is the thing that you start implementing, or you start using because you’re trying to talk to multiple different data sources, not all of which are in an easily readable format, for example, files in an S3 bucket. But you need to be able to talk to them in a consistent fashion. It’s a developer experience tool is the thing, right?

[00:36:37.29] – Celeste Horgan

It’s designed to make this developer’s experience of interacting with disparate data sources somewhat more consistent. And it leans on the idea of tables and SQL. And transactions. Yeah, and transactions to help with that understanding. But when you really dig into it, it’s not actually… It’s shuffling metadata files around. Yeah.

[00:36:59.17] – Robert Hodges

And in fact, if you’re solving the problem that it was originally designed for, which is batch query, or as it turns out, this is also well suited for AI where you’re doing training, it works pretty well for that. If you try to use it as a real-time store, not so much.

[00:37:16.10] – Celeste Horgan

And we have some—it’s not storing anything. It relies on other things for its concepts of storage, right? It’s there to index.

[00:37:26.05] – Robert Hodges

Yeah, it provides the metadata for this, but it also is designed I think to your point, to solve specific problems. If you have that problem, which is, say, batch query on very large data sets is great for that. But if you want to stuff into it really quickly, it doesn’t all of that problem yet. Although I think that people like Ryan Blue are very, very aware that that’s where it needs to go, that you need to be able to do little updates and you need to be able to do them really fast. Where I see the products for this is things like Amazon S3 table, table buckets. And that’s talk about developer experience. It’s just like, hey, make this thing just pop up and work. And Cloudflare has this same idea. I think other people, to be frank, we have that at our company. It’s like we’re going to use this as storage, and we’re just going to make it so you don’t really have to think about how it’s being managed.

[00:38:23.11] – Celeste Horgan

Yeah. And I mean, we had a bit of a pre-discussion around PG Lake, which was recently open-sourced by Snowflake. And I think that very much exists in a similar vein of the Snowflake customer is fundamentally… The thing that makes Snowflake useful is if you have as much data as possible with us and you are able to query against all that data. Sofrom Snowflake’s perspective, doing something like open sourcing, PG Lake just allows people to get more data into Snowflake. So it’s not necessarily… Sorry.

[00:38:53.08] – Robert Hodges

Oh, no, please go ahead.

[00:38:54.04] – Celeste Horgan

I thought it- I was going to say it’s not really… I don’t think it’s necessarily about building a particular your level of Postgres compatibility. I think it’s very much about the story of, again, Snowflake is most useful when you give us as much data as humanly possible, and this is another means of doing it.

[00:39:10.25] – Robert Hodges

Yeah. I think that what do you think about the idea of Iceberg just being a way of sharing data between these different databases? Is that something that resonates for you?

[00:39:22.25] – Celeste Horgan

The only way that I personally understand Iceberg, as somebody who’s fairly new to this technology, so full here. Again, I understand Iceberg as a developer experience tool, primarily. But it’s not necessarily the most effective way of, say, traversing an S3 bucket. I don’t know that it’s necessarily the most effective way of being able to query different data sources with a consistent format. What I do think is that it’s open source, obviously, which means it has a certain level of virality built into it. And that’s the other thing. When any company open sources anything or when any company, a large company, supports an open source project, it’s because aligning themselves with something that will become viral and therefore becomes standard is in the company’s best interest. And I think Iceberg has definitely caught on with the community. It’s like, if you go to PyCon, I want to say it’s like every second or third talk had featured Iceberg or somebody was wanting Iceberg, or they were implementing Iceberg. And I think there’s, again, I think that’s a developer experience, primarily.

[00:40:35.05] – Robert Hodges

It’s a standard. And it’s a standard, right? That people who want that experience can just adopt it and there’s libraries for it, and Hey, if you’re Python, in fact, actually after Java, I highlighted Java, but the second best implementation is Python. And it’s pretty good.There’s not much stuff missing there. Yeah, so I think that… Because I think, I mean, one of One of the things that intrigues me is that if Postgres can put data into iceberg, then analytic databases like ClickHouse, like Star Rocks, any you care to mention, they can read that data. It used to be in the old days, how we would move that data is we’d have silos and we’d set up replication. Replication, you can make your product as perfect as possible. It’s still a pain in the butt to do that movement, particularly real-time movement across silos. I feel like there’s a really bright future for Iceberg to solve that problem. We just dump it all into the center and all look at it together, and then everybody.

[00:41:42.07] – Celeste Horgan

I think the thing that I would say is similar to the data has gravity comment, I think that silos will never go away. I just think that that’s the nature of storing data. I think that that’s the… If we I haven’t solved how to talk to multiple databases all at once. Forty years in, I don’t know that we’re going to solve it today. And there will always be different parts of any business that have different needs and therefore have different data storage and retrieval needs. And that’s fundamentally what causes silos. But I think that Iceberg has a really interesting opportunity, again, to become the API layer for all of those things. And I think it’s getting enough community support that it’s moving in that direction. So I’m quite bullish on it, personally. Yeah.

[00:42:36.25] – Robert Hodges

It feels a bit like Snowflake… Excuse me, not Snowflake. Iceberg is going to turn into an API layer the same way that Kafka has effectively become a standard for streaming. And you can look at other things. I think those are two big ways of moving data around now and sharing data across enterprises that just feel like they’ve won.

[00:42:57.22] – Celeste Horgan

Yeah. And I mean, there’s another aspect that I always tend to think about it in regards to as well, which is hiring teams of people to do this stuff. And I think this is definitely influenced by the time that I spent with the Kubernetes community, where you can hire AWS engineers, you can hire Azure engineers, but it’s far more effective to write a job posting for a Kubernetes engineer.

[00:43:24.28] – Robert Hodges

It’s a standard.

[00:43:26.25] – Celeste Horgan

It’s a standard. And the standard has or has a life outside of the code base. And fundamentally, if you’re building a data pipeline, you need to hire data engineers. And you could say, yes, we’re going to hire a StarRocks engineer. Yes, we’re going to hire a ClickHouse engineer. And somebody might need specific experience in particular data stores. But if they know how to deal with Iceberg, that makes it a lot easier. You can teach them the rest, right?

[00:43:59.10] – Robert Hodges

Right. So what you’re describing is a vision of these interfaces that are used to build enterprise systems, and they are Kafka, their Iceberg is taking its place in that pantheon, there’s Kubernetes. These are things that you can just expect people to know.

[00:44:18.10] – Celeste Horgan

A little bit.

[00:44:19.28] – Speaker 1

I think you’re absolutely right on this. I think that’s a key reason why Snowflake, your current employer, has become huge. It’s just like you pick Snowflake, you know it’s going to solve the problem. And same with Postgres.

[00:44:35.10] – Celeste Horgan

But I think that even Snowflake recognizes the value of Iceberg, which is why they’ve invested pretty heavily in it. Yeah, absolutely. Because I think even Snowflake, again, and I’m not a spokesperson for the company. But I think that Snowflake is a pretty smart company, and I think that they recognize that the open standard is the one that’s going to win.

[00:44:56.01] – Robert Hodges

I think it’s going to be really interesting to see where it goes. I’m really glad to see how big the community is around Iceberg, and how many people have made commitments to it, which means they also have skin in the game to make it better, to contribute to it and make it better. And I think that’s a huge… I think that’s when you look at competitors to Snowflake, I keep seeing Snowflake, competitors to Iceberg, like DuckLake, for example. I just don’t know if they’re going to take off, because they don’t have a big enough community around them. It’s not that the technology isn’t, it’s not a question of the quality of technology, it’s just how many people are bought into it.

[00:45:35.10] – Celeste Horgan

And it feels like- And I mean, this loops back around to the start of this conversation, which is, yes, I agree with that statement, but at the same time, again, there’s always an appetite for, especially for specialized data source, because people are storing all sorts of data and doing all sorts of strange stuff with it, right?

[00:45:54.23] – Speaker 1

Yeah. Well, I think that’s what the DuckDB folks would say. That’s why people realize that for the use cases we’re targeting, you’re just better off using this than Iceberg, for example.

[00:46:07.28] – Celeste Horgan

Robert, you had to bring up DuckDB. That’s one of the things I wanted to talk about. But we’ve been going for a while, and we actually only made it halfway through our notes.

[00:46:16.19] – Robert Hodges

Oh, my God.

[00:46:17.02] – Josh Lee

I think we have to have you back on. I think that’s the one.

[00:46:19.22] – Celeste Horgan

I would be happy to do that.

[00:46:21.20] – Robert Hodges

This is like software developers.

[00:46:24.20] – Celeste Horgan

This is my favorite conversation to have. So when Josh I’ll let you wrap up after this, Josh. Please, please. Josh introduced this to me. He was like, Oh, yeah, do you want to talk about Snowflake also? And I was like, No, I want to talk about what’s happening out there and potentially spit some opinions. This is so much fun. This is the funnest conversation. Yeah.

[00:46:50.06] – Robert Hodges

One of the things that would be really cool to rip off is that point that you keep making of, Hey, there are standards, and then there are specific cases where you want something different. I think just being able to explore a little bit more deeply, what’s the dynamic there? You have the big standard, but you also have… I think there’s a lot to explore there. That’s what keeps the technology fresh and also leads to new standards.

[00:47:16.13] – Celeste Horgan

Yeah. I mean, you know where to find me, Josh. I would love to be back on. Unless you wrap it up.

[00:47:22.26] – Josh Lee

Yeah. No, you know you’re an open source nerd when standards excite you, but I think we all here identify that way.

[00:47:30.23] – Robert Hodges

Yeah.

[00:47:32.18] – Josh Lee

No, it’s been great having you, Celeste. I would just quickly plug. I think we’re all going to be at FOSDEM coming up, correct?

[00:47:41.05] – Celeste Horgan

We are already at FOSDEM. Snowflake is actually hosting an Iceberg meetup at FOSDEM, so look out for that. I don’t know if it’s officially on the fringe events, but I know that it’s happening. So Snowflake has its developer event, BUILD in London. Go to BUILD. It’s in London. I know that we’re having a meetup the day before on the second in London, specifically to talk about open source stuff. So if you’re in London, I’ll also post a link to that when this gets posted on LinkedIn. So, yeah. Yeah.

[00:48:14.04] – Josh Lee

We’ll be hosting a meetup in Brussels, so I can’t make it to your meetup.

[00:48:18.06] – Robert Hodges

We’ll see it. We’ll see it, FOSDEM. That’s great news that there’s an Iceberg meetup. I think FOSDEM is turning into a bit of a database place with all these fringe events. I hope it’s on a date we can attend.

[00:48:30.06] – Celeste Horgan

I know. It would be lovely to see you all at FOSDEM. Cool.

[00:48:35.06] – Josh Lee

All right. Celeste, thank you so much.

[00:48:37.08] – Celeste Horgan

No problem.

[00:48:38.10] – Robert Hodges

Thanks, Celeste.

Listen to the full conversation on the Unevenly Distributed podcast, available on Spotify, Apple Podcasts, and YouTube. Connect with Celeste on LinkedIn at /in/celeste-horgan/. To learn more about BUILD London, visit Snowflake. For more insights on ClickHouse and real-time data architecture, visit our blog.