The Primitives of a Modern Data Platform

Introduction

In this episode, Josh and Robert are joined by Hazel Weakly, Fellow of the Nivenly Foundation, Director of the Haskell Foundation, and infrastructure engineer known for her work on the Hachyderm Mastodon instance. Hazel brings a rare blend of deep technical knowledge and systems thinking to questions that most engineering teams sidestep: Who are your users, really? What does governance look like when you build it into the platform rather than bolting it on afterward? And what will it take to make next-generation data platforms work in regulated environments?

The conversation covers the organizational failure modes that sink large data platform projects, how modern analytic databases like ClickHouse® change the economics of compliance, the tension between enabling innovation and enforcing control, and a deep dive into authorization, authentication, and attestation, including Google’s Zanzibar paper and the open-source SpiceDB project it inspired.

The episode also touches on AI agents, shadow IT, and what Hazel would standardize with a magic wand if she could.

Episode Highlights

  • [00:01:32] Why data platforms are utility layers, not end products
  • [00:03:48] The war story: a $10–$30M, 50-person, 18-month project that got canned because the real problem was social, not technical
  • [00:07:27] Success story: guerrilla open source at VMware to find real users
  • [00:10:11] Why complex data platforms resist the checkbox cultures of regulated environments
  • [00:13:49] Using new platforms to lower the automation threshold for compliance reporting
  • [00:17:06] How modern data platforms enable correlation at scale
  • [00:21:00] ClickHouse’s ability to infer structure from raw files and why that matters
  • [00:29:23] Enabling versus regulating: the two camps in every data platform conversation, and how to serve both
  • [00:34:37] Why decades of internet common sense don’t transfer to AI
  • [00:41:36] Authentication, authorization, and attestation: why OAuth/OIDC only solves one
  • [00:49:19] Google’s Zanzibar paper, SpiceDB, and building policy engines on Lakehouse architecture
  • [00:57:06] Magic wand picks: transaction tokens everywhere, and heterogeneous access control with Iceberg metadata

Episode Transcript

Josh Lee  [00:00:10]

Okay. Hello, everybody, and welcome to our sixth episode of Unevenly Distributed. I am Josh Lee, open source advocate at Altinity. I am joined by my co-host, Robert Hodges.

Robert Hodges  [00:00:23]

Hi, everybody.

Josh Lee  [00:00:26]

CEO of Altinity. My boss. Today, we are joined by Hazel Weakly, Fellow at the Nivenly Foundation. Hazel, I always enjoy running into you at conferences and getting to hear your latest thoughts on things. It always leaves my head spinning a little bit, but full of awesome ideas. I’m really excited to have you on. Hazel, would you like to say hi and introduce yourself?

Hazel Weakly  [00:00:57]

So my name is Hazel Weakly. I do have thoughts, lots of thoughts. They never stop. Not only am I a Fellow of the Nivenly Foundation, I’m also on the board of the Haskell Foundation. I spend my time trying to help open source communities figure out how to grow sustainably. Part of that is also figuring out really complicated questions about technology and how to make them explainable to everyone.

Josh Lee  [00:01:23]

And we’re going to try to do a little bit of that today.

Hazel Weakly  [00:01:28]

We need to get there, though. We’re not there the whole time. That’s what we’re doing.

Josh Lee  [00:01:32]

It’s morning for us, evening for you. I’m on coffee. Robert and I work at a data platform company. We have a tool to sell. I think a lot of us in the space, when we have a tool to sell, get caught up in the tool being the end goal. The tool itself becomes the thing we talk about and get excited about. But there’s not actually any value there. The actual value is in how you use the tool and what you can achieve with it. I think that’s what we wanted to talk about today. You and I have talked about the data lakehouse as a utility layer rather than the final product. What does that look like to you?

Hazel Weakly  [00:02:30]

For me, stepping back slightly, it’s also important to know that it’s not just the vendors and not just the companies that built things that lose sight of them being part of a whole. Everyone does it. Every single platform team, internal team, utility team, it doesn’t matter how integrated they’re supposed to be. That’s their whole world. Of course, for them, they’re the whole story.

For me, the challenge is: how do I teach people to consider themselves as their whole world and lean into that? To get really good at what they do, but then also simultaneously go through the ego death that’s required to actually build something that’s greater than just you, that interconnects with everything else, and that actually ties together.

With data platforms in particular, the thing that really gets me is: do you know who your users are? Have you actually talked to them? Do you know what their needs are? And I don’t mean in the “oh, I know what their needs are because we thought about it really hard” way. Have you sat there watching them break their keyboard at 3:00 in the morning because your product doesn’t work for them? Have you actually watched it fail? Because everybody’s product fails at some point. Everything breaks down at the margins. Nothing’s perfect, and it can’t be, but that’s where the opportunity is to figure out how to make your platform usable.

Robert Hodges  [00:04:42]

Do you have any war stories about where people totally got it wrong? Sometimes what happens when engineers start building without really talking to their users, particularly around data, it’s embarrassing, but it can also be comical.

Hazel Weakly  [00:05:02]

It’s not even just engineers. At a previous company, we were trying to figure out how to do next-generation data center design, involving very, very large problems. The answer was basically: greenfield, replace everything. So I started asking: what do people need? Who are we talking to?

It turned out that internally, there was already a team working on the problem we were trying to solve. Then there was something else we wanted to replace, and there was already a team working on that. Everywhere I looked, there was either a team or part of an existing initiative already working on it, at least one, sometimes two or more parallel workstreams. We ended up with three to four different workstreams, almost 50 people, and we spent over a year and a half on it. The budget was easily between $10 and $30 million.

And then the whole thing eventually got canned, because it turned out the real problem was something else entirely. Some executive needed a proof of concept, or there was an organizational structure question to resolve. No one in the organization was actually solving the problem the larger company needed solved. We were sitting there trying to solve the technical problem when the real problem was a social one. Had we known any of that, we wouldn’t have needed 50 people, and over 10,000 hours building a thing that ends up getting deleted. It goes pretty high up, actually.

Robert Hodges  [00:07:27]

I think the organization is a fascinating problem. I have a success story from VMware, one that didn’t get canned., where I worked for over four years. At some point they decided they wanted to redo the UIs for all their products. The traditional approach, based on what you just described, would have been to go around talking to every single team who has a UI and try to figure out what they’re doing.

The team responsible didn’t take that approach. What they did was develop a new widget set that you could use in web applications, and they open sourced it. Then they pointed people to it. By going out on public GitHub, it gave them a way of communicating with the users who actually had the problem. It succeeded brilliantly. Our own cloud product uses that widget set, built by these folks just doing this guerrilla open source approach to getting a solution out there, getting real practitioners engaged, and making it better.

Have you seen other ways of short-circuiting that process? Sometimes getting to the problem is, as you say, the hardest part.

Hazel Weakly  [00:08:53]

I’ve seen ways of short-circuiting it. The fundamentals are typically the same: you have to figure out what the pain actually is. You have to build a form where people can openly communicate, and a way for collaboration to happen. And then you have to make all of this happen in a way that doesn’t fight the friction and inertia internally so hard that it just gets killed, either out of spite, cost savings, or an inability to even grasp the problem technically.

But there’s no one way or easy explanation for it. What I try to use conceptually is the idea of leverage points in a system. Donella Meadows wrote a really fascinating piece on intervention points about how to intervene in a complex system. You have things like constants and parameters as the least impactful levers. Then you have changing the rules in the system, changing the incentives, changing how things feed into one another, changing the data path in an organization. But the most powerful lever is actually changing what people think is possible, changing how they perceive the problem entirely.

If you change someone’s entire conceptual model, then everything else follows. What I try to do is find the conceptual model that is ready to be changed, ready to be challenged.

With data platforms, one of the things that’s been really interesting is in regulated environments, the kind I tend to work on, there’s this type-one safety approach where everything has to go in a little checkbox, into a little form. Every single feature or capability needs its own separate individual product, run by its own tight little team, sitting on a shelf somewhere organized perfectly. That’s how they think about it.

But if you look at modern data platforms, they’re not that. They’re not even close. They’re this streaming, analytic, batch, joined, mismatched thing with offset storage and serverless compute and hyper-elasticity, all this weird, actually incredibly powerful stuff. They’re building something out of reusable problems rather than making neat little boxes to put things in. They’re useful for complex problems because they can be built to mirror that complexity.

A complex problem can only be solved by complex capabilities. And complex capabilities resist categorization. They resist atomization. They resist factorization. That’s the world that regulation, compliance, and security, everything that “seeing like a state” thrives on. We need a fundamentally new way to track information deeply, categorize things, provide the evidence, and do all the things compliance and risk require, while also being legible to the business.

Robert Hodges  [00:13:49]

Is there a way to introduce new technology by solving old problems? You mentioned compliance, and regulated environments often have to generate reports, like to the SEC. Bad things happen when those reports are late, and they’re not easy to do. One way to sneak things in is to take things that are painful drudgery and make them easier. Do you see opportunities to bring new technology in that way, or do you seek out new problems?

Hazel Weakly  [00:14:29]

I do both. When you seek out a new problem, there’s always a lot of energy around it. People want to migrate to a new thing, an executive wants to add something to their portfolio, someone wants to get promoted. That mostly only happens with the shiny stuff. So yes, I do want to seek the new problem.

But I also want to see if I can solve an old problem better with the new thing. If I can get people to reframe what they think is possible and introduce the idea that they can solve a problem that couldn’t be solved before, and then, “oh, by the way, it also solves these other old problems.”

Take compliance reporting, for example. A lot of companies live and die by their 15 or so critical automated compliance reports, and they also have 2,000 random Excel spreadsheets to check all the other stuff. Piles and piles of checklists, people being paid to copy and paste from one thing to another. The 10 or 15 automated reports were painful enough that they got automated. Everything else never hit that threshold.

But with a new solution, if you can lower that threshold, just by addressing a whole new complex correlation of data. Usually the reason the old tools need manual evidence passing is because you can’t correlate data. That’s typically why it’s hard to automate compliance reports. And correlation is one of the things that next-generation data platforms do almost categorically better than anything that came before.

So if I look at these compliance reports and ask why we can’t automate all the others, it’s because doing so requires getting tens of thousands of people to copy and paste a whole bunch of evidence points across a million different things so that data is right where it needs to be. Because we can’t join across 20 different products. That’s the real problem.

Hazel Weakly  [00:17:06]

But if we can join across 20 different products, if we can bring all that data together in a Lakehouse-style architecture, then all of a sudden we can automate those reports. Not just the 15 critical ones, but maybe 200 of them. And then suddenly compliance isn’t just an afterthought, it’s baked in.

What I want is a platform that does both: I can query the data deeply in a very relational, correlated way and get the compliance evidence I need, but I can also get the streaming, analytical, aggregate view. Both. That’s the thing I can’t get with any other solution today.

Robert Hodges  [00:22:11]

And there’s something I ran into recently that totally dovetails with that point. Modern analytic databases, like ClickHouse in particular, have this amazing ability to look at a file and extract it, use whatever metadata is there, and sometimes just infer what the file even is. Is this Parquet? Is this CSV? Is it JSON? And then show you the structure, either because it’s explicitly there in metadata (as with Parquet), or because it can read something like JSON and infer the structure from looking at the first few records.

This is astonishingly good. You can put stuff out there and with these modern tools, you can actually make sense of it. Back in the days of Hadoop, there was nothing like this. You had to eyeball the files yourself and then write the Java that would read them. It’s a huge step forward.

Hazel Weakly  [00:23:16]

And even with Hadoop, even if you read the files and analyzed everything, you had to build your own job to process everything. Everything was its own job. But as a platform owner, there was no way to restrict the ability to create a job such that two different types of fields don’t end up together in a way that accidentally creates a personal identifier. That requires being able to tag every single thing.

Some things that are individually low-sensitivity can become high-sensitivity when combined. Even in Hadoop, even if you did all the work, you couldn’t get that capability. So you’d end up as a platform team building blueprints and starter kits and templates and making sure that everybody registered every single type of job. Then all the jobs had to go through a review to make sure that the data you got, now and in the future, could never result in a data leak or a correlation leak.

But now I can just write policies that say: this field shouldn’t be joined with these other fields unless it’s been joined into a report that meets certain criteria. If it’s compliance level one, bump it up to two. I can write those types of policies now.

That means my templates, my standard parts, my review process basically become: did you use the platform? Cool. You’re good to go. The standard parts are no longer oriented around proving compliance, but around making sure developers are actually effective at what they need to do. Which is, ironically, supposed to be the highest priority. But the more regulated something is, the lower priority developer experience tends to be.

Hazel Weakly  [00:26:07]

So if I build an internal platform for developers at a regulated company, priority zero is: did we preserve compliance? Did we preserve the risk controls? Did we preserve the security posture? Did we collect the evidence? And then 20 items down the list: do developers like it? It’s priority 17 or something.

If I can bake all those other priorities into the actual infrastructure, into the capabilities and the protocols, then all of a sudden I can make the actual users the number one concern again, where they should be. That’s the thing I find most fascinating about the next generation of data.

Josh Lee  [00:26:58]

I love that. The question becomes simply: did you use the platform? Check. You mentioned users being developers, and I’m going to probably regret opening this can of worms. These days, users are also commonly coding assistants and agents. It seems like that’s another layer. If we build barriers and controls that prevent people from using those tools, they’ll just come up with workarounds and end up with shadow IT. Whereas if you can build governance in a way that empowers them to use these tools and ensures they won’t run amok — yeah, I like it.

Hazel Weakly  [00:27:41]

Honestly, even with AI agents, that type of scenario has happened before AI. Even with AI, it’s not exclusive to non-developers, because every developer has been tired and rushed, someone said “just do it, you have a deadline”, and so they write something fast and throw it in without really checking it. Everyone’s done it. Everyone’s going to keep doing it because that’s what happens when you get pushed and squeezed and have to just get through something. You stop being careful. Careful is for when you have lots of free time and no stress. And I don’t think any developer in the last five years would describe themselves as having lots of free time and no stress.

So at this point, it’s not even about developers vs. non-developers, or AI vs. non-AI. I want to make it so we can focus on helping the users, making that the actual priority, rather than making data movement or evidence collection the priority.

Hazel Weakly  [00:29:03]

Because that’s how you go: “I hear what you’re saying. Let me help you.” Instead of: “I hear what you’re saying, but you need to fill out the form, check the state, go over here, and hit the box.” No one wants that conversation. Not even the people who have to deliver it.

Robert Hodges  [00:29:23]

It’s the difference between enabling versus regulating. Some people see new technology and say, “Oh my goodness, I want to make this possible to use.” But others see it and say, “This could get out of hand, so I’m going to lock it down as much as possible, even though perhaps I don’t fully understand how people are even going to use it.” Do you see that tension in building data lakes? How do you address those two camps? You want to make people productive, but you don’t want them to go off the rails.

Hazel Weakly  [00:30:03]

I definitely see that tension. It’s tricky because there’s so much legitimate need for both perspectives, and very frequently, people are not actually in that legitimate need when they take a given stance. If you’re building account transactions between different accounts or life-saving medical devices, yes, you’re going to lock it down tight. On the other hand, if you’re building a little internal fun thing that doesn’t touch anything critical, you don’t necessarily need to lock it down.

But you’ll find the “I have to lock this down” attitude in both places. And you’ll find the “Let’s experiment with this” attitude in both places too. People bring the attitude from the toy project into the nuclear power plant and vice versa. You need both perspectives, and you’re going to find people applying them when they shouldn’t and not applying them when they should.

Hazel Weakly  [00:31:37]

So if I’m thinking about a data platform and how to address both groups, usually I go for the easy approach first: if someone has the right perspective for the right need, great. Let’s build for both. Let’s build for the boundaries. Let’s build for where it breaks down.

I have a metric I use internally: time to typo. If I see a typo somewhere, what is my time from noticing it to getting it fixed? All I need to do is remove one character. How long does that take? I want to minimize that. Or if someone just wants to put a Hello World page up, and they don’t really care what it does, they just want to put something somewhere. How easy is that? If a data platform is a key-value store, can I store ‘Hello World’? Can I get it back? How hard is that?

Then on the other side, ironically, it’s everything but the data. Can I make sure that no one can do anything with the data except exactly what I specified? Can I make sure that only at 2:00 PM on a Sunday, this one account in this one region can do this one thing? Can I write those types of policies? Can I think about that in a useful way?

Once the people who care about control have the confidence that they can build the controls and analyze everything, they’re ready to add data. Once the people who just want to play with it know they can put something somewhere and get it back, they’re ready to lock it down. I focus on finding what gives each group psychological satisfaction first, from both perspectives, and then bring them toward the middle, because the platform should serve both needs.

Robert Hodges  [00:34:37]

Do you feel like AI is actually shifting the balance toward more focus on compliance and regulation? If you or I look at a dataset and realize we’re getting into PII data, something holds us back: the fear of losing our job if we screw up. AI agents don’t have that constraint in the same way. How do you see people managing that risk in regulated systems?

Hazel Weakly  [00:35:19]

I’ll answer in two parts. Have I seen people caring more about compliance or controlling AI because it doesn’t access things the way humans do? Honestly, typically no. And the reason is that the way AI has been marketed and sold has dropped all sorts of normal common-sense filters in people’s brains.

Those filters, the “I shouldn’t do this” instincts, have been carefully built up over decades. AI is such a different way of engaging with technology that we haven’t built up any of those common-sense safeguards yet. It’s like the early internet: “stranger danger” and “don’t talk to people you don’t know” were ingrained from childhood. Then the internet came along and it was totally normal to meet strangers and make friends. None of that internet common sense was intuitive. You had to learn it. We had to develop the social conditioning over time.

AI comes in and people simultaneously encounter a technology that’s been marketed as production-ready and capable of doing anything, while having zero of the mental guardrails built up for it. It’s going to take decades for that social conditioning to develop. And there’s currently zero economic pressure to build any of it. That’s the other part of the problem.

Hazel Weakly  [00:38:13]

There’s another dimension to it for me. Computers in isolation are very predictable: if something works, I can follow the instructions step by step and it works every time. With AI, all of a sudden I do the same thing twice and it doesn’t work the second time, or it breaks every third time.

Humans aren’t really built to deal with that, because it turns out that random but semi-predictable rewards are the largest driver of addiction. So we’ve basically built the ultimate addiction machine with AI, and then we didn’t regulate it, and we didn’t put any social protections in place. So it’s no surprise that even in compliance environments with heavy enforcement of controls, it’s still really hard. The biggest challenge I’ve seen with AI in regulated environments is the conversation: “No, you can’t actually do that,” or “We need to design this in a different way,” or “I know everything on the internet says just pass the flag and you’re done, but we can’t actually do that.”

Robert Hodges  [00:41:36]

I’m really interested in authorization, specifically in regulated environments. They need consistent, single sign-on access to all these data sources. There are existing technologies, OAuth, OIDC, these protocols. Are they sufficient to control access, or do we need more? And if so, what is it that we need?

Hazel Weakly  [00:42:16]

This is a whole can of worms, and a lot of what I do for work is designing and thinking about exactly these things. To simplify: I’d break it into authorization, authentication, and attestation. Let me throw that third one in.

Authentication is: who are you? Authorization is: are you allowed to do this? And attestation is: how have you proved this? What’s the evidence trail? How does it work?

OAuth, OIDC, and single sign-on only deal with authentication. They answer “who are you?” and give you things like session validity and an “audience”, basically a bunch of metadata around the identity that can carry roles or permission sets. That then gets passed into the authorization layer.

For a data platform, taking a key-value store as the simplest example, you might have many different types of permissions for what someone is allowed to do. When building that platform, you have a really interesting problem: how do you define the policies? There are two things to think about: the level of the policies around the data, and the policies around how the application structure itself is built. A policy is really just going to give you a yes or no for some action.

What I need to do with any data application is not just figure out what the policies are. I need to figure out what types of policies can be built around something, whether or not they even make sense for what we need, and then how to combine that with an application foundation that’s probably too limited out of the box for a regulated environment.

So much of actual compliance is around control policies and process policies: did multiple people review the code before it went to production? That code review might be tied to a Terraform change. That Terraform change might create a key-value store. Did I have permission to create a key-value store? Yes, if this happened during a change request in this time window, approved by these people in this organization with these properties. None of that is relevant to people building a key-value store. They just need a yes or no, and all of that extra context tying together business concepts only makes sense at the business level.

Hazel Weakly  [00:47:39]

And it has to sit inside a policy control engine. That policy control engine ironically needs an enormous amount of data. It needs all these types of data, all these relationships. So it itself needs essentially its own Lakehouse, its own next-generation data architecture, in order to effectively pull all that deeply correlated data from all different parts of the company, from all different systems, and put it together to give you a yes or no that makes sense.

Robert Hodges  [00:48:46]

But how do you do that in practice? This is very complex software. When people build a policy, they often bake it into the application because you can’t represent things like “are you a US citizen?” in standard SQL role-based access control. It just isn’t available. This is an application-layer thing. How do you roll this stuff up into something that can be used centrally? Is it possible?

Hazel Weakly  [00:49:19]

This gets into what some people call the dual data problem. Google published the Zanzibar paper that showed how they approach and solve this. What they did was build essentially a Lakehouse (before the term existed), combined with many different tiers of caching, sitting on top of Spanner: a massively distributed, strongly consistent relational data model. Then they built a relational query language on top of it.

That query language can know the entire org structure. It can know all the access that people need to have. For example: should I be able to view every photo in this folder, given that the photo has permissions inherited from the folder, the folder’s parent folder, whether or not the photo is shared from someone, whether or not it’s part of a collection or album? All these different bits of information get correlated and queried together. The classic problem was: I open a folder with 10,000 pictures. How do I make that action take only 200 milliseconds? Zanzibar is how Google solved that.

People took that concept and ran with it. A couple of open-source alternatives emerged, one of which is SpiceDB. SpiceDB sits on top of Postgres or a few other databases. It provides a way of building these indices and putting data in the right places. You upload the data you need into it, query it relationally, and get sophisticated access control. Then the question becomes: how granular do you make it?

For example, you might need to write a policy like: no software can be deployed to production if it has CVEs above a certain criticality level that haven’t been explicitly allowed, within this time window. To do that, you take the entire transitive dependency list of every application in the entire company, potentially dozens of millions of entries. You put them into this database, and then every time you query any individual package during deployment, you can get a yes or no. The logistics of how to build this are an open-ended question. But tying together the evidence and figuring out how granular to make something is ultimately a business decision.

Hazel Weakly  [00:53:18]

The next generation of something like Zanzibar or SpiceDB would be built on top of a highly modern Lakehouse-type architecture. Things like Apache Iceberg, with its next-generation formats and capabilities, are exactly what you need when you have deeply relational, nested, weird correlations of 50 different things, combined with dynamic run-time data. Like that ‘during a time window’ concept: if I need to know that something can only be deployed during business hours, and “business hours” means the local hours of that particular location, not the location of the controlling server. I need to pass that in dynamically at run time to the evaluation engine. I need to take some run-time data from the request, look up a whole bunch of things from the database, maybe query some services as well, pull all that information together, and come back with a response in a very tight window. That’s the rough shape of how this works.

Robert Hodges  [00:54:48]

Thank you so much for the reference to Zanzibar. I searched “Zanzibar Google Security” and the paper came right up. That looks really, really interesting.

Hazel Weakly  [00:54:59]

The people behind SpiceDB have an annotated version of the paper on their website with a lot of really fascinating notes. Also, Lea Kissner, I believe that’s how you say the name, was one of the designers of Zanzibar. If you look up their writing and talks, they go into a lot of the underlying design decisions. One thing they discuss is the need for invertibility: not just the ability to write a query like “what are all the photos I’m allowed to access?” but also the inverse, starting from a given photo and asking “who’s allowed to access this?” Developers needed that inverse traversal for debugging and troubleshooting.

Robert Hodges  [00:56:06]

It feels like we’re entering a new era of security, particularly with AI agents. Previously there was a form of security through obscurity: it was just hard to get to the data. But by making it easy, and then having agents that can ask all sorts of different questions, there’s a whole new level of access occurring. It feels like databases and data lakes are going to have to adapt to that.

Hazel Weakly  [00:56:33]

And it used to be easier to make things secure by making the problem simpler. If you’re only allowed to access data in five different ways, I can check each one individually and make sure you can’t break it. But if you’re allowed to access data in any way, that’s really hard.

Robert Hodges  [00:56:54]

It’s definitely not going to be boring.

Hazel Weakly  [00:56:59]

I’m going to be gainfully employed for a long time. Let’s put it that way.

Josh Lee  [00:57:06]

If it were easy, they wouldn’t pay us so much. Hazel, just to wrap things up. This has been a great conversation. If we gave you a magic wand and you could standardize one thing with one universal protocol, what would you start with?

Hazel Weakly  [00:57:27]

I actually have two answers. The first is something that already exists: I would love transaction tokens to be standardized and used everywhere. They’re a way to prove something, but throughout the entire lifecycle of a request. At every single hop, you can prove that hop, pass the evidence through, and attach policies to it. You can double-check everything.

The second thing doesn’t exist yet, but I really want a way to take the metadata and capabilities of modern formats like Apache Iceberg and attach really strong policies and access controls to them. Then use that to automatically segment things into different storage locations, or different encryption keys, or different access patterns, essentially heterogeneous data access in the same place. If I could do that, it would dramatically simplify a lot of these problems. I want to access a lot of data for different types of purposes, but I need to make sure the data is always being used in a compliant way. That’s still a huge unsolved problem in next-generation data platforms. If someone ever figures out how to do it, that would be the key.

Josh Lee  [00:58:58]

Okay. If you’re listening and you want to start a startup, there you go. Hazel, you’re speaking at KubeCon, I think I saw?

Hazel Weakly  [00:59:09]

I’m speaking at KubeCon and I have some other things coming up as well. Very exciting. Folks can see me in person.

Robert Hodges  [01:00:10]

Great. Thank you so much for the conversation.

Hazel Weakly  [01:00:16]

Yeah. Thank you for having me on.

Josh Lee  [01:00:20]

This was extremely enlightening. I look forward to having you back.

Listen to the full conversation on the Unevenly Distributed podcast, available on Spotify, Apple Podcasts, and YouTube. Connect with Hazel on LinkedIn at /in/hazelweakly/. For more insights on ClickHouse and real-time data architecture, visit our blog.