Webinars

Using dlt to move data from DuckDB to ClickHouse®

Recorded: March 16 @ 08:00 am PDT
Presenters: Joshua Lee & Elvis Kahoro

In this webinar, Josh Lee (Open Source Advocate at Altinity) and Elvis Kahoro (DLT Hub) demonstrate how dlt, an open-source Python data movement library, can take a pipeline from a local DuckDB prototype all the way to a production ClickHouse instance running on Kubernetes via Altinity.Cloud®. Elvis explains dlt’s core philosophy of democratizing data engineering: making data pipelines as code-first, portable, and AI-agent-friendly as possible, so teams spend time on business problems rather than boilerplate infrastructure work.

The session covers dlt’s key features: a declarative pipeline API that works natively with Python lists, dictionaries, and DataFrames; an incremental loading system that tracks state on the destination; over 10,000 API source contexts that help both humans and LLMs write correct pipeline code; and a scaffolding CLI (dlt init) that generates a ready-to-run project with credentials, schema, and agent skills in a single command. Elvis demos the full workflow: initializing a GitHub API pipeline in a fresh project, using Claude as the debugging agent via a dlt MCP server, switching the destination from DuckDB to Iceberg on Altinity.Cloud® with minimal configuration changes, and exploring loaded data with MIMO-based data quality primitives. The webinar closes with a look at the dlt runtime command for cloud execution and the DLT Hub observability dashboard for monitoring pipeline runs.

Key Moments (Timestamps)

Key moments generated with AI assistance.

  • 00:12 – Welcome and speaker introductions
  • 00:37 – Overview of Altinity and ClickHouse: Project Antalya, Altinity.Cloud, BYOC, BYOK
  • 03:06 – The Altinity Kubernetes Operator for ClickHouse: 6 years, 100 million downloads
  • 03:36 – Introduction to dlt: what it is and who uses it
  • 06:48 – Why teams switch from managed platforms to dlt: cost, code ownership, compliance
  • 09:20 – Why teams switch from custom Python scripts to dlt: schema, incremental loads, best practices
  • 10:25 – What a dlt pipeline looks like: sources, resources, destinations
  • 12:36 – File system and S3 sources with incremental loading
  • 13:52 – REST API support and the declarative type dictionary
  • 14:17 – Live demo: dlt init, virtual environment setup, and GitHub API pipeline
  • 17:00 – dlt AI: downloading Claude skills and an MCP server for data validation
  • 20:31 – Code walkthrough: GitHub source, async generators, dlt decorators, and resources
  • 23:31 – Switching destination to Iceberg on Altinity.Cloud
  • 24:48 – Data exploration with MIMO and pipeline introspection
  • 26:38 – Data quality checks and fast local feedback loops
  • 27:43 – dlt runtime launch: running pipelines in the cloud against Altinity
  • 29:45 – Chaining pipelines: bronze, silver, gold medallion architecture
  • 32:15 – DLT Hub observability dashboard
  • 34:08 – Summary, resources, and Q&A

Webinar Transcript

[00:12] – Welcome and Speaker Introductions

Josh: Hello everybody and welcome to our first March webinar. Today we’re going to be talking about how we can use dlt, which is an open-source tool, to move data from DuckDB, another open-source tool, to ClickHouse, everyone’s favorite open-source database. In a way, we’re going to be taking things from a pocket calculator on a laptop all the way up to a production environment running on Kubernetes on Altinity.Cloud.

I am Josh Lee. I am an open-source advocate at Altinity. We do hosting and support for ClickHouse, but we are not affiliated with ClickHouse, Inc. And I am joined today by Elvis. Hey Elvis.

Elvis: Hey everyone. My name is Elvis. I work at DLT Hub. We make dlt, which is an open-source Python library that stands for Data Load Tool, and we’re just trying to make it as easy as possible for people to move data.


[00:37] – Overview of Altinity and ClickHouse

Josh: Quickly, for those coming from outside of the Altinity sphere of influence who have not heard of ClickHouse: it’s awesome, we love it, and it’s a very, very fast database. Here is a diagram we use in a lot of our talks that gives you an overview of the ClickHouse architecture.

Worth pointing out here: Project Antalya is Altinity’s project to make Iceberg and Parquet-based data lakes the primary storage for ClickHouse. Not just a destination or source you’d move data to or from, but really the primary data layer. This can dramatically reduce costs, and surprisingly can even speed things up if it’s architected correctly and if the shape of the data is right. Even in cases where it isn’t speeding things up, it’s not as slow as you’d expect. We have other content and webinars on this if you’d like to check them out.

Also worth pointing out: there are a lot of pieces to running ClickHouse, and one of the things we do at Altinity is manage all of those pieces for you. We have Altinity.Cloud, and we offer enterprise support. The Altinity Cloud Manager makes it really easy to just click a button and get a production-ready or development ClickHouse instance on a completely isolated Kubernetes cluster, typically EKS but with other options. You get backups, you get consensus, you can have swarm clusters, which are stateless nodes introduced as part of Project Antalya. And because ClickHouse is open source, we support customers running it anywhere they can run Linux binaries. That means our Bring Your Own Cloud and Bring Your Own Kubernetes offerings, where you can run ClickHouse on your own cloud account or on your own Kubernetes cluster anywhere.


[03:06] – The Altinity Kubernetes Operator for ClickHouse

Josh: I’ll mention my favorite open-source piece besides ClickHouse itself: the Altinity Kubernetes Operator for ClickHouse. This operator has been around for six years. It’s used in thousands of production installations, including all the installations we manage on Altinity.Cloud. It’s been downloaded over 100 million times, and it is, as far as I can see, the best and easiest way to run production ClickHouse workloads in Kubernetes.


[03:36] – Introduction to dlt

Elvis: Like I was saying, dlt is an open-source library for data movement and our goal is to democratize data engineering. You can think of us almost as a platform engineering team for the data community, but one that ships a product as open source. Right now we have around 8,000 companies using dlt in production, and we have roughly 600 shared customers with Snowflake. We’re just trying to make it easy for folks to pull data from something like Salesforce, or to do CDC with Postgres, and then be able to point it at any data warehouse. Whenever you have a dlt pipeline, we abstract away the logic of actually loading data into the destination, whether that’s Redshift, DuckDB, Snowflake, or others. The goal is that with a single line of code change, you can switch the destination, add some tokens, and it just works.

Here’s a more formal overview of different use cases. You might be working in RevOps and moving data from your CRM into Snowflake. You might be doing CDC from Postgres to Snowflake. You might also be doing the reverse: pulling data from your data warehouse, taking a subset, and using it for exploration or to build views and filters. We also make it really easy to work with your file system. You can point dlt at a folder on your computer and say “move all the Parquet files or all the CSVs into my data warehouse.” You can also point it at an S3 bucket. And we have support for Iceberg. In the second demo later today, we actually show loading some data into Iceberg.

We also have support for REST APIs. We give people a declarative type dictionary to fill out with the information for the REST API, and what we call scaffolds: context bundles for a bunch of different API sources. Whenever you start using dlt, you can use the CLI to tell us your source and your destination, and we can download context for that particular source. We’re around 10,000 sources right now that we’ve built context for. The goal is to keep growing that number and to make it really easy for people to pull data from wherever they want, using AI to help.


[06:48] – Why Teams Switch from Managed Platforms to dlt

Elvis: One of the main reasons people switch to dlt is that managed data platforms are very expensive, and not just in compute cost. There’s also the engineering time spent working around the limitations of whatever managed platform you’re using. The nice thing about dlt is that since everything is code, all of your transformation logic is completely explicit. What tends to happen with managed platforms is that one person who knows a particular API’s quirks ends up being the only person who understands all the edge cases. When that person leaves, that knowledge goes with them. If you switch to a code-first tool, you can move all of those nuances, workarounds, and edge cases into the code itself, and you get real observability into your pipelines.

The second important reason, especially for companies working in fintech or health, is data contracts and regulatory requirements. Being legally bound not to let data leave your cloud is a real constraint. Being able to pip install this Python package and run it on your own server with your own compute is really helpful for large enterprises that have FedRAMP requirements and similar regulations.


[09:20] – Why Teams Switch from Custom Python Scripts to dlt

Elvis: The other place people come to us from is custom Python scripts. Typically what happens is: some internal data platform is missing a connector for a source, or an ML engineer needs to add new features from a new source. So they vibe-code a quick Python script just to be productive and solve a problem. Then they realize: oh, now I have to normalize data, think about how the schema is going to evolve, and handle incremental loading. These are all things that get in the way of doing your actual job. This is where we come back to being a platform team and say: how can dlt embed best practices into the SDK so people get all of these things out of the box?

We also think carefully about the API boundaries in the SDK so it’s easy for people to use, and about how this works for agents, not just agents in general but the different LLM providers, each of which has different nuances. We think holistically about how to design this to work well with Claude, with Cursor, with OpenAI.


[10:25] – What a dlt Pipeline Looks Like

Elvis: A dlt pipeline typically looks like this. You import dlt, pull out the pipeline method, give it a destination, and give it a dataset name. The dataset name is the name of the database that data gets loaded into. When you run the pipeline, you just pass in the data source. A data source has resources under it, and you can think of resources as tables. In the case of GitHub, my source is GitHub, but my underlying resources might be pull requests, issues, repositories, contributors, and so on.

We also make it easy to load regular Python primitives. We have good support for DataFrames. The most common dlt source types are actually Python lists and Python dictionaries. People want to work with native types they already know, and load that data wherever they want. This is especially true of the ML crowd.

Here’s what the file system source looks like. You give it a bucket URL and specify you want only Parquet or only CSV. There’s also an incremental parameter where you specify the state, in this case a column called modification_date. Whenever dlt runs, it keeps metadata on the destination and uses that metadata to track the last file loaded.

The REST API source is declarative. Giving an LLM a type dictionary with explicit variable names makes it easy for the LLM to generate correct REST API calls. The second challenge is understanding the API’s specific quirks. That’s where the dlt context project comes in: you tell the LLM things like “GitHub uses a non-standard attribute name here instead of the standard modification date.” It’s the combination of a clean SDK and extra context for each specific API that makes LLM workflows reliable.


[14:17] – Live Demo: dlt init, Virtual Environment, and GitHub API Pipeline

Elvis: Let me show the demo. We’re essentially starting from scratch with a new project. Typically we start by initializing a virtual environment, installing UV, and then installing dlt.

Then we use the dlt init command, where you pass in a source and a destination. For this example, we’re using the GitHub API and loading into DuckDB. In the second project, I’ll show how we get data into Altinity. The difference would be that instead of duckdb, the destination would be clickhouse.

The CLI will ask if it’s okay to download context for the GitHub API. You say yes, and it sets things up: the GitHub API pipeline, credentials for ClickHouse, and a secrets config file where you put in your credentials. There’s also a requirements.txt. After installing requirements, we’ll have the ClickHouse drivers plus some supporting packages.

Looking at the diff from a commit: we added a config file, a .gitignore, and the actual GitHub API pipeline. I can see there’s a GitHub source with a REST API client, the base URL, paginator, and the required headers. There are different functions that yield data about repos or issues. In the case of issues, there’s already an incremental feature attribute, so we only load issues newer than the last run.


[17:00] – dlt AI: Claude Skills and an MCP Server

Elvis: There’s also a sub-command called dlt ai. If I click dlt ai, I can see that dlt offers an MCP server as well as toolkits. These are similar to Claude plugins.

If I run dlt init with the agent CLI parameter set to claude, dlt will download Claude skills that teach Claude how to use dlt. We get a skill for the dlt workspace and an MCP JSON configuration. The reason we have an MCP server is that once an agent has created a pipeline and loaded data, we want to give the agent tools to run queries against the loaded data so it can validate the integrity. The pipeline might work, but it might not be pulling the right data. So the MCP server is there for that validation step. It runs locally because we’re typically loading into DuckDB first to verify the pipeline with around 100 sample rows. Then, once confirmed, we switch to ClickHouse and point it at our Altinity instance.

There are also specialized toolkit options. We have one for data exploration, one for the DuckDB runtime, and one for the REST API pipeline. Since this example uses the REST API, we install that toolkit, which downloads additional skills for finding new sources, improving skills, validating data, and creating REST API pipelines.


[20:31] – Code Walkthrough: GitHub Source, Async Generators, and Resources

Elvis: Let me walk through the code for the pipeline I already ran. The first thing is the GitHub source. Typically, you have a function that returns an async generator, and you add a @dlt.resource decorator to tell dlt that this is a data source. When we hit the API, dlt loads some data in memory first, introspects on that data, does schema generation, and then creates a table in your destination based on what it found in that yield.

This is my overall source. I have an API client hitting GitHub. Within the source, we have different resources. This function pulls data from all the repos in our dlt org. And I have another function for GitHub issues, filtering by open issues. Eventually, I just run this in a pipeline. In this case, the destination is Iceberg. If I open the secrets and config file, the only thing I need for my Iceberg credentials is a URI, a warehouse variable, and an Altinity token. I already ran this pipeline. I’ve changed those parameters so I don’t leak any secrets, but if I jump into Altinity, I can see my dev cluster is up, and I can start exploring data.

Altinity.Cloud makes it easy to explore data and also to change my deployment configuration, like adding more memory or changing the region. And the free trial is pretty generous. I was able to do this entire demo without even entering a credit card.


[24:48] – Data Exploration with MIMO and Pipeline Introspection

Elvis: Once you have data loaded, you can start inspecting it. We have some experimental work for making it easy to explore data. We leverage MIMO a lot. We’re trying to add primitives for making it easy for people to inspect data locally.

Typically, you have a pipeline, and you can attach to a particular pipeline and start introspecting on the data in it. You can query a data set from the pipeline. You can also convert the data to Arrow and get a DataFrame back. You can also get your schemas back as Mermaid or as JSON, which is really helpful for sending schema context to other agents.


[26:38] – Data Quality Checks and Fast Feedback Loops

Elvis: We’re also working on data quality. This is still experimental, which is why I haven’t set it up with full ceremony. The concept is data quality checks: you pick a column you care about and want to guarantee has values, like making sure a price column is never null or always above zero. There’s a lot of work going into making it easier for teams using AI to generate pipelines to also validate that the AI-generated code is correct. Being able to do all of this locally is really important. Fast local feedback loops are the key.


[27:43] – dlt runtime launch: Running Pipelines in the Cloud

Elvis: The next level up from running locally is the dlt runtime launch command. Instead of running python pipeline.py locally, you can run uv run dlt runtime launch and pass in your pipeline. Instead of running in DuckDB, it automatically switches to Altinity. There’s a separate config in the repository for the production environment, and dlt picks that up automatically when it runs in the remote environment.


[29:45] – Chaining Pipelines: Bronze, Silver, Gold Architecture

Elvis: Let me show a design pattern that a lot of people actually use. You can chain together different pipelines using the dlt.transformation decorator, which takes a dataset as input. From that dataset, which is the data already loaded in your data warehouse, you can query tables, do joins, do group bys, and aggregate results. You can then coalesce multiple transformations into a new source. So you might have an original pipeline for raw data, and then pass its dataset to a transformation function that computes customer metrics. You can chain these together as many times as you want, implementing a bronze, silver, and gold medallion architecture.

And another cool thing about those transformations: you can actually export them as dbt projects. If you have data analysts who only use dbt, you can export the transformations as dbt and run them using dbt instead, which is pretty cool.


[32:15] – DLT Hub Observability Dashboard

Elvis: We have the DLT Hub app, which gives you observability around your runtime. I can see all the jobs I have: dashboards, notebooks, pipelines. I can see when I recently had runs, and I can see for each pipeline run how long each stage took: extract, normalize, and load. I can also see pipeline info like the destination name, credentials, and all the tables that were loaded.

The notebook we were using earlier to inspect locally loaded data can also be pointed at Altinity instead of DuckDB, no code change. The idea is that you have this exploration notebook you use locally, and you can use the exact same notebook in the cloud against your actual production workload.


[34:08] – Summary, Resources, and Q&A

Josh: That’s amazing. I know that was one of the questions that came up when we posted this on social media: how easy it is to switch.

Elvis: Yeah, and so to give a quick recap: use the dlt init command, pass in a source and a destination, and download files that help you and your agent move more quickly. You’ll get metadata files that give you context on how the API works. We’ve done this for tens of thousands of sources. You can end up with a pipeline up and running in under 20 minutes. Some of our design partners and consultants who were taking one to two days to prototype with a new customer are now doing it in about two hours. They can even vibe-code a pipeline live during an intro discovery call, which is kind of mind-blowing for these large enterprises.

You can find all these sources at dlthub.com/context. And there’s a link to a bunch of GitHub examples as well, including the Altinity demo using the Iceberg destination. You can literally just clone and start loading data.

Josh: Thanks for pointing that out. We do have an Iceberg catalog built into our cloud offering for those who are interested. Elvis didn’t have to set anything up. He just added the URI and the config.

If there are no more questions, feel free to join the Altinity Slack or the DLT Hub Slack and hit both of us up. And we have a webinar coming up next week for those interested in learning more about ClickHouse from a developer’s point of view. Thank you, Elvis. This has been awesome.

Elvis: Cool. Yep.


FAQ Section

Q: What is dlt and how does it relate to DuckDB and ClickHouse?

dlt (Data Load Tool) is an open-source Python library for building data movement pipelines. The dlt init command scaffolds a complete pipeline project from any of its thousands of supported sources to any supported destination. For ClickHouse in particular, dlt supports both a native ClickHouse destination and an Iceberg destination. The recommended development workflow is to prototype locally with DuckDB first (which requires no infrastructure), then switch to a ClickHouse instance on Altinity.Cloud for production by changing a single destination parameter and adding credentials.

Q: How does dlt handle incremental loading?

dlt tracks incremental load state as metadata on the destination. You specify an incremental parameter on a dlt resource, pointing to a column like modification_date. On each pipeline run, dlt reads the last-processed value from its metadata and only loads records newer than that value. This means you don’t need to write or manage watermark logic yourself. For file-based sources like S3 or a local folder, dlt uses the file’s metadata to track what was last loaded.

Q: How does dlt integrate with AI agents like Claude?

dlt provides a dlt ai command that downloads agent-specific skills and an MCP (Model Context Protocol) server configuration for use with tools like Claude, Cursor, or Codex. The skills teach the agent how to use dlt. The MCP server lets the agent run queries against already-loaded data to validate pipeline correctness. This addresses a common failure mode where AI-generated pipeline code technically runs but pulls the wrong data. Context files for specific APIs (available via dlt init) also help agents avoid hallucinating API-specific details.

Q: Can dlt write directly to Iceberg on Altinity.Cloud?

Yes. dlt supports an Iceberg destination. To use it with Altinity.Cloud, you provide a URI for the Altinity Iceberg catalog, a warehouse name, and a bearer token. Altinity.Cloud includes a built-in Iceberg REST catalog as part of Project Antalya, which can be enabled in your environment with minimal configuration. The demo showed this working entirely on a free trial account with no credit card.

Q: What are the benefits of running dlt against Altinity.Cloud vs. managing ClickHouse yourself?

Altinity.Cloud handles the operational complexity: Kubernetes cluster provisioning, ClickHouse configuration, backups, consensus management via ClickHouse Keeper, and upgrades. The Altinity Kubernetes Operator for ClickHouse underpins every Altinity.Cloud deployment. For teams that want data to stay in their own cloud account, Altinity’s Bring Your Own Cloud (BYOC) offering keeps ClickHouse running inside your own VPC while Altinity handles operations.

Q: What is dlt runtime launch, and how does it work with Altinity?

dlt runtime launch is a command that runs a pipeline in the cloud rather than locally. It automatically switches from DuckDB to the configured production destination, which can be a ClickHouse instance on Altinity.Cloud. The command reads a production config file from the repository to determine connection details and then runs the pipeline remotely. This allows developers to prototype locally and promote to production without changing pipeline logic, and without managing a separate execution environment themselves.


© Altinity, Inc. All rights reserved. Altinity®, Altinity.Cloud®, and Altinity Stable® are registered trademarks of Altinity, Inc. ClickHouse® is a registered trademark of ClickHouse, Inc.; Altinity is not affiliated with or associated with ClickHouse, Inc. Kubernetes, MySQL, and PostgreSQL are trademarks and property of their respective owners.

Join our Slack

ClickHouse® is a registered trademark of ClickHouse, Inc.; Altinity is not affiliated with or associated with ClickHouse, Inc.

Related:

Leave a Reply

Your email address will not be published. Required fields are marked *