LogoJourney
HomePathCoreCategoriesTagsGitHubAbout

Built with React, Next.js, and Sanity.

© 2026 Engineering Journey

Back to Home
Data Engineering
dataengineering
iceberg
metadata

Metadata Over Compute: Visualizing Table History with dlt and Iceberg

March 22, 2026
5 min read

Aman G.

Author


If you’ve been following the Tabular – Databricks news, you’ve probably heard the phrase “metadata over compute.” It sounds like marketing, but it’s a real architectural idea. We use dlt (data load tool) to ingest into Apache Iceberg and watch how table metadata evolves and why it stays independent of the catalog.


The Thesis

Most data platforms tie table metadata to the compute engine. Your schema, partition info, and history live inside Snowflake, BigQuery, or Databricks. Move tables, and you leave that metadata behind.

Apache Iceberg inverts that. The source of truth is the metadata itself: JSON and Avro files that sit next to your data. The catalog (Hive, Glue, Snowflake Polaris, Tabular) is just a pointer: “this table’s metadata lives here.”

The metadata is portable. You can swap catalogs without losing history or lineage.
That’s “metadata over compute” in practice.


Operation → Metadata, in Succession


We use dlt’s filesystem destination with `table_format="iceberg"`. Same pipeline, three runs.

1. Append 5 rows

python
@dlt.resource(table_format="iceberg", table_name="events") def events_append(): yield [{"id": 1, "event": "deploy", "ts": "..."}, ...] pipeline.run(events_append())

Metadata responds:

json
metadata/ has 2 file(s): 00000-....metadata.json 00001-....metadata.json current-snapshot-id: 8570791162370934054 snapshots (history): 1 [1] append: parent=None added=5 deleted=0 total_records=5

Each commit creates a new metadata file. Iceberg keeps the old ones.

2. Overwrite 2 rows (id 1 and 2 get new values)

python
@dlt.resource(table_format="iceberg", table_name="events", write_disposition="merge", primary_key="id") def events_overwrite(): yield [{"id": 1, "event": "deploy-v2", ...}, {"id": 2, "event": "build-v2", ...}, ...] pipeline.run(events_overwrite())

Metadata responds:

json
snapshots (history): 3 [1] append: added=5 deleted=0 total_records=5 [2] delete: added=0 deleted=5 total_records=0 [3] append: added=5 deleted=0 total_records=5

Merge is modeled as delete + append. The chain is explicit.

3. Delete 1 row (id == 3)

python
@dlt.resource(table_format="iceberg", table_name="events", write_disposition="replace") def events_after_delete(): yield [{"id": 1, ...}, {"id": 2, ...}, {"id": 4, ...}, {"id": 5, ...}] pipeline.run(events_after_delete())

Metadata responds:

json
snapshots (history): 4 ... [4] overwrite: added=4 deleted=5 total_records=4

Replace drops the old data and writes only what remains. Final state: 4 rows. The full chain stays in the metadata.


Where Does the Data Live?

The metadata points to the Parquet files. It doesn’t store the column values themselves.

Why This Matters


1.Portability - Metadata is just files. You can move them to another catalog (Polaris, Tabular) and keep the same history and lineage.
2. Auditability - `snapshot-log` and `metadata-log` give a clear record of what changed and when.
3. Time travel - Snapshots form a chain via `parent`. You can query the table as it looked at any past snapshot.
4. Catalog independence - The catalog is only a mapping from table name to metadata location. The same metadata files work with Hive, Glue, Polaris, or Tabular.

PreviousOntology engineering: bringing meaning to data
Back to All Posts
Tags:
dataengineering
iceberg
metadata