This Week’s Learning Nuggets (W23-2025)

Anuvrat Singh

Jun 09, 2025

Trying something new - quick, imperfect notes from what I read this week. If I got something wrong let me know!

Nugget #1: Hugging Face Spark Connector

What I used to think:
- Loading datasets into Spark was always clunky: Python → Pandas → Spark, with temp files in the middle.
- Hugging Face was aiming for “big data” scale.

What changed:
- Now you can load and write datasets directly from the Hugging Face Hub with native Spark syntax. Much less friction.
- But, no predicate pushdown. You have to pass filters as options. Awkward and not scalable.

Takeaways:
- Great for quick, reproducible ML workflows and fine-tuning.
- Not built for foundation model training or petabyte-scale ETL.
- The simplicity is a win. Just don’t expect data warehouse-level performance.

Nugget #2: Apache Iceberg v3 – Unification

What I used to think:
- Deleting or updating rows in the lakehouse = maintenance headaches. Compaction jobs, file bloat, schema pain.

What changed:
- Deletion vectors: Deletes update a bitmap per file. No more pileup of tiny delete files or periodic cleanups.
- Row lineage: Every row gets a stable ID. Easier deletes, real CDC, and reliable auditing.
- VARIANT columns: Native support for JSON/semi-structured data with path pushdown. Less ETL, more query flexibility.

Takeaways:
- Less ops overhead. Maintenance is mostly automatic now.
- Easier to track changes, do GDPR audits, or build real-time materialized views.
- Semi-structured data is now first-class, not a workaround.

That’s it for this week! Still figuring out some of the deeper details, but already rethinking a few assumptions. Always open to corrections or new reading recommendations. Drop them my way.

Pensieve

Discussion about this post