This Week’s Learning Nuggets (W23-2025)
Trying something new - quick, imperfect notes from what I read this week. If I got something wrong let me know!
Nugget #1: Hugging Face Spark Connector
What I used to think:
Loading datasets into Spark was always clunky: Python → Pandas → Spark, with temp files in the middle.
Hugging Face was aiming for “big data” scale.
What changed:
Now you can load and write datasets directly from the Hugging Face Hub with native Spark syntax. Much less friction.
But, no predicate pushdown. You have to pass filters as options. Awkward and not scalable.
Takeaways:
Great for quick, reproducible ML workflows and fine-tuning.
Not built for foundation model training or petabyte-scale ETL.
The simplicity is a win. Just don’t expect data warehouse-level performance.
Nugget #2: Apache Iceberg v3 – Unification
What I used to think:
Deleting or updating rows in the lakehouse = maintenance headaches. Compaction jobs, file bloat, schema pain.
What changed:
Deletion vectors: Deletes update a bitmap per file. No more pileup of tiny delete files or periodic cleanups.
Row lineage: Every row gets a stable ID. Easier deletes, real CDC, and reliable auditing.
VARIANT columns: Native support for JSON/semi-structured data with path pushdown. Less ETL, more query flexibility.
Takeaways:
Less ops overhead. Maintenance is mostly automatic now.
Easier to track changes, do GDPR audits, or build real-time materialized views.
Semi-structured data is now first-class, not a workaround.
That’s it for this week! Still figuring out some of the deeper details, but already rethinking a few assumptions. Always open to corrections or new reading recommendations. Drop them my way.