11 Comments
Jul 5, 2023Liked by Brian "bits" Olsen

FYI - Delta Lake has had it's spec published since open sourcing the project in 2019:

https://github.com/delta-io/delta/blob/master/PROTOCOL.md

Expand full comment
author

Right, and again, DeltaLake is a much more formidable spec. So the question comes down to choosing a spec that everyone can (or simply has to) agree on. I've been surprised before, but I don't imagine Snowflake will ever come to a point where they can agree on building around a Delta Lake spec. Same goes for other engines who see Databricks as competition.

I'm not saying it's correct to view Delta Lake in such a lens as there's a good amount of diversity in the maintainers (https://delta.io/community), but coming from a practical side of which spec is now most likely going to be the most adopted, I believe it's clearly the Iceberg spec.

This again, doesn't invalidate the work that Delta Lake is doing nor their spec, it's simply a product of concern over Databricks' influence over the project. I don't personally hold a strong opinion on this but I do empathize with running a project where the optics seem to favor what's actually going on inside.

Expand full comment

Fundamentally, everyone will not agree on a common spec (or simply have to). Companies run heterogeneous tech stacks for many reasons. There are many companies that run both Delta and Iceberg. For them, winning means interoperability across this heterogenous landscape. Delta UniForm helps to unify all 3 of the Lakehouse Formats, which is ultimately what customers are looking for.

Expand full comment
author

These three table formats exist as an iteration of Hive because most people using data didn't want to learn new things and only wanted to use SQL, a spec. Vendors will always have incentives to skirt a spec to hold onto lock-in.

Perhaps if you're Apple, Uber, or AWS, throw everything at the wall and see what sticks, but that's not considering the larger community. What I've heard from users in the Trino and data community is the preference to get to fewer systems doing the same things and have the option to move their use cases between these systems without a large cost involved. That is done through standards and largely why SQL (as annoying as it is to follow) still lives today. Dislike specs all you want, the more you fight the demand the more fall prey to the lock-in thinking.

I think the Delta Lake protocol is fine, it's just not gonna be the spec that gets mass adoption and Databricks decision with Uniform is just dipping your toe into the inevitable.

Expand full comment
Jul 5, 2023Liked by Brian "bits" Olsen

Time will tell.

For the record, I love specs :)

Expand full comment
Jul 5, 2023Liked by Brian "bits" Olsen

Very interesting read. As a person using Hudi at our org, I was very much inclined towards using Hudi table format. Can you shed more light on Hudi exposing Java Dependencies?

Expand full comment
author

I started getting my doubts seeing some of the conversations the Trino community had when implementing the Hudi connector. That gave me some negative intuition about the Hudi impl. (https://github.com/trinodb/trino/pull/10228#discussion_r959007161)

I came to find out that Hudi isn't portable, with the spec supporting four different kinds of Java-specific serialization, including Kryo, Hadoop Writeable (where serialization is defined by Java code), DataOutputStream (which defines "modified utf-8"), and HBase's HFile. (https://hudi.apache.org/tech-specs/#log-file-format, https://hudi.apache.org/tech-specs/#hfile-block-id-5)

To add more flavor about my transaction semantics comment, there are serious drawbacks to using Hudi for CDC, since it doesn't support multiple writers safely and doesn't correctly isolate readers from writers. Hudi is only eventually consistent when used for CDC and because they batch operations, they lose the history of granular changes which is one of the reasons many are shifting to CDC today. (https://cwiki.apache.org/confluence/display/HUDI/RFC+-+22+%3A+Snapshot+Isolation+using+Optimistic+Concurrency+Control+for+multi-writers#RFC22:SnapshotIsolationusingOptimisticConcurrencyControlformultiwriters-Guarantees)

Based on my current understanding (which I'll admit is limited on the Hudi front), I just don't get a good intuition about the project. It feels more fruitful for the larger community to narrow in one spec and as few implementations that the community defines as useful. Again, totally open to hearing defenses from Hudi folk on this and open my vantage point on this.

Expand full comment

We've looked at both Delta Lake and Iceberg. What stood out was that Delta Lake has implementations in non-JVM languages (e.g. Rust and subsequently bindings to Python etc). Delta also now has a pure Java API that doesn't totally rely on Spark. This is a particularly useful feature since we don't want to have a depdendency on Spark or we'd like to use Delta from a non-JVM language.

The last time we looked at Iceberg, there was no way to use it without using something like Spark or another "integration". This was quite surprising. The docs state that: "Spark is currently the most feature-rich compute engine for Iceberg operations. " The Java API seems to be a bit secondary which is disappointing...

Expand full comment
author

These are implementation details which are being addressed in iceberg. There is a collaboration for a rust API and already the pyIceberg implementation is maturing rapidly.

This article focuses on larger implications of the open spec. Each implementations can go tit for tat on who has the best X in their implementation today and that information will quickly become outdated. That’s why I wanted this article to touch on the bigger picture.

Expand full comment

>Hudi’s “spec” exposes Java dependencies making it unusable for any system not running on the JDK, doesn’t clarify schema,

The actual line I think in reference is this?

> the other metadata in the block is serialized using the Java DataOutputStream (DOS) serializer.

This does NOT mean only Java DataOutputStream can write these metadata or it's in java serialized form. What we write out is bytes. One can argue theoretically based on Endianity, but all there is to it. I finally understand where there comments from the iceberg devs are coming from. Let me clarify the spec more. Thanks!

>has a lot of implementation details rather than leaving that to the systems that implement it.

What's pointed to as implementation details is basically a protocol/algorithm to read the timeline, similar to what Delta has. It's actually pretty common in distributed system literature.

Expand full comment
author

See my comment (https://bitsondatadev.substack.com/p/iceberg-won-the-table-format-war/comment/18152049), there's just a lot of leaky abstractions that bubble up from Hadoop when using these libraries along with the multitude of serializers being mentioned in the spec from HFile, DataOutputStream, to Hadoop Writeable, and Kyro. It's just easy to get things wrong in this type of env IMO.

> It's actually pretty common in distributed system literature.

Agreed, it shouldn't be used as a spec. A spec is free of implementation details, a distributed system architecture /docs is to get into these details.

Expand full comment