Disclaimer: I am the Head of Developer Relations at Tabular and a Developer Advocate in both Apache Iceberg and Trino communities. All of my 🌶️ takes are my biased opinion and not necessarily the opinion of Tabular, the Apache Software Foundation, the Trino Software Foundation, or the communities I work with. This also goes into a bit of my personal story for leaving my previous company but relates to my reasoning so I offer you a TL;DR if you don’t care about the details.
TL;DR: I believe Apache Iceberg won the table format wars, not because of a feature race, but primarily because of the open Iceberg spec. There are some features only available in Iceberg due to the breaking of compatibility with Hive, which was also a contributing factor to the adoption of the implementation.
My revelation with Iceberg
Two months ago, I made the difficult decision to leave Starburst which was hands down the best job I’ve ever had up to this point. Since I left I’ve had a lot of questions about my motivations for leaving and wanted to put some concerns to rest. This role allowed me to get deeply involved in open-source during working hours and showed me how I could aid the community to get traction on their work and drive the roadmap for many in the project. This was a new calling that overlapped with many altruist parts of how I define myself and was deeply rewarding.
I made some incredible friends, some of which have become invaluable mentors during this process of learning the nuances and interplay between venture capital and an open-source community. So why did I leave this job that I love so much?
Apache Iceberg Baby
Let’s time-travel (pun intended) to the first Iceberg episode of the Trino Community Broadcast. In true ADHD form, I crammed learning about Apache Iceberg well into the night before the broadcast with the creator of Iceberg, Ryan Blue. While setting up that demo, I really started to understand what a game-changer Iceberg was. I had heard the Trino users and maintainers talk about Iceberg replacing Hive but it just didn’t sink in for the first couple of months. I mean really, what could be better than Hive? 🥲
While researching I learned about hidden partitioning, schema evolution, and most importantly, the open specification. The whole package was just such an elegant solution to problems that had caused me and many in the Trino community failed deployments and late-night calls. Just as I had the epiphany with Trino (Presto at the time) of how big of a productivity booster SQL queries over multiple systems were, I had a similar experience with Iceberg that night. Preaching the combination of these two became somewhat of a mission of mine after that.
ANSI SQL + Iceberg + Parquet + S3
Immediately after that show, I wrote a four-blog series on Trino on Iceberg, did a talk, and built out the getting started repository for Iceberg. I was rather hooked on the thought of these two technologies in tandem. You start out with a system that can connect to any data source you throw at it and sees it as yet another SQL table. Take that system and add a table format that interacts with all the popular analytics engines people use today from Spark, Snowflake, and DuckDB, to Trino variants like EMR, Athena, and Starburst.
Standards all the way down
This approach to data virtualization is so interesting as each system offers full leverage over vendors trying to lock you in their particular query language or storage format. It pushes the incentives for vendors to support these open standards which puts them in a seemingly vulnerable position compared to locking users in. However, that’s a fallacy I hope vendors will slowly begin to understand is not true. With the open S3 storage standard, open file standards like Parquet, open table standards like Iceberg, and the ANSI SQL spec closely followed by Trino, the entire analytics warehouse has become a modular stack of truly open components. This is not just open in the sense of the freedom to use and contribute to a project, but the open standard that enables you the freedom to simply move between different projects.
This new freedom gives the user the features of a data warehouse, with the scalability of the cloud, and a free market of a la carte services to handle your needs at whatever price point you need at any given time. All users need to do in this new ecosystem is shop around and choose any open project or vendor that implements the open standard and your migration cost will be practically non-existent. This is the ultimate definition of future-proofing your architecture.
Back to why I left Starburst
Trino Community
I’ll quickly tie up why I left Starburst before I reveal why Iceberg won the table format wars. For the last three years, I have worked on building awareness around Trino. My partner in crime, Manfred Moser, had been in the Trino community and literally wrote the book on Trino. Together we spent long days and nights growing the Trino community. I loved every minute of it and honestly didn’t see myself leaving Starburst or shifting focus from Trino until it became an analytics organization standard.
Something became apparent though. Trino community health was thriving, and there were many organic product movements taking place in the Trino community. Cole Bowden was boosting the Trino release process getting us to cutting Trino releases every 1-2 weeks which is unprecedented in open-source. Cole, Manfred, and I did a manual scan over the pull requests and gracefully closed or revived outdated or abandoned pull requests. The Trino community is in great shape.
Iceberg Community
As I looked at Iceberg, the adoption and awareness were growing at an unprecedented rate with Snowflake, BigQuery, Starburst, and Athena all announcing support between 2021 and 2022. However, nothing was moving the needle forward from a developer experience perspective. There was some initial amazing work done by Sam Redai, but there was still so much to be done. I noticed the Iceberg documentation needed improvement. While many vendors were advocating for Iceberg, there was nobody putting in consistent work to the vanilla Iceberg site. PMCs like Ryan Blue, Jack Ye, Russel Spitzer, Dan Weeks, and many others are doing a great shared job of driving roadmap features for Iceberg, but no individual currently has the time to dedicate to the cat herding, improving communication in the project, or bettering the developer and contributor experience for users. Since Trino was on stable ground it felt imperative to move to Iceberg and fill in these gaps. When Ryan approached me with a Head of DevRel position at Tabular, I couldn’t pass up the opportunity. To be clear I left Starburst but not the Trino community. Being at Iceberg also helped me in my mission to continue forging these two technologies that I believe in so much.
Tell us why Iceberg won already!
Moving back to the meat of the subject. My first blog in the Trino community covered what I once called, the invisible Hive spec to alleviate confusion around why Trino would need a “Hive” connector if it’s a query engine itself. The reason we called the Hive connector as such is that it translated Parquet files sitting in an S3 object store into a schema that could be read and modified via a query engine that knew the Hive spec. This had nothing to do with Hive the query engine, but Hive the spec. Why was the Hive spec invisible 👻? Because nobody wrote it down. It was in the minds of the engineers who put Hive together, spread across Cloudera forums by engineers who had bashed their heads against the wall and reverse-engineered the binaries to understand this “spec”.
Why do you even need a spec?
Having an “invisible” spec was rather problematic, as every implementation of that spec ran on different assumptions. I spent days searching Hive solutions on Hortonworks/Cloudera forums trying to solve an error with solutions that for whatever reason, didn’t work on my instance of Hive. There were also implicit dependencies required with using the Hive spec. It’s actually incorrect to call the Hive spec as such because a spec should have independence of platform, programming language, and hardware, while including minimal implementation details not required for interoperability between implementations. SQL, for instance, is a declarative language that runs on systems written in, C++, Rust, Java, and doesn’t get into the business of telling query engines how to answer that query, just what behavior is expected.