Welcome to 2024. You made it and this year is going to be big for Spark, Lakehouses, Stream processing engines and streaming data in general.
I’ve had O’Reilly’s Stream Processing with Apache Spark, Streaming Systems and Stream Processing with Apache Flink on my shelves for ages. I wondered a while ago whether I’d missed the boat and that everyone was implementing this stuff in secret. Turns out what I saw with Flink a few years ago was the Gartner Hype cycle in full swing.
A few years ago everyone was talking about Flink, streaming systems, Google Dataflow/Apache Beam and so on. What this I suspect turned out to be was the Peak of Inflated Expectations. Thats not to say large companies didn’t make great use of Flink, Spark Streaming etc, but that wasn’t the norm. Then we’ve been through the Trough of Disillusionment which I don’t really think is that big a trough, but more just other companies biding their time, waiting for the right usecase to come along. These use cases are becoming more and more frequent and with streaming systems becoming easier to deploy, the slope of enlightenment and plateau of productivity are on their way.
Databricks have made huge strides in coming up with ways to ease the integration pain driving the Data Lakehouse concept, trying to help companies adopt a framework that would allow for structured sharing of data within teams in a business, no not a data mesh. Then we’ve had companies like Dremio come along with a slightly different offering, allowing query and data manipulation over differing datasources. Again, using the lakehouse concept and cloud based solutions to allow businesses to bring together data from various sources and process them as one.
On top of this we also have other less invasive frameworks, DuckDB can give you excellent analytics performance over your structured data, be it flat files or processed Parquet data. Also if you’re wanting an open source Spark distribution to run on your own hardware then Canonical have recently entered the market with their Data Fabric(you can check out our introductory foray into it here)
As we go through 2024 we’ll also see the parquet based table format wars play out. With most companies favouring the Iceberg format but Databricks out there on a one company crusade with their Delta table format. Both are wrappers around Parquet but both offer their own advantages and disadvantages. We’ve also got Apache Hudi, and to wrap all of this companies like Onetable trying to being the players in the interoperability space.
Of course to use these platforms we need to get the data into the platforms, which takes me back to the Plateau of Productivity, I’ve seen a number of analysts state 2024 will be the year for Apache Flink to really kick on in the stream processing space. Of course in this area, but not directly related we’ve also got Apache Kafka and Apache Pulsar and more.
In this roundup of products, I’ve deliberately left out Cloud vendor offerings, they follow their own trajectory and their usage probably depends on whether you’re heavily invested in a specific cloud vendor. The other big player to not ignore of course is Snowflake, who are the cloud based warehouse vendor.
All of this means we have a rich feast of platforms we can leverage in 2024, but how to chose and which ones do you need? As we get into Q1 and beyond, we’ll be taking a real look at these platforms, what they offer, where to use them and where not to use them. Stay tuned!