Why isn't differential dataflow more popular?

Published 2021-01-21

Differential dataflow is a library that lets you write simple dataflow programs and a) then runs them in parallel and b) efficiently updates the outputs when new inputs arrive. Compared to competition like spark and kafka streams, it can handle more complex computations and provides dramatically better throughput and latency while using much less memory.

But I'm only aware of a few companies that use it in production, even though it's been around for 5 years.

Possible explanations:

These all seem plausible, but it's not clear which are most important.

Even more surprising is that noone has copied the ideas into some enterprise friendly java monstrosity - despite the fact that differential dataflow is open source and is explained in depth in many papers and blog posts.

I'm interested because materialize is expending a huge amount of effort adding a SQL layer on top of differential dataflow. That's all very well for people who like SQL, but I'm curious whether there are also potential users who would have been perfectly happy with javascript/python/R bindings and a good tutorial? There are probably multiple niches to be served here.

If you considered using differential dataflow and decided against, please let me know why.

I got feedback in the form of ~20 emails and ~100 comments on hn and lobsters. Thanks to everyone who took the time to reach out - it was very helpful.

(To clarify for many fine but confused commenters - I did not make differential dataflow. I'm just trying to find out what more needs to be done to be useful in that niche.)

Reasons given fell in a few buckets:

  1. Never heard of differential dataflow
  2. Want a complete drop-in solution (builtin integrations for various other tools, orchestration, monitoring, support, hosting etc) rather than a choose-your-own-adventure library
  3. Api too difficult / docs not good enough
  4. Want to handle late arriving data

Materialize is doing a good job with 1-3 already.

I think differential dataflow actually can handle 4, since it can handle bitemporal timestamps, but this isn't something that has been well tested or advertised. That might be worth experimenting with. UPDATE: Frank McSherry posted a video demo.

All of the people in group 2 talked about typical data processing tasks, but people in group 3 had a much wider range of tasks including large-scale code analysis and monitoring systems with strong latency/consistency requirements.

Group 3 includes many people who seriously evaluated DD but couldn't get past the hello world stage, but also several people who are using DD because nothing else can handle their requirements, but still complain that the api is difficult to use.

Api complaints included:

It sounds like there is some demand for a DD-like tool that:

This seems like a very distinct niche from the kafka/spark/flink niche that materialize is targeting - somewhere along a similar dimension to sqlite vs snowflake.