Differential dataflow is a library that lets you write simple dataflow programs and a) then runs them in parallel and b) efficiently updates the outputs when new inputs arrive. Compared to competition like spark and kafka streams, it can handle more complex computations and provides dramatically better throughput and latency while using much less memory.
But I'm only aware of a few companies that use it in production, even though it's been around for 5 years.
- It's missing some important feature, like persistence?
- It's had very little advertising?
- The api is too hard to use?
- The docs / tutorials are not good enough?
- Rust is intimidating?
- No company to provide paid support?
These all seem plausible, but it's not clear which are most important.
Even more surprising is that noone has copied the ideas into some enterprise friendly java monstrosity - despite the fact that differential dataflow is open source and is explained in depth in many papers and blog posts.
If you considered using differential dataflow and decided against, please let me know why.
(To clarify for many fine but confused commenters - I did not make differential dataflow. I'm just trying to find out what more needs to be done to be useful in that niche.)
Reasons given fell in a few buckets:
- Never heard of differential dataflow
- Want a complete drop-in solution (builtin integrations for various other tools, orchestration, monitoring, support, hosting etc) rather than a choose-your-own-adventure library
- Api too difficult / docs not good enough
- Want to handle late arriving data
Materialize is doing a good job with 1-3 already.
I think differential dataflow actually can handle 4, since it can handle bitemporal timestamps, but this isn't something that has been well tested or advertised. That might be worth experimenting with. UPDATE: Frank McSherry posted a video demo.
All of the people in group 2 talked about typical data processing tasks, but people in group 3 had a much wider range of tasks including large-scale code analysis and monitoring systems with strong latency/consistency requirements.
Group 3 includes many people who seriously evaluated DD but couldn't get past the hello world stage, but also several people who are using DD because nothing else can handle their requirements, but still complain that the api is difficult to use.
Api complaints included:
- where is all the state? where do all these map/reduce calls actually end up living?
- which operators are internally stateful? how much memory will this use? how can I monitor how much memory each operator is using?
- too many single-letter type variables with unhelpfully-named bounds
- hard to figure out why various traits (eg Data) are not being satisfied
- hard to know what methods are available on a collection because they're all in trait impls with complicated bounds
- losing track of column names when everything is a tuple
- how to feed live data in - examples all show loading static data from a file
- how to get data out, especially how to pull results instead of pushing them
- hard to integrate threaded workers with tokio executors
It sounds like there is some demand for a DD-like tool that:
- has a simplified, opinionated api
- is easy to call from other languages
- is easy to target as a compiler backend
- is easy to integrate into other event loops
This seems like a very distinct niche from the kafka/spark/flink niche that materialize is targeting - somewhere along a similar dimension to sqlite vs snowflake.