This is really exciting. Is anyone familiar with this space able to point to prior art? Have people built similar frameworks in other languages?
I know different people have worked on dataflow and remember thinking Materialize was very cool and I've used Kafka Streams at work before, and I remember thinking that a framework probably made sense for stitching this all together
From first glance it looks conceptually pretty similar to some work in the data-science space, I'm thinking of spark (which they mention in their docs) and dask.
My knee-jerk excitements is that this has the potential to be pretty powerful specifically because it's based on Rust so can play really nicely with other languages. Spark runs on the JVM which is a good choice for portability but still introduces a bunch of complexities, and Dask runs in Python which is a fairly hefty dependency you'd almost never bring in unless you're already on python.
In terms of distributed Rust, I've had a look at Lunatic too before which seems good but probably a bit more low-level than what Hydro is going for (although I haven't really done anything other than basic noodling around with it).
I was also going to say this looks similar to one layer of dask - dask takes arbitrary python code and uses cloudpickle to serialise it in order to propagate dependencies to workers, this seems to be an equivalent layer for rust.
So each "process" is deployed as a separate binary, so presumably run as a separate process?
If so, this seems somewhat problematic in terms of increased overhead.
How is fast communication achieved?
Some fast shared memory IPC mechanism?
Also, I don't see anything about integration with async?
For better or worse, the overwhelming majority of code dealing with networking has migrated to async. You won't find good non-async libraries for many things that need networking.
By “distributed” I assumed it meant “distributed,”as in on entirely separate machines. Thus necessitating each component running as an independent process.
Their latest paper [0] refers to Naiad(timely dataflow) a few times, e.g.:
"Inspired by ingress/egress nodes in Naiad [34], nested streams can be processed by nested dataflow graphs, which iteratively process chunks of data sourced from a larger stream with support for carrying state across iterations."
looks really cool and I can see a few ways how to use it, especially deploy part which seems unique. Looking forward to more fleshed-out documentation, especially seemingly crucial Streams and Singletons and Optionals part.
This could do with some real-world application examples so I can understand where you might want to apply it.
Love the effort but I would love an “akka.rs” to eventually make its way into the Rust ecosystem.
This is really exciting. Is anyone familiar with this space able to point to prior art? Have people built similar frameworks in other languages?
I know different people have worked on dataflow and remember thinking Materialize was very cool and I've used Kafka Streams at work before, and I remember thinking that a framework probably made sense for stitching this all together
From first glance it looks conceptually pretty similar to some work in the data-science space, I'm thinking of spark (which they mention in their docs) and dask.
My knee-jerk excitements is that this has the potential to be pretty powerful specifically because it's based on Rust so can play really nicely with other languages. Spark runs on the JVM which is a good choice for portability but still introduces a bunch of complexities, and Dask runs in Python which is a fairly hefty dependency you'd almost never bring in unless you're already on python.
In terms of distributed Rust, I've had a look at Lunatic too before which seems good but probably a bit more low-level than what Hydro is going for (although I haven't really done anything other than basic noodling around with it).
I was also going to say this looks similar to one layer of dask - dask takes arbitrary python code and uses cloudpickle to serialise it in order to propagate dependencies to workers, this seems to be an equivalent layer for rust.
It looks like a mixture between Akka (https://getakka.net/ less enterprisy than the Java version), which is based on the actor model and has a focus on distributed systems, and reactive libraries like rx (https://reactivex.io/). So maybe https://doc.akka.io/libraries/akka-core/current/stream/index... is the best fit.
Worth mentioning Pekko, the Akka fork.
https://pekko.apache.org/
This is a project out of the riselab
https://rise.cs.berkeley.edu/projects/
Most data processing and distributed systems have some sort of link back to the research this lab has done.
So each "process" is deployed as a separate binary, so presumably run as a separate process?
If so, this seems somewhat problematic in terms of increased overhead.
How is fast communication achieved? Some fast shared memory IPC mechanism?
Also, I don't see anything about integration with async? For better or worse, the overwhelming majority of code dealing with networking has migrated to async. You won't find good non-async libraries for many things that need networking.
By “distributed” I assumed it meant “distributed,”as in on entirely separate machines. Thus necessitating each component running as an independent process.
How does this compare to using something like Ballista for data pipelines?
The latter benefits a lot from building on top of Apache Arrow and Apache Datafusion.
How does this compare to timely [0] in terms of data flow? Can you represent control flow like loops in the IR?
[0] https://github.com/TimelyDataflow/timely-dataflow
Their latest paper [0] refers to Naiad(timely dataflow) a few times, e.g.: "Inspired by ingress/egress nodes in Naiad [34], nested streams can be processed by nested dataflow graphs, which iteratively process chunks of data sourced from a larger stream with support for carrying state across iterations."
[0] https://hydro.run/papers/flo.pdf
looks really cool and I can see a few ways how to use it, especially deploy part which seems unique. Looking forward to more fleshed-out documentation, especially seemingly crucial Streams and Singletons and Optionals part.