Hydro: Distributed Programming Framework for Rust

hydro.run

275 points by ardel95 5 months ago

conor-23 5 months ago

There is a nice talk on Youtube explaining the Hydro project (focused mostly on DFIR)

https://www.youtube.com/watch?v=YpMKUQKlak0&ab_channel=ACMSI...

IshKebab 5 months ago

This could do with some real-world application examples so I can understand where you might want to apply it.

shadaj 5 months ago

These code examples aren't fully documented yet (which is why we've not linked them in the documentation), but you can take a look at a (more-real) implementation of Paxos here: https://github.com/hydro-project/hydro/blob/main/hydro_test/.... We're also working on building more complex applications like a key-value store.
- IshKebab 5 months ago
  
  I meant concrete real-world applications like "a WhatsApp clone" or whatever. Paxos is very abstract. Nobody deploys "a paxos".

sebstefan 5 months ago

If there is an intermediary language in the middle with its own runtime, does that mean that we lose everything Rust brings?

I thought this would introduce the language to choreograph separate Rust binaries into a consistent and functional distributed system but it looks more like you're writing DFIR the whole way through and not just as glue

shadaj 5 months ago

Hi, I'm one of the PhD students leading the work on Hydro!
DFIR is more of a middle-layer DSL that allows us (the high-level language developers) to re-structure your Rust code to make it more amenable to low-level optimizations like vectorization. Because DFIR operators (like map, filter, etc.) take in Rust closures, we can pass those through all the way from the high-level language to the final Rust binaries. So as a user, you never interact with DFIR.
cess11 5 months ago

DFIR is implemented in Rust, if that's what you're asking.
- sebstefan 5 months ago
  
  I can implement Lua in Rust and lose everything Rust brings when I code in Lua.

djtango 5 months ago

This is really exciting. Is anyone familiar with this space able to point to prior art? Have people built similar frameworks in other languages?

I know different people have worked on dataflow and remember thinking Materialize was very cool and I've used Kafka Streams at work before, and I remember thinking that a framework probably made sense for stitching this all together

benrutter 5 months ago

From first glance it looks conceptually pretty similar to some work in the data-science space, I'm thinking of spark (which they mention in their docs) and dask.
My knee-jerk excitements is that this has the potential to be pretty powerful specifically because it's based on Rust so can play really nicely with other languages. Spark runs on the JVM which is a good choice for portability but still introduces a bunch of complexities, and Dask runs in Python which is a fairly hefty dependency you'd almost never bring in unless you're already on python.
In terms of distributed Rust, I've had a look at Lunatic too before which seems good but probably a bit more low-level than what Hydro is going for (although I haven't really done anything other than basic noodling around with it).
- tomnicholas1 5 months ago
  
  I was also going to say this looks similar to one layer of dask - dask takes arbitrary python code and uses cloudpickle to serialise it in order to propagate dependencies to workers, this seems to be an equivalent layer for rust.
  
  FridgeSeal 5 months ago
  
  This looks to be a degree more sophisticated than that.
  Authors in the comments here mention that the flo compiler (?) will accept-and-rewrite Rust code to make it more amenable to distribution. It also appears to be building and optimising the data-flow rather than just distributing the work. There’s also comparisons to timely, which I believe does some kind of incremental compute.
- conor-23 5 months ago
  
  One of the creators of Hydro here. Yeah, one way to think about Hydro is bringing the dataflow/query optimization/distributed execution ideas from databases and data science to programming distributed systems. We are focused on executing latency-critical longrunning services in this way though rather than individual queries. The kinds of things we have implemented in Hydro include a key-value store and the Paxos protocol, but these compile down to dataflow just like a Spark or SQL query does!
Paradigma11 5 months ago

It looks like a mixture between Akka (https://getakka.net/ less enterprisy than the Java version), which is based on the actor model and has a focus on distributed systems, and reactive libraries like rx (https://reactivex.io/). So maybe https://doc.akka.io/libraries/akka-core/current/stream/index... is the best fit.
- Cyph0n 5 months ago
  
  Worth mentioning Pekko, the Akka fork.
  https://pekko.apache.org/
  
  haolez 5 months ago
  
  Is this an active fork?
  
  necubi 5 months ago
  
  In 2022 Lightbend relicensed Akka from Apache 2.0 to BSL, which was a huge problem for all of the other opensource projects (like Flink) that used it as part of their coordination layer. At this point most or all of them have moved to Pekko, which is a fork of the last release of Akka under Apache 2.0.
  
  Cyph0n 5 months ago
  
  Seems like it based on GH activity, but I don’t know for sure.
  https://github.com/apache/pekko
- pradn 5 months ago
  
  An important design consideration for Hydro, it seems, is to be able to define a workflow in a higher level language and then be able to cut them into different binaries.
  Is that something Akka / RX offer? My quick thought is that they structure code in one binary.
sitkack 5 months ago

This is a project out of the riselab
https://rise.cs.berkeley.edu/projects/
Most data processing and distributed systems have some sort of link back to the research this lab has done.
- sriram_malhar 5 months ago
  
  > Most data processing and distributed systems have some sort of link back to the research this lab has done.
  Heh. "most data processing and distributed systems"? Surely you don't mean that the rest of the world was sitting tight working on uniprocessors until this lab got set up in 2017!
  
  necubi 5 months ago
  
  I assume they're talking about the longer history of the distributed systems lab at Berkeley, which was AMP before RISE. (It's actually now Sky Lab[0], each of the labs live for 5 years). AMP notably is the origin of Spark, Mesos, and Tachyon (now Alluxio), and RISE originated Ray.
  [0] https://sky.cs.berkeley.edu/
  
  conor-23 5 months ago
  
  There is a nice article by David Patterson (who used to direct the lab and won the Turing Award) on why Berkeley changes the name and scope of the lab every five years https://www2.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013-... . Unfortunately, there's no good name for the lab across each of the five-year boundaries so people just say "rise lab" or "amp lab" etc.
  
  irq-1 5 months ago
  
  Interesting.
  > Good Commandment 3. Thou shalt limit the duration of a center. ...
  > To hit home runs, it’s wise to have many at bats. ...
  > It’s hard to predict information technology trends much longer than five years. ...
  > US Graduate student lifetimes are about five years. ...
  > You need a decade after a center finishes to judge if it was a home run. Just 8 of the 12 centers in Table I are old enough, and only 3 of them—RISC, RAID, and the Network of Workstations center—could be considered home runs. If slugging .375 is good, then I’m glad that I had many 5-‐year centers rather than fewer long ones.
  (Network of Workstations > Google)
  
  sriram_malhar 5 months ago
  
  Right .. the AMPLab was set up in 2011. The Djikstra prize for distributed computing was set up in 2006 .. people like Djikstra and Lamport and Jim Gray and Barbara Liskov won Turing Awards for a lifetime's worth of work.
  Now, Berkeley has been a fount of research on the topic, no question about that. I myself worked there (on Bloom, with Joe Hellerstein). But forgetting the other top universities of the world is a bit ... amusing?
  Let's take one of the many lists of foundational papers of this field:
  http://muratbuffalo.blogspot.com/2021/02/foundational-distri...
  How many came out of Berkeley, let alone a recent entry like the AMPLab?
  
  sitkack 5 months ago
  
  You are mischaracterizing my comment, what I said was true. Most distributed systems work (now) has a link back to Berkeley distributed systems labs. Someone wanted context about Hydro (Joe Hellerstein).
  I am not going to make every contextualizing comment an authoritative bibliography , you of all people could have added that w/o being snarky and starting this whole subthread.
  
  sriram_malhar 5 months ago
  
  > Most distributed systems work (now) has a link back to Berkeley distributed systems labs.
  I didn't think you were saying that most distributed systems work happening at Berkeley harks back to earlier work at Berkeley. That's a bit obvious.
  The only way I can interpret "most distributed systems work now" is a statement about work happening globally. In which case it is a sweeping and false generalization.
  Is there another interpretation?
  
  sitkack 5 months ago
  
  correct
  
  macintux 5 months ago
  
  I too was a bit surprised by the assertion, but it doesn't say "ancestry", just "link".
  And I'm guessing if you include BOOM[1], the links are even deeper.
  [1] http://boom.cs.berkeley.edu

halfmatthalfcat 5 months ago

Love the effort but I would love an “akka.rs” to eventually make its way into the Rust ecosystem.

thelittlenag 5 months ago

Be careful what you wish for!
- weego 5 months ago
  
  We appear to be wishing for writing tons of boilerplate with nonsensical use of implicits
  
  rozap 5 months ago
  
  i once heard it described as "erlang except it's bad and 30 years later on the JVM" which, ime, is harsh but accurate
- halfmatthalfcat 5 months ago
  
  If it’s happening I’d love to know about it

stefanka 5 months ago

How does this compare to timely [0] in terms of data flow? Can you represent control flow like loops in the IR?

[0] https://github.com/TimelyDataflow/timely-dataflow

tel 5 months ago

Reading a bit about it from the Flo paper
- Describe a dataflow graph just like Timely - Comes from a more "semantic dataflow" kind of heritage (frp, composition, flow-of-flows, algebraic operators, proof-oriented) as opposed to the more operationally minded background of Timely - Has a (very) different notion of "progress" than Timely, focused instead of ensuring the compositions are generative in light of potentially unbounded streaming inputs - In fact, Flo doesn't really have any notion of "timeliness", no timestamping at all - Supports nested looping like Timely, though via a very different mechanism. The basic algebra is extremely non-cyclic, but the nested streams/graphs formalism allows for iteration.
The paper also makes a direct comparison with DBSP, which as I understand it, is also part of the Timely/Naiad heritage. Similar to Timely, the authors suggest that Flo could be a unifying semantic framework for several other similar systems (Flink, LVars, DBSP).
So I'd say that the authors of Flo are aware of Naiad/Timely and took inspiration of nested iterative graphs, but little else.
- shadaj 5 months ago
  
  Flo lead-author here! This is spot on :) Flo aims to be a bit less opinionated than Timely in how the runtime should behave, so in particular we don't support the type of "time-traveling" computation that Timely needs when you have iterative computations on datasets with retractions.
  This is also one of the core differences of Timely compared to DBSP, which uses a flat representation (z-sets) to store retractions rather than using versioned elements. This allows retractions to be propagated as just negative item counts which fits into the Flo model (and therefore Hydro).
- v3xro 5 months ago
  
  Thanks for the summary, really looks like something that is worth digging into!
leicmi 5 months ago

Their latest paper [0] refers to Naiad(timely dataflow) a few times, e.g.: "Inspired by ingress/egress nodes in Naiad [34], nested streams can be processed by nested dataflow graphs, which iteratively process chunks of data sourced from a larger stream with support for carrying state across iterations."
[0] https://hydro.run/papers/flo.pdf

the_duke 5 months ago

So each "process" is deployed as a separate binary, so presumably run as a separate process?

If so, this seems somewhat problematic in terms of increased overhead.

How is fast communication achieved? Some fast shared memory IPC mechanism?

Also, I don't see anything about integration with async? For better or worse, the overwhelming majority of code dealing with networking has migrated to async. You won't find good non-async libraries for many things that need networking.

mplanchard 5 months ago

By “distributed” I assumed it meant “distributed,”as in on entirely separate machines. Thus necessitating each component running as an independent process.
shadaj 5 months ago

Currently, Hydro is focused on networked applications, where most parallelism is across machines rather than within them. So there is some extra overhead if you want single-machine parallelism. It's something we definitely want to address in the future, via shared memory as you mentioned.
At POPL 2025 (last week!), an undergraduate working on Hydro presented a compiler that automatically compiles blocks of async-await code into Hydro dataflow. You can check out that (WIP, undocumented) compiler here: https://github.com/hydro-project/HydraulicLift

Keyframe 5 months ago

looks really cool and I can see a few ways how to use it, especially deploy part which seems unique. Looking forward to more fleshed-out documentation, especially seemingly crucial Streams and Singletons and Optionals part.

shadaj 5 months ago

You caught us in our docs-writing week :) In the meantime, the Rustdoc for streams are fairly complete: https://hydro.run/rustdoc/hydro_lang/stream/struct.Stream

vikslab 5 months ago

I do like the programming model. Do you perform any network optimizations when rewriting applications? Handling network bottlenecks/congestion.

GardenLetter27 5 months ago

How does this compare to using something like Ballista for data pipelines?

The latter benefits a lot from building on top of Apache Arrow and Apache Datafusion.

conor-23 5 months ago

One of the Hydro creators here. Ballista (and the ecosystem around Arrow and Parquet) are much more focused on analytical query processing whereas Hydro is bringing the concepts from the query processing world to the implementation of distributed systems. Our goal isn't to execute a SQL query, but rather to treat your distributed systems code (e.g a microservice implementation) like it is a SQL query. Integration with Arrow and Parquet are definitely planned in our roadmap though!

pauldemarco 5 months ago

Is this like the BEAM but with rust?

jmakov 5 months ago

Not sure what problem this is solving. For real applications one would need sth as ray.io for Rust. Academia ppl: let's make another data flow framework.