[HN Gopher] Ballista: Distributed compute platform implemented i...
       ___________________________________________________________________
        
       Ballista: Distributed compute platform implemented in Rust using
       Apache Arrow
        
       Author : Kinrany
       Score  : 185 points
       Date   : 2021-01-18 17:50 UTC (5 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | habitue wrote:
       | It saddens me a little bit that a nascent protocol like arrow
       | flight is using grpc + protobufs behind the scenes with some
       | hacks on top to skip the cost of deserializing protobufs. It
       | seems like a really common belief that protobufs have so much
       | engineering time behind them and are cross-language that it's a
       | no brainer to implement your new protocol on top of them.
       | 
       | In reality, all the engineering and optimization time is behind
       | the implementations for the google internal languages, and even
       | the python protobuf implementation is pretty bad.
       | 
       | Protobuf makes some _stunningly_ bad decisions like using
       | varints, etc that you shouldn 't make the immediate assumption
       | "google has tons of great engineers, google uses protobuf for
       | everything internally, therefore, protobuf is a good foundation
       | to build my new thing on top of"
       | 
       | In reality, path dependence and the (amazing) internal tooling
       | ecosystem at google both play a huge part of why they use
       | protobuf so extensively.
       | 
       | (Grpc is a little overly complicated to be a universal
       | recommendation, but I could believe it's a good choice for Arrow
       | Flight. But it seems like they didn't do grpc + arrow or grpc +
       | flatbuffer + arrow in the hopes that "dumb" grpc + protobuf
       | implementations would be able to still benefit. In my opinion,
       | grpc implementations are so coupled, there's no reason to make
       | this unnecessary concession to protobufs)
        
         | sitkack wrote:
         | What you describe is truly an
         | https://github.com/TimelyDataflow/abomonation
        
         | jfim wrote:
         | I'm not sure I'd call varints a stunningly bad decision. It
         | seems more like the kind of tradeoff one would make if storage
         | and network transfer costs are considered to be more important
         | than the serialization/deserialization speed.
         | 
         | That being said, the fact that that particular tradeoff was
         | considered to be good for Google doesn't mean it actually is,
         | or that it's applicable to one's application.
        
           | habitue wrote:
           | You're right, it's probably fair to say it's a stunningly bad
           | tradeoff for most applications most of the time, given we
           | have fast compression like snappy & brotli available now and
           | cloud costs are heavily weighted towards CPU costing more
           | than storage & network transfer
        
           | xiphias2 wrote:
           | I guess it was good before compression. Nowdays working with
           | protobufs is painful inside Google as well, but at least it's
           | supported by everything.
           | 
           | Most of what the CPUs at Google are doing is just copying
           | fields from one protobuffer to another.
        
             | cogman10 wrote:
             | I think that's a fair description of pretty much every
             | webapp :D
             | 
             | Most of them are doing nothing more than copying data from
             | the database into an http stream.
        
               | ithkuil wrote:
               | Well, all what CPUs do is copying memory from one memory
               | location to another, perform some arithmetic and do some
               | conditional jumps :-)
        
               | cogman10 wrote:
               | pop pop jump jump oh what a crunch it is.
               | 
               | (sung to this tune
               | https://www.youtube.com/watch?v=iENQXIQ8wH0 )
               | 
               | Side note: It really is incredible what happened in the
               | early days of computing when memory and computation were
               | limited. How much care was taken in the precise layout of
               | memory or even the timing of a calculation was insane.
        
               | xiphias2 wrote:
               | Most of the arithmetic is checking optional fields if
               | they are empty or not before copying :)
        
               | munk-a wrote:
               | Don't forget the validation!
               | 
               | Webapps are dumb middleware that pipes data from the
               | database into an http stream - but it needs to determine
               | which database calls to invoke and sanitize all the
               | incoming junk.
        
               | xiphias2 wrote:
               | Sure, but when you are using C++, protobufs are not the
               | best way to store data...but I guess it could be worse.
        
             | philsnow wrote:
             | At least when I was there, proto (de)serialization consumed
             | the plurality of cpu cycles, but not the majority.
             | 
             | It isn't really the majority these days, is it?
        
         | ampdepolymerase wrote:
         | What are some production ready alternatives to gRPC that have
         | both a pleasant developer experience and great performance?
         | Apache Avro? Apache Thrift?
        
           | jeffnappi wrote:
           | GraphQL is one.
           | 
           | Here's an article making the argument for it
           | https://blog.spaceuptech.com/posts/why-we-moved-from-grpc-
           | to...
        
             | e12e wrote:
             | I think that if you honestly consider GraphQL a better fit
             | than gRPC, you probably should never have considered gRPC
             | to begin with...
             | 
             | And much as we're considering GraphQL for some services as
             | work... I'm not sure I buy it as an RPC framework. I
             | suppose it has about the same appeal as SOAP for that
             | purpose.
        
           | habitue wrote:
           | I think for this kind of high performance stuff, grpc is a
           | reasonable choice. For ergonomics though, http + json is fine
           | for many/most applications and there is a lot more widely
           | available tooling for it than there is for grpc.
           | 
           | It's very possible that will change over time
           | 
           | (My implicit assumption here is that a project like Arrow
           | Flight wants a cross-language, widely used foundation for
           | their protocol, and there's not a ton of things that fit that
           | bill. But depending on your application's needs, implementing
           | a language-specific rpc system is perfectly acceptable, and
           | may have even better ergonomics. Rust and Python both have a
           | plethora of mono-lingual rpc frameworks)
        
           | e12e wrote:
           | Cap'n'proto?
           | 
           | https://capnproto.org/
        
           | hilbertseries wrote:
           | gRPC has a pleasant developer experience? This is news to me.
        
             | riku_iki wrote:
             | Maybe it is relatively pleasant when compared to
             | alternatives.
        
             | ampdepolymerase wrote:
             | It doesn't.
        
         | [deleted]
        
       | speps wrote:
       | > using Apache Arrow MEMORY MODEL
       | 
       | Probably got cut because of maximum title length but important
       | nonetheless.
        
         | superbcarrot wrote:
         | It has Rust in the title, that will be enough.
        
           | DSingularity wrote:
           | lol
        
         | davesque wrote:
         | Not sure what that omitted portion would have clarified. Apache
         | Arrow, at its core, is a memory model spec. Also, it appears
         | that this project is using the official Rust library which is
         | developed in the same repo as all the other language
         | implementations. So the simple statement that they're using
         | Apache Arrow seems adequate.
        
           | eb0la wrote:
           | Author is the same guy that wrote Arrow Rust library ;-)
        
             | davesque wrote:
             | So he is, hah. Well there you go :).
        
       | vasi26ro wrote:
       | I have less then an year as software tester and I have a startup
       | in web. I am wining around 300 EUR a month but the level of
       | knowledge that I won is amazing. Even a freelancer business is
       | trying to sign contract with me to establish them self as
       | authority in the startup world. Don't worry just try
        
       | secondcoming wrote:
       | Hopefully this is the beginning of the end for JVM use in data-
       | centric applications like this. I'm not particularly bothered if
       | it's Rust or C++
        
       | georgewfraser wrote:
       | This kind of data infrastructure is a great use case for Rust. A
       | lot of data infrastructure is memory-bound, so saving the memory
       | overhead of GC is a huge win.
       | 
       | The use of Arrow to support multiple programming languages is
       | also a great concept. Other distributed computing engines have
       | ended up tied to the JVM (Spark, Presto, Kafka) as a way of
       | avoiding serialization/deserialization costs when you go across a
       | language boundary. Arrow is a really elegant solution, as long as
       | you're willing to batch up operations.
        
         | MrPowers wrote:
         | Databricks recently rebuilt Spark in C++ in a project called
         | "Delta Engine" to overcome the JVM limitations you're pointing
         | out. You're right, Rust is a great way to sidestep the dreaded
         | JVM GC pauses.
        
           | sitkack wrote:
           | At the same time the JVM is getting better memory tracking
           | analysis and incremental pauseless collectors (C4, ZGZ,
           | Shenandoah, G1 improvements)
           | 
           | https://blogs.oracle.com/javamagazine/understanding-the-
           | jdks...
        
       | eb0la wrote:
       | Blog post that started the project. Worth reading.
       | https://andygrove.io/2019/07/announcing-ballista/
        
         | andygrove wrote:
         | There is also a more recent blog post which perhaps led to the
         | project being posted here (I am guessing).
         | 
         | https://andygrove.io/2021/01/ballista-2021/
        
       | lumost wrote:
       | I think it's telling on the state of rust in 2021 that this
       | project can't compile fully for the latest rust versions.
       | Maintaining these types of frameworks in the early days is an
       | intensive and often thankless job, having your language leave you
       | behind is a near guaranteed way to kill off your project, not too
       | mention introduce the obvious "I tried using library X and hit
       | compilation issue Y type issues".
       | 
       | I'm curious to see how this evolves as there are a number of
       | motivated folks working on similar efforts such as Vega. I for
       | one would welcome a mature rust based distributed compute
       | platform.
        
         | jasonpbecker wrote:
         | > With the release of Apache Arrow 3.0.0 there are many
         | breaking changes in the Rust implementation and as a result it
         | has been necessary to comment out much of the code in this
         | repository and gradually get each Rust module working again
         | with the 3.0.0 release.
         | 
         | This appears to be an issue with Arrows implementation hitting
         | a new major version and the Rust libraries not yet being
         | compatible with the newest versions of Arrow. That's not
         | something specific to the Rust ecosystem. It's not like a new
         | version of Rust broke this project.
         | 
         | But even if it had, maintenance is always hard and the health
         | of a project is better measured by how long it takes to be
         | working with new, major, stable versions after widespread
         | community adoption of those new, major, stable versions.
         | 
         | I don't know if Arrow 3.0 is the most commonly used
         | implementation-- it may not have even reached that milestone.
        
           | nevi-me wrote:
           | Arrow 3.0 will be released likely in the next week. The Rust
           | implementation has a lot of changes because we've had to make
           | public-facing changes, mostly for performance benefits.
        
             | jasonpbecker wrote:
             | Thanks! This is helpful context, and supports my notion.
        
         | andygrove wrote:
         | The project uses stable Rust. Which version are you trying to
         | compile with?
        
           | frankmcsherry wrote:
           | I think maybe they were confused by this text (which I agree
           | has nothing to do with Rust itself breaking):
           | 
           | > With the release of Apache Arrow 3.0.0 there are many
           | breaking changes in the Rust implementation and as a result
           | it has been necessary to comment out much of the code in this
           | repository and gradually get each Rust module working again
           | with the 3.0.0 release.
        
             | andygrove wrote:
             | Ah, yes, that makes sense. I can see how this could have
             | been misread.
        
       | alisaus6 wrote:
       | Hottie hangout pics with nude babes - https://adultlove.life
        
       ___________________________________________________________________
       (page generated 2021-01-18 23:00 UTC)