[HN Gopher] Ballista: Distributed compute platform implemented i... ___________________________________________________________________ Ballista: Distributed compute platform implemented in Rust using Apache Arrow Author : Kinrany Score : 185 points Date : 2021-01-18 17:50 UTC (5 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | habitue wrote: | It saddens me a little bit that a nascent protocol like arrow | flight is using grpc + protobufs behind the scenes with some | hacks on top to skip the cost of deserializing protobufs. It | seems like a really common belief that protobufs have so much | engineering time behind them and are cross-language that it's a | no brainer to implement your new protocol on top of them. | | In reality, all the engineering and optimization time is behind | the implementations for the google internal languages, and even | the python protobuf implementation is pretty bad. | | Protobuf makes some _stunningly_ bad decisions like using | varints, etc that you shouldn 't make the immediate assumption | "google has tons of great engineers, google uses protobuf for | everything internally, therefore, protobuf is a good foundation | to build my new thing on top of" | | In reality, path dependence and the (amazing) internal tooling | ecosystem at google both play a huge part of why they use | protobuf so extensively. | | (Grpc is a little overly complicated to be a universal | recommendation, but I could believe it's a good choice for Arrow | Flight. But it seems like they didn't do grpc + arrow or grpc + | flatbuffer + arrow in the hopes that "dumb" grpc + protobuf | implementations would be able to still benefit. In my opinion, | grpc implementations are so coupled, there's no reason to make | this unnecessary concession to protobufs) | sitkack wrote: | What you describe is truly an | https://github.com/TimelyDataflow/abomonation | jfim wrote: | I'm not sure I'd call varints a stunningly bad decision. It | seems more like the kind of tradeoff one would make if storage | and network transfer costs are considered to be more important | than the serialization/deserialization speed. | | That being said, the fact that that particular tradeoff was | considered to be good for Google doesn't mean it actually is, | or that it's applicable to one's application. | habitue wrote: | You're right, it's probably fair to say it's a stunningly bad | tradeoff for most applications most of the time, given we | have fast compression like snappy & brotli available now and | cloud costs are heavily weighted towards CPU costing more | than storage & network transfer | xiphias2 wrote: | I guess it was good before compression. Nowdays working with | protobufs is painful inside Google as well, but at least it's | supported by everything. | | Most of what the CPUs at Google are doing is just copying | fields from one protobuffer to another. | cogman10 wrote: | I think that's a fair description of pretty much every | webapp :D | | Most of them are doing nothing more than copying data from | the database into an http stream. | ithkuil wrote: | Well, all what CPUs do is copying memory from one memory | location to another, perform some arithmetic and do some | conditional jumps :-) | cogman10 wrote: | pop pop jump jump oh what a crunch it is. | | (sung to this tune | https://www.youtube.com/watch?v=iENQXIQ8wH0 ) | | Side note: It really is incredible what happened in the | early days of computing when memory and computation were | limited. How much care was taken in the precise layout of | memory or even the timing of a calculation was insane. | xiphias2 wrote: | Most of the arithmetic is checking optional fields if | they are empty or not before copying :) | munk-a wrote: | Don't forget the validation! | | Webapps are dumb middleware that pipes data from the | database into an http stream - but it needs to determine | which database calls to invoke and sanitize all the | incoming junk. | xiphias2 wrote: | Sure, but when you are using C++, protobufs are not the | best way to store data...but I guess it could be worse. | philsnow wrote: | At least when I was there, proto (de)serialization consumed | the plurality of cpu cycles, but not the majority. | | It isn't really the majority these days, is it? | ampdepolymerase wrote: | What are some production ready alternatives to gRPC that have | both a pleasant developer experience and great performance? | Apache Avro? Apache Thrift? | jeffnappi wrote: | GraphQL is one. | | Here's an article making the argument for it | https://blog.spaceuptech.com/posts/why-we-moved-from-grpc- | to... | e12e wrote: | I think that if you honestly consider GraphQL a better fit | than gRPC, you probably should never have considered gRPC | to begin with... | | And much as we're considering GraphQL for some services as | work... I'm not sure I buy it as an RPC framework. I | suppose it has about the same appeal as SOAP for that | purpose. | habitue wrote: | I think for this kind of high performance stuff, grpc is a | reasonable choice. For ergonomics though, http + json is fine | for many/most applications and there is a lot more widely | available tooling for it than there is for grpc. | | It's very possible that will change over time | | (My implicit assumption here is that a project like Arrow | Flight wants a cross-language, widely used foundation for | their protocol, and there's not a ton of things that fit that | bill. But depending on your application's needs, implementing | a language-specific rpc system is perfectly acceptable, and | may have even better ergonomics. Rust and Python both have a | plethora of mono-lingual rpc frameworks) | e12e wrote: | Cap'n'proto? | | https://capnproto.org/ | hilbertseries wrote: | gRPC has a pleasant developer experience? This is news to me. | riku_iki wrote: | Maybe it is relatively pleasant when compared to | alternatives. | ampdepolymerase wrote: | It doesn't. | [deleted] | speps wrote: | > using Apache Arrow MEMORY MODEL | | Probably got cut because of maximum title length but important | nonetheless. | superbcarrot wrote: | It has Rust in the title, that will be enough. | DSingularity wrote: | lol | davesque wrote: | Not sure what that omitted portion would have clarified. Apache | Arrow, at its core, is a memory model spec. Also, it appears | that this project is using the official Rust library which is | developed in the same repo as all the other language | implementations. So the simple statement that they're using | Apache Arrow seems adequate. | eb0la wrote: | Author is the same guy that wrote Arrow Rust library ;-) | davesque wrote: | So he is, hah. Well there you go :). | vasi26ro wrote: | I have less then an year as software tester and I have a startup | in web. I am wining around 300 EUR a month but the level of | knowledge that I won is amazing. Even a freelancer business is | trying to sign contract with me to establish them self as | authority in the startup world. Don't worry just try | secondcoming wrote: | Hopefully this is the beginning of the end for JVM use in data- | centric applications like this. I'm not particularly bothered if | it's Rust or C++ | georgewfraser wrote: | This kind of data infrastructure is a great use case for Rust. A | lot of data infrastructure is memory-bound, so saving the memory | overhead of GC is a huge win. | | The use of Arrow to support multiple programming languages is | also a great concept. Other distributed computing engines have | ended up tied to the JVM (Spark, Presto, Kafka) as a way of | avoiding serialization/deserialization costs when you go across a | language boundary. Arrow is a really elegant solution, as long as | you're willing to batch up operations. | MrPowers wrote: | Databricks recently rebuilt Spark in C++ in a project called | "Delta Engine" to overcome the JVM limitations you're pointing | out. You're right, Rust is a great way to sidestep the dreaded | JVM GC pauses. | sitkack wrote: | At the same time the JVM is getting better memory tracking | analysis and incremental pauseless collectors (C4, ZGZ, | Shenandoah, G1 improvements) | | https://blogs.oracle.com/javamagazine/understanding-the- | jdks... | eb0la wrote: | Blog post that started the project. Worth reading. | https://andygrove.io/2019/07/announcing-ballista/ | andygrove wrote: | There is also a more recent blog post which perhaps led to the | project being posted here (I am guessing). | | https://andygrove.io/2021/01/ballista-2021/ | lumost wrote: | I think it's telling on the state of rust in 2021 that this | project can't compile fully for the latest rust versions. | Maintaining these types of frameworks in the early days is an | intensive and often thankless job, having your language leave you | behind is a near guaranteed way to kill off your project, not too | mention introduce the obvious "I tried using library X and hit | compilation issue Y type issues". | | I'm curious to see how this evolves as there are a number of | motivated folks working on similar efforts such as Vega. I for | one would welcome a mature rust based distributed compute | platform. | jasonpbecker wrote: | > With the release of Apache Arrow 3.0.0 there are many | breaking changes in the Rust implementation and as a result it | has been necessary to comment out much of the code in this | repository and gradually get each Rust module working again | with the 3.0.0 release. | | This appears to be an issue with Arrows implementation hitting | a new major version and the Rust libraries not yet being | compatible with the newest versions of Arrow. That's not | something specific to the Rust ecosystem. It's not like a new | version of Rust broke this project. | | But even if it had, maintenance is always hard and the health | of a project is better measured by how long it takes to be | working with new, major, stable versions after widespread | community adoption of those new, major, stable versions. | | I don't know if Arrow 3.0 is the most commonly used | implementation-- it may not have even reached that milestone. | nevi-me wrote: | Arrow 3.0 will be released likely in the next week. The Rust | implementation has a lot of changes because we've had to make | public-facing changes, mostly for performance benefits. | jasonpbecker wrote: | Thanks! This is helpful context, and supports my notion. | andygrove wrote: | The project uses stable Rust. Which version are you trying to | compile with? | frankmcsherry wrote: | I think maybe they were confused by this text (which I agree | has nothing to do with Rust itself breaking): | | > With the release of Apache Arrow 3.0.0 there are many | breaking changes in the Rust implementation and as a result | it has been necessary to comment out much of the code in this | repository and gradually get each Rust module working again | with the 3.0.0 release. | andygrove wrote: | Ah, yes, that makes sense. I can see how this could have | been misread. | alisaus6 wrote: | Hottie hangout pics with nude babes - https://adultlove.life ___________________________________________________________________ (page generated 2021-01-18 23:00 UTC)