[HN Gopher] Apache Arrow 3.0
       ___________________________________________________________________
        
       Apache Arrow 3.0
        
       Author : kylebarron
       Score  : 244 points
       Date   : 2021-02-03 19:43 UTC (3 hours ago)
        
 (HTM) web link (arrow.apache.org)
 (TXT) w3m dump (arrow.apache.org)
        
       | mbyio wrote:
       | I'm surprised they are still making breaking changes, and they
       | plan to make more (they are already working on a 4.0).
        
         | reilly3000 wrote:
         | I didn't see any breaking changes in the release notes but I
         | may have missed them. Maybe they don't use SemVer?
        
           | lidavidm wrote:
           | Arrow uses SemVer, but the library and the data format are
           | versioned separately:
           | https://arrow.apache.org/docs/format/Versioning.html
        
       | offtop5 wrote:
       | Does no one do load testing anymore, anyone got a working mirror
        
       | Thaxll wrote:
       | Last time I worked in ETL was with Hadoop, looks like a lot
       | happened.
        
         | macksd wrote:
         | There's actually a lot of overlap between Hadoop and Arrow's
         | origins - a lot of the projects that integrated early and the
         | founding contributors had been in the larger Hadoop ecosystem.
         | It's a very good sign IMO that you can hardly tell anymore -
         | very diverse community and wide adoption!
        
       | georgewfraser wrote:
       | Arrow is _the most important_ thing happening in the data
       | ecosystem right now. It 's going to allow you to run your choice
       | of execution engine, on top of your choice of data store, as
       | though they are designed to work together. It will mostly be
       | invisible to users, the key thing that needs to happen is that
       | all the producers and consumers of batch data need to adopt Arrow
       | as the common interchange format.
       | 
       | BigQuery recently implemented the storage API, which allows you
       | to read BQ tables, in parallel, in Arrow format:
       | https://cloud.google.com/bigquery/docs/reference/storage
       | 
       | Snowflake has adopted Arrow as the in-memory format for their
       | JDBC driver, though to my knowledge there is still no way to
       | access data in _parallel_ from Snowflake, other than to export to
       | S3.
       | 
       | As Arrow spreads across the ecosystem, users are going to start
       | discovering that they can store data in one system and query it
       | in another, at full speed, and it's going to be amazing.
        
         | wesm wrote:
         | Microsoft is also on top of this with their Magpie project
         | 
         | http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
         | 
         | "A common, efficient serialized and wire format across data
         | engines is a transformational development. Many previous
         | systems and approaches (e.g., [26, 36, 38, 51]) have observed
         | the prohibitive cost of data conversion and transfer,
         | precluding optimizers from exploiting inter-DBMS performance
         | advantages. By contrast, inmemory data transfer cost between a
         | pair of Arrow-supporting systems is effectively zero. Many
         | major, modern DBMSs (e.g., Spark, Kudu, AWS Data Wrangler,
         | SciDB, TileDB) and data-processing frameworks (e.g., Pandas,
         | NumPy, Dask) have or are in the process of incorporating
         | support for Arrow and ArrowFlight. Exploiting this is key for
         | Magpie, which is thereby free to combine data from different
         | sources and cache intermediate data and results, without
         | needing to consider data conversion overhead."
        
           | data_ders wrote:
           | way cool! Is magpie end-user facing anywhere yet? We were
           | using the azureml-dataprep library for a while which seems
           | similar but not all of magpie
        
         | breck wrote:
         | Arrow is definitely one of the top 10 new things I'm most
         | excited about in the data science space, but not sure I'd call
         | it _the_ most important thing. ;)
         | 
         | It is pretty awesome, however, particularly for folks like me
         | that are often hopping between Python/R/Javascript. I've
         | definitely got in on the roadmap for all my data science
         | libraries.
         | 
         | Btw, Arquero from that UW lab looks really neat as well, and is
         | supporting Arrow out of the gate
         | (https://github.com/uwdata/arquero).
        
         | tristanz wrote:
         | Agreed! Thank you Arrow community for such a great project.
         | It's a long road but it opens up tremendous potential for
         | efficient data systems that talk to one another. The future
         | looks bright with so many vendors backing Arrow independently
         | and Wes McKinney founding Ursa Labs and now Ursa Computing.
         | https://ursalabs.org/blog/ursa-computing/
        
         | rubicon33 wrote:
         | Wait so you're telling me I could store data as a PDF file, and
         | access it easily / quickly as SQL?
        
           | sethhochberg wrote:
           | If you found/wrote a adapter to translate your structured PDF
           | into Arrow's format, yes - the idea is that you can wire up
           | anything that can produce Arrow data to anything that can
           | consume Arrow data.
        
             | staticassertion wrote:
             | I'm kinda confused. Is that not the case for literally
             | everything? "You can send me data of format X, all I ask is
             | that you be able to produce format X" ?
             | 
             | I'm assuming that I'm missing something fwiw, not trying to
             | diminish the value.
        
               | humbleMouse wrote:
               | The difference is that arrow's mapping behind the scenes
               | enables automatic translation to any implemented "plugin"
               | that is on the user's implementation of arrow. You can
               | extend arrows format to make it automatically map to
               | whatever you want, basically.
        
               | [deleted]
        
         | shafiemukhre wrote:
         | True. Arrow is awesome and Dremio is using it as well as their
         | built in memory. I tried it and it is increadibily fast. The
         | future of data ecosystem is gonna be amazing
        
         | waynesonfire wrote:
         | Uhh.. maybe. It's a serde that's trying to be cross-language /
         | platform.
         | 
         | I guess it also offers some APIs to process the data so you can
         | minimize serde operations. But, I dunno. It's been hard to
         | understand the benefit of the libabry and the posts here don't
         | help.
        
           | chrisweekly wrote:
           | For those wondering what a SerDe is:
           | https://docs.serde.rs/serde/
        
             | waynesonfire wrote:
             | I meant serialize / deserialize in the literal sense.
        
             | nevi-me wrote:
             | The term likely predates the Rust implementation. SerDe is
             | Serializer & Deserializer, which could be any framework or
             | tool that allows the serialization and deserialization of
             | data.
             | 
             | I first came across the concept in Apache Hive.
        
           | TuringTest wrote:
           | If it works as a universal intermediate exchange language, it
           | could help standardize connections among disparate systems.
           | 
           | When you have N systems, it takes N^2 translators to build
           | direct connections to transfer data between them; but it only
           | takes N translators if all them can talk the same exchange
           | language.
        
             | waynesonfire wrote:
             | can you define what at translator is? I don't understand
             | the complexity you're constructing. I have N systems and
             | they talk protobuf. What's the problem?
        
               | TuringTest wrote:
               | By a translator, I mean a library that allows accessing
               | data from different subsystems (either languages or OS
               | processes).
               | 
               | In this case, the advantages are that 1) Arrow is
               | language agnostic, so it's likely that it can be used as
               | a native library in your program and 2) it doesn't copy
               | data to make it accessible to another process, so it
               | saves a lot of marshalling / unmarshalling steps
               | (assuming both sides use data in tabular format, which is
               | typical of data analysis contexts).
        
               | [deleted]
        
           | wesm wrote:
           | There's no serde by design (aside from inspecting a tiny
           | piece of metadata indicating the location of each constituent
           | block of memory). So data processing algorithms execute
           | directly against the Arrow wire format without any
           | deserialization.
        
             | waynesonfire wrote:
             | Of course there is. There is always deserialization. The
             | data format is most definitely not native to the CPU.
        
               | wesm wrote:
               | I challenge you to have a closer look at the project.
               | 
               | Deserialization by definition requires bytes or bits to
               | be relocated from their position in the wire protocol to
               | other data structures which are used for processing.
               | Arrow does not require any bytes or bits to be relocated.
               | So if a "C array of doubles" is not native to the CPU,
               | then I don't know what is.
        
               | throwaway894345 wrote:
               | Perhaps "zero-copy" is a more precise or well-defined
               | term?
        
               | waynesonfire wrote:
               | CPUs come in many flavors. One area where they differ is
               | in the way that bytes of a word are represented in
               | memory. Two common formats are Big Endian and Little
               | Endian. This is an example where a "C array of doubles"
               | would be incompatible and some form of deserilaziation
               | would be needed.
               | 
               | My understanding is that an apache arrow library provides
               | an API to manipulate the format in a platform agnostic
               | way. But to claim that it eliminates deserialization is
               | false.
        
           | mumblemumble wrote:
           | It's not just a serde. One of its key use cases is
           | eliminating serde.
        
             | waynesonfire wrote:
             | I just don't believe you. My CPU doesn't understand Apache
             | Arrow 3.0.
        
               | mumblemumble wrote:
               | So, there are several components to Arrow. One of them
               | transfers data using IPC, and naturally needs to
               | serialize. The other uses shared memory, which eliminates
               | the need for serde.
               | 
               | Sadly, the latter isn't (yet) well supported anywhere but
               | Python and C++. If you can/do use it, though, data are
               | just kept as as arrays in memory. Which is exactly what
               | the CPU wants to see.
        
               | kristjansson wrote:
               | Not GP post, but it might have been better stated as
               | 'eliminating serde overhead'. Arrow's RPC serialization
               | [1] is basically Protobuf, with a whole lot of hacks to
               | eliminate copies on both ends of the wire. So it's still
               | 'serde', but markedly more efficient for large blocks of
               | tabular-ish data.
               | 
               | [1]: https://arrow.apache.org/docs/format/Flight.html
        
               | wesm wrote:
               | > Arrow's serialization is Protobuf
               | 
               | Incorrect. Only Arrow Flight embeds the Arrow wire format
               | in a Protocol Buffer, but the Arrow protocol itself does
               | not use Protobuf.
        
               | kristjansson wrote:
               | Apologies, off base there. Edited with a pointer to
               | Flight :)
        
       | jrevels wrote:
       | Excited to see this release's official inclusion of the pure
       | Julia Arrow implementation [1]!
       | 
       | It's so cool to be able mmap Arrow memory and natively manipulate
       | it from within Julia with virtually no performance overhead.
       | Since the Julia compiler can specialize on the layout of Arrow-
       | backed types at runtime (just as it can with any other type), the
       | notion of needing to build/work with a separate "compiler for
       | fast UDFs" is rendered obsolete.
       | 
       | It feels pretty magical when two tools like this compose so well
       | without either being designed with the other in mind - a
       | testament to the thoughtful design of both :) mad props to Jacob
       | Quinn for spearheading the effort to revive/restart Arrow.jl and
       | get the package into this release.
       | 
       | [1] https://github.com/JuliaData/Arrow.jl
        
       | jayd16 wrote:
       | Can someone dig into the pros and cons of the columnar aspect of
       | Arrow? To some degree there are many other data transfer formats
       | but this one seems to promote its columnar orientation.
       | 
       | Things like eg. protobuffers support hierarchical data which
       | seems like a superset of columns. Is there a benefit to a column
       | based format? Is it an enforced simplification to ensure greater
       | compatibility or is there some other reason?
        
         | aldanor wrote:
         | - Selective/lazy access (e.g. I have 100 columns but I want to
         | quickly pull out/query just 2 of them).
         | 
         | - Improved compression (e.g. a column of timestamps).
         | 
         | - Flexible schemas being easy to manage (e.g. adding more
         | columns, or optional columns).
         | 
         | - Vectorization/SIMD-friendly.
        
         | wging wrote:
         | Redshift's explanation is pretty good. Among other things, if
         | you only need a few columns there are entire blocks of data you
         | don't need to touch at all.
         | https://docs.aws.amazon.com/redshift/latest/dg/c_columnar_st...
         | 
         | It's truly magical when you scope down a SELECT to the columns
         | you need and see a query go blazing fast. Or maybe I'm easily
         | impressed.
        
         | mumblemumble wrote:
         | A columnar format is almost always what you want for analytical
         | workloads, because their access patterns tend to iterate ranges
         | of rows but select only a few columns at random.
         | 
         | About the only thing protocol buffers has in common is that
         | it's a standardized binary format. The use case is largely non-
         | overlapping, though. Protobuf is meant for transmitting
         | monolithic datagrams, where the entire thing will be
         | transmitted and then decoded as a monolithic blob. It's also,
         | out of the box, not the best for efficiently transmitting
         | highly repetitive data. Column-oriented formats cut down on
         | some repetition of metadata, and also tend to be more
         | compressible because similar data tends to get clumped
         | together.
         | 
         | Coincidentally, Arrow's format for transmitting data over a
         | network, Arrow Flight, uses protocol buffers as its messaging
         | format. Though the payload is still blocks of column-oriented
         | data, for efficiency.
        
         | tomnipotent wrote:
         | This is intended for analytical workloads where you're often
         | doing things that can benefit from vectorization (like SIMD).
         | It's much faster to SUM(X) when all values of X are neatly laid
         | out in-memory.
         | 
         | It also has the added benefit of eliminating serialization and
         | deserialization of data between processes - a Python process
         | can now write to memory which is read by a C++ process that's
         | doing windowed aggregations, which are then written over the
         | network to another Arrow compatible service that just copies
         | the data as-is from the network into local memory and resumes
         | working.
        
           | waynesonfire wrote:
           | > It also has the added benefit of eliminating serialization
           | and deserialization of data between processes
           | 
           | Is that accurate? It still has to deserialize from apache
           | arrow format to whatever the cpu understands.
        
             | ianmcook wrote:
             | The Arrow Feather format is an on-disk representation of
             | Arrow memory. To read a Feather file, Arrow just copies it
             | byte for byte from disk into memory. Or Arrow can memory-
             | map a Feather file so you can operate on it without reading
             | the whole file into memory.
        
               | waynesonfire wrote:
               | That's exactly how I read every data format.
               | 
               | The advantage you describe is in the operations that can
               | performed against the data. It would be nice to see what
               | this API looks like and how it compares to flatbuffers /
               | pq.
               | 
               | To help me understand this benefit, can you talk through
               | what it's like to add 1 to each record and write it back
               | to disk?
        
             | zten wrote:
             | The important part to focus on is _between processes_.
             | 
             | Consider Spark and PySpark. The Python bits of Spark are in
             | a sidecar process to the JVM running Spark. If you ask
             | PySpark to create a DataFrame from Parquet data, it'll
             | instruct the Java process to load the data. Its in-memory
             | form will be Arrow. Now, if you want to manipulate that
             | data in PySpark using Python-only libraries, prior to the
             | adoption of Arrow it used to serialize and deserialize the
             | data between processes on the same host. With Arrow, this
             | process is simplified -- however, I'm not sure if it's
             | simplified by exchanging bytes that don't require
             | serialization/deserialization between the processes or by
             | literally sharing memory between the processes. The docs do
             | mention zero-copied shared memory.
        
             | [deleted]
        
         | georgewfraser wrote:
         | The main benefit of a columnar representation in _memory_ is it
         | 's more cache friendly for a typical analytical workload. For
         | example, if I have a dataframe:                 (A int, B int,
         | C int, D int)
         | 
         | And I write:                 A + B
         | 
         | In a columnar representation, all the As are next to each
         | other, and all the Bs are next to each other, so the process of
         | (A and B in memory) => (A and B in CPU registers) => (addition)
         | => (A + B result back to memory) will be a lot more efficient.
         | 
         | In a row-oriented representation like protobuf, all your C and
         | D values are going to get dragged into the CPU registers
         | alongside the A and B values that you actually want.
         | 
         | Column-oriented representation is also more friendly to SIMD
         | CPU instructions. You can still use SIMD with a row-oriented
         | representation, but you have to use gather-scatter operations
         | which makes the whole thing less efficient.
        
       | skratlo wrote:
       | Yay, another ad-tech support engine from Apache, great
        
       | [deleted]
        
       | dang wrote:
       | If curious see also
       | 
       | 2020 https://news.ycombinator.com/item?id=23965209
       | 
       | 2018 (a bit) https://news.ycombinator.com/item?id=17383881
       | 
       | 2017 https://news.ycombinator.com/item?id=15335462
       | 
       | 2017 https://news.ycombinator.com/item?id=15594542 rediscussed
       | recently https://news.ycombinator.com/item?id=25258626
       | 
       | 2016 https://news.ycombinator.com/item?id=11118274
       | 
       | Also: related from a couple weeks ago
       | https://news.ycombinator.com/item?id=25824399
       | 
       | related from a few months ago
       | https://news.ycombinator.com/item?id=24534274
       | 
       | related from 2019 https://news.ycombinator.com/item?id=21826974
        
       | humbleMouse wrote:
       | I worked at a large company a few years ago on a team
       | implementing this. It's super cool and works great. Definitely
       | where the future is headed
        
       | mushufasa wrote:
       | Can someone ELI5 what problems are best solved by apache arrow?
        
         | hbcondo714 wrote:
         | The press release on the first version of Apache Arrow
         | discussed here 5 years ago is actually a good introduction IMO:
         | 
         | https://news.ycombinator.com/item?id=11118274
        
         | Diederich wrote:
         | Rather curious myself.
         | 
         | https://en.wikipedia.org/wiki/Apache_Arrow was interesting, but
         | I think many of us would benefit from a broader, problem
         | focused description of Arrow from someone in the know.
        
         | foobarian wrote:
         | To really grok why this is useful, put yourself in the shoes of
         | a data warehouse user or administrator. This is a farm of
         | machines with disks that hold data too big to have in a typical
         | RDBMS. Because the data is so big and stored spread out across
         | machines, a whole parallel universe of data processing tools
         | and practices exists that run various computations over shards
         | of the data locally on each server, and merge the results etc.
         | (Map-reduce originally from Google was the first famous
         | example). Then there is a cottage industry of wrappers around
         | this kind of system to let you use SQL to query the data, make
         | it faster, let you build cron jobs and pipelines with
         | dependencies, etc.
         | 
         | Now, so far all these tools did not really have a common
         | interchange format for data, so there was a lot of wheel
         | reinvention and incompatibility. Got file system layer X on
         | your 10PB cluster? Can't use it with SQL engine Y. And I guess
         | this is where Arrow comes in, where if everyone uses it then
         | interop will get a lot better and each individual tool that
         | much more useful.
         | 
         | Just my naive take.
        
         | adgjlsfhk1 wrote:
         | The big thing is that it is one of the first standardized,
         | cross language binary data formats. CSV is an OK text format,
         | but parsing it is really slow because of string escaping. The
         | files it produces are also pretty big since it's text.
         | 
         | Arrow is really fast to parse (up to 1000x faster than CSV),
         | supports data compression, enough data-types to be useful, and
         | deals with metadata well. The closest competitor is probably
         | protobuf, but protobuf is a total pain to parse.
        
           | MrPowers wrote:
           | CSV is a file format and Arrow is an in memory data format.
           | 
           | The CSV vs Parquet comparison makes more sense. Conflating
           | Arrow / Parquet is a pet peeve of Wes:
           | https://news.ycombinator.com/item?id=23970586
        
             | nolta wrote:
             | To be fair, arrow absorbed parquet-cpp, so a little
             | confusion is to be expected.
        
           | aldanor wrote:
           | The closest competitor would be HDF5, not Protobuf.
        
           | makapuf wrote:
           | Seems nice. How does it compare to hdf5?
        
             | BadInformatics wrote:
             | HDF5 is pretty terrible as a wire format, so it's not a 1-1
             | comparison to Arrow. Generally people are not going to be
             | saving Arrow data to disk either (though you can with the
             | IPC format), but serializing to a more compact
             | representation like Parquet.
        
         | neovintage wrote:
         | The premise around arrow is that when you want share data with
         | another system, or even on the same machine between processes,
         | most of the compute time spent is in serializing and
         | deserializing data. Arrow removes that step by defining a
         | common columnar format that can be used in many different
         | programming languages. Theres more to arrow than just the file
         | format that makes working with data even easier like better
         | over the wire transfers (arrow flight). How this would manifest
         | for your customers using your applications? They'd like see
         | speeds increase. Arrow makes a lot of sense when working with
         | lots of data in analytical or data science use cases.
        
           | RobinL wrote:
           | Yes.
           | 
           | A second important point is the recognition that data tooling
           | often re-implements the same algorithms again and again,
           | often in ways which are not particularly optimised, because
           | the in-memory representation of data is different between
           | tools. Arrow offers the potential to do this once, and do it
           | well. That way, future data analysis libraries (e.g. a
           | hypothetical pandas 2) can concentrate on good API design
           | without having to re-invent the wheel.
           | 
           | And a third is that Arrow allows data to be chunked and
           | batched (within a particular tool), meaning that computations
           | can be streamed through memory rather than the whole
           | dataframe needing to be stored in memory. A little bit like
           | how Spark partitions data and sends it to different nodes for
           | computation, except all on the same machine. This also
           | enables parallelisation by default. With the core count of
           | CPUS this means Arrow is likely to be extremely fast.
        
             | ianmcook wrote:
             | Re this second point: Arrow opens up a great deal of
             | language and framework flexibility for data engineering-
             | type tasks. Pre-Arrow, common kinds of data warehouse ETL
             | tasks like writing Parquet files with explicit control over
             | column types, compression, etc. often meant you needed to
             | use Python, probably with PySpark, or maybe one of the
             | other Spark API languages. With Arrow now there are a bunch
             | more languages where you can code up tasks like this, with
             | consistent results. Less code switching, lower complexity,
             | less cognitive overhead.
        
           | BenoitP wrote:
           | > most of the compute time spent is in serializing and
           | deserializing data.
           | 
           | This is to be viewed in light how hardware evolves now. CPU
           | compute power is no longer growing as much (at least for
           | individual cores).
           | 
           | But one thing that's still doubling on a regular basis is
           | memory capacity of all kinds (RAM, SSD, etc) and bandwidth of
           | all kinds (PCIe lanes, networking, etc). This divide is
           | getting large and will only continue to increase.
           | 
           | Which brings me to my main point:
           | 
           | You can't be serializing/deserializing data on the CPU. What
           | you want is to have the CPU coordinate the SSD to copy chunks
           | directly -and as is- to the NIC/app/etc.
           | 
           | Short of having your RAM doing compute work*, you would be
           | leaving performance on the table.
           | 
           | ----
           | 
           | * Which is starting to appear
           | (https://www.upmem.com/technology/), but that's not quite
           | there yet.
        
             | jeffbee wrote:
             | An interesting perspective on the future of computer
             | architecture but it doesn't align well with my experience.
             | CPUs are easier to build and although a lot of ink has been
             | spilled about the end of Moore's Law, it remains the case
             | that we are still on Moore's curve for number of
             | transistors, and since about 15 years ago we are now also
             | on the same slope for # of cores per CPU. We also still
             | enjoy increasing single-thread performance, even if not at
             | the rates of past innovation.
             | 
             | DRAM, by contrast, is currently stuck. We need materials
             | science breakthroughs to get beyond the capacitor aspect
             | ratio challenge. RAM is still cheap but as a systems
             | architect you should get used to the idea that the amount
             | of DRAM per core will fall in the future, by amounts that
             | might surprise you.
        
             | jedbrown wrote:
             | This is backward -- this sort of serialization is
             | overwhelmingly bottlenecked on bandwidth (not CPU). (Multi-
             | core) compute improvements have been outpacing bandwidth
             | improvements for decades and have not stopped.
             | Serialization is a bottleneck because compute is fast/cheap
             | and bandwidth is precious. This is also reflected in the
             | relative energy to move bytes being increasingly larger
             | than the energy to do some arithmetic on those bytes.
        
           | MrPowers wrote:
           | Exactly. Some specific examples.
           | 
           | Read a Parquet file into a Pandas DataFrame. Then read the
           | Pandas DataFrame into a Spark DataFrame. Spark & Pandas are
           | using the same Arrow memory format, so no serde is needed.
           | 
           | See the "Standardization Saves" diagram here:
           | https://arrow.apache.org/overview/
        
             | pindab0ter wrote:
             | I hope you don't mind me asking dumb questions, but how
             | does this differ from the role that say Protocol Buffers
             | fills? To my ears they both facilitate data exchange. Are
             | they comparable in that sense?
        
               | arjunnarayan wrote:
               | protobufs still get encoded and decoded by each client
               | when loaded into memory. arrow is a little bit more like
               | "flatbuffers, but designed for common data-intensive
               | columnar access patterns"
        
               | hackcasual wrote:
               | Arrow does actually use flatbuffers for metadata storage.
        
               | poorman wrote:
               | https://github.com/apache/arrow/tree/master/format
        
               | hawk_ wrote:
               | is that at the cost of the ability to do schema
               | evolution?
        
               | rcoveson wrote:
               | Better to compare it to Cap'n Proto instead. Arrow data
               | is already laid out in a usable way. For example, an
               | Arrow column of int64s is an 8-byte aligned memory region
               | of size 8*N bytes (plus a bit vector for nullity), ready
               | for random access or vectorized operations.
               | 
               | Protobuf, on the other hand, would encode those values as
               | variable-width integers. This saves a lot of space, which
               | might be better for transfer over a network, but means
               | that writers have to take a usable in-memory array and
               | serialize it, and readers have to do the reverse on their
               | end.
               | 
               | Think of Arrow as standardized shared memory using
               | struct-of-arrays layout, Cap'n Proto as standardized
               | shared memory using array-of-structs layout, and Protobuf
               | as a lightweight purpose-built compression algorithm for
               | structs.
        
               | jeffbee wrote:
               | Protobuf provides the fixed64 type and when combined with
               | `packed` (the default in proto3, optional in proto2)
               | gives you a linear layout of fixed-size values. You would
               | not get natural alignment from protobuf's wire format if
               | you read it from an arbitrary disk or net buffer; to get
               | alignment you'd need to move or copy the vector.
               | Protobuf's C++ generated code provides RepeatedField that
               | behaves in most respects like std::vector, but in as much
               | as protobuf is partly a wire format and partly a library,
               | users are free to ignore the library and use whatever
               | code is most convenient to their application.
               | 
               | TL;DR variable-width numbers in protobuf are optional.
        
               | cornstalks wrote:
               | > _Think of Arrow as standardized shared memory using
               | struct-of-arrays layout, Cap 'n Proto as standardized
               | shared memory using array-of-structs layout_
               | 
               | I just want to say thank you for this part of the
               | sentence. I understand struct-of-arrays vs array-of-
               | structs, and now I finally understand what the heck Arrow
               | is.
        
           | poorman wrote:
           | Not only in between processes, but also in between languages
           | in a single process. In this POC I spun up a Python
           | interpreter in a Go process and pass the Arrow data buffer
           | between processes in constant time.
           | https://github.com/nickpoorman/go-py-arrow-bridge
        
           | didip wrote:
           | Question, doesn't Parquet already do that?
        
             | ianmcook wrote:
             | From https://arrow.apache.org/faq/: "Parquet files cannot
             | be directly operated on but must be decoded in large
             | chunks... Arrow is an in-memory format meant for direct and
             | efficient use for computational purposes. Arrow data is...
             | laid out in natural format for the CPU, so that data can be
             | accessed at arbitrary places at full speed."
        
             | snicker7 wrote:
             | Yes. But parquet is now based on Apache Arrow.
        
               | ianmcook wrote:
               | Parquet is not based on Arrow. The Parquet libraries are
               | built into Arrow, but the two projects are separate and
               | Arrow is not a dependency of Parquet.
        
         | CrazyPyroLinux wrote:
         | This is not the greatest answer, but one anecdote: I am looking
         | at it as an alternative to MS SSIS for moving batches of data
         | around between databases.
        
         | [deleted]
        
         | RobinL wrote:
         | I wrote a bit about this from the perspective of a data
         | scientist here:
         | https://www.robinlinacre.com/demystifying_arrow/
         | 
         | I cover some of the use cases, but more importantly try and
         | explain how it all fits together, justifying why - as another
         | commenters has said - it's the most important thing happening
         | in the data ecosystem right now.
         | 
         | I wrote it because i'd heard a lot about Arrow, and even used
         | it quite a lot, but realised I hadn't really understood what it
         | was!
        
         | juntan wrote:
         | In Perspective (https://github.com/finos/perspective), we use
         | Apache Arrow as a fast, cross-language/cross-network data
         | encoding that is extremely useful for in-browser data
         | visualization and analytics. Some benefits:
         | 
         | - super fast read/write compared to CSV & JSON (Perspective and
         | Arrow share an extremely similar column encoding scheme, so we
         | can memcpy Arrow columns into Perspective wholesale instead of
         | reading a dataset iteratively).
         | 
         | - the ability to send Arrow binaries as an ArrayBuffer between
         | a Python server and a WASM client, which guarantees
         | compatibility and removes the overhead of JSON
         | serialization/deserialization.
         | 
         | - because Arrow columns are strictly typed, there's no need to
         | infer data types - this helps with speed and correctness.
         | 
         | - Compared to JSON/CSV, Arrow binaries have a super compact
         | encoding that reduces network transport time.
         | 
         | For us, building on top of Apache Arrow (and using it wherever
         | we can) reduces the friction of passing around data between
         | clients, servers, and runtimes in different languages, and
         | allows larger datasets to be efficiently visualized and
         | analyzed in the browser context.
        
         | prepend wrote:
         | I recently found it useful for the dumbest reason. A dataset
         | was about 3GB as a CSV and 20MB as a parquet file created and
         | consumed by arrow. The file also worked flawlessly across
         | different environments and languages.
         | 
         | So it's a good transport tool. It also happens to be fast to
         | load and query, but I only used it because of the compact way
         | it stores data without any hoops to jump through.
         | 
         | Of course one might say that it's stupid to try to pass around
         | a 2GB or 20MB file, but in my case I needed to do that.
        
         | IshKebab wrote:
         | When you want to process large amounts of in-memory tabular
         | data from different languages.
         | 
         | You can save it to disk too using Apache Parquet but I
         | evaluated Parquet and it is very immature. Extremely incomplete
         | documentation and lots of Arrow features are just not supported
         | in Parquet unfortunately.
        
           | Lucasoato wrote:
           | Do you mean the Parquet format? I don't think Parquet is
           | immature, it is used in so many enterprise environments, it's
           | is one of the few columnar file format for batch analysis and
           | processing. It preforms so well... But I'm curious to know
           | your opinion on this, so feel free to add some context to
           | your position!
        
             | IshKebab wrote:
             | Yeah I do. For example Apache Arrow supports in memory
             | compression. But Parquet does not support that. I had to
             | look through the code to find that out, and I found many
             | instances of basically `throw "Not supported"`. And yeah as
             | I said the documentation is just non-existent.
             | 
             | If you are already using Arrow, or you absolutely must use
             | a columnar file format then it's probably a good option.
        
               | BadInformatics wrote:
               | Is that a problem in the Parquet format or in PyArrow? My
               | understanding is that Parquet is primarily meant for on-
               | disk storage (hence the default on-disk compression), so
               | you'd read into Arrow for in-memory compression or IPC.
        
       | liminal wrote:
       | Would really love to see first class support for
       | Javascript/Typescript for data visualization purposes. The
       | columnar format would naturally lend itself to an Entity-
       | Component style architecture with TypedArrays.
        
         | BadInformatics wrote:
         | https://arrow.apache.org/docs/js/ has existed for a while and
         | uses typed arrays under the hood. It's a bit of a chunky
         | dependency, but if you're at the point where that level of
         | throughput is required bundle size is probably not a big deal.
        
           | liminal wrote:
           | I was mostly looking at this:
           | https://arrow.apache.org/docs/status.html
        
             | BadInformatics wrote:
             | Not sure I follow, that page indicates that JS support is
             | pretty good for all but the more obscure features (e.g.
             | decimals) and doesn't mention data visualization at all?
             | Anyhow, I've successfully used
             | https://github.com/vega/vega-loader-arrow for in-browser
             | plots before, and Observable has a fair few notebooks
             | showing how to use the JS API (e.g.
             | https://observablehq.com/@theneuralbit/introduction-to-
             | apach...)
        
         | nevi-me wrote:
         | Have you seen https://github.com/finos/perspective? Though they
         | removed the JS library a few months ago in favour of a WASM
         | build of the C++ library.
        
       | anonyfox wrote:
       | So if I understand this correctly from an application developers
       | perspective:
       | 
       | - for OLTP tasks, something row based like sqlite is great. Small
       | to medium amounts of data mixed reading/writing with transactions
       | 
       | - for OLAP tasks, arrow looks great. Big amounts of data, faster
       | querying (datafusion) and more compact data files with parquet.
       | 
       | Basically prevent the operational database from growing too
       | large, offload older data to arrow/parquet. Did I get this
       | correct?
       | 
       | Additionally there seem to be further benefits like sharing
       | arrow/parquet with other consumers.
       | 
       | Sounds convincing, I just have two very specific questions:
       | 
       | - if I load a ~2GB collection of items into arrow and query it
       | with datafusion, how much slower will this perform in comparison
       | to my current rust code that holds a large Vec in memory and
       | ,,queries" via iter/filter?
       | 
       | - if I want to move data from sqlite to a more permanent parquet
       | ,,Archive" file, is there a better way than recreating the whole
       | file or write additional files, like, appending?
       | 
       | Really curious, could find no hints online so far to get an idea.
        
       ___________________________________________________________________
       (page generated 2021-02-03 23:00 UTC)