[HN Gopher] Demystifying Apache Arrow (2020)
       ___________________________________________________________________
        
       Demystifying Apache Arrow (2020)
        
       Author : dmlorenzetti
       Score  : 174 points
       Date   : 2023-01-09 15:36 UTC (1 days ago)
        
 (HTM) web link (www.robinlinacre.com)
 (TXT) w3m dump (www.robinlinacre.com)
        
       | RobinL wrote:
       | Author here. Since I wrote this, Arrow seems to be be more and
       | more pervasive. As a data engineer, the adoption of Arrow (and
       | parquet) as a data exchange format has so much value. It's
       | amazing how much time me and colleagues have spent on data type
       | issues that have arisen from the wide range of data tooling (R,
       | Pandas, Excel etc. etc.). So much so that I try to stick to
       | parquet, using SQL where possible to easily preserve data types
       | (pandas is a particularly bad offender for managing data types).
       | 
       | In doing so, I'm implicitly using Arrow - e.g. with Duckdb, AWS
       | Athena and so on. The list of tools using Arrow is long!
       | https://arrow.apache.org/powered_by/
       | 
       | Another interesting development since I wrote this is DuckDB.
       | 
       | DuckDB offers a compute engine with great performance against
       | parquet files and other formats. Probably similar performance to
       | Arrow. It's interesting they opted to write their own compute
       | engine rather than use Arrow's - but I believe this is partly
       | because Arrow was immature when they were starting out. I mention
       | it because, as far as I know, there's not yet an easy SQL
       | interface to Arrow from Python.
       | 
       | Nonetheless, DuckDB are still Arrow for some of its other
       | features: https://duckdb.org/2021/12/03/duck-arrow.html
       | 
       | Arrow also has a SQL query engine:
       | https://arrow.apache.org/blog/2019/02/04/datafusion-donation...
       | 
       | I might be wrong about this - but in my experience, it feels like
       | there's more consensus around the Arrow format, as opposed to the
       | compute side.
       | 
       | Going forward, I see parquet continuing on its path to becoming a
       | de facto standard for storing and sharing bulk data. I'm
       | particularly excited about new tools that allow you to process it
       | in the browser. I've written more about this just yesterday:
       | https://www.robinlinacre.com/parquet_api/, discussion:
       | https://news.ycombinator.com/item?id=34310695.
        
         | hintymad wrote:
         | Thanks for sharing your insights. Any comments on Feather vs
         | Parquet? If we don't need to support tools that can only
         | interact with Parquet, how will Feather pan out as a Parquet
         | alternative (or Feather can't be such alternative at all)?
        
           | reichardt wrote:
           | I recently looked into this as well. Specifically how the two
           | formats differ. As it stands right now the "Feather" file
           | format seems to be a synonym for the Arrow IPC file format or
           | "Arrow files" [0]. There should be basically no overhead
           | while reading into the arrow memory format [1]. Parquet files
           | on the other hand are stored in a different format and
           | therefore occur some overhead while reading into memory but
           | offer more advanced mechanism for on disk encoding and
           | compression [1].
           | 
           | As far as I can tell the main trade-off seems to be around
           | deserialization overhead vs on disk file size. If anyone has
           | more information or experience with the topic I'd love to
           | hear!
           | 
           | [0] https://arrow.apache.org/faq/#what-about-the-feather-
           | file-fo... [1] https://arrow.apache.org/faq/#what-is-the-
           | difference-between...
           | 
           | EDIT:
           | 
           | More information:
           | https://news.ycombinator.com/item?id=34324649
        
             | RobinL wrote:
             | This is also my understanding - see
             | https://news.ycombinator.com/item?id=34324649
        
               | reichardt wrote:
               | Thanks! Just stumbled across your comment as well.
        
         | sanderjd wrote:
         | Since you know a bunch about this, I'm going to ask you a
         | question that I was about to research: If I have a dataset in
         | memory in Arrow, but I want to cache it to disk to read back in
         | later, what is the most efficient way to do that at this moment
         | in time? Is it to write to parquet and read the parquet back
         | into memory, or is there a more efficient way to write the
         | native Arrow format such that it can be read back in directly?
         | I think this sounds kind of like Flight, except that my
         | understanding is that is intended for moving the data across a
         | network rather than temporally across a disk.
        
           | sdz wrote:
           | You're probably looking for the Arrow IPC format [1], which
           | writes the data in close to the same format as the memory
           | layout. On some platforms, reading this back is just an mmap
           | and can be done with zero copying. Parquet, on the other
           | hand, is a somewhat more complex format and there will be
           | some amount of encoding and decoding on read/write. Flight is
           | an RPC framework that essentially sends Arrow data around in
           | IPC format.
           | 
           | [1] https://arrow.apache.org/docs/python/ipc.html
        
           | RobinL wrote:
           | I'm not an expert in the nuts and bolts of Arrow, but I think
           | you have two options:
           | 
           | - Save to feather format. Feather format is essentially the
           | same thing as the Arrow in-memory format. This is
           | uncompressed and so if you have super fast IO, it'll read
           | back to memory faster, or at least, with minimal CPU usage.
           | 
           | - Save to compressed parquet format. Because you're often IO
           | bound, not CPU bound, this may read back to memory faster, at
           | the expense of the CPU usage of decompressing.
           | 
           | On a modern machine with a fast SSD, I'm not sure which would
           | be faster. If you're saving to remote blob storage e.g. S3,
           | parquet will almost certainly be faster.
           | 
           | See also https://news.ycombinator.com/item?id=34324649
        
             | sanderjd wrote:
             | Thanks! Exactly what I was looking for. I'll do some
             | benchmarking of these two options for my workload.
        
       | kajika91 wrote:
       | I prefer looking at benchmarks :
       | https://towardsdatascience.com/the-best-format-to-save-panda...
       | 
       | I have used Arrow and even made my humble contribution to the Go
       | binding but I don't like pretending it is so much better than
       | other solutions. It is not a silver bullet and probably the best
       | pro is the "non-copy" goal to convert data into different
       | frameworks' object. Depending of the use for the data columnar
       | layout can be better but not always.
        
       | kordlessagain wrote:
       | FeatureBase uses Arrow for processing data stored in bitmap
       | format: https://featurebase.com/
        
       | mjburgess wrote:
       | > Learning more about a tool that can filter and aggregate two
       | billion rows on a laptop in two seconds
       | 
       | If someone has a code example to this effect, I'd be greatful.
       | 
       | I was once engaged in a salesy pitch by a cloud advocate that
       | BigQuery (et al.) can "process a billion rows a second".
       | 
       | I tried to create an SQLite example with a billion rows to show
       | that this isn't impressive, but I gave up after some obstacles to
       | generating the data.
       | 
       | It would be nice to have an example like this to show developers
       | (, engineers) who have become accustomed to the extreme levels of
       | CPU abuse today, to show that modern laptops really are
       | supercomputers.
       | 
       | It should be obvious that a laptop can rival a data centre at 90%
       | of ordinary tasks, that it isn't in my view, has a lot to do with
       | the state of OS/Browser/App/etc. design & performance.
       | Supercomputers, alas, dedicated to drawing pixels by way of a
       | dozen layers of indirection.
        
         | StreamBright wrote:
         | You can try the examples or datafusion with flight. I have been
         | able to process data with that setup in Rust under milliseconds
         | that usually takes tens of seconds with a distributed query
         | engine. I think Rust combined with Arrow, Flight, Parquet can
         | be a game changer for analytics after a decade of Java with
         | Hadoop & co.
        
           | cmollis wrote:
           | completely agree with this. Rust and arrow will be part of
           | the next set of toolsets for data engineering. Spark is great
           | and I use it every day but it's big and cumbersome to use.
           | There are use-cases today that are being addressed by
           | datafusion, duckdb, (to a certain extent, pandas).. that will
           | continue to evolve.. hopefully ballista can mature to a point
           | where it's a real spark alternative for distributed
           | computations. Spark isn't standing still of course and we're
           | already seeing a lot of different drop in C++ SQL engines..
           | but moving entirely away from the JVM would be a watershed,
           | IMO
        
         | RobinL wrote:
         | An example using R code is here:
         | https://arrow.apache.org/docs/r/articles/dataset.html
         | 
         | The speed comes from the raw speed of arrow, but also a
         | 'trick'. If you apply a filter, this is pushed down to the raw
         | parquet files so some don't need to be read at all due to the
         | hive-style organisation
         | 
         | Another trick is that parquet files store some summary
         | statistics in their metadata. This means, for example, that if
         | you want to find the max of a column, only the metadata needs
         | to be read, rather than the data itself.
         | 
         | I'm a Python user myself, but the code would be comparable on
         | the Python side
        
         | tveita wrote:
         | Clickhouse or DuckDB are databases I would look at that support
         | this use case pretty much "out of the box"
         | 
         | E.g. https://benchmark.clickhouse.com has some query times for
         | a 100 million row dataset.
        
           | spaniard89277 wrote:
           | DuckDB is so simple to work with. It's only worth to look
           | elsewhere with _real_ big data, or where you really need a
           | client-server setup.
           | 
           | I hope it receives more love.
        
             | IanCal wrote:
             | Duckdb is outrageously useful. Great on its own, but slots
             | in perfectly reading and providing back arrow data frames,
             | meaning you can seamlessly swap between tools when SQL for
             | some parts and other tools better for others. Also very
             | fast. I was able to throw away designs for multi machine
             | setups as duckdb on its own was fast enough to not worry
             | about anything else.
        
           | intelVISA wrote:
           | Having used all three I'd go with Clickhouse/DuckDB over
           | Arrow every time.
        
             | sanderjd wrote:
             | Oh interesting - why?
        
               | intelVISA wrote:
               | They're easier to use and faster is the tl;dr.
        
               | nlittlepoole wrote:
               | 100% agree.
        
               | lmeyerov wrote:
               | Probably for SQL (top n, ...), but not for wrangling &
               | analytics & ML & ai & viz
        
         | [deleted]
        
         | thinkharderdev wrote:
         | You can see some of the benchmarks in DataFusion (part of the
         | Arrow project and built with Arrow as the underlying in-memory
         | format) https://github.com/apache/arrow-
         | datafusion/blob/master/bench...
         | 
         | Disclaimer: I'm a committer on the Arrow project and
         | contributor to DataFusion.
        
         | mihevc wrote:
         | Here are some cookbook examples:
         | https://arrow.apache.org/cookbook/py/data.html#group-a-table,
         | https://arrow.apache.org/cookbook/. Datasets would probably be
         | a good approach for the billions size, see:
         | https://blog.djnavarro.net/posts/2022-11-30_unpacking-arrow-...
        
         | cube2222 wrote:
         | Generally, operating on raw numbers in a columnar layout is
         | very very fast, even if you just write it as a straightforward
         | loop.
        
       | agumonkey wrote:
       | Very interesting project
       | 
       | ps: a tiny video to explain storage layout optimizations
       | https://yewtu.be/watch?v=dPb2ZXnt2_U
        
       | alamb wrote:
       | Here is another blog post that offers some perspective on the
       | growth of Arrow over the intervening years and future directions:
       | https://www.datawill.io/posts/apache-arrow-2022-reflection/
        
         | RobinL wrote:
         | That's really good, thanks. Better than my blog, actually - the
         | author has a much deeper understanding and I learned a lot by
         | reading it. I was coming at it from the perspective of someone
         | very confused by Arrow, and wrote the blog to help myself
         | understand it!
        
       | gizmodo59 wrote:
       | There is a bunch of other projects that grew out of arrow which
       | are also contributing a lot to data engineering:
       | https://www.dremio.com/blog/apache-arrows-rapid-growth-over-...
        
       | flakiness wrote:
       | FYI: A recent "Data Analysis Podcast" interviews the Arrows
       | founder Wes McKinney on this topic.
       | 
       | https://roundup.getdbt.com/p/ep-37-what-does-apache-arrow-un...
        
       | rr888 wrote:
       | I always thought the file format was going to be tightly bound to
       | Arrow but looks like they aren't encouraging feather. Should we
       | just be using Parquet for file storage?
        
         | RobinL wrote:
         | Yes - save to parquet. From the OP:
         | 
         | "Why not just persist the data to disk in Arrow format, and
         | thus have a single, cross-language data format that is the same
         | on-disk and in-memory? One of the biggest reasons is that
         | Parquet generally produces smaller data files, which is more
         | desirable if you are IO-bound. This will especially be the case
         | if you are loading data from cloud storage like such as AWS S3.
         | 
         | Julien LeDem explains this further in a blog post discussing
         | the two formats:
         | 
         | >> The trade-offs for columnar data are different for in-
         | memory. For data on disk, usually IO dominates latency, which
         | can be addressed with aggressive compression, at the cost of
         | CPU. In memory, access is much faster and we want to optimise
         | for CPU throughput by paying attention to cache locality,
         | pipelining, and SIMD instructions.
         | https://www.kdnuggets.com/2017/02/apache-arrow-parquet-
         | colum..."
        
           | mempko wrote:
           | I opted to store feather for one particular reason. You can
           | open it using mmap and randomly index the data without having
           | to load it all in memory. Also the data I have isn't very
           | compressible to begin with, so the cpu cost vs data savings
           | of parquet don't make sense. This only makes sense in that
           | narrow use case.
        
             | _frkl wrote:
             | I'm doing the same. It's also quite nice for de-
             | duplication, a lot of operations on our data happen on a
             | column basis, and we need to assemble tables that are
             | basically the same, except for one or two computed columns.
             | I usually store all columns in a separate file, and
             | assemble tables on the fly, also memory-mapped. Quite happy
             | with being able to do that. Not sure how easy that would be
             | with parquet.
        
       | Lyngbakr wrote:
       | This has been a game changer for us. When our analysts run
       | queries on parquets using Arrow they are orders of magnitude
       | faster than equivalent SQL queries on databases.
        
         | BeefWellington wrote:
         | Were you working off proper data warehouses, or just the
         | transactional db?
         | 
         | I ask because something a lot of people miss here is how much
         | performance you can get from the T part of ETL. Denormalizing
         | everything into big simple inflated tables makes things orders
         | of magnitude faster. It matters quite a bit what your
         | comparison is against.
        
           | Lyngbakr wrote:
           | We saw major improvements when we simply wrote full tables
           | from a transactional database to parquet, but also, as you
           | say, modelling the data appropriately produced significant
           | improvements, too.
        
             | benjaminwootton wrote:
             | A column oriented database is probably the bigger
             | performance increase. Parquet and a good data warehouse
             | (something like Clickhouse, Druid or Snowflake) will both
             | use metadata and efficient scans to power through
             | aggregation queries.
        
         | RobinL wrote:
         | Author here! I've actually just written a separate blog post on
         | a similar topic! https://www.robinlinacre.com/parquet_api/
         | 
         | Parquet seems to be on a path to become the de facto standard
         | for storing and sharing bulk data - for good reason!
         | (discussion: https://news.ycombinator.com/item?id=34310695)
        
           | the_black_hand wrote:
           | very cool post. thanks
        
       | amayui wrote:
       | We've been thinking about using Parquet as a serialization format
       | during data exchange in our project as well.
        
       | d_burfoot wrote:
       | Can someone comment on the code quality of Arrow vs other Apache
       | data engineering tools?
       | 
       | I have been burned so many times by amateur hour software
       | engineering failures from the Apache world, that it's very hard
       | for me to ever willingly adopt anything from that brand again.
       | Just put it in gripped JSon or TSV and hey, if there's a
       | performance penalty, it's better to pay a bit more for cloud
       | compute than hate your job because of some nonsense dependency
       | issue caused by an org.Apache library failing to follow proper
       | versioning guidelines.
        
         | rfoo wrote:
         | Arrow the format is pretty good, there are occasional quirks
         | (null bitmap has 1 = non-null etc) but no big deal.
         | 
         | From my experience Arrow the C++ implementation is pretty solid
         | too, though I don't like it (taste). I just don't like their
         | "force std::shared_ptr over Array, Table, Schema and basically
         | everything" approach, why don't use an intrusive ref count if
         | the object could only be hold by shared_ptr anyways? There are
         | also a lot of const std::shared_ptr<Array>& arguments on not-
         | obvious-when-it-takes-ownership functions. And immutable Array
         | + ArrayBuilder (versus COW/switch between mutable uniquely
         | owned and immutable shared in ClickHouse and friends), so if
         | you have to fill the data out of order you are forced to buffer
         | your data on your side.
         | 
         | Do note that the compute engine (e.g. Velox) may still need to
         | implement their own (Arrow compatible) array types as there
         | aren't many fancy encodings in Arrow the format.
        
         | sanderjd wrote:
         | Arrow (and the ecosystem around it that I've looked into,
         | namely DataFusion) seems really solid and well-engineered to
         | me.
        
       | hermitcrab wrote:
       | I have written a desktop data wrangling/ETL tool for Windows/Mac
       | (Easy Data Transform). It is designed to handle millions of rows
       | (but not billions). Currently it mostly inputs and outputs CSV,
       | Excel, XML and JSON. I am looking to add some additional formats
       | in future, such as SQLite, Parquet or DuckBD. Maybe I need to
       | look at Feather as well? I could also use one of these formats to
       | store intermediate datasets to disk, rather than holding
       | everything in memory. If anyone has any experience in integrating
       | any of these formats into a C++ application on Windows and/or
       | Mac, I would be interested to hear how you got on and what
       | libraries you used.
        
       ___________________________________________________________________
       (page generated 2023-01-10 23:01 UTC)