[HN Gopher] Demystifying Apache Arrow (2020) ___________________________________________________________________ Demystifying Apache Arrow (2020) Author : dmlorenzetti Score : 174 points Date : 2023-01-09 15:36 UTC (1 days ago) (HTM) web link (www.robinlinacre.com) (TXT) w3m dump (www.robinlinacre.com) | RobinL wrote: | Author here. Since I wrote this, Arrow seems to be be more and | more pervasive. As a data engineer, the adoption of Arrow (and | parquet) as a data exchange format has so much value. It's | amazing how much time me and colleagues have spent on data type | issues that have arisen from the wide range of data tooling (R, | Pandas, Excel etc. etc.). So much so that I try to stick to | parquet, using SQL where possible to easily preserve data types | (pandas is a particularly bad offender for managing data types). | | In doing so, I'm implicitly using Arrow - e.g. with Duckdb, AWS | Athena and so on. The list of tools using Arrow is long! | https://arrow.apache.org/powered_by/ | | Another interesting development since I wrote this is DuckDB. | | DuckDB offers a compute engine with great performance against | parquet files and other formats. Probably similar performance to | Arrow. It's interesting they opted to write their own compute | engine rather than use Arrow's - but I believe this is partly | because Arrow was immature when they were starting out. I mention | it because, as far as I know, there's not yet an easy SQL | interface to Arrow from Python. | | Nonetheless, DuckDB are still Arrow for some of its other | features: https://duckdb.org/2021/12/03/duck-arrow.html | | Arrow also has a SQL query engine: | https://arrow.apache.org/blog/2019/02/04/datafusion-donation... | | I might be wrong about this - but in my experience, it feels like | there's more consensus around the Arrow format, as opposed to the | compute side. | | Going forward, I see parquet continuing on its path to becoming a | de facto standard for storing and sharing bulk data. I'm | particularly excited about new tools that allow you to process it | in the browser. I've written more about this just yesterday: | https://www.robinlinacre.com/parquet_api/, discussion: | https://news.ycombinator.com/item?id=34310695. | hintymad wrote: | Thanks for sharing your insights. Any comments on Feather vs | Parquet? If we don't need to support tools that can only | interact with Parquet, how will Feather pan out as a Parquet | alternative (or Feather can't be such alternative at all)? | reichardt wrote: | I recently looked into this as well. Specifically how the two | formats differ. As it stands right now the "Feather" file | format seems to be a synonym for the Arrow IPC file format or | "Arrow files" [0]. There should be basically no overhead | while reading into the arrow memory format [1]. Parquet files | on the other hand are stored in a different format and | therefore occur some overhead while reading into memory but | offer more advanced mechanism for on disk encoding and | compression [1]. | | As far as I can tell the main trade-off seems to be around | deserialization overhead vs on disk file size. If anyone has | more information or experience with the topic I'd love to | hear! | | [0] https://arrow.apache.org/faq/#what-about-the-feather- | file-fo... [1] https://arrow.apache.org/faq/#what-is-the- | difference-between... | | EDIT: | | More information: | https://news.ycombinator.com/item?id=34324649 | RobinL wrote: | This is also my understanding - see | https://news.ycombinator.com/item?id=34324649 | reichardt wrote: | Thanks! Just stumbled across your comment as well. | sanderjd wrote: | Since you know a bunch about this, I'm going to ask you a | question that I was about to research: If I have a dataset in | memory in Arrow, but I want to cache it to disk to read back in | later, what is the most efficient way to do that at this moment | in time? Is it to write to parquet and read the parquet back | into memory, or is there a more efficient way to write the | native Arrow format such that it can be read back in directly? | I think this sounds kind of like Flight, except that my | understanding is that is intended for moving the data across a | network rather than temporally across a disk. | sdz wrote: | You're probably looking for the Arrow IPC format [1], which | writes the data in close to the same format as the memory | layout. On some platforms, reading this back is just an mmap | and can be done with zero copying. Parquet, on the other | hand, is a somewhat more complex format and there will be | some amount of encoding and decoding on read/write. Flight is | an RPC framework that essentially sends Arrow data around in | IPC format. | | [1] https://arrow.apache.org/docs/python/ipc.html | RobinL wrote: | I'm not an expert in the nuts and bolts of Arrow, but I think | you have two options: | | - Save to feather format. Feather format is essentially the | same thing as the Arrow in-memory format. This is | uncompressed and so if you have super fast IO, it'll read | back to memory faster, or at least, with minimal CPU usage. | | - Save to compressed parquet format. Because you're often IO | bound, not CPU bound, this may read back to memory faster, at | the expense of the CPU usage of decompressing. | | On a modern machine with a fast SSD, I'm not sure which would | be faster. If you're saving to remote blob storage e.g. S3, | parquet will almost certainly be faster. | | See also https://news.ycombinator.com/item?id=34324649 | sanderjd wrote: | Thanks! Exactly what I was looking for. I'll do some | benchmarking of these two options for my workload. | kajika91 wrote: | I prefer looking at benchmarks : | https://towardsdatascience.com/the-best-format-to-save-panda... | | I have used Arrow and even made my humble contribution to the Go | binding but I don't like pretending it is so much better than | other solutions. It is not a silver bullet and probably the best | pro is the "non-copy" goal to convert data into different | frameworks' object. Depending of the use for the data columnar | layout can be better but not always. | kordlessagain wrote: | FeatureBase uses Arrow for processing data stored in bitmap | format: https://featurebase.com/ | mjburgess wrote: | > Learning more about a tool that can filter and aggregate two | billion rows on a laptop in two seconds | | If someone has a code example to this effect, I'd be greatful. | | I was once engaged in a salesy pitch by a cloud advocate that | BigQuery (et al.) can "process a billion rows a second". | | I tried to create an SQLite example with a billion rows to show | that this isn't impressive, but I gave up after some obstacles to | generating the data. | | It would be nice to have an example like this to show developers | (, engineers) who have become accustomed to the extreme levels of | CPU abuse today, to show that modern laptops really are | supercomputers. | | It should be obvious that a laptop can rival a data centre at 90% | of ordinary tasks, that it isn't in my view, has a lot to do with | the state of OS/Browser/App/etc. design & performance. | Supercomputers, alas, dedicated to drawing pixels by way of a | dozen layers of indirection. | StreamBright wrote: | You can try the examples or datafusion with flight. I have been | able to process data with that setup in Rust under milliseconds | that usually takes tens of seconds with a distributed query | engine. I think Rust combined with Arrow, Flight, Parquet can | be a game changer for analytics after a decade of Java with | Hadoop & co. | cmollis wrote: | completely agree with this. Rust and arrow will be part of | the next set of toolsets for data engineering. Spark is great | and I use it every day but it's big and cumbersome to use. | There are use-cases today that are being addressed by | datafusion, duckdb, (to a certain extent, pandas).. that will | continue to evolve.. hopefully ballista can mature to a point | where it's a real spark alternative for distributed | computations. Spark isn't standing still of course and we're | already seeing a lot of different drop in C++ SQL engines.. | but moving entirely away from the JVM would be a watershed, | IMO | RobinL wrote: | An example using R code is here: | https://arrow.apache.org/docs/r/articles/dataset.html | | The speed comes from the raw speed of arrow, but also a | 'trick'. If you apply a filter, this is pushed down to the raw | parquet files so some don't need to be read at all due to the | hive-style organisation | | Another trick is that parquet files store some summary | statistics in their metadata. This means, for example, that if | you want to find the max of a column, only the metadata needs | to be read, rather than the data itself. | | I'm a Python user myself, but the code would be comparable on | the Python side | tveita wrote: | Clickhouse or DuckDB are databases I would look at that support | this use case pretty much "out of the box" | | E.g. https://benchmark.clickhouse.com has some query times for | a 100 million row dataset. | spaniard89277 wrote: | DuckDB is so simple to work with. It's only worth to look | elsewhere with _real_ big data, or where you really need a | client-server setup. | | I hope it receives more love. | IanCal wrote: | Duckdb is outrageously useful. Great on its own, but slots | in perfectly reading and providing back arrow data frames, | meaning you can seamlessly swap between tools when SQL for | some parts and other tools better for others. Also very | fast. I was able to throw away designs for multi machine | setups as duckdb on its own was fast enough to not worry | about anything else. | intelVISA wrote: | Having used all three I'd go with Clickhouse/DuckDB over | Arrow every time. | sanderjd wrote: | Oh interesting - why? | intelVISA wrote: | They're easier to use and faster is the tl;dr. | nlittlepoole wrote: | 100% agree. | lmeyerov wrote: | Probably for SQL (top n, ...), but not for wrangling & | analytics & ML & ai & viz | [deleted] | thinkharderdev wrote: | You can see some of the benchmarks in DataFusion (part of the | Arrow project and built with Arrow as the underlying in-memory | format) https://github.com/apache/arrow- | datafusion/blob/master/bench... | | Disclaimer: I'm a committer on the Arrow project and | contributor to DataFusion. | mihevc wrote: | Here are some cookbook examples: | https://arrow.apache.org/cookbook/py/data.html#group-a-table, | https://arrow.apache.org/cookbook/. Datasets would probably be | a good approach for the billions size, see: | https://blog.djnavarro.net/posts/2022-11-30_unpacking-arrow-... | cube2222 wrote: | Generally, operating on raw numbers in a columnar layout is | very very fast, even if you just write it as a straightforward | loop. | agumonkey wrote: | Very interesting project | | ps: a tiny video to explain storage layout optimizations | https://yewtu.be/watch?v=dPb2ZXnt2_U | alamb wrote: | Here is another blog post that offers some perspective on the | growth of Arrow over the intervening years and future directions: | https://www.datawill.io/posts/apache-arrow-2022-reflection/ | RobinL wrote: | That's really good, thanks. Better than my blog, actually - the | author has a much deeper understanding and I learned a lot by | reading it. I was coming at it from the perspective of someone | very confused by Arrow, and wrote the blog to help myself | understand it! | gizmodo59 wrote: | There is a bunch of other projects that grew out of arrow which | are also contributing a lot to data engineering: | https://www.dremio.com/blog/apache-arrows-rapid-growth-over-... | flakiness wrote: | FYI: A recent "Data Analysis Podcast" interviews the Arrows | founder Wes McKinney on this topic. | | https://roundup.getdbt.com/p/ep-37-what-does-apache-arrow-un... | rr888 wrote: | I always thought the file format was going to be tightly bound to | Arrow but looks like they aren't encouraging feather. Should we | just be using Parquet for file storage? | RobinL wrote: | Yes - save to parquet. From the OP: | | "Why not just persist the data to disk in Arrow format, and | thus have a single, cross-language data format that is the same | on-disk and in-memory? One of the biggest reasons is that | Parquet generally produces smaller data files, which is more | desirable if you are IO-bound. This will especially be the case | if you are loading data from cloud storage like such as AWS S3. | | Julien LeDem explains this further in a blog post discussing | the two formats: | | >> The trade-offs for columnar data are different for in- | memory. For data on disk, usually IO dominates latency, which | can be addressed with aggressive compression, at the cost of | CPU. In memory, access is much faster and we want to optimise | for CPU throughput by paying attention to cache locality, | pipelining, and SIMD instructions. | https://www.kdnuggets.com/2017/02/apache-arrow-parquet- | colum..." | mempko wrote: | I opted to store feather for one particular reason. You can | open it using mmap and randomly index the data without having | to load it all in memory. Also the data I have isn't very | compressible to begin with, so the cpu cost vs data savings | of parquet don't make sense. This only makes sense in that | narrow use case. | _frkl wrote: | I'm doing the same. It's also quite nice for de- | duplication, a lot of operations on our data happen on a | column basis, and we need to assemble tables that are | basically the same, except for one or two computed columns. | I usually store all columns in a separate file, and | assemble tables on the fly, also memory-mapped. Quite happy | with being able to do that. Not sure how easy that would be | with parquet. | Lyngbakr wrote: | This has been a game changer for us. When our analysts run | queries on parquets using Arrow they are orders of magnitude | faster than equivalent SQL queries on databases. | BeefWellington wrote: | Were you working off proper data warehouses, or just the | transactional db? | | I ask because something a lot of people miss here is how much | performance you can get from the T part of ETL. Denormalizing | everything into big simple inflated tables makes things orders | of magnitude faster. It matters quite a bit what your | comparison is against. | Lyngbakr wrote: | We saw major improvements when we simply wrote full tables | from a transactional database to parquet, but also, as you | say, modelling the data appropriately produced significant | improvements, too. | benjaminwootton wrote: | A column oriented database is probably the bigger | performance increase. Parquet and a good data warehouse | (something like Clickhouse, Druid or Snowflake) will both | use metadata and efficient scans to power through | aggregation queries. | RobinL wrote: | Author here! I've actually just written a separate blog post on | a similar topic! https://www.robinlinacre.com/parquet_api/ | | Parquet seems to be on a path to become the de facto standard | for storing and sharing bulk data - for good reason! | (discussion: https://news.ycombinator.com/item?id=34310695) | the_black_hand wrote: | very cool post. thanks | amayui wrote: | We've been thinking about using Parquet as a serialization format | during data exchange in our project as well. | d_burfoot wrote: | Can someone comment on the code quality of Arrow vs other Apache | data engineering tools? | | I have been burned so many times by amateur hour software | engineering failures from the Apache world, that it's very hard | for me to ever willingly adopt anything from that brand again. | Just put it in gripped JSon or TSV and hey, if there's a | performance penalty, it's better to pay a bit more for cloud | compute than hate your job because of some nonsense dependency | issue caused by an org.Apache library failing to follow proper | versioning guidelines. | rfoo wrote: | Arrow the format is pretty good, there are occasional quirks | (null bitmap has 1 = non-null etc) but no big deal. | | From my experience Arrow the C++ implementation is pretty solid | too, though I don't like it (taste). I just don't like their | "force std::shared_ptr over Array, Table, Schema and basically | everything" approach, why don't use an intrusive ref count if | the object could only be hold by shared_ptr anyways? There are | also a lot of const std::shared_ptr<Array>& arguments on not- | obvious-when-it-takes-ownership functions. And immutable Array | + ArrayBuilder (versus COW/switch between mutable uniquely | owned and immutable shared in ClickHouse and friends), so if | you have to fill the data out of order you are forced to buffer | your data on your side. | | Do note that the compute engine (e.g. Velox) may still need to | implement their own (Arrow compatible) array types as there | aren't many fancy encodings in Arrow the format. | sanderjd wrote: | Arrow (and the ecosystem around it that I've looked into, | namely DataFusion) seems really solid and well-engineered to | me. | hermitcrab wrote: | I have written a desktop data wrangling/ETL tool for Windows/Mac | (Easy Data Transform). It is designed to handle millions of rows | (but not billions). Currently it mostly inputs and outputs CSV, | Excel, XML and JSON. I am looking to add some additional formats | in future, such as SQLite, Parquet or DuckBD. Maybe I need to | look at Feather as well? I could also use one of these formats to | store intermediate datasets to disk, rather than holding | everything in memory. If anyone has any experience in integrating | any of these formats into a C++ application on Windows and/or | Mac, I would be interested to hear how you got on and what | libraries you used. ___________________________________________________________________ (page generated 2023-01-10 23:01 UTC)