[HN Gopher] Apache Arrow 3.0 ___________________________________________________________________ Apache Arrow 3.0 Author : kylebarron Score : 244 points Date : 2021-02-03 19:43 UTC (3 hours ago) (HTM) web link (arrow.apache.org) (TXT) w3m dump (arrow.apache.org) | mbyio wrote: | I'm surprised they are still making breaking changes, and they | plan to make more (they are already working on a 4.0). | reilly3000 wrote: | I didn't see any breaking changes in the release notes but I | may have missed them. Maybe they don't use SemVer? | lidavidm wrote: | Arrow uses SemVer, but the library and the data format are | versioned separately: | https://arrow.apache.org/docs/format/Versioning.html | offtop5 wrote: | Does no one do load testing anymore, anyone got a working mirror | Thaxll wrote: | Last time I worked in ETL was with Hadoop, looks like a lot | happened. | macksd wrote: | There's actually a lot of overlap between Hadoop and Arrow's | origins - a lot of the projects that integrated early and the | founding contributors had been in the larger Hadoop ecosystem. | It's a very good sign IMO that you can hardly tell anymore - | very diverse community and wide adoption! | georgewfraser wrote: | Arrow is _the most important_ thing happening in the data | ecosystem right now. It 's going to allow you to run your choice | of execution engine, on top of your choice of data store, as | though they are designed to work together. It will mostly be | invisible to users, the key thing that needs to happen is that | all the producers and consumers of batch data need to adopt Arrow | as the common interchange format. | | BigQuery recently implemented the storage API, which allows you | to read BQ tables, in parallel, in Arrow format: | https://cloud.google.com/bigquery/docs/reference/storage | | Snowflake has adopted Arrow as the in-memory format for their | JDBC driver, though to my knowledge there is still no way to | access data in _parallel_ from Snowflake, other than to export to | S3. | | As Arrow spreads across the ecosystem, users are going to start | discovering that they can store data in one system and query it | in another, at full speed, and it's going to be amazing. | wesm wrote: | Microsoft is also on top of this with their Magpie project | | http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf | | "A common, efficient serialized and wire format across data | engines is a transformational development. Many previous | systems and approaches (e.g., [26, 36, 38, 51]) have observed | the prohibitive cost of data conversion and transfer, | precluding optimizers from exploiting inter-DBMS performance | advantages. By contrast, inmemory data transfer cost between a | pair of Arrow-supporting systems is effectively zero. Many | major, modern DBMSs (e.g., Spark, Kudu, AWS Data Wrangler, | SciDB, TileDB) and data-processing frameworks (e.g., Pandas, | NumPy, Dask) have or are in the process of incorporating | support for Arrow and ArrowFlight. Exploiting this is key for | Magpie, which is thereby free to combine data from different | sources and cache intermediate data and results, without | needing to consider data conversion overhead." | data_ders wrote: | way cool! Is magpie end-user facing anywhere yet? We were | using the azureml-dataprep library for a while which seems | similar but not all of magpie | breck wrote: | Arrow is definitely one of the top 10 new things I'm most | excited about in the data science space, but not sure I'd call | it _the_ most important thing. ;) | | It is pretty awesome, however, particularly for folks like me | that are often hopping between Python/R/Javascript. I've | definitely got in on the roadmap for all my data science | libraries. | | Btw, Arquero from that UW lab looks really neat as well, and is | supporting Arrow out of the gate | (https://github.com/uwdata/arquero). | tristanz wrote: | Agreed! Thank you Arrow community for such a great project. | It's a long road but it opens up tremendous potential for | efficient data systems that talk to one another. The future | looks bright with so many vendors backing Arrow independently | and Wes McKinney founding Ursa Labs and now Ursa Computing. | https://ursalabs.org/blog/ursa-computing/ | rubicon33 wrote: | Wait so you're telling me I could store data as a PDF file, and | access it easily / quickly as SQL? | sethhochberg wrote: | If you found/wrote a adapter to translate your structured PDF | into Arrow's format, yes - the idea is that you can wire up | anything that can produce Arrow data to anything that can | consume Arrow data. | staticassertion wrote: | I'm kinda confused. Is that not the case for literally | everything? "You can send me data of format X, all I ask is | that you be able to produce format X" ? | | I'm assuming that I'm missing something fwiw, not trying to | diminish the value. | humbleMouse wrote: | The difference is that arrow's mapping behind the scenes | enables automatic translation to any implemented "plugin" | that is on the user's implementation of arrow. You can | extend arrows format to make it automatically map to | whatever you want, basically. | [deleted] | shafiemukhre wrote: | True. Arrow is awesome and Dremio is using it as well as their | built in memory. I tried it and it is increadibily fast. The | future of data ecosystem is gonna be amazing | waynesonfire wrote: | Uhh.. maybe. It's a serde that's trying to be cross-language / | platform. | | I guess it also offers some APIs to process the data so you can | minimize serde operations. But, I dunno. It's been hard to | understand the benefit of the libabry and the posts here don't | help. | chrisweekly wrote: | For those wondering what a SerDe is: | https://docs.serde.rs/serde/ | waynesonfire wrote: | I meant serialize / deserialize in the literal sense. | nevi-me wrote: | The term likely predates the Rust implementation. SerDe is | Serializer & Deserializer, which could be any framework or | tool that allows the serialization and deserialization of | data. | | I first came across the concept in Apache Hive. | TuringTest wrote: | If it works as a universal intermediate exchange language, it | could help standardize connections among disparate systems. | | When you have N systems, it takes N^2 translators to build | direct connections to transfer data between them; but it only | takes N translators if all them can talk the same exchange | language. | waynesonfire wrote: | can you define what at translator is? I don't understand | the complexity you're constructing. I have N systems and | they talk protobuf. What's the problem? | TuringTest wrote: | By a translator, I mean a library that allows accessing | data from different subsystems (either languages or OS | processes). | | In this case, the advantages are that 1) Arrow is | language agnostic, so it's likely that it can be used as | a native library in your program and 2) it doesn't copy | data to make it accessible to another process, so it | saves a lot of marshalling / unmarshalling steps | (assuming both sides use data in tabular format, which is | typical of data analysis contexts). | [deleted] | wesm wrote: | There's no serde by design (aside from inspecting a tiny | piece of metadata indicating the location of each constituent | block of memory). So data processing algorithms execute | directly against the Arrow wire format without any | deserialization. | waynesonfire wrote: | Of course there is. There is always deserialization. The | data format is most definitely not native to the CPU. | wesm wrote: | I challenge you to have a closer look at the project. | | Deserialization by definition requires bytes or bits to | be relocated from their position in the wire protocol to | other data structures which are used for processing. | Arrow does not require any bytes or bits to be relocated. | So if a "C array of doubles" is not native to the CPU, | then I don't know what is. | throwaway894345 wrote: | Perhaps "zero-copy" is a more precise or well-defined | term? | waynesonfire wrote: | CPUs come in many flavors. One area where they differ is | in the way that bytes of a word are represented in | memory. Two common formats are Big Endian and Little | Endian. This is an example where a "C array of doubles" | would be incompatible and some form of deserilaziation | would be needed. | | My understanding is that an apache arrow library provides | an API to manipulate the format in a platform agnostic | way. But to claim that it eliminates deserialization is | false. | mumblemumble wrote: | It's not just a serde. One of its key use cases is | eliminating serde. | waynesonfire wrote: | I just don't believe you. My CPU doesn't understand Apache | Arrow 3.0. | mumblemumble wrote: | So, there are several components to Arrow. One of them | transfers data using IPC, and naturally needs to | serialize. The other uses shared memory, which eliminates | the need for serde. | | Sadly, the latter isn't (yet) well supported anywhere but | Python and C++. If you can/do use it, though, data are | just kept as as arrays in memory. Which is exactly what | the CPU wants to see. | kristjansson wrote: | Not GP post, but it might have been better stated as | 'eliminating serde overhead'. Arrow's RPC serialization | [1] is basically Protobuf, with a whole lot of hacks to | eliminate copies on both ends of the wire. So it's still | 'serde', but markedly more efficient for large blocks of | tabular-ish data. | | [1]: https://arrow.apache.org/docs/format/Flight.html | wesm wrote: | > Arrow's serialization is Protobuf | | Incorrect. Only Arrow Flight embeds the Arrow wire format | in a Protocol Buffer, but the Arrow protocol itself does | not use Protobuf. | kristjansson wrote: | Apologies, off base there. Edited with a pointer to | Flight :) | jrevels wrote: | Excited to see this release's official inclusion of the pure | Julia Arrow implementation [1]! | | It's so cool to be able mmap Arrow memory and natively manipulate | it from within Julia with virtually no performance overhead. | Since the Julia compiler can specialize on the layout of Arrow- | backed types at runtime (just as it can with any other type), the | notion of needing to build/work with a separate "compiler for | fast UDFs" is rendered obsolete. | | It feels pretty magical when two tools like this compose so well | without either being designed with the other in mind - a | testament to the thoughtful design of both :) mad props to Jacob | Quinn for spearheading the effort to revive/restart Arrow.jl and | get the package into this release. | | [1] https://github.com/JuliaData/Arrow.jl | jayd16 wrote: | Can someone dig into the pros and cons of the columnar aspect of | Arrow? To some degree there are many other data transfer formats | but this one seems to promote its columnar orientation. | | Things like eg. protobuffers support hierarchical data which | seems like a superset of columns. Is there a benefit to a column | based format? Is it an enforced simplification to ensure greater | compatibility or is there some other reason? | aldanor wrote: | - Selective/lazy access (e.g. I have 100 columns but I want to | quickly pull out/query just 2 of them). | | - Improved compression (e.g. a column of timestamps). | | - Flexible schemas being easy to manage (e.g. adding more | columns, or optional columns). | | - Vectorization/SIMD-friendly. | wging wrote: | Redshift's explanation is pretty good. Among other things, if | you only need a few columns there are entire blocks of data you | don't need to touch at all. | https://docs.aws.amazon.com/redshift/latest/dg/c_columnar_st... | | It's truly magical when you scope down a SELECT to the columns | you need and see a query go blazing fast. Or maybe I'm easily | impressed. | mumblemumble wrote: | A columnar format is almost always what you want for analytical | workloads, because their access patterns tend to iterate ranges | of rows but select only a few columns at random. | | About the only thing protocol buffers has in common is that | it's a standardized binary format. The use case is largely non- | overlapping, though. Protobuf is meant for transmitting | monolithic datagrams, where the entire thing will be | transmitted and then decoded as a monolithic blob. It's also, | out of the box, not the best for efficiently transmitting | highly repetitive data. Column-oriented formats cut down on | some repetition of metadata, and also tend to be more | compressible because similar data tends to get clumped | together. | | Coincidentally, Arrow's format for transmitting data over a | network, Arrow Flight, uses protocol buffers as its messaging | format. Though the payload is still blocks of column-oriented | data, for efficiency. | tomnipotent wrote: | This is intended for analytical workloads where you're often | doing things that can benefit from vectorization (like SIMD). | It's much faster to SUM(X) when all values of X are neatly laid | out in-memory. | | It also has the added benefit of eliminating serialization and | deserialization of data between processes - a Python process | can now write to memory which is read by a C++ process that's | doing windowed aggregations, which are then written over the | network to another Arrow compatible service that just copies | the data as-is from the network into local memory and resumes | working. | waynesonfire wrote: | > It also has the added benefit of eliminating serialization | and deserialization of data between processes | | Is that accurate? It still has to deserialize from apache | arrow format to whatever the cpu understands. | ianmcook wrote: | The Arrow Feather format is an on-disk representation of | Arrow memory. To read a Feather file, Arrow just copies it | byte for byte from disk into memory. Or Arrow can memory- | map a Feather file so you can operate on it without reading | the whole file into memory. | waynesonfire wrote: | That's exactly how I read every data format. | | The advantage you describe is in the operations that can | performed against the data. It would be nice to see what | this API looks like and how it compares to flatbuffers / | pq. | | To help me understand this benefit, can you talk through | what it's like to add 1 to each record and write it back | to disk? | zten wrote: | The important part to focus on is _between processes_. | | Consider Spark and PySpark. The Python bits of Spark are in | a sidecar process to the JVM running Spark. If you ask | PySpark to create a DataFrame from Parquet data, it'll | instruct the Java process to load the data. Its in-memory | form will be Arrow. Now, if you want to manipulate that | data in PySpark using Python-only libraries, prior to the | adoption of Arrow it used to serialize and deserialize the | data between processes on the same host. With Arrow, this | process is simplified -- however, I'm not sure if it's | simplified by exchanging bytes that don't require | serialization/deserialization between the processes or by | literally sharing memory between the processes. The docs do | mention zero-copied shared memory. | [deleted] | georgewfraser wrote: | The main benefit of a columnar representation in _memory_ is it | 's more cache friendly for a typical analytical workload. For | example, if I have a dataframe: (A int, B int, | C int, D int) | | And I write: A + B | | In a columnar representation, all the As are next to each | other, and all the Bs are next to each other, so the process of | (A and B in memory) => (A and B in CPU registers) => (addition) | => (A + B result back to memory) will be a lot more efficient. | | In a row-oriented representation like protobuf, all your C and | D values are going to get dragged into the CPU registers | alongside the A and B values that you actually want. | | Column-oriented representation is also more friendly to SIMD | CPU instructions. You can still use SIMD with a row-oriented | representation, but you have to use gather-scatter operations | which makes the whole thing less efficient. | skratlo wrote: | Yay, another ad-tech support engine from Apache, great | [deleted] | dang wrote: | If curious see also | | 2020 https://news.ycombinator.com/item?id=23965209 | | 2018 (a bit) https://news.ycombinator.com/item?id=17383881 | | 2017 https://news.ycombinator.com/item?id=15335462 | | 2017 https://news.ycombinator.com/item?id=15594542 rediscussed | recently https://news.ycombinator.com/item?id=25258626 | | 2016 https://news.ycombinator.com/item?id=11118274 | | Also: related from a couple weeks ago | https://news.ycombinator.com/item?id=25824399 | | related from a few months ago | https://news.ycombinator.com/item?id=24534274 | | related from 2019 https://news.ycombinator.com/item?id=21826974 | humbleMouse wrote: | I worked at a large company a few years ago on a team | implementing this. It's super cool and works great. Definitely | where the future is headed | mushufasa wrote: | Can someone ELI5 what problems are best solved by apache arrow? | hbcondo714 wrote: | The press release on the first version of Apache Arrow | discussed here 5 years ago is actually a good introduction IMO: | | https://news.ycombinator.com/item?id=11118274 | Diederich wrote: | Rather curious myself. | | https://en.wikipedia.org/wiki/Apache_Arrow was interesting, but | I think many of us would benefit from a broader, problem | focused description of Arrow from someone in the know. | foobarian wrote: | To really grok why this is useful, put yourself in the shoes of | a data warehouse user or administrator. This is a farm of | machines with disks that hold data too big to have in a typical | RDBMS. Because the data is so big and stored spread out across | machines, a whole parallel universe of data processing tools | and practices exists that run various computations over shards | of the data locally on each server, and merge the results etc. | (Map-reduce originally from Google was the first famous | example). Then there is a cottage industry of wrappers around | this kind of system to let you use SQL to query the data, make | it faster, let you build cron jobs and pipelines with | dependencies, etc. | | Now, so far all these tools did not really have a common | interchange format for data, so there was a lot of wheel | reinvention and incompatibility. Got file system layer X on | your 10PB cluster? Can't use it with SQL engine Y. And I guess | this is where Arrow comes in, where if everyone uses it then | interop will get a lot better and each individual tool that | much more useful. | | Just my naive take. | adgjlsfhk1 wrote: | The big thing is that it is one of the first standardized, | cross language binary data formats. CSV is an OK text format, | but parsing it is really slow because of string escaping. The | files it produces are also pretty big since it's text. | | Arrow is really fast to parse (up to 1000x faster than CSV), | supports data compression, enough data-types to be useful, and | deals with metadata well. The closest competitor is probably | protobuf, but protobuf is a total pain to parse. | MrPowers wrote: | CSV is a file format and Arrow is an in memory data format. | | The CSV vs Parquet comparison makes more sense. Conflating | Arrow / Parquet is a pet peeve of Wes: | https://news.ycombinator.com/item?id=23970586 | nolta wrote: | To be fair, arrow absorbed parquet-cpp, so a little | confusion is to be expected. | aldanor wrote: | The closest competitor would be HDF5, not Protobuf. | makapuf wrote: | Seems nice. How does it compare to hdf5? | BadInformatics wrote: | HDF5 is pretty terrible as a wire format, so it's not a 1-1 | comparison to Arrow. Generally people are not going to be | saving Arrow data to disk either (though you can with the | IPC format), but serializing to a more compact | representation like Parquet. | neovintage wrote: | The premise around arrow is that when you want share data with | another system, or even on the same machine between processes, | most of the compute time spent is in serializing and | deserializing data. Arrow removes that step by defining a | common columnar format that can be used in many different | programming languages. Theres more to arrow than just the file | format that makes working with data even easier like better | over the wire transfers (arrow flight). How this would manifest | for your customers using your applications? They'd like see | speeds increase. Arrow makes a lot of sense when working with | lots of data in analytical or data science use cases. | RobinL wrote: | Yes. | | A second important point is the recognition that data tooling | often re-implements the same algorithms again and again, | often in ways which are not particularly optimised, because | the in-memory representation of data is different between | tools. Arrow offers the potential to do this once, and do it | well. That way, future data analysis libraries (e.g. a | hypothetical pandas 2) can concentrate on good API design | without having to re-invent the wheel. | | And a third is that Arrow allows data to be chunked and | batched (within a particular tool), meaning that computations | can be streamed through memory rather than the whole | dataframe needing to be stored in memory. A little bit like | how Spark partitions data and sends it to different nodes for | computation, except all on the same machine. This also | enables parallelisation by default. With the core count of | CPUS this means Arrow is likely to be extremely fast. | ianmcook wrote: | Re this second point: Arrow opens up a great deal of | language and framework flexibility for data engineering- | type tasks. Pre-Arrow, common kinds of data warehouse ETL | tasks like writing Parquet files with explicit control over | column types, compression, etc. often meant you needed to | use Python, probably with PySpark, or maybe one of the | other Spark API languages. With Arrow now there are a bunch | more languages where you can code up tasks like this, with | consistent results. Less code switching, lower complexity, | less cognitive overhead. | BenoitP wrote: | > most of the compute time spent is in serializing and | deserializing data. | | This is to be viewed in light how hardware evolves now. CPU | compute power is no longer growing as much (at least for | individual cores). | | But one thing that's still doubling on a regular basis is | memory capacity of all kinds (RAM, SSD, etc) and bandwidth of | all kinds (PCIe lanes, networking, etc). This divide is | getting large and will only continue to increase. | | Which brings me to my main point: | | You can't be serializing/deserializing data on the CPU. What | you want is to have the CPU coordinate the SSD to copy chunks | directly -and as is- to the NIC/app/etc. | | Short of having your RAM doing compute work*, you would be | leaving performance on the table. | | ---- | | * Which is starting to appear | (https://www.upmem.com/technology/), but that's not quite | there yet. | jeffbee wrote: | An interesting perspective on the future of computer | architecture but it doesn't align well with my experience. | CPUs are easier to build and although a lot of ink has been | spilled about the end of Moore's Law, it remains the case | that we are still on Moore's curve for number of | transistors, and since about 15 years ago we are now also | on the same slope for # of cores per CPU. We also still | enjoy increasing single-thread performance, even if not at | the rates of past innovation. | | DRAM, by contrast, is currently stuck. We need materials | science breakthroughs to get beyond the capacitor aspect | ratio challenge. RAM is still cheap but as a systems | architect you should get used to the idea that the amount | of DRAM per core will fall in the future, by amounts that | might surprise you. | jedbrown wrote: | This is backward -- this sort of serialization is | overwhelmingly bottlenecked on bandwidth (not CPU). (Multi- | core) compute improvements have been outpacing bandwidth | improvements for decades and have not stopped. | Serialization is a bottleneck because compute is fast/cheap | and bandwidth is precious. This is also reflected in the | relative energy to move bytes being increasingly larger | than the energy to do some arithmetic on those bytes. | MrPowers wrote: | Exactly. Some specific examples. | | Read a Parquet file into a Pandas DataFrame. Then read the | Pandas DataFrame into a Spark DataFrame. Spark & Pandas are | using the same Arrow memory format, so no serde is needed. | | See the "Standardization Saves" diagram here: | https://arrow.apache.org/overview/ | pindab0ter wrote: | I hope you don't mind me asking dumb questions, but how | does this differ from the role that say Protocol Buffers | fills? To my ears they both facilitate data exchange. Are | they comparable in that sense? | arjunnarayan wrote: | protobufs still get encoded and decoded by each client | when loaded into memory. arrow is a little bit more like | "flatbuffers, but designed for common data-intensive | columnar access patterns" | hackcasual wrote: | Arrow does actually use flatbuffers for metadata storage. | poorman wrote: | https://github.com/apache/arrow/tree/master/format | hawk_ wrote: | is that at the cost of the ability to do schema | evolution? | rcoveson wrote: | Better to compare it to Cap'n Proto instead. Arrow data | is already laid out in a usable way. For example, an | Arrow column of int64s is an 8-byte aligned memory region | of size 8*N bytes (plus a bit vector for nullity), ready | for random access or vectorized operations. | | Protobuf, on the other hand, would encode those values as | variable-width integers. This saves a lot of space, which | might be better for transfer over a network, but means | that writers have to take a usable in-memory array and | serialize it, and readers have to do the reverse on their | end. | | Think of Arrow as standardized shared memory using | struct-of-arrays layout, Cap'n Proto as standardized | shared memory using array-of-structs layout, and Protobuf | as a lightweight purpose-built compression algorithm for | structs. | jeffbee wrote: | Protobuf provides the fixed64 type and when combined with | `packed` (the default in proto3, optional in proto2) | gives you a linear layout of fixed-size values. You would | not get natural alignment from protobuf's wire format if | you read it from an arbitrary disk or net buffer; to get | alignment you'd need to move or copy the vector. | Protobuf's C++ generated code provides RepeatedField that | behaves in most respects like std::vector, but in as much | as protobuf is partly a wire format and partly a library, | users are free to ignore the library and use whatever | code is most convenient to their application. | | TL;DR variable-width numbers in protobuf are optional. | cornstalks wrote: | > _Think of Arrow as standardized shared memory using | struct-of-arrays layout, Cap 'n Proto as standardized | shared memory using array-of-structs layout_ | | I just want to say thank you for this part of the | sentence. I understand struct-of-arrays vs array-of- | structs, and now I finally understand what the heck Arrow | is. | poorman wrote: | Not only in between processes, but also in between languages | in a single process. In this POC I spun up a Python | interpreter in a Go process and pass the Arrow data buffer | between processes in constant time. | https://github.com/nickpoorman/go-py-arrow-bridge | didip wrote: | Question, doesn't Parquet already do that? | ianmcook wrote: | From https://arrow.apache.org/faq/: "Parquet files cannot | be directly operated on but must be decoded in large | chunks... Arrow is an in-memory format meant for direct and | efficient use for computational purposes. Arrow data is... | laid out in natural format for the CPU, so that data can be | accessed at arbitrary places at full speed." | snicker7 wrote: | Yes. But parquet is now based on Apache Arrow. | ianmcook wrote: | Parquet is not based on Arrow. The Parquet libraries are | built into Arrow, but the two projects are separate and | Arrow is not a dependency of Parquet. | CrazyPyroLinux wrote: | This is not the greatest answer, but one anecdote: I am looking | at it as an alternative to MS SSIS for moving batches of data | around between databases. | [deleted] | RobinL wrote: | I wrote a bit about this from the perspective of a data | scientist here: | https://www.robinlinacre.com/demystifying_arrow/ | | I cover some of the use cases, but more importantly try and | explain how it all fits together, justifying why - as another | commenters has said - it's the most important thing happening | in the data ecosystem right now. | | I wrote it because i'd heard a lot about Arrow, and even used | it quite a lot, but realised I hadn't really understood what it | was! | juntan wrote: | In Perspective (https://github.com/finos/perspective), we use | Apache Arrow as a fast, cross-language/cross-network data | encoding that is extremely useful for in-browser data | visualization and analytics. Some benefits: | | - super fast read/write compared to CSV & JSON (Perspective and | Arrow share an extremely similar column encoding scheme, so we | can memcpy Arrow columns into Perspective wholesale instead of | reading a dataset iteratively). | | - the ability to send Arrow binaries as an ArrayBuffer between | a Python server and a WASM client, which guarantees | compatibility and removes the overhead of JSON | serialization/deserialization. | | - because Arrow columns are strictly typed, there's no need to | infer data types - this helps with speed and correctness. | | - Compared to JSON/CSV, Arrow binaries have a super compact | encoding that reduces network transport time. | | For us, building on top of Apache Arrow (and using it wherever | we can) reduces the friction of passing around data between | clients, servers, and runtimes in different languages, and | allows larger datasets to be efficiently visualized and | analyzed in the browser context. | prepend wrote: | I recently found it useful for the dumbest reason. A dataset | was about 3GB as a CSV and 20MB as a parquet file created and | consumed by arrow. The file also worked flawlessly across | different environments and languages. | | So it's a good transport tool. It also happens to be fast to | load and query, but I only used it because of the compact way | it stores data without any hoops to jump through. | | Of course one might say that it's stupid to try to pass around | a 2GB or 20MB file, but in my case I needed to do that. | IshKebab wrote: | When you want to process large amounts of in-memory tabular | data from different languages. | | You can save it to disk too using Apache Parquet but I | evaluated Parquet and it is very immature. Extremely incomplete | documentation and lots of Arrow features are just not supported | in Parquet unfortunately. | Lucasoato wrote: | Do you mean the Parquet format? I don't think Parquet is | immature, it is used in so many enterprise environments, it's | is one of the few columnar file format for batch analysis and | processing. It preforms so well... But I'm curious to know | your opinion on this, so feel free to add some context to | your position! | IshKebab wrote: | Yeah I do. For example Apache Arrow supports in memory | compression. But Parquet does not support that. I had to | look through the code to find that out, and I found many | instances of basically `throw "Not supported"`. And yeah as | I said the documentation is just non-existent. | | If you are already using Arrow, or you absolutely must use | a columnar file format then it's probably a good option. | BadInformatics wrote: | Is that a problem in the Parquet format or in PyArrow? My | understanding is that Parquet is primarily meant for on- | disk storage (hence the default on-disk compression), so | you'd read into Arrow for in-memory compression or IPC. | liminal wrote: | Would really love to see first class support for | Javascript/Typescript for data visualization purposes. The | columnar format would naturally lend itself to an Entity- | Component style architecture with TypedArrays. | BadInformatics wrote: | https://arrow.apache.org/docs/js/ has existed for a while and | uses typed arrays under the hood. It's a bit of a chunky | dependency, but if you're at the point where that level of | throughput is required bundle size is probably not a big deal. | liminal wrote: | I was mostly looking at this: | https://arrow.apache.org/docs/status.html | BadInformatics wrote: | Not sure I follow, that page indicates that JS support is | pretty good for all but the more obscure features (e.g. | decimals) and doesn't mention data visualization at all? | Anyhow, I've successfully used | https://github.com/vega/vega-loader-arrow for in-browser | plots before, and Observable has a fair few notebooks | showing how to use the JS API (e.g. | https://observablehq.com/@theneuralbit/introduction-to- | apach...) | nevi-me wrote: | Have you seen https://github.com/finos/perspective? Though they | removed the JS library a few months ago in favour of a WASM | build of the C++ library. | anonyfox wrote: | So if I understand this correctly from an application developers | perspective: | | - for OLTP tasks, something row based like sqlite is great. Small | to medium amounts of data mixed reading/writing with transactions | | - for OLAP tasks, arrow looks great. Big amounts of data, faster | querying (datafusion) and more compact data files with parquet. | | Basically prevent the operational database from growing too | large, offload older data to arrow/parquet. Did I get this | correct? | | Additionally there seem to be further benefits like sharing | arrow/parquet with other consumers. | | Sounds convincing, I just have two very specific questions: | | - if I load a ~2GB collection of items into arrow and query it | with datafusion, how much slower will this perform in comparison | to my current rust code that holds a large Vec in memory and | ,,queries" via iter/filter? | | - if I want to move data from sqlite to a more permanent parquet | ,,Archive" file, is there a better way than recreating the whole | file or write additional files, like, appending? | | Really curious, could find no hints online so far to get an idea. ___________________________________________________________________ (page generated 2021-02-03 23:00 UTC)