[HN Gopher] Fast analysis with DuckDB and Pyarrow ___________________________________________________________________ Fast analysis with DuckDB and Pyarrow Author : amrrs Score : 78 points Date : 2022-04-30 17:50 UTC (5 hours ago) (HTM) web link (tech.gerardbentley.com) (TXT) w3m dump (tech.gerardbentley.com) | jagtesh wrote: | First of all, thanks for sharing this OP! So glad to see a way to | query a df using SQL without further transformation. | | Arrow has been truly revolutionary in this regard, providing a | solid in-memory data format (with performant APIs in many | languages) for interchange between different engines and even | formats. | | You can go from ORC to Parset to CSV on a local FS or S3. | | With DuckDB, it's like you can build your own AWS Athena at | likely a fraction of the cost. Now if only someone would | integrate vaex with DuckDB, it will make your powerful Apple | Silicon machines a compelling alternative to running a full | fledged Spark/Hadoop cluster. | singhrac wrote: | This is mildly off topic, but I am very unhappy with Pandas. | Every single API feels bolted on without any consideration of | composability or ergonomics. After spending 4 years with a much | better proprietary library I cannot deal with arbitrary functions | I have to learn like "value_counts" or whatever the output of a | "groupby" is. | rdedev wrote: | My go-to these days is Polars. You get good performance since | it uses arrow in the back. Coupled that with built-in lazy | evaluation and it's API design it's pretty good for me. There | are some caveats you need to be aware though. It doesn't always | work as a drop in replacement for pandas | bsg75 wrote: | Are you referring to https://www.pola.rs ? | rdedev wrote: | Yup. My bad. Looks like autocorrect screwed me | shankr wrote: | yeah even after working for years with pandas, I never feel | very confident writing it. I always have to look-up even | simpler stuff. | isoprophlex wrote: | Pandas is an absolute horrorshow: poor performance, | inconsistent API, terrible implicit behavior leading to | footguns. | | And everyone uses it because it's what you do when your boss | tells you "we're transforming the analytics team, you're all to | become data scientists because everyone has data scientists | now". You just grab whatever had the biggest mindshare on SO | and in random yt tutorials. Can't blame them. | | But hooo boy does pd get on my nerves. | | Care to share what propriety stuff you were using? | singhrac wrote: | I can't really share in any detail, I think, but the best | part was that "Series" were immutable and had sorted keys | (indexes). Essentially they were (math) functions, so | "indexes" had unique elements. All the important bits had | fast numpy/Cython implementations, but the semantics were | good because of unique keys. | | Honestly I still feel like I'm missing some sort of larger | story about the semantics of Pandas (like the "functions" | explanation above), so if anyone knows of anything that made | Pandas click, please let me know. | minimaxir wrote: | The fact that most data analysis/ETL tutorials on the internet | have converged on the same CSV/pandas tactics over the past | decade is disappointing when newer tools demonstrated here such | as DuckDB/Arrow have practical advantages without much code | complexity overhead. | | This post also links to another discussion about the Parquet data | format (https://pythonspeed.com/articles/pandas-read-csv-fast/), | also supported by Arrow, which is also extremely useful but I | never see anyone talking about it. Granted, Parquet data can't | natively be imported into Excel which is likely the main cause. | teej wrote: | These tools are very new compared to CSV+pandas. And most | things you want to get data out of won't give it to you in | parquet. | | The future is very promising, I am personally very excited | about DuckDB. But it's too soon to be griping about old | tutorials. | philshem wrote: | As of Pandas 1.4, you can use the pyarrow engine for reading a | csv df = pd.read_csv("large.csv", | engine="pyarrow") | | https://pythonspeed.com/articles/pandas-read-csv-fast/ | [deleted] | mkl wrote: | They do that in the article too. | mritchie712 wrote: | Is it me or do posts about data tools do better on HN then your | average software post? | minimaxir wrote: | Higher signal-to-noise, at the least. | tomrod wrote: | I like pyarrow a lot, but this is my first time come across | DuckDB. I'll check it out! | | I'm curious how it loads so fast initially. ___________________________________________________________________ (page generated 2022-04-30 23:00 UTC)