[HN Gopher] Fast analysis with DuckDB and Pyarrow
       ___________________________________________________________________
        
       Fast analysis with DuckDB and Pyarrow
        
       Author : amrrs
       Score  : 78 points
       Date   : 2022-04-30 17:50 UTC (5 hours ago)
        
 (HTM) web link (tech.gerardbentley.com)
 (TXT) w3m dump (tech.gerardbentley.com)
        
       | jagtesh wrote:
       | First of all, thanks for sharing this OP! So glad to see a way to
       | query a df using SQL without further transformation.
       | 
       | Arrow has been truly revolutionary in this regard, providing a
       | solid in-memory data format (with performant APIs in many
       | languages) for interchange between different engines and even
       | formats.
       | 
       | You can go from ORC to Parset to CSV on a local FS or S3.
       | 
       | With DuckDB, it's like you can build your own AWS Athena at
       | likely a fraction of the cost. Now if only someone would
       | integrate vaex with DuckDB, it will make your powerful Apple
       | Silicon machines a compelling alternative to running a full
       | fledged Spark/Hadoop cluster.
        
       | singhrac wrote:
       | This is mildly off topic, but I am very unhappy with Pandas.
       | Every single API feels bolted on without any consideration of
       | composability or ergonomics. After spending 4 years with a much
       | better proprietary library I cannot deal with arbitrary functions
       | I have to learn like "value_counts" or whatever the output of a
       | "groupby" is.
        
         | rdedev wrote:
         | My go-to these days is Polars. You get good performance since
         | it uses arrow in the back. Coupled that with built-in lazy
         | evaluation and it's API design it's pretty good for me. There
         | are some caveats you need to be aware though. It doesn't always
         | work as a drop in replacement for pandas
        
           | bsg75 wrote:
           | Are you referring to https://www.pola.rs ?
        
             | rdedev wrote:
             | Yup. My bad. Looks like autocorrect screwed me
        
         | shankr wrote:
         | yeah even after working for years with pandas, I never feel
         | very confident writing it. I always have to look-up even
         | simpler stuff.
        
         | isoprophlex wrote:
         | Pandas is an absolute horrorshow: poor performance,
         | inconsistent API, terrible implicit behavior leading to
         | footguns.
         | 
         | And everyone uses it because it's what you do when your boss
         | tells you "we're transforming the analytics team, you're all to
         | become data scientists because everyone has data scientists
         | now". You just grab whatever had the biggest mindshare on SO
         | and in random yt tutorials. Can't blame them.
         | 
         | But hooo boy does pd get on my nerves.
         | 
         | Care to share what propriety stuff you were using?
        
           | singhrac wrote:
           | I can't really share in any detail, I think, but the best
           | part was that "Series" were immutable and had sorted keys
           | (indexes). Essentially they were (math) functions, so
           | "indexes" had unique elements. All the important bits had
           | fast numpy/Cython implementations, but the semantics were
           | good because of unique keys.
           | 
           | Honestly I still feel like I'm missing some sort of larger
           | story about the semantics of Pandas (like the "functions"
           | explanation above), so if anyone knows of anything that made
           | Pandas click, please let me know.
        
       | minimaxir wrote:
       | The fact that most data analysis/ETL tutorials on the internet
       | have converged on the same CSV/pandas tactics over the past
       | decade is disappointing when newer tools demonstrated here such
       | as DuckDB/Arrow have practical advantages without much code
       | complexity overhead.
       | 
       | This post also links to another discussion about the Parquet data
       | format (https://pythonspeed.com/articles/pandas-read-csv-fast/),
       | also supported by Arrow, which is also extremely useful but I
       | never see anyone talking about it. Granted, Parquet data can't
       | natively be imported into Excel which is likely the main cause.
        
         | teej wrote:
         | These tools are very new compared to CSV+pandas. And most
         | things you want to get data out of won't give it to you in
         | parquet.
         | 
         | The future is very promising, I am personally very excited
         | about DuckDB. But it's too soon to be griping about old
         | tutorials.
        
       | philshem wrote:
       | As of Pandas 1.4, you can use the pyarrow engine for reading a
       | csv                   df = pd.read_csv("large.csv",
       | engine="pyarrow")
       | 
       | https://pythonspeed.com/articles/pandas-read-csv-fast/
        
         | [deleted]
        
         | mkl wrote:
         | They do that in the article too.
        
       | mritchie712 wrote:
       | Is it me or do posts about data tools do better on HN then your
       | average software post?
        
         | minimaxir wrote:
         | Higher signal-to-noise, at the least.
        
       | tomrod wrote:
       | I like pyarrow a lot, but this is my first time come across
       | DuckDB. I'll check it out!
       | 
       | I'm curious how it loads so fast initially.
        
       ___________________________________________________________________
       (page generated 2022-04-30 23:00 UTC)