[HN Gopher] Practical SQL for Data Analysis
       ___________________________________________________________________
        
       Practical SQL for Data Analysis
        
       Author : steve-chavez
       Score  : 420 points
       Date   : 2021-05-03 14:55 UTC (8 hours ago)
        
 (HTM) web link (hakibenita.com)
 (TXT) w3m dump (hakibenita.com)
        
       | michelpp wrote:
       | This is an excellent example of what I call the Copy-Object-Copy
       | effect. It's particularly apparent in frameworks with ORMs like
       | Django. In Pandas case, devs will do 'SELECT *' and use pandas as
       | a sort of pseudo ORM.
       | 
       | You run a query to get your data as a bunch of objects, but
       | you're copying the data over the db connection from the postgres
       | wire protocol into Python objects in memory, which are typically
       | then garbage collected at the end of the transaction, then copy
       | the result to a JSON buffer that is also garbage collected, and
       | then send the final result to send to the browser which has been
       | waiting this whole time.
       | 
       | I regularly see Django apps require several gigs of RAM per
       | worker and when you look at the queries, it's just a bunch of
       | poorly rendered SELECTs. It's grotesque.
       | 
       | Contrast PostgREST: 100 megabytes of RAM per worker. I've seen
       | Django DRF workers require 50 to 1 memory vs PostgREST for the
       | same REST interface and result with PostgREST being much faster.
       | Since the database is generating the json, it can do so
       | _immediately_ upon generating the first result row which is then
       | _streamed_ to the browser. No double buffering.
       | 
       | Unlike Django, it's not that I don't like Pandas, it's that
       | people way overuse it for SQL tasks as this article points out.
       | I've seen pandas scripts that could be a simpler psql script. If
       | psql can't do what you want, resist the temptation to 'SELECT *'
       | into a data frame and break the problem up into stages where you
       | get the database to do the maximum work before it gets to the
       | data frame.
        
         | password4321 wrote:
         | I would appreciate pointers to any other low/no-copy db ->
         | browser technologies.
         | 
         | I was thinking it would be cool to use parts of JavaScript db
         | libraries to parse query results passed straight through in
         | (compressed?) db wire protocol format via WebRTC.
        
         | geraneum wrote:
         | Well it's true that bringing row(s) of data from database to
         | python and then serializing it for any reason as has extra
         | overhead.
         | 
         | But in Django's case, the same arguments can be made as for
         | Pandas. Django is a big complex framework and an application
         | using it might consume more memory due to countless number of
         | other reasons. There are also best practices to use Django ORM.
         | 
         | But to say if a given Django instance consumes more memory it
         | is only because of "Copy-Object-Copy effect"... I don't think
         | so.
        
         | tomnipotent wrote:
         | > it's that people way overuse it for SQL tasks as this article
         | points out
         | 
         | I'm very confused by this. I've used pandas for ever a decade,
         | and in most cases it's a massive time saver. I can do a single
         | query to bring data into local memory in a Jupyter notebook,
         | and from there re-use that memory across hundreds or more
         | executions of pandas functions to further refine an analysis or
         | whatever task I'm up to.
         | 
         | Your "copy-object-copy" is not relevant in data analysis use
         | cases, and exists in pretty much any system that pulls data
         | from an external service be it SQL or not.
        
           | tomrod wrote:
           | I know a team that would read in multi-GB CSVs from an FTP
           | site into an orchestrator server using Pandas, write to locsl
           | CSV in the orchestator, validate data after write, reload
           | into pandas, painfully and manually check every data type,
           | then use a pandas connector to a database.
           | 
           | Every step along the way here is wrong. The database came
           | with SFTP connectors. The orchestrator should have simply
           | told the database server to pull from the SFTP site into a
           | landing table. Even if not, Pandas has datatypes you can
           | specify on ingestion, which is heaven if your CSV or other
           | files don't record filetypes yet are consistent in structure.
           | Further, if you have custom validation you're applying (not a
           | bad thing!) you likely rarely even need pandas; the native
           | CSV library or (XLRD/openpyxl) for Excel are fine.
           | 
           | Ultimately, it is a training issue. The toolspace is too
           | complex, with too many buzzwords to describe simple things,
           | that people get lost in the whirlwind.
        
           | mistrial9 wrote:
           | parent post says "devs do SELECT *" and ...
           | 
           | relational databases can have a lot more data in one table,
           | or related tables, than one query needs. It is often very
           | wasteful of RAM to get ALL and filter again
        
             | protomyth wrote:
             | If your query does " * " it gets all the columns in all the
             | tables. Often, the optimizer when all the columns you need
             | are on the index, never visits the actual table (I remember
             | the term covered index). " * " basically screws this up.
             | "Select *" should never be used in anything in an actual
             | production environment.
        
               | tomnipotent wrote:
               | Data analysis workloads more often run into problems
               | solved by partitioning rather than indexes.
        
               | mistrial9 wrote:
               | yes and no - I trained on "larger than RAM datasets"
               | intentionally, and subsequently took on projects that
               | require "larger than RAM datasets". Two things happened
               | on my way, companies led by Google invented and deployed
               | datasets previously impossible e.g. BigTable, and
               | secondly the equipment I worked on went from modest to
               | mid-sized RAM (hard to define that). Granted that lots of
               | very useful (and common?) tasks are not "larger than RAM
               | datasets", and then have very different characteristics..
               | which is starting to sound like "Excel problems look like
               | this, they always fit in RAM, they need partitions not
               | indexes" and the like..
               | 
               | There is a blind-man-and-the-Elephant drift here, which
               | is not terrible, but might need calling out to improve..
        
             | tomnipotent wrote:
             | Not for data analysis, it's a pointless constraint unless
             | you're having issues.
             | 
             | Most data analysis isn't against datasets larger than
             | memory, and I'd rather have more data than I need to spend
             | time waiting to bring it all local again because I forgot a
             | few columns that turn out to be useful later on.
        
               | disgruntledphd2 wrote:
               | True, but in that kind of exploratory environment, you'd
               | normally reduce data load and speed up iteration by using
               | sampling, which is super easy in SQL.
               | 
               | > Not for data analysis, it's a pointless constraint
               | unless you're having issues.
               | 
               | This will always happen, as time spent on the project
               | scales up (unless you use some kind of autoscaling magic
               | I suppose).
        
               | EForEndeavour wrote:
               | I totally get what you're saying because most of my
               | datasets also used to be smaller than my laptop's memory.
               | 
               |  _Used to._
               | 
               | I'm in the process of moving more and more processing
               | "upstream" from local pandas to the DBMS and it's already
               | proving to be a bit of a superpower. I regret not
               | learning it earlier, but the second-best time is now.
        
           | tarsinge wrote:
           | The point is that if you only need to join and aggregate data
           | it's easier and more efficient to do it directly in SQL. Also
           | in production or when your database doesn't fit in memory the
           | idea is to first use SQL to generate an optimized view of
           | your data from the database before further analysis and
           | transformation.
        
             | jstrong wrote:
             | > The point is that if you only need to join and aggregate
             | data it's easier and more efficient to do it directly in
             | SQL
             | 
             | citation needed? I've seen plenty of cases where it would
             | have taken the db ages to do something that pandas does
             | fast, and I don't consider pandas to be particularly fast.
        
           | qsort wrote:
           | Pandas is great when you're working on the data manually
           | because it lets you interleave python code and "queries", but
           | it's strictly worse than a plain SQL statement if you're
           | writing, say, a REST service that needs to output some data.
           | 
           | SQL executed by the database is orders of magnitude more
           | efficient, way more expressive, and doesn't require you to
           | memorize pandas' absurd API.
           | 
           | And I say this as someone who is by no means a SQL wizard,
           | and fully acknowledging all of SQL's blemishes.
        
             | tomnipotent wrote:
             | > a REST service
             | 
             | This is a post about data analysis, and everyone wants to
             | point out that pandas isn't good at real-time service
             | requests.
             | 
             | > SQL executed by the database is orders of magnitude more
             | efficient
             | 
             | Compared to pandas? No, it's not. Once I have all the data
             | I need locally, it's MUCH faster to use pandas locally then
             | to re-issue queries to a remote database.
        
               | crazygringo wrote:
               | > _Once I have all the data I need locally, it 's MUCH
               | faster to use pandas locally then to re-issue queries to
               | a remote database._
               | 
               | Both of you are right.
               | 
               | Sure, if you need to grab a huge chunk of the entire
               | database and then do tons of processing on every row that
               | SQL simply cannot do, then you're right.
               | 
               | But when most people think SQL and database, they're
               | thinking of grabbing a tiny fraction of rows, sped up by
               | many orders of magnitude because it utilizes indexes, and
               | doing all calculations/aggregations/joins server-side.
               | Where it absolutely _is_ going to be orders of magnitude
               | more efficient.
               | 
               | Traditional database client API's often aren't going to
               | be particularly performant in your case, because they're
               | usually designed to read and store the entire result of a
               | query in-memory before you can access it. If you're lucky
               | you can enable streaming of results that bypasses this.
               | Other times you'll be far better off accessing the data
               | via some kind of command-line table export tool and
               | streaming its output directly into your program.
        
               | 3pt14159 wrote:
               | I agree, they're both right.
               | 
               | I've been doing data science since around 2008 and it's a
               | balance between local needs and repeated analysis and an
               | often more efficient one off query. Sure the SQL
               | optimizer is going to read off the index for a count(*),
               | but it doesn't really help if I need all the rows locally
               | for data mining _anyway_. The counts need to line up! So
               | I 'll take the snapshot of the data locally for the one
               | off analysis and call it a day. If I need this type of
               | report to be run nightly, it will be off of the data
               | warehouse infrastructure not the production DB server.
               | 
               | Shrug. These things take type and experience to fully
               | internalize and appreciate.
        
               | tomnipotent wrote:
               | > But when most people think SQL and database
               | 
               | We're not talking about most people, but data analysts.
               | If we're just doing simple sum/count/average aggregated
               | by a column or two with some basic predicates, SQL and an
               | RDBMS are your best friend. Always better to keep compute
               | + storage as close together as possible.
               | 
               | > Traditional database client API's
               | 
               | I'm not sure what point you're trying to make with this.
               | Most data analysts are not working against streaming
               | data, but performing one-off analyses for business
               | counterparts that can be as simple as "X,Y,Z by A,B,C"
               | reports to "which of three marketing treatments for our
               | snail mail catalogue was most effective at converting
               | customers".
        
               | crazygringo wrote:
               | I'm not talking about streaming live data, I'm talking
               | about streaming query results if you re-read my comment.
               | 
               | If you're reading many MB's or GB's of data from a
               | database, it's a lot more performant to stream it from
               | the database directly into your local data structure,
               | rather than rely on default processing of query results
               | as a single chunk which will be a lot worse for memory
               | and speed.
        
               | qsort wrote:
               | > Compared to pandas? No, it's not.
               | 
               | I'm not saying you shouldn't use pandas, it depends on
               | the size of the data. I'm working right now on a project
               | where a SELECT * of the entire fact table would be a
               | couple hundred gigabytes.
               | 
               | The flow is SQL -> pandas -> manipulation, and as always
               | in pipelines like those, the most work you can do at the
               | earliest stage, the better.
        
               | disgruntledphd2 wrote:
               | Yeah, speaking as a data person, the SQL argument is
               | correct. Python/R are much, much, much slower for this
               | kind of work.
               | 
               | OTOH, SQL is super limiting for a lot of data analysis
               | tasks and you'll inevitably need the data in weird forms
               | that require lots of munging.
               | 
               | Personally, I'm a big fan of using SQL/Airflow/whatever
               | to generate whatever data I'll need all the time at a
               | high level of granularity (user/action etc), and then
               | just run a (very quick) SQL query to get whatever you
               | need into your analytics environment.
               | 
               | Gives you the best of both worlds, IME.
        
             | waltherg wrote:
             | This is a great point and way of looking at it: pandas and
             | SQL live at opposite ends of the maturation level of data
             | wrangling pipelines.
        
         | dragonwriter wrote:
         | > If psql can't do what you want, resist the temptation to
         | 'SELECT *' into a data frame and break the problem up into
         | stages where you get the database to do the maximum work before
         | it gets to the data frame.
         | 
         | Why are we introducing an additional tool (psql) here
         | unnecessarily? Sure, if it can be a simpler postgres query,
         | that can be useful, but introducing psql and psql scripting
         | into a workflow that is still going to use python is...utterly
         | wasteful. And even simplifying by optimizing the use of SQL,
         | while a good tool to have in the toolbox, is probably pretty
         | low value for the effort for lots of data analysis tasks.
        
           | EForEndeavour wrote:
           | I'm pretty sure the premise here is that the many
           | gigabytes/terabytes of data reside in an RDBMS to begin with,
           | so we aren't introducing another tool -- we're saying it
           | makes sense to leverage multiple decades' worth of database
           | optimizations by doing a bunch of data filtering, processing,
           | and analysis in the database software and then exporting
           | intermediate results to a pandas environment for final
           | tweaks, visualization, etc.
        
             | dragonwriter wrote:
             | > I'm pretty sure the premise here is that the many
             | gigabytes/terabytes of data reside in an RDBMS to begin
             | with, so we aren't introducing another tool
             | 
             | psql is a separate scriptable client tool from the postgres
             | database server.
        
               | shkkmo wrote:
               | The point was about doing more work using the SQL DB
               | engine, the client library you do that with seems
               | irrelevant to making that point.
        
               | dragonwriter wrote:
               | > The point was about doing more work using the SQL DB
               | engine
               | 
               | The specific reference was to "psql script", which runs
               | in the psql client program, not the DB engine. Since I
               | suspected that might be an error intending to reference
               | SQL running on the server, I separately addressed, in my
               | initial response, both what was literally said (psql
               | script) and sql running in the DB.
        
               | shkkmo wrote:
               | Any psql script would be building and running SQL queries
               | so I fail to see how that has any impact of the point
               | being made.
               | 
               | Edit: The argument is that there are operations that
               | should be done in the SQL layer and it is worth time
               | learning enough about that layer to understand when and
               | how to use it for computation. Once you learn that, it
               | isn't really relevant if you are using psql or some other
               | client library to build those queries.
        
         | mixmastamyk wrote:
         | Interesting. I'm guessing the extra layer may be needed for
         | some manipulations of data at the application layer, but
         | perhaps that is almost always avoidable?
         | 
         | Have you ever tried Hasura? I'm thinking of giving it a go in a
         | future project.
        
           | tylerhannan wrote:
           | If/as you give it a go in the future...do let us know what is
           | working and what could be better. We do, in fact, care.
           | 
           | <disclaimer: I work at Hasura>
        
           | musingsole wrote:
           | It's only avoidable to the degree that your software
           | architecture and company architecture allow you to make the
           | more efficient decision to move the logic upstream.
           | 
           | The cases where this pattern emerges are invariably because
           | someone without insufficient access to the DB and its
           | configuration bites the bullet and instead builds whatever
           | they need to do it in the environment they have complete
           | access to.
           | 
           | I've similar problems lead to dead cycles in _processor_
           | design where the impact is critical. These problems aren 't
           | software technology problems. They're people coordination
           | problems.
        
           | arcticfox wrote:
           | As a Hasura user, I highly recommend it for OLTP workloads.
           | And for the specific subject at hand (memory usage) it is
           | fantastic. Its (in-database) serialization speed just
           | completely runs circles around anything we could do before in
           | Ruby.
        
         | WhompingWindows wrote:
         | Within the VA hospital system, the data admins often got onto
         | their "soapboxes" to dress down the 1400+ data analysts for
         | writing inefficient SQL queries. If you need 6 million
         | patients, joining data from 15 tables, gathering 250 variables,
         | a beginning SQL user has the potential to take 15-20 hours
         | where they could be pulling for 1-2 if they do some up-front
         | filtering and sub-grouping within SQL. If you already know
         | you'll throw out NA's on certain variables, if you need a
         | certain date or age range, or even the way you format these
         | filters: these all lead to important savings. And this saves R
         | from being full on memory, which would often happen if you fed
         | it way too much data.
         | 
         | Within a system of 1400 analysts, it makes a big difference if
         | everyone's taking 2X or 4X the time pulling that they could be.
         | Then, even the efficiently written pulls get slowed down, so
         | you have to run things overnight...and what if you had an error
         | on that overnight pull? Suffice to say, it'd have been much
         | simpler if people wrote solid SQL from the start.
        
           | sagarm wrote:
           | How many TBs of data are we really talking here, if it's just
           | 6M rows? Surely this processing could easily be done on a
           | single machine.
        
             | sztanko wrote:
             | It doesn't say it is just 6m rows. It is 6m patient, which
             | only hints that one of the dimension tables is 6m. Facts
             | gathered in patients might be significantly larger. Also,
             | my experience is saying, if you have hundreds of queries
             | running simultaneously, it is not the volume of data that
             | can be a bottleneck. Depending on the system it can be
             | anything, starting from acquiring lock on a record or
             | initiating a transaction.
        
           | solaxun wrote:
           | I understand your point generally but I don't understand this
           | example. If you need data from 15 tables, you need to do
           | those joins, regardless of prefiltering or subgrouping,
           | right?
        
             | protomyth wrote:
             | Well, yes but. Why do you need 15 tables for analysis
             | queries and why isn't someone rolling those tables into
             | something a bit easier with some backend process.
        
               | [deleted]
        
               | alextheparrot wrote:
               | This is where I think the original data admins were
               | deluding themselves. Expecting 1,400 analysts to write
               | better code is a really non-trivial problem, but easy to
               | proclaim.
               | 
               | An actual solution is creating pre-joined tables and
               | having processes ("Hey your query took forever, have you
               | considered using X?") or connectors
               | (".getPrejoinedPatientTable()") that make sure those
               | tables are being used in practice.
        
               | steve-chavez wrote:
               | > why isn't someone rolling those tables into something a
               | bit easier with some backend process.
               | 
               | To put it in more concrete terms(plain SQL): the tables
               | could be aggregated on a set returning function or a
               | view.
        
               | cgh wrote:
               | Yes, in these situations materialized views with indexes
               | are generally the correct answer.
        
               | tomrod wrote:
               | > why isn't someone rolling those tables into something a
               | bit easier with some backend process.
               | 
               | You'd identified why we're all going to have jobs in 100
               | years.
               | 
               | Automation sounds great. It's exponentially augmenting to
               | some users in a defined user space.
               | 
               | Until you get to someone like me, who looks at the
               | production structure and goes: "this is wholly
               | insufficient for what I need to build, but it has good
               | bones, so I'm going to strip it out and rebuild it for my
               | use case, and only my use case, because to wait for a
               | team to prioritize it according to an arcane schedule
               | will push my team deadlines exceedingly far."
               | 
               | This is why you don't roll everything into backend
               | processes. Companies set up for production (high
               | automation value ROI) and analytics (high labor value
               | ROI) and has a hard time serving the mid tail. EVERYTHING
               | on either direction works against the mid-tail --
               | security policies, data access policies, software
               | approvals, you name it.
               | 
               | People, policy, and technology. These are the three
               | pillars. If your org isn't performing the way it should,
               | then by golly work at one of these and remember that
               | technology is only one of three.
        
               | punnerud wrote:
               | Tree pillars where you only can choose two to be perfect.
               | Just like databases and CAP theorem
        
               | tomrod wrote:
               | Technology is the easiest to adjust, ironically.
        
           | protomyth wrote:
           | I don't disagree with writing solid SQL. I would go so far as
           | to say some things (most) need to be in stored procedures
           | that are reviewed by competent people. But, some folks don't
           | think about usage sometimes.
           | 
           | This is one of those things I just don't get about folks
           | setting up their databases. If you have a rather large
           | dataset that keeps building via daily transactions, then its
           | time to recognize you really have some basic distinct
           | scenarios and to plan for them.
           | 
           | The most common is adding or querying data about a single
           | entity. Most application developers really only deal with
           | this scenario since that is what most applications care about
           | and how you get your transactional data. Basic database
           | knowledge gets most people to do this ok with proper primary
           | and secondary keys.
           | 
           | Next up is a simple rule, "if you put a state on an entity,
           | expect someone to need to know all the entities with this
           | state." This is a killer for application developers for some
           | reason. It actually requires some database knowledge to setup
           | correctly to be performant. If the data analysts have
           | problems with those queries, then its to to get the DBA to
           | fix the damn schema and write some stored procedures for the
           | app team.
           | 
           | At some point, you will need to do actual reporting, excuse
           | me, business intelligence. You really should have some
           | process that takes the transactional data and puts it into a
           | form where the queries of data analysts can take place. In
           | the old days that would be something to load up Red Brick or
           | some equivalent. Transactional systems make horrid reporting
           | systems. Running those types of queries on the same database
           | as the transactional system is currently trying to work is
           | just a bad idea.
           | 
           | Of course, if you are buying something like IBM DB2 EEE
           | spreading queries against a room of 40 POWER servers, then
           | ignore the above. IBM will fix it for you.
        
             | jnsie wrote:
             | At its simplest its OLTP vs OLAP. Separate the data
             | entry/transactional side of things from the reporting part.
             | Make it efficient for data analysts do do their jobs.
        
           | fifilura wrote:
           | I can't help thinking that with better processes and tools
           | those 1400 analysts could instead be 50 analysts? (or even
           | 5?)
           | 
           | And that building an aggregation ETL pipeline, maybe inspired
           | by this post, could be the solution?
        
             | 1980phipsi wrote:
             | VA -> government -> bloat
        
         | allknowingfrog wrote:
         | Is it possible to do any kind of version control with
         | PostgREST? I have no doubt that raw SQL offers huge performance
         | advantages over my ORM, but I'm pretty fond of the ability to
         | roll back to a previous state when I push a bad change to
         | production. Performance is just one of a number of tradeoffs
         | that I consider when I'm choosing tools.
        
           | sitkack wrote:
           | It is customary to check in all DB assets, table creation
           | scripts, queries, stored procedures, etc. No different than
           | the rest of the development process. If something _isn 't_
           | being versioned, there is a huge problem.
        
       | jonas_b wrote:
       | Question, if I have a CSV file that I'd like to do some quick SQL
       | queries on before moving the results into Pandas. What would be
       | good resource to do this? Preferably compatible with the rest of
       | the Python-dataframe ecosystem and as simple as pd.read_csv()
        
         | Vaslo wrote:
         | Try SQL Alchemy?
        
         | vincnetas wrote:
         | http://harelba.github.io/q/
         | 
         | q "SELECT COUNT(*) FROM ./clicks_file.csv WHERE c3 > 32.3"
         | 
         | It uses sqlite under the hood.
        
         | arusahni wrote:
         | In addition to the recommendation for sqlite, I've found
         | `csvkit` to be an extremely useful set of CLI tools for CSV
         | munging. The `csvsql` [1] entrypoint is especially handy
         | because it allows you to issue SQL queries against your CSV
         | file directly vs. loading then querying.
         | 
         | 1: https://csvkit.readthedocs.io/en/latest/scripts/csvsql.html
        
         | rahulj51 wrote:
         | Try duckdb to query and even transform the data and then export
         | to a pandas df.
        
         | simonw wrote:
         | If you want to try it in SQLite (pros: no need to run a server
         | or install anything since it's in the Python standard library,
         | cons: not nearly as many advanced statistical analysis features
         | as PostgreSQL) my sqlite-utils CLI tool may help here: it can
         | import from CSV/TSV/JSON into a SQLite database:
         | https://sqlite-utils.datasette.io/en/stable/cli.html#inserti...
        
           | saltcured wrote:
           | The sqlite3 connection object in Python allows you to
           | register callables which you can use as scalar or aggregate
           | functions in your SQL queries. With this, you can fill some
           | of the gaps compared to PostgreSQL by essentially importing
           | Python libraries. I just found this nice tutorial while
           | looking for relevant docs:
           | 
           | https://wellsr.com/python/create-scalar-and-aggregate-
           | functi...
           | 
           | However, I think the limited type system in SQLite means you
           | would still want to extract more data to process in Python,
           | whether via pandas, numpy, or scipy stats functions. Rather
           | introducing new composite types, I think you might be stuck
           | with just JSON strings and frequent
           | deserialization/reserialization if you wanted to build up
           | structured results and process them via layers of user-
           | defined functions.
        
         | ithrow wrote:
         | You can easily import the csv file into sqlite and you don't
         | even have to create the table fields beforehand
        
         | bruiseralmighty wrote:
         | I keep a non-commercial installation of SQLServer on my machine
         | for this reason, but this is likely overkill for most purposes
         | and requires a Windows machine.
         | 
         | It does have some nice import features for CSV data though.
        
         | wswope wrote:
         | SQLite is what you're looking for. If you want to use the CLI,
         | .format csv       .import <path to csv file> <table name>
         | 
         | Alternatively, read your data into pandas and there's extremely
         | easy interop between a DBAPI connection from the python
         | standard lib Sqlite3 module and Pandas (to_sql, read_sql_query,
         | etc.).
        
           | thamer wrote:
           | I would second this suggestion. I was querying CSV data
           | imported into SQLite this weekend, and it was extremely easy
           | to get started. SQLite is pretty snappy even for tables with
           | millions of rows.
           | 
           | SQLite supports defining columns without a type and will use
           | TEXT by default, so you can take the first line of your CSV
           | that lists the document's dimensions, put those in the
           | brackets of a CREATE TABLE statement, and then run the
           | .import described above (so just CREATE TABLE foo(x,y,z); if
           | x,y,z are your column names).
           | 
           | After importing the data don't forget to create indexes for
           | the queries you'll be using most often, and you're good to
           | go.
           | 
           | Another suggestion for once your data is imported, have
           | SQLite report it in table format:                   .mode
           | column         .headers on
        
         | zabzonk wrote:
         | Windows provides an ODBC driver for CSV files - I don't know
         | how this would play with Pandas.
        
       | jplr8922 wrote:
       | tldr ; if you hear ''but we can do this in SQL'', RUN!!!
       | 
       | My eyes hurt as I read this article. There are reasons why
       | analysts dont use SQL to do their job, and it has nothing to do
       | with saving RAM and memory.
       | 
       | 1) Data analysis is not a linear process, it involve playing and
       | manipulating the data in different way and letting your mind
       | drift a bit. You want your project in an IDE made for that
       | purpose, the ability to create charts, source control, export and
       | share the information, ect. Pandas is just a piece of that puzzle
       | which is not possible to replicate in pure SQL.
       | 
       | 2) In 2020, there are numerical methods you want to try beyond a
       | traditional regression. Most real world data problems are not
       | made for stats101 tools included in sql. Kurtosis?
       | Autocorrelation?
       | 
       | 3) Politics. Most database administrator are control freaks who
       | _hate_ the idea of somebody else doing stuff in their DB. Right
       | now were I work we still have to use SSIS-2013 instead of stored
       | procedures in order to avoid the DBA refusal bureaucratic
       | process.
       | 
       | 4) Eventual professional development. If your analysis is good
       | and creates value, chances are it will become a 'real' program
       | and you will have to explain what you are doing to turn in into
       | an OOP tool. If you have CS101, good coding in python will make
       | this process much easier than a 3000 lines spagetti-SQL SQL
       | script.
       | 
       | 5) Data cleaning. Dealing with outliers, NAN and all that jazz
       | really depends on the problem you try to solve. The absence of a
       | one size fits all solutions is a good case for R/pandas/etc.
       | These issue will break an SQL script in no time.
       | 
       | 6) Debugging in SQL. Hahahahahahahaha
       | 
       | If you are still preocupied with the ram usage of your PC to do
       | your project, here are two solutions which infuriate a lot of
       | DBAs I've worked with.
       | 
       | A)
       | https://www.amazon.ca/s?k=ram&__mk_fr_CA=%C3%85M%C3%85%C5%BD...
       | 
       | B) https://aws.amazon.com/
        
         | qwertox wrote:
         | What is your problem with this example, doesn't it get the job
         | done?                 WITH temperatures AS ( /* ... */ )
         | SELECT           *,           MAX(c) OVER (               ORDER
         | BY t               ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
         | ) AS hottest_temperature_last_three_days       FROM
         | temperatures;                   t      | c  |
         | hottest_temperature_last_three_days
         | ------------+----+-------------------------------------
         | 2021-01-01 | 10 |                                  10
         | 2021-01-02 | 12 |                                  12
         | 2021-01-03 | 13 |                                  13
         | 2021-01-04 | 14 |                                  14
         | 2021-01-05 | 18 |                                  18
         | 2021-01-06 | 15 |                                  18
         | 2021-01-07 | 16 |                                  18
         | 2021-01-08 | 17 |                                  17
         | 
         | Why should I fetch all the data and then form the result with
         | another tool?
         | 
         | What a great article.
        
           | jplr8922 wrote:
           | I re-read my comment, and I think that I expressed myself too
           | harshly. If you like SQL and it get the job done for you, no
           | problem. I would see myself using that SQL query.
           | 
           | However in order to get there, you need to know why you need
           | the ''hottest_temparature_last_three_days''. Why not 2 or 4
           | days? Why not using heating degree day? What about serial
           | correlation with other regions, or metering disfunction? What
           | if you are working directly with raw data from instruments,
           | and still need to choose wich cleaning method you will use?
           | What if you want to check the correlation with another
           | dataset which is not yet in your database (ex: private
           | dataset in excel from a potential vendor)?
           | 
           | If you know _exactly_ what you want, sure SQL is the way to
           | go. However the first step of a data project is to admit that
           | you dont know what you want. You will perform a litterature
           | review and might have a general idea, but starting with a
           | precise solution in mind is a receipe for failure. This is
           | why you could need to fetch all the data and then form
           | results with another tool. How do you even know which factors
           | and features must be extracted before doing some exploration?
           | 
           | If you know the process you need and only care about RAM-CPU
           | optimization, sure SQL is the way to go. Your project is an
           | ETL and your have the job of a programmer who is coding a
           | report. There is no ''data analysis'' there...
        
             | sagarm wrote:
             | If you know SQL well, then doing exploratory analysis in
             | SQL is perfectly reasonable. Especially if you are using a
             | high-performance analytic database.
             | 
             | Disclosure: I develop high-performance analytic database
             | systems with SQL frontends.
        
               | jplr8922 wrote:
               | What do you mean by exploratory analysis? Lets assume
               | that my dataset contains features which are not normally
               | distributed, and I want to check for coskewness and
               | cokurtosis matrices before and after I remove outliers
               | with a method specific to my field of research, found in
               | a academic paper. Am I supposed to code all of this in
               | SQL? What is your definition of exporatory analysis, and
               | which metrics do you have in mind?
               | 
               | And what if I want to make graphs to present stuff to a
               | colleague? Am I supposed to use... excel to do that?
        
             | qwertox wrote:
             | > If you know exactly what you want, sure SQL is the way to
             | go.
             | 
             | I've been doing similar aggregations with MongoDB, simply
             | because when I start a project where I quickly want to
             | store data and then a week later check what I can do with
             | it, it's a tool which is trivial to use. I don't need to
             | think about so many factors like where and how to store the
             | data. I use MongoDB for its flexibility (schemaless),
             | compared to SQL.
             | 
             | But I also use SQL mostly because of the power relations
             | have, yet I use it only for storing stuff where it's clear
             | how the data will look for forever, and what (mostly
             | simple) queries I need to perform.
             | 
             | I've read your comment as if it was suggesting that these
             | kind of articles are not good, yet for me it was a nice
             | overview of some interesting stuff that can be done with
             | SQL. I don't chase articles on SQL, so I don't get many to
             | read, but this one is among the best I've read.
             | 
             | To get back to the quote: How will I know if SQL can do
             | what I want, if I don't read these kind of articles?
        
         | kbelder wrote:
         | Familiar. I think the biggest issue is understanding workflow.
         | My hunch is that some developers view a data analysis project
         | as a program... it needs to have these data processing steps
         | performed, summarized in this certain way, and saved off in
         | that certain format. Once those queries are defined, the
         | program is ran and the analysis done.
         | 
         | The actual work the analyst is doing is far more ad-hoc than
         | that. It's iterative and frequently will get torn down,
         | rebuilt, and joined to other sources, as they discover new
         | aspects to the data. They need their hands on the raw data, not
         | on a specifically summarized version of it.
        
       | twobitshifter wrote:
       | I now do most of my data analysis in Julia instead of pandas, but
       | used pandas for a long time. SQL is a valuable skill and useful
       | when you need to optimize a query for performance, which can
       | sometimes happen with pandas and is much less likely with Julia.
       | 
       | However, even if a pandas query takes 5X longer to run than a SQL
       | query you must consider the efficiency to a developer. You can
       | chain pandas commands together that will accomplish something
       | that takes 10x the lines of SQL. With SQL you're more likely to
       | end up with many intermediate CTEs along the way. So while you
       | can definitely save processor cycles by using SQL, I don't think
       | you'll save clock-face time by using it in most one off tasks.
       | Datascience is usually column oriented and Julia and pandas allow
       | you to stay in that world.
        
         | nomel wrote:
         | For me, the main use case of a database server is so I can do
         | operations that require more memory than I have on my local
         | computer. Reading through this comment section makes me think
         | my use case is rare or something. Maybe everyone is ok with
         | chunked operations?
        
       | spamalot159 wrote:
       | I think for a lot of people, SQL is a skill that doesn't stick.
       | You learn enough to do the queries you need for your project,
       | they work then you forget about them as you work on the rest of
       | your project. These skills are perishable. Left outer join? Yeah,
       | I knew what that was some time ago, but not anymore
       | 
       | The days of dedicated SQL programers are mostly gone.
        
         | zabzonk wrote:
         | > The days of dedicated SQL programers are mostly gone.
         | 
         | I've never met a "dedicated SQL programmer" - all the C++
         | programmers I've worked with in the investment banking world
         | were also expected to know, and did know, SQL pretty well -
         | asking questions about it were normally part of the interview
         | process.
        
         | simonw wrote:
         | I started learning SQL twenty years ago and it's one of the
         | only skills from the start of my career that has been
         | consistently useful ever since.
        
         | slver wrote:
         | We have a lot more clueless developers on average, which makes
         | it seem like no one uses SQL anymore beyond basics. But we're
         | still here, believe me. No one today is a dedicated X
         | programmer, but doesn't mean we're all clueless about X.
        
         | nisegami wrote:
         | I held this view before graduating and moving back to my 3rd
         | world home country. For some reason, people here (the
         | established ones with 5+ years of experience) see SQL as a
         | general purpose programming language with which they do as much
         | as possible.
        
         | andrewmcwatters wrote:
         | Exactly. I own a software consultancy, and I tried explaining
         | this to an engineering manager and he just didn't get it.
         | 
         | For some reason, he couldn't understand that when you write SQL
         | queries for a project, you typically do it once, and basically
         | never again, with the exception of maybe adding or removing
         | columns from the query.
         | 
         | The "hard work," if you can call it that, is all in the joins.
         | Then, I completely forget about it.
         | 
         | You spend so much more time in development on everything else.
        
       | muxator wrote:
       | > This benchmark does not mention the memory consumed by the
       | database itself - this is intentional. [...] Whether you decide
       | to use the database or not, the memory is already paid for, so
       | you might as well use it!
       | 
       | This sentence is a big red flag for me. An analysis of a stategy
       | that pushes work towards a subsystem, and then purposedly ignores
       | the perforance implications on that subsystem is methodologically
       | unsound.
       | 
       | Personally, I am all for using the DB and writing good SQL. But
       | if I weren't, this argument would not convince me.
        
       | truth_seeker wrote:
       | Pure SQL porn. Very exciting!
       | 
       | Lately I am seeing rather really creative blog posts on HN. Keep
       | it up guys.
        
       | flakiness wrote:
       | The more I learn SQL, the less I write Python.
       | 
       | Although the SQL syntax is weird and so dated, its portability
       | across tools trumps everything else. You finished the EDA and
       | decided to port the insights to a dashboard? With SQL it's
       | trivial. With Python... well, probably you'll have to port it to
       | SQL unless you have Netflix-like, Jupyter-backed dashboard
       | infrastructure in place. For many of us who only have much-more-
       | prevalent SQL-based dashboard platform, why not starting from
       | SQL? Copy-n-paste is your friend!
       | 
       | I still hate the SQL as a programmer, but as a non-expert data
       | analyst I now have accepted it.
        
         | arcticfox wrote:
         | After 15 years programming I have come to love SQL - 10 yrs ago
         | I used to do everything in the book to avoid it out of hate,
         | even though I knew how to use it, but over time I have moved
         | more and more into the database / SQL and my code is just
         | better and better as a result.
        
         | smilbandit wrote:
         | I'm neutral on SQL but very big on the right tool for the job.
         | My rule of thumb is, if i'm querying data just to massage it
         | and then stuff it back into the database then I first try to do
         | that as a stored procedure. I remember learning about windowing
         | in SQL late into a project and after redoing some python
         | functions as stored procedures really improving performance.
        
       | simonw wrote:
       | This article is so useful. It starts out with SQL basics but then
       | quickly leaps into all kinds of PostgreSQL tricks that I didn't
       | know about - binning, efficient sampling, calculating, even
       | linear regression.
        
       | nerdponx wrote:
       | SQL syntax sucks for doing non-trivial data analysis. I've tried
       | it. Verbose, no composability or code reuse, not really portable
       | across different databases, no easy interoperability with other
       | tools, limited editor/IDE support, etc.
       | 
       | I guess if you have huge amounts of data (10m+ rows) already
       | loaded into a database then sure, do your basic summary stats in
       | SQL.
       | 
       | For everything else, I'll continue using SQL to get the data from
       | the database and use Pandas or Data.Table to actually analyze it.
       | 
       | That said, this is a very comprehensive review of SQL techniques
       | which I think could be very useful for when you _do_ have bigger
       | datasets and /or just need to get data out of a database
       | efficiently. Great writeup and IMO required reading for anyone
       | looking to be a serious "independent" data scientist (i.e. not
       | relying on data engineers to do basic ETL for you).
       | 
       | I'd be a huge fan of something like the PySpark DataFrame API
       | that "compiles" to SQL* (but that doesn't require you to actually
       | be using PySpark which is its own can of worms). I think this
       | would be a lot nicer for data analysis than any traditional ORM
       | style of API, at least for data analysis, while providing better
       | composability and IDE support than writing raw SQL.
       | 
       | *I also want this for Numexpr:
       | https://numexpr.readthedocs.io/en/latest/user_guide.html, but
       | also want a lot of other things, like a Arrow-backed data frames
       | and a Numexpr-like C library to interact with them.
        
         | ramraj07 wrote:
         | Completely disagree. I have a choice of using snowflake and
         | spark/pandas to do my EDA and I'll choose sql every time. The
         | code is significantly more readable once you get used to it and
         | you can most definitely do things one step at a time using udfs
         | and temp tables / cte s. I've come back to EDA I did a year
         | back and it's always easier to read a long sql script than a
         | notebook with pandas code.
        
           | disgruntledphd2 wrote:
           | There's pluses and minuses to both. That being said, how do I
           | write a function in SQL which can abstract over something I
           | do a lot (like the pivoting example earlier), or even some
           | date function like iff(date>'important_date', 'before',
           | 'after') as grouper.
           | 
           | Honestly, that's what ends up making me move away from doing
           | analytics in SQL.
        
             | saltcured wrote:
             | Edit: oops, I think I replied a level deeper than intended.
             | I was responding to the composition/abstraction topic. I
             | think I should just leave this here now though?
             | 
             | I assume you are talking about composition of generic set-
             | processing routines, but I wonder if others realize that?
             | It is easy enough to write a set-returning function and
             | wrap it in arbitrary SQL queries to consume its output.
             | But, it is not easy to write a set-consuming function that
             | can be invoked on an arbitrary SQL query to define its
             | input. Thus, you cannot easily build a library of set-
             | manipulation functions and then compose them into different
             | pipelines.
             | 
             | I think different RDBMS dialects have different approaches
             | here, but none feel like natural use of SQL. You might do
             | something terrible with cursors. Or you might start passing
             | around SQL string arguments to EXECUTE within the generic
             | function, much like an eval() step in other interpreted
             | languages. Other workarounds are to do everything as macro-
             | processing (write your compositions in a different
             | programming language and "compile" to SQL you pass to the
             | query engine) or to abuse arrays or other variable-sized
             | types (abuse some bloated "scalar" value in SQL as a quasi-
             | set).
             | 
             | What's missing is some nice, first-class query (closure)
             | and type system. It would be nice to be able to write a CTE
             | and use it as a named input to a function and to have a
             | sub-query syntax to pass an anonymous input to a function.
             | Instead, all we can do is expand the library of scalar and
             | aggregate functions but constantly repeat ourselves with
             | the boilerplate SQL query structures that orchestrate these
             | row-level operations.
        
               | disgruntledphd2 wrote:
               | Yeah, that's exactly the issue. One can do this in
               | Python, but not in SQL, and this leads to boilerplate.
               | 
               | I actually think that SparkSQL is a good solution here,
               | as you can create easily reusable functions.
        
         | rdedev wrote:
         | For apache arrow backed dataframes check out
         | Polars(https://github.com/ritchie46/polars)
        
         | disgruntledphd2 wrote:
         | > (i.e. not relying on data engineers to do basic ETL for you).
         | 
         | Is this actually a thing? Surely it can't be a thing.
        
           | denimnerd42 wrote:
           | of course, that's my job... we do basic ETL on hundreds of
           | data sources to provide data to the analysts and quants
        
           | EForEndeavour wrote:
           | In a sufficiently large org, why not?
        
         | dreyfan wrote:
         | > SQL syntax sucks for doing non-trivial data analysis. I've
         | tried it. Verbose, no composability or code reuse, not really
         | portable across different databases, no easy interoperability
         | with other tools, limited editor/IDE support, etc.
         | 
         | You're entitled to your opinion and tooling choices of course,
         | but the problem is you don't know SQL.
        
           | dunefox wrote:
           | That could be said about any tool, framework, language,
           | library, etc. Invest enough time and you can do everything
           | with malbolge but that doesn't mean it doesn't suck.
        
           | mixmastamyk wrote:
           | A more useful comment would be to illustrate how. I'd like to
           | know as well as I need to reuse queries in SQL occasionally
           | but am not an expert. Currently I believe learning pl/pgsql
           | is the answer, but even it reeks of punch-cards and other
           | sixtiesisms :-). Tough sell when you're used to Python etc.
        
           | pandasusr wrote:
           | I am a pandas user considering refactoring part of my ETL
           | pipeline to SQL. I see the trade off as memory efficiency vs
           | expressiveness, and for simple queries on big data, SQL wins.
           | Would you disagree that Pandas/Python is more expressive than
           | SQL? I'm less experienced in SQL but based on my limited
           | experience there, it seems Pandas is clearly more expressive.
           | What is the SQL equivalent of Pandas .apply(lambda x) ?
        
             | higeorge13 wrote:
             | Please define an example lambda function, in order to see
             | whether there is an sql equivalent. Imo >90% of the data
             | issues i have seen, can be solved with sql queries. I have
             | seen some in the finance sector which would require super
             | complex udfs, but other than these, sql is the first choise
             | to solve a data issue.
        
               | pandasusr wrote:
               | I do work in finance sector and need complex functions :)
               | 
               | And it's interesting that pandas was invented at a hedge
               | fund, AQR.
               | 
               | I agree that SQL may be better for vanilla BI.
        
             | hodgesrm wrote:
             | ClickHouse has lambdas for arrays. They are very useful.
             | Here's an example.                 WITH ['a', 'bc', 'def',
             | 'g'] AS array       SELECT arrayFilter(v -> (length(v) >
             | 1), array) AS filtered              +-filtered-----+
             | | ['bc','def'] |       +--------------+
             | 
             | The lambda in this case is a selector for strings with more
             | than one character. I would not argue that they are as
             | general as Pandas, but they are might useful. More examples
             | from the following article.
             | 
             | https://altinity.com/blog/harnessing-the-power-of-
             | clickhouse...
        
           | default-kramer wrote:
           | "No composability or code reuse" is definitely a valid
           | criticism of SQL. Check out the very first example of my
           | personal project https://docs.racket-
           | lang.org/plisqin/Read_Me_First.html and let me know how you
           | would implement something akin to `(CategoryName p)` and
           | `(TotalSales p)` in SQL. A view does not count, because then
           | you either have one view per derived field and that is very
           | cumbersome to consume, or you have one view with all the
           | derived fields and you will end up paying performance costs
           | for fields that you are not using.
        
             | ok123456 wrote:
             | Don't views and materialized views give you reuse?
             | 
             | Can't you do composibility with foreign references?
        
               | neurocean wrote:
               | Sure, while views may be fine for small projects, using
               | them for larger projects like data warehousing is usually
               | a mistake you'll come to regret.
               | 
               | SQL views are rarely unit tested so you always end up in
               | a regression nightmare when you need to make updates.
               | 
               | If you're going to go pure SQL, you should use something
               | like dbt.
        
         | hodgesrm wrote:
         | > I guess if you have huge amounts of data (10m+ rows) already
         | loaded into a database then sure, do your basic summary stats
         | in SQL.
         | 
         | This is not huge. This is quite small. Pandas can't handle data
         | that runs into billions of rows _or_ sub-second response on
         | arbitrary queries. Both are common requirements in many
         | analytic applications.
         | 
         | I like Pandas. It's flexible and powerful if you have
         | experience with it. But there's no question that SQL databases
         | (especially data warehouses) handle large datasets and low
         | latency response far better than anything in the Python
         | ecosystem.
        
           | nerdponx wrote:
           | _This is not huge. This is quite small. Pandas can 't handle
           | data that runs into billions of rows or sub-second response
           | on arbitrary queries. Both are common requirements in many
           | analytic applications._
           | 
           | You're right. It's "huge" with respect to what you can expect
           | to load into Pandas and get instant results from. But it's
           | not even "medium" data on the small-medium-big spectrum.
           | 
           |  _I like Pandas. It 's flexible and powerful if you have
           | experience with it. But there's no question that SQL
           | databases (especially data warehouses) handle large datasets
           | and low latency response far better than anything in the
           | Python ecosystem._
           | 
           | I agree with this 100%, but I think a lot of people missed it
           | in my post.
        
             | hodgesrm wrote:
             | OTOH Pandas/Numpy are great for results. One pattern that
             | I'm hoping to try is running heavy lifting in SQL, transfer
             | results via Apache Arrow, and manipulate the result set in
             | Python with aforesaid libraries.
             | 
             | I don't know enough about how Arrow handles streaming to
             | understand how this would really work but it would break
             | away from the single-threaded connectivity with expensive
             | ser/deser databases have used since the days of Sybase Db-
             | Library, the predecessor to ODBC. 36 years is probably long
             | enough for that model.
        
           | neurocean wrote:
           | Yeah but that's not even close to being Pandas' value
           | proposition. It's for the data scientist and analyst, not the
           | db admin, data engineer or production applications.
        
       | beforeolives wrote:
       | Nice article. Pandas gets the job done but it's such a step
       | backwards in terms of useability, API consistency and code feel.
       | You can do anything that you can possibly need with it but you
       | regularly have to look up things that you've looked up before
       | because the different parts of the library are patched up
       | together and don't work consistenly in an intuitive way. And then
       | you end up with long lines
       | of().chained().['expressions'].like_this(0).
        
         | nvilcins wrote:
         | You might be thinking of specific functionality that you find
         | is being implemented in an overly long/verbose fashion.. But
         | generally speaking, how are
         | 
         | > long lines of().chained().['expressions'].like_this(0)
         | 
         | a _bad thing_?
         | 
         | IMHO these pandas chains are easy to read and communicate quite
         | clearly what's being done. If anything, I've found that in my
         | day-to-day while reading pandas I parse the meaning of those
         | chains at least as efficiently as from comments of any level of
         | specificity, or from what other languages (that I have had
         | experience with) would've looked like.
        
           | musingsole wrote:
           | People don't like them because the information density of
           | pandas chains is soooo much higher than the rest of the
           | surrounding code. So, they're reading along at a happy place,
           | consuming a few concepts per statement
           | 
           | ...and then BOOM, pandas chain! One statement containing 29+
           | concepts and their implications.
           | 
           | Followed by more low density code. The rollercoaster leads to
           | complaints because it _feels_ harder. Not because of any
           | actual change in difficulty.
           | 
           | /that's my current working theory, anyway
        
             | disgruntledphd2 wrote:
             | Hmmm, interesting. I don't mind the information density,
             | coming from R which is even more terse, but the API itself
             | is just not that well thought out (which is fair enough, he
             | was learning as he went).
        
         | isoprophlex wrote:
         | This really grinds my gears too. There's something about the
         | pandas API that makes it impossible for me to do basic ops
         | without tedious manual browsing to get inplace or index
         | arguments right... assignments and conditionals are needlessly
         | verbose too.
         | 
         | Pyspark on the other hand just sticks in my brain, somehow.
         | Chained pyspark method calls looks much neater.
        
           | ziml77 wrote:
           | Setting inplace=True isn't too bad, but I definitely have had
           | many issues with working with indexes in Pandas. I don't
           | understand why they didn't design it so that the index can be
           | referenced like any other columns. It overcomplicates things
           | like having to know the subtle difference between join() and
           | merge().
        
             | nojito wrote:
             | Setting Inplace=True is not recommended and should be used
             | with caution
             | 
             | https://github.com/pandas-dev/pandas/issues/16529
        
               | ziml77 wrote:
               | I've never seen anything about inplace being a poor idea
               | to use. Is that documented anywhere or is that ticket the
               | only info about it?
        
               | Peritract wrote:
               | It's not particularly front-and-centre, but it's all over
               | various discussion boards if you go looking for it
               | directly. I'd like the debate to be more visible,
               | personally, particularly whenever the argument gets
               | deprecated for a particular method.
               | 
               | Essentially, `inplace=True` rarely actually saves memory,
               | and causes problems if you like chaining things together.
               | The people who maintain the library/populate the
               | discussion boards are generally pro-chaining, so
               | `inplace` is slowly and quietly on its way out.
        
           | disgruntledphd2 wrote:
           | Pandas is just a bad API, to be honest. I know base-R very,
           | very well (which was one of the inspirations, I believe) and
           | I still spend most of my time looking stuff up.
           | 
           | It's such a shame that python doesn't have a better DF
           | library.
        
             | shankr wrote:
             | I also switched from R to Python/pandas. I remember always
             | being frustrated with pandas since it tries to emulate
             | data.frame, but then just does its own thing without being
             | consistent.
        
             | EForEndeavour wrote:
             | I've used pandas regularly for the past ~5 years and find
             | its API intuitive enough not to complain. I can write and
             | read decently long pandas chained expressions fluently, but
             | I barely know any R. Am I unwittingly a hostage of an
             | inferior API and don't know what I'm missing?
        
               | disgruntledphd2 wrote:
               | Yes.
               | 
               | Base R is OK, but dplyr is magical.
               | 
               | For instance, integer indexing in base R is df[row,col]
               | rather than the iloc pandas stuff.
               | 
               | plot, print and summary (and generic function OOP more
               | generally is really underappreciated).
               | 
               | Python is a better programming language, but R is a
               | better data analysis environment.
               | 
               | And dplyr is an incredibly fluent DSL for doing data
               | analysis (not quite as good for modelling though).
               | 
               | Seriously, I read the original vignette for dplyr in late
               | 2013/early 2014 and within two weeks I'd switched most of
               | my new analytical code over to it. So very, very good.
               | Less idea-impedance match than any other environment, in
               | my experience.
        
               | shankr wrote:
               | I was actually using data.table The syntax can be bit
               | cryptic, but you get used to it.
        
               | [deleted]
        
               | shankr wrote:
               | This might not seem like a big issue but in the
               | beginning, these were my issues
               | 
               | 1. Not so easy way to rename columns during aggregation
               | 
               | 2. The group by generates its own grouped by data and
               | hence you almost always need `reset_index`
               | 
               | 3. Sometimes group by can convert a dataframe to series
               | 
               | 4. Now `.loc` has provided bit consistent
               | indexing/slicing, but earlier you had `.ix` `.iloc` and
               | what not
               | 
               | These are something I can remember from top of my head.
               | Of course all of these have solutions, but it makes
               | pandas much more verbose. In R, these are just much more
               | succinct.
        
         | carabiner wrote:
         | Can you honestly say you'd prefer to be debug _thousands_ of
         | lines of SQL versus the usual  <100 lines of Python/pandas that
         | does the same thing? It's no contest. A histogram in pandas:
         | df.col.hist(). Unique values: df.col.value_counts(). Off the
         | top of your head, what is the cleanest way of doing this in SQL
         | and how does it compare? How anyone can say that SQL is
         | objectively better (in readability, density, any metric) other
         | than the fact that they learned it first and now are frustrated
         | that they have to learn another new tool baffles me.
         | 
         | I learned Pandas first. I have no issue with indexing,
         | different ways of referencing cells, modifying individual rows
         | and columns, numerous ways of slicing and dicing. It gets a
         | little sprawling but there's a method to the madness. I can
         | come back to it months later and easily debug. With SQL, it's
         | just madness and 10x more verbose.
        
           | beforeolives wrote:
           | > Can you honestly say you'd prefer to be debug thousands of
           | lines of SQL versus the usual <100 lines of Python/pandas
           | that does the same thing?
           | 
           | That's fair, I was using the opportunity to complain about
           | pandas and didn't point out all the problems SQL has for some
           | tasks. What I really want is a dataframe library for Python
           | that's designed in a more sensible way than pandas.
        
           | marktl wrote:
           | Distinct Count and Group By in SQL will provide distinct
           | Count and Histogram (numerical value) output. Like No SQL vs
           | Relationship, seems there are use cases that lemme them
           | selves to new vs old technologies. It's hard for me to put
           | much stock in opinions from those that have vastly more
           | experience in one tool and not the other being compared.
        
         | nerdponx wrote:
         | I much prefer writing actual functions in a real programming
         | language to the verbose non-reusable non-composable relic that
         | is SQL syntax.
         | 
         | Give me a better query language and I will gladly drop Pandas
         | and Data.Table.
        
           | semitones wrote:
           | SQL is a real programming language.
        
             | thinkharderdev wrote:
             | It's not turing complete which is I think why it doesn't
             | feel like a "real" programming language.
        
               | sagarm wrote:
               | Recursive CTEs exist; while I haven't investigated
               | thoroughly, I would be surprised if they didn't make SQL
               | turing complete.
               | 
               | More usefully, most databases allow you to write user-
               | defined functions in other languages, including
               | imperative SQL dialects.
        
               | steve-chavez wrote:
               | SQL is turing complete. A demonstration with recursive
               | CTEs is done here[1].
               | 
               | [1]: https://wiki.postgresql.org/wiki/Cyclic_Tag_System
        
           | beforeolives wrote:
           | Yes, SQL does have the problem of bad composeability and some
           | things being less explicit than they are when working with
           | dataframes. Dplyr is probably the most sane API out of all of
           | them.
        
             | alexilliamson wrote:
             | +1 for dplyr. I've used both pandas and dplyr daily for a
             | couple years at different times in my career, and there is
             | no comparison in mind when it comes to
             | usability/verbosity/number of times I need to look at the
             | documentation.
        
       | m16ghost wrote:
       | >Pandas is a very popular tool for data analysis. It comes built-
       | in with many useful features, it's battle tested and widely
       | accepted. However, pandas is not always the best tool for the
       | job.
       | 
       | SQL is very useful, but there are some data manipulations which
       | are much easier to perform in pandas/dplyr/data.table than in
       | SQL. For example, the article discusses how to perform a pivot
       | table, which takes data in a "long" format, and makes it "wider".
       | 
       | In the article, the pandas version is:
       | 
       | >pd.pivot_table(df, values='name', index='role',
       | columns='department', aggfunc='count')
       | 
       | Compared to the SQL version:
       | 
       | >SELECT role, SUM(CASE department WHEN 'R&D' THEN 1 ELSE 0 END)
       | as "R&D", SUM(CASE department WHEN 'Sales' THEN 1 ELSE 0 END) as
       | "Sales" FROM emp GROUP BY role;
       | 
       | Not only does the SQL code require you to know up front how many
       | distinct columns you are creating, it requires you to write a
       | line out for each new column. This is okay in simple cases, but
       | is untenable when you are pivoting on a column with hundreds or
       | more distinct values, such as dates or zip codes.
       | 
       | There are some SQL dialects which provide pivot functions like in
       | pandas, but they are not universal.
       | 
       | There are other examples in the article where the SQL code is
       | much longer and less flexible, such as binning, where the bins
       | are hardcoded into the query.
        
         | michelpp wrote:
         | Does `tablefunc.crosstab()` do what you want?
         | 
         | https://www.postgresql.org/docs/13/tablefunc.html
        
           | m16ghost wrote:
           | It is not much better than the canonical example given in the
           | article. It still has the following usability issues:
           | 
           | -You still need to enumerate and label each new column and
           | their types. This particular problem is fixed by crosstabN().
           | 
           | -You need to know upfront how many columns are created before
           | performing the pivot. In the context of data analysis, this
           | is often dynamic or unknown.
           | 
           | -The input to the function is not a dataframe, but a text
           | string that generates the pre-pivot results. This means your
           | analysis up to this point needs to be converted into a
           | string. Not only does this disrupt the flow of an analysis,
           | you also have to worry about escape characters in your
           | string.
           | 
           | -It is not standard across SQL dialects. This function is
           | specific to Postgres, and other dialects have their own
           | version of this function with their own limitations.
           | 
           | The article contains several examples like this where SQL is
           | much more verbose and brittle than the equivalent pandas
           | code.
        
           | papercrane wrote:
           | That's one of the non-standard ways to do it. MSSQL and
           | Oracle also have a pivot function to do this. Unfortunately
           | there is no standard way to do this.
        
         | ellis-bell wrote:
         | agreed that pivoting can be a pain.
         | 
         | a pattern that i converged on --- at least in postgres --- is
         | to aggregate your data into json objects and then go from
         | there. you don't need to know how many attributes (columns)
         | should be in the result of your pivot. you can also do this in
         | reverse (pivot from wide to long) with the same technique.
         | 
         | so for example if you have the schema `(obj_id, key, value)` in
         | a long-formatted table, where an `obj_id` will have data
         | spanning multiple rows, then you can issue a query like
         | 
         | ``` SELECT obj_id, jsonb_object_agg(key, value) FROM table
         | GROUP BY obj_id; ```
         | 
         | up to actual syntax...it's been awhile since i've had to do a
         | task requiring this, so details are fuzzy but pattern's there.
         | 
         | so each row in your query result would look like a json
         | document: `(obj_id, `{"key1": "value", "key2": "value", ...})`
         | 
         | see https://www.postgresql.org/docs/current/functions-json.html
         | for more goodies.
        
         | [deleted]
        
         | hantusk wrote:
         | Agreed. https://ibis-project.org/ and
         | https://dbplyr.tidyverse.org/ can compile dataframe-like input
         | to SQL, which might bridge the gap in tooling (although there
         | still are small differences to the pure dataframe syntax)
        
         | kbelder wrote:
         | I've been doing a lot of data analysis in Pandas recently. I
         | started off thinking that for efficiency's sake, I should do as
         | much initial processing in the DB as possible, and use Pandas
         | just for the higher level functions that were difficult to do
         | in SQL.
         | 
         | But after some trial and error, I find it much faster to pull
         | relatively large, unprocessed datasets and do everything in
         | Pandas on the local client. Faster both in total analysis time,
         | and faster in DB cycles.
         | 
         | It seems like a couple of simple "select * from cars" and
         | "select * from drivers where age < 30", and doing all the
         | joining, filtering, and summarizing on my machine, is often
         | less burdensome on the db than doing it up-front in SQL.
         | 
         | Of course, this can change depending on the specific dataset,
         | how big it is, how you're indexed, and all that jazz. Just
         | wanted to mention how my initial intuition was misguided.
        
         | ziml77 wrote:
         | I've always been disappointed by the SQL pivot. It's hardly
         | useful for me if I have to know up-front all of the columns
         | it's going to pivot out into. The solution would be to use
         | another SQL query to generate a dynamic SQL query, but at that
         | point I would rather just use Pandas
        
         | [deleted]
        
       | racl101 wrote:
       | I think I do a healthy mixture of both.
        
       | codeulike wrote:
       | I'm so glad to have been around long enough that SQL is now being
       | seen as exotic again
        
         | thinkharderdev wrote:
         | Not long after I got into software engineering is when the
         | first NoSQL craze began. I imagine that those who started not
         | long after me and spent a few years wrangling with MongoDB
         | queries (shudder) and trying to do joins, transactions and
         | consistency guarantees in application code must see RDBMS as
         | some sort of magical new technology.
        
       | dmitrykoval wrote:
       | Following similar observations I was wondering if one can
       | actually execute SQL queries inside of a Python process with the
       | access to native Python functions and Numpy as UDFs. Thanks to
       | Apache Arrow one can essentially combine DataFrame API with SQL
       | within data analysis workflows, without the need to copy the data
       | and write operators in a mix of C++ and Python, all within the
       | confines of the same Python process.
       | 
       | So I implemented Vinum, which allows to execute queries which may
       | invoke Numpy or Python functions as UDFs available to the
       | interpreter. For example: "SELECT value, np.log(value) FROM t
       | WHERE ..".
       | 
       | https://github.com/dmitrykoval/vinum
       | 
       | Finally, DuckDB makes a great progress integrating pandas
       | dataframes into the API, with UDFs support coming soon. I would
       | certainly recommend giving it a shot for OLAP workflows.
        
         | justsomeuser wrote:
         | Also I think SQLite lets you call Python functions from the SQL
         | program.
        
           | dmitrykoval wrote:
           | That's correct, but SQLite would require to
           | serialize/deserialize the data sent to Python func (from C to
           | Python and back), while Arrow allows to get a "view" of the
           | same data without making a copy. Which is probably not an
           | issue in OLTP workloads, but may become more visible in OLAP.
        
       | sbuttgereit wrote:
       | The code:                 WITH dt AS (           SELECT
       | unnest(array[1, 2]) AS n       )       SELECT * FROM dt;
       | 
       | Is more complex than necessary. This produces the same result:
       | SELECT n FROM unnest(array[1, 2]) n;       +---+       | n |
       | +---+       | 1 |       | 2 |       +---+       (2 rows)
       | 
       | I think I see some other opportunities as well.
       | 
       | I know the code is from a section dealing with CTEs, but CTEs
       | aren't needed for every situation including things like using
       | VALUES lists. Most people in the target audience can probably
       | ignore this next point (probably on a newer version of
       | PostgreSQL), but older versions of PostgreSQL would materialize
       | the CTE results prior to processing the main query, which is not
       | immediately obvious.
        
         | simonw wrote:
         | I think that was a deliberate choice by the author to
         | consistently demonstrate CTEs.
        
           | sbuttgereit wrote:
           | Sorry was editing to address this point probably while you
           | were typing your response. It's well taken, but we should be
           | clear that it's not required.
        
       | liprais wrote:
       | it seems the only reason people on hn hate sql is that they don't
       | understand sql and/or already buy in another toolset,if you only
       | have a hammer at hand,everything is a nail
        
       ___________________________________________________________________
       (page generated 2021-05-03 23:00 UTC)