[HN Gopher] Practical SQL for Data Analysis ___________________________________________________________________ Practical SQL for Data Analysis Author : steve-chavez Score : 420 points Date : 2021-05-03 14:55 UTC (8 hours ago) (HTM) web link (hakibenita.com) (TXT) w3m dump (hakibenita.com) | michelpp wrote: | This is an excellent example of what I call the Copy-Object-Copy | effect. It's particularly apparent in frameworks with ORMs like | Django. In Pandas case, devs will do 'SELECT *' and use pandas as | a sort of pseudo ORM. | | You run a query to get your data as a bunch of objects, but | you're copying the data over the db connection from the postgres | wire protocol into Python objects in memory, which are typically | then garbage collected at the end of the transaction, then copy | the result to a JSON buffer that is also garbage collected, and | then send the final result to send to the browser which has been | waiting this whole time. | | I regularly see Django apps require several gigs of RAM per | worker and when you look at the queries, it's just a bunch of | poorly rendered SELECTs. It's grotesque. | | Contrast PostgREST: 100 megabytes of RAM per worker. I've seen | Django DRF workers require 50 to 1 memory vs PostgREST for the | same REST interface and result with PostgREST being much faster. | Since the database is generating the json, it can do so | _immediately_ upon generating the first result row which is then | _streamed_ to the browser. No double buffering. | | Unlike Django, it's not that I don't like Pandas, it's that | people way overuse it for SQL tasks as this article points out. | I've seen pandas scripts that could be a simpler psql script. If | psql can't do what you want, resist the temptation to 'SELECT *' | into a data frame and break the problem up into stages where you | get the database to do the maximum work before it gets to the | data frame. | password4321 wrote: | I would appreciate pointers to any other low/no-copy db -> | browser technologies. | | I was thinking it would be cool to use parts of JavaScript db | libraries to parse query results passed straight through in | (compressed?) db wire protocol format via WebRTC. | geraneum wrote: | Well it's true that bringing row(s) of data from database to | python and then serializing it for any reason as has extra | overhead. | | But in Django's case, the same arguments can be made as for | Pandas. Django is a big complex framework and an application | using it might consume more memory due to countless number of | other reasons. There are also best practices to use Django ORM. | | But to say if a given Django instance consumes more memory it | is only because of "Copy-Object-Copy effect"... I don't think | so. | tomnipotent wrote: | > it's that people way overuse it for SQL tasks as this article | points out | | I'm very confused by this. I've used pandas for ever a decade, | and in most cases it's a massive time saver. I can do a single | query to bring data into local memory in a Jupyter notebook, | and from there re-use that memory across hundreds or more | executions of pandas functions to further refine an analysis or | whatever task I'm up to. | | Your "copy-object-copy" is not relevant in data analysis use | cases, and exists in pretty much any system that pulls data | from an external service be it SQL or not. | tomrod wrote: | I know a team that would read in multi-GB CSVs from an FTP | site into an orchestrator server using Pandas, write to locsl | CSV in the orchestator, validate data after write, reload | into pandas, painfully and manually check every data type, | then use a pandas connector to a database. | | Every step along the way here is wrong. The database came | with SFTP connectors. The orchestrator should have simply | told the database server to pull from the SFTP site into a | landing table. Even if not, Pandas has datatypes you can | specify on ingestion, which is heaven if your CSV or other | files don't record filetypes yet are consistent in structure. | Further, if you have custom validation you're applying (not a | bad thing!) you likely rarely even need pandas; the native | CSV library or (XLRD/openpyxl) for Excel are fine. | | Ultimately, it is a training issue. The toolspace is too | complex, with too many buzzwords to describe simple things, | that people get lost in the whirlwind. | mistrial9 wrote: | parent post says "devs do SELECT *" and ... | | relational databases can have a lot more data in one table, | or related tables, than one query needs. It is often very | wasteful of RAM to get ALL and filter again | protomyth wrote: | If your query does " * " it gets all the columns in all the | tables. Often, the optimizer when all the columns you need | are on the index, never visits the actual table (I remember | the term covered index). " * " basically screws this up. | "Select *" should never be used in anything in an actual | production environment. | tomnipotent wrote: | Data analysis workloads more often run into problems | solved by partitioning rather than indexes. | mistrial9 wrote: | yes and no - I trained on "larger than RAM datasets" | intentionally, and subsequently took on projects that | require "larger than RAM datasets". Two things happened | on my way, companies led by Google invented and deployed | datasets previously impossible e.g. BigTable, and | secondly the equipment I worked on went from modest to | mid-sized RAM (hard to define that). Granted that lots of | very useful (and common?) tasks are not "larger than RAM | datasets", and then have very different characteristics.. | which is starting to sound like "Excel problems look like | this, they always fit in RAM, they need partitions not | indexes" and the like.. | | There is a blind-man-and-the-Elephant drift here, which | is not terrible, but might need calling out to improve.. | tomnipotent wrote: | Not for data analysis, it's a pointless constraint unless | you're having issues. | | Most data analysis isn't against datasets larger than | memory, and I'd rather have more data than I need to spend | time waiting to bring it all local again because I forgot a | few columns that turn out to be useful later on. | disgruntledphd2 wrote: | True, but in that kind of exploratory environment, you'd | normally reduce data load and speed up iteration by using | sampling, which is super easy in SQL. | | > Not for data analysis, it's a pointless constraint | unless you're having issues. | | This will always happen, as time spent on the project | scales up (unless you use some kind of autoscaling magic | I suppose). | EForEndeavour wrote: | I totally get what you're saying because most of my | datasets also used to be smaller than my laptop's memory. | | _Used to._ | | I'm in the process of moving more and more processing | "upstream" from local pandas to the DBMS and it's already | proving to be a bit of a superpower. I regret not | learning it earlier, but the second-best time is now. | tarsinge wrote: | The point is that if you only need to join and aggregate data | it's easier and more efficient to do it directly in SQL. Also | in production or when your database doesn't fit in memory the | idea is to first use SQL to generate an optimized view of | your data from the database before further analysis and | transformation. | jstrong wrote: | > The point is that if you only need to join and aggregate | data it's easier and more efficient to do it directly in | SQL | | citation needed? I've seen plenty of cases where it would | have taken the db ages to do something that pandas does | fast, and I don't consider pandas to be particularly fast. | qsort wrote: | Pandas is great when you're working on the data manually | because it lets you interleave python code and "queries", but | it's strictly worse than a plain SQL statement if you're | writing, say, a REST service that needs to output some data. | | SQL executed by the database is orders of magnitude more | efficient, way more expressive, and doesn't require you to | memorize pandas' absurd API. | | And I say this as someone who is by no means a SQL wizard, | and fully acknowledging all of SQL's blemishes. | tomnipotent wrote: | > a REST service | | This is a post about data analysis, and everyone wants to | point out that pandas isn't good at real-time service | requests. | | > SQL executed by the database is orders of magnitude more | efficient | | Compared to pandas? No, it's not. Once I have all the data | I need locally, it's MUCH faster to use pandas locally then | to re-issue queries to a remote database. | crazygringo wrote: | > _Once I have all the data I need locally, it 's MUCH | faster to use pandas locally then to re-issue queries to | a remote database._ | | Both of you are right. | | Sure, if you need to grab a huge chunk of the entire | database and then do tons of processing on every row that | SQL simply cannot do, then you're right. | | But when most people think SQL and database, they're | thinking of grabbing a tiny fraction of rows, sped up by | many orders of magnitude because it utilizes indexes, and | doing all calculations/aggregations/joins server-side. | Where it absolutely _is_ going to be orders of magnitude | more efficient. | | Traditional database client API's often aren't going to | be particularly performant in your case, because they're | usually designed to read and store the entire result of a | query in-memory before you can access it. If you're lucky | you can enable streaming of results that bypasses this. | Other times you'll be far better off accessing the data | via some kind of command-line table export tool and | streaming its output directly into your program. | 3pt14159 wrote: | I agree, they're both right. | | I've been doing data science since around 2008 and it's a | balance between local needs and repeated analysis and an | often more efficient one off query. Sure the SQL | optimizer is going to read off the index for a count(*), | but it doesn't really help if I need all the rows locally | for data mining _anyway_. The counts need to line up! So | I 'll take the snapshot of the data locally for the one | off analysis and call it a day. If I need this type of | report to be run nightly, it will be off of the data | warehouse infrastructure not the production DB server. | | Shrug. These things take type and experience to fully | internalize and appreciate. | tomnipotent wrote: | > But when most people think SQL and database | | We're not talking about most people, but data analysts. | If we're just doing simple sum/count/average aggregated | by a column or two with some basic predicates, SQL and an | RDBMS are your best friend. Always better to keep compute | + storage as close together as possible. | | > Traditional database client API's | | I'm not sure what point you're trying to make with this. | Most data analysts are not working against streaming | data, but performing one-off analyses for business | counterparts that can be as simple as "X,Y,Z by A,B,C" | reports to "which of three marketing treatments for our | snail mail catalogue was most effective at converting | customers". | crazygringo wrote: | I'm not talking about streaming live data, I'm talking | about streaming query results if you re-read my comment. | | If you're reading many MB's or GB's of data from a | database, it's a lot more performant to stream it from | the database directly into your local data structure, | rather than rely on default processing of query results | as a single chunk which will be a lot worse for memory | and speed. | qsort wrote: | > Compared to pandas? No, it's not. | | I'm not saying you shouldn't use pandas, it depends on | the size of the data. I'm working right now on a project | where a SELECT * of the entire fact table would be a | couple hundred gigabytes. | | The flow is SQL -> pandas -> manipulation, and as always | in pipelines like those, the most work you can do at the | earliest stage, the better. | disgruntledphd2 wrote: | Yeah, speaking as a data person, the SQL argument is | correct. Python/R are much, much, much slower for this | kind of work. | | OTOH, SQL is super limiting for a lot of data analysis | tasks and you'll inevitably need the data in weird forms | that require lots of munging. | | Personally, I'm a big fan of using SQL/Airflow/whatever | to generate whatever data I'll need all the time at a | high level of granularity (user/action etc), and then | just run a (very quick) SQL query to get whatever you | need into your analytics environment. | | Gives you the best of both worlds, IME. | waltherg wrote: | This is a great point and way of looking at it: pandas and | SQL live at opposite ends of the maturation level of data | wrangling pipelines. | dragonwriter wrote: | > If psql can't do what you want, resist the temptation to | 'SELECT *' into a data frame and break the problem up into | stages where you get the database to do the maximum work before | it gets to the data frame. | | Why are we introducing an additional tool (psql) here | unnecessarily? Sure, if it can be a simpler postgres query, | that can be useful, but introducing psql and psql scripting | into a workflow that is still going to use python is...utterly | wasteful. And even simplifying by optimizing the use of SQL, | while a good tool to have in the toolbox, is probably pretty | low value for the effort for lots of data analysis tasks. | EForEndeavour wrote: | I'm pretty sure the premise here is that the many | gigabytes/terabytes of data reside in an RDBMS to begin with, | so we aren't introducing another tool -- we're saying it | makes sense to leverage multiple decades' worth of database | optimizations by doing a bunch of data filtering, processing, | and analysis in the database software and then exporting | intermediate results to a pandas environment for final | tweaks, visualization, etc. | dragonwriter wrote: | > I'm pretty sure the premise here is that the many | gigabytes/terabytes of data reside in an RDBMS to begin | with, so we aren't introducing another tool | | psql is a separate scriptable client tool from the postgres | database server. | shkkmo wrote: | The point was about doing more work using the SQL DB | engine, the client library you do that with seems | irrelevant to making that point. | dragonwriter wrote: | > The point was about doing more work using the SQL DB | engine | | The specific reference was to "psql script", which runs | in the psql client program, not the DB engine. Since I | suspected that might be an error intending to reference | SQL running on the server, I separately addressed, in my | initial response, both what was literally said (psql | script) and sql running in the DB. | shkkmo wrote: | Any psql script would be building and running SQL queries | so I fail to see how that has any impact of the point | being made. | | Edit: The argument is that there are operations that | should be done in the SQL layer and it is worth time | learning enough about that layer to understand when and | how to use it for computation. Once you learn that, it | isn't really relevant if you are using psql or some other | client library to build those queries. | mixmastamyk wrote: | Interesting. I'm guessing the extra layer may be needed for | some manipulations of data at the application layer, but | perhaps that is almost always avoidable? | | Have you ever tried Hasura? I'm thinking of giving it a go in a | future project. | tylerhannan wrote: | If/as you give it a go in the future...do let us know what is | working and what could be better. We do, in fact, care. | | <disclaimer: I work at Hasura> | musingsole wrote: | It's only avoidable to the degree that your software | architecture and company architecture allow you to make the | more efficient decision to move the logic upstream. | | The cases where this pattern emerges are invariably because | someone without insufficient access to the DB and its | configuration bites the bullet and instead builds whatever | they need to do it in the environment they have complete | access to. | | I've similar problems lead to dead cycles in _processor_ | design where the impact is critical. These problems aren 't | software technology problems. They're people coordination | problems. | arcticfox wrote: | As a Hasura user, I highly recommend it for OLTP workloads. | And for the specific subject at hand (memory usage) it is | fantastic. Its (in-database) serialization speed just | completely runs circles around anything we could do before in | Ruby. | WhompingWindows wrote: | Within the VA hospital system, the data admins often got onto | their "soapboxes" to dress down the 1400+ data analysts for | writing inefficient SQL queries. If you need 6 million | patients, joining data from 15 tables, gathering 250 variables, | a beginning SQL user has the potential to take 15-20 hours | where they could be pulling for 1-2 if they do some up-front | filtering and sub-grouping within SQL. If you already know | you'll throw out NA's on certain variables, if you need a | certain date or age range, or even the way you format these | filters: these all lead to important savings. And this saves R | from being full on memory, which would often happen if you fed | it way too much data. | | Within a system of 1400 analysts, it makes a big difference if | everyone's taking 2X or 4X the time pulling that they could be. | Then, even the efficiently written pulls get slowed down, so | you have to run things overnight...and what if you had an error | on that overnight pull? Suffice to say, it'd have been much | simpler if people wrote solid SQL from the start. | sagarm wrote: | How many TBs of data are we really talking here, if it's just | 6M rows? Surely this processing could easily be done on a | single machine. | sztanko wrote: | It doesn't say it is just 6m rows. It is 6m patient, which | only hints that one of the dimension tables is 6m. Facts | gathered in patients might be significantly larger. Also, | my experience is saying, if you have hundreds of queries | running simultaneously, it is not the volume of data that | can be a bottleneck. Depending on the system it can be | anything, starting from acquiring lock on a record or | initiating a transaction. | solaxun wrote: | I understand your point generally but I don't understand this | example. If you need data from 15 tables, you need to do | those joins, regardless of prefiltering or subgrouping, | right? | protomyth wrote: | Well, yes but. Why do you need 15 tables for analysis | queries and why isn't someone rolling those tables into | something a bit easier with some backend process. | [deleted] | alextheparrot wrote: | This is where I think the original data admins were | deluding themselves. Expecting 1,400 analysts to write | better code is a really non-trivial problem, but easy to | proclaim. | | An actual solution is creating pre-joined tables and | having processes ("Hey your query took forever, have you | considered using X?") or connectors | (".getPrejoinedPatientTable()") that make sure those | tables are being used in practice. | steve-chavez wrote: | > why isn't someone rolling those tables into something a | bit easier with some backend process. | | To put it in more concrete terms(plain SQL): the tables | could be aggregated on a set returning function or a | view. | cgh wrote: | Yes, in these situations materialized views with indexes | are generally the correct answer. | tomrod wrote: | > why isn't someone rolling those tables into something a | bit easier with some backend process. | | You'd identified why we're all going to have jobs in 100 | years. | | Automation sounds great. It's exponentially augmenting to | some users in a defined user space. | | Until you get to someone like me, who looks at the | production structure and goes: "this is wholly | insufficient for what I need to build, but it has good | bones, so I'm going to strip it out and rebuild it for my | use case, and only my use case, because to wait for a | team to prioritize it according to an arcane schedule | will push my team deadlines exceedingly far." | | This is why you don't roll everything into backend | processes. Companies set up for production (high | automation value ROI) and analytics (high labor value | ROI) and has a hard time serving the mid tail. EVERYTHING | on either direction works against the mid-tail -- | security policies, data access policies, software | approvals, you name it. | | People, policy, and technology. These are the three | pillars. If your org isn't performing the way it should, | then by golly work at one of these and remember that | technology is only one of three. | punnerud wrote: | Tree pillars where you only can choose two to be perfect. | Just like databases and CAP theorem | tomrod wrote: | Technology is the easiest to adjust, ironically. | protomyth wrote: | I don't disagree with writing solid SQL. I would go so far as | to say some things (most) need to be in stored procedures | that are reviewed by competent people. But, some folks don't | think about usage sometimes. | | This is one of those things I just don't get about folks | setting up their databases. If you have a rather large | dataset that keeps building via daily transactions, then its | time to recognize you really have some basic distinct | scenarios and to plan for them. | | The most common is adding or querying data about a single | entity. Most application developers really only deal with | this scenario since that is what most applications care about | and how you get your transactional data. Basic database | knowledge gets most people to do this ok with proper primary | and secondary keys. | | Next up is a simple rule, "if you put a state on an entity, | expect someone to need to know all the entities with this | state." This is a killer for application developers for some | reason. It actually requires some database knowledge to setup | correctly to be performant. If the data analysts have | problems with those queries, then its to to get the DBA to | fix the damn schema and write some stored procedures for the | app team. | | At some point, you will need to do actual reporting, excuse | me, business intelligence. You really should have some | process that takes the transactional data and puts it into a | form where the queries of data analysts can take place. In | the old days that would be something to load up Red Brick or | some equivalent. Transactional systems make horrid reporting | systems. Running those types of queries on the same database | as the transactional system is currently trying to work is | just a bad idea. | | Of course, if you are buying something like IBM DB2 EEE | spreading queries against a room of 40 POWER servers, then | ignore the above. IBM will fix it for you. | jnsie wrote: | At its simplest its OLTP vs OLAP. Separate the data | entry/transactional side of things from the reporting part. | Make it efficient for data analysts do do their jobs. | fifilura wrote: | I can't help thinking that with better processes and tools | those 1400 analysts could instead be 50 analysts? (or even | 5?) | | And that building an aggregation ETL pipeline, maybe inspired | by this post, could be the solution? | 1980phipsi wrote: | VA -> government -> bloat | allknowingfrog wrote: | Is it possible to do any kind of version control with | PostgREST? I have no doubt that raw SQL offers huge performance | advantages over my ORM, but I'm pretty fond of the ability to | roll back to a previous state when I push a bad change to | production. Performance is just one of a number of tradeoffs | that I consider when I'm choosing tools. | sitkack wrote: | It is customary to check in all DB assets, table creation | scripts, queries, stored procedures, etc. No different than | the rest of the development process. If something _isn 't_ | being versioned, there is a huge problem. | jonas_b wrote: | Question, if I have a CSV file that I'd like to do some quick SQL | queries on before moving the results into Pandas. What would be | good resource to do this? Preferably compatible with the rest of | the Python-dataframe ecosystem and as simple as pd.read_csv() | Vaslo wrote: | Try SQL Alchemy? | vincnetas wrote: | http://harelba.github.io/q/ | | q "SELECT COUNT(*) FROM ./clicks_file.csv WHERE c3 > 32.3" | | It uses sqlite under the hood. | arusahni wrote: | In addition to the recommendation for sqlite, I've found | `csvkit` to be an extremely useful set of CLI tools for CSV | munging. The `csvsql` [1] entrypoint is especially handy | because it allows you to issue SQL queries against your CSV | file directly vs. loading then querying. | | 1: https://csvkit.readthedocs.io/en/latest/scripts/csvsql.html | rahulj51 wrote: | Try duckdb to query and even transform the data and then export | to a pandas df. | simonw wrote: | If you want to try it in SQLite (pros: no need to run a server | or install anything since it's in the Python standard library, | cons: not nearly as many advanced statistical analysis features | as PostgreSQL) my sqlite-utils CLI tool may help here: it can | import from CSV/TSV/JSON into a SQLite database: | https://sqlite-utils.datasette.io/en/stable/cli.html#inserti... | saltcured wrote: | The sqlite3 connection object in Python allows you to | register callables which you can use as scalar or aggregate | functions in your SQL queries. With this, you can fill some | of the gaps compared to PostgreSQL by essentially importing | Python libraries. I just found this nice tutorial while | looking for relevant docs: | | https://wellsr.com/python/create-scalar-and-aggregate- | functi... | | However, I think the limited type system in SQLite means you | would still want to extract more data to process in Python, | whether via pandas, numpy, or scipy stats functions. Rather | introducing new composite types, I think you might be stuck | with just JSON strings and frequent | deserialization/reserialization if you wanted to build up | structured results and process them via layers of user- | defined functions. | ithrow wrote: | You can easily import the csv file into sqlite and you don't | even have to create the table fields beforehand | bruiseralmighty wrote: | I keep a non-commercial installation of SQLServer on my machine | for this reason, but this is likely overkill for most purposes | and requires a Windows machine. | | It does have some nice import features for CSV data though. | wswope wrote: | SQLite is what you're looking for. If you want to use the CLI, | .format csv .import <path to csv file> <table name> | | Alternatively, read your data into pandas and there's extremely | easy interop between a DBAPI connection from the python | standard lib Sqlite3 module and Pandas (to_sql, read_sql_query, | etc.). | thamer wrote: | I would second this suggestion. I was querying CSV data | imported into SQLite this weekend, and it was extremely easy | to get started. SQLite is pretty snappy even for tables with | millions of rows. | | SQLite supports defining columns without a type and will use | TEXT by default, so you can take the first line of your CSV | that lists the document's dimensions, put those in the | brackets of a CREATE TABLE statement, and then run the | .import described above (so just CREATE TABLE foo(x,y,z); if | x,y,z are your column names). | | After importing the data don't forget to create indexes for | the queries you'll be using most often, and you're good to | go. | | Another suggestion for once your data is imported, have | SQLite report it in table format: .mode | column .headers on | zabzonk wrote: | Windows provides an ODBC driver for CSV files - I don't know | how this would play with Pandas. | jplr8922 wrote: | tldr ; if you hear ''but we can do this in SQL'', RUN!!! | | My eyes hurt as I read this article. There are reasons why | analysts dont use SQL to do their job, and it has nothing to do | with saving RAM and memory. | | 1) Data analysis is not a linear process, it involve playing and | manipulating the data in different way and letting your mind | drift a bit. You want your project in an IDE made for that | purpose, the ability to create charts, source control, export and | share the information, ect. Pandas is just a piece of that puzzle | which is not possible to replicate in pure SQL. | | 2) In 2020, there are numerical methods you want to try beyond a | traditional regression. Most real world data problems are not | made for stats101 tools included in sql. Kurtosis? | Autocorrelation? | | 3) Politics. Most database administrator are control freaks who | _hate_ the idea of somebody else doing stuff in their DB. Right | now were I work we still have to use SSIS-2013 instead of stored | procedures in order to avoid the DBA refusal bureaucratic | process. | | 4) Eventual professional development. If your analysis is good | and creates value, chances are it will become a 'real' program | and you will have to explain what you are doing to turn in into | an OOP tool. If you have CS101, good coding in python will make | this process much easier than a 3000 lines spagetti-SQL SQL | script. | | 5) Data cleaning. Dealing with outliers, NAN and all that jazz | really depends on the problem you try to solve. The absence of a | one size fits all solutions is a good case for R/pandas/etc. | These issue will break an SQL script in no time. | | 6) Debugging in SQL. Hahahahahahahaha | | If you are still preocupied with the ram usage of your PC to do | your project, here are two solutions which infuriate a lot of | DBAs I've worked with. | | A) | https://www.amazon.ca/s?k=ram&__mk_fr_CA=%C3%85M%C3%85%C5%BD... | | B) https://aws.amazon.com/ | qwertox wrote: | What is your problem with this example, doesn't it get the job | done? WITH temperatures AS ( /* ... */ ) | SELECT *, MAX(c) OVER ( ORDER | BY t ROWS BETWEEN 2 PRECEDING AND CURRENT ROW | ) AS hottest_temperature_last_three_days FROM | temperatures; t | c | | hottest_temperature_last_three_days | ------------+----+------------------------------------- | 2021-01-01 | 10 | 10 | 2021-01-02 | 12 | 12 | 2021-01-03 | 13 | 13 | 2021-01-04 | 14 | 14 | 2021-01-05 | 18 | 18 | 2021-01-06 | 15 | 18 | 2021-01-07 | 16 | 18 | 2021-01-08 | 17 | 17 | | Why should I fetch all the data and then form the result with | another tool? | | What a great article. | jplr8922 wrote: | I re-read my comment, and I think that I expressed myself too | harshly. If you like SQL and it get the job done for you, no | problem. I would see myself using that SQL query. | | However in order to get there, you need to know why you need | the ''hottest_temparature_last_three_days''. Why not 2 or 4 | days? Why not using heating degree day? What about serial | correlation with other regions, or metering disfunction? What | if you are working directly with raw data from instruments, | and still need to choose wich cleaning method you will use? | What if you want to check the correlation with another | dataset which is not yet in your database (ex: private | dataset in excel from a potential vendor)? | | If you know _exactly_ what you want, sure SQL is the way to | go. However the first step of a data project is to admit that | you dont know what you want. You will perform a litterature | review and might have a general idea, but starting with a | precise solution in mind is a receipe for failure. This is | why you could need to fetch all the data and then form | results with another tool. How do you even know which factors | and features must be extracted before doing some exploration? | | If you know the process you need and only care about RAM-CPU | optimization, sure SQL is the way to go. Your project is an | ETL and your have the job of a programmer who is coding a | report. There is no ''data analysis'' there... | sagarm wrote: | If you know SQL well, then doing exploratory analysis in | SQL is perfectly reasonable. Especially if you are using a | high-performance analytic database. | | Disclosure: I develop high-performance analytic database | systems with SQL frontends. | jplr8922 wrote: | What do you mean by exploratory analysis? Lets assume | that my dataset contains features which are not normally | distributed, and I want to check for coskewness and | cokurtosis matrices before and after I remove outliers | with a method specific to my field of research, found in | a academic paper. Am I supposed to code all of this in | SQL? What is your definition of exporatory analysis, and | which metrics do you have in mind? | | And what if I want to make graphs to present stuff to a | colleague? Am I supposed to use... excel to do that? | qwertox wrote: | > If you know exactly what you want, sure SQL is the way to | go. | | I've been doing similar aggregations with MongoDB, simply | because when I start a project where I quickly want to | store data and then a week later check what I can do with | it, it's a tool which is trivial to use. I don't need to | think about so many factors like where and how to store the | data. I use MongoDB for its flexibility (schemaless), | compared to SQL. | | But I also use SQL mostly because of the power relations | have, yet I use it only for storing stuff where it's clear | how the data will look for forever, and what (mostly | simple) queries I need to perform. | | I've read your comment as if it was suggesting that these | kind of articles are not good, yet for me it was a nice | overview of some interesting stuff that can be done with | SQL. I don't chase articles on SQL, so I don't get many to | read, but this one is among the best I've read. | | To get back to the quote: How will I know if SQL can do | what I want, if I don't read these kind of articles? | kbelder wrote: | Familiar. I think the biggest issue is understanding workflow. | My hunch is that some developers view a data analysis project | as a program... it needs to have these data processing steps | performed, summarized in this certain way, and saved off in | that certain format. Once those queries are defined, the | program is ran and the analysis done. | | The actual work the analyst is doing is far more ad-hoc than | that. It's iterative and frequently will get torn down, | rebuilt, and joined to other sources, as they discover new | aspects to the data. They need their hands on the raw data, not | on a specifically summarized version of it. | twobitshifter wrote: | I now do most of my data analysis in Julia instead of pandas, but | used pandas for a long time. SQL is a valuable skill and useful | when you need to optimize a query for performance, which can | sometimes happen with pandas and is much less likely with Julia. | | However, even if a pandas query takes 5X longer to run than a SQL | query you must consider the efficiency to a developer. You can | chain pandas commands together that will accomplish something | that takes 10x the lines of SQL. With SQL you're more likely to | end up with many intermediate CTEs along the way. So while you | can definitely save processor cycles by using SQL, I don't think | you'll save clock-face time by using it in most one off tasks. | Datascience is usually column oriented and Julia and pandas allow | you to stay in that world. | nomel wrote: | For me, the main use case of a database server is so I can do | operations that require more memory than I have on my local | computer. Reading through this comment section makes me think | my use case is rare or something. Maybe everyone is ok with | chunked operations? | spamalot159 wrote: | I think for a lot of people, SQL is a skill that doesn't stick. | You learn enough to do the queries you need for your project, | they work then you forget about them as you work on the rest of | your project. These skills are perishable. Left outer join? Yeah, | I knew what that was some time ago, but not anymore | | The days of dedicated SQL programers are mostly gone. | zabzonk wrote: | > The days of dedicated SQL programers are mostly gone. | | I've never met a "dedicated SQL programmer" - all the C++ | programmers I've worked with in the investment banking world | were also expected to know, and did know, SQL pretty well - | asking questions about it were normally part of the interview | process. | simonw wrote: | I started learning SQL twenty years ago and it's one of the | only skills from the start of my career that has been | consistently useful ever since. | slver wrote: | We have a lot more clueless developers on average, which makes | it seem like no one uses SQL anymore beyond basics. But we're | still here, believe me. No one today is a dedicated X | programmer, but doesn't mean we're all clueless about X. | nisegami wrote: | I held this view before graduating and moving back to my 3rd | world home country. For some reason, people here (the | established ones with 5+ years of experience) see SQL as a | general purpose programming language with which they do as much | as possible. | andrewmcwatters wrote: | Exactly. I own a software consultancy, and I tried explaining | this to an engineering manager and he just didn't get it. | | For some reason, he couldn't understand that when you write SQL | queries for a project, you typically do it once, and basically | never again, with the exception of maybe adding or removing | columns from the query. | | The "hard work," if you can call it that, is all in the joins. | Then, I completely forget about it. | | You spend so much more time in development on everything else. | muxator wrote: | > This benchmark does not mention the memory consumed by the | database itself - this is intentional. [...] Whether you decide | to use the database or not, the memory is already paid for, so | you might as well use it! | | This sentence is a big red flag for me. An analysis of a stategy | that pushes work towards a subsystem, and then purposedly ignores | the perforance implications on that subsystem is methodologically | unsound. | | Personally, I am all for using the DB and writing good SQL. But | if I weren't, this argument would not convince me. | truth_seeker wrote: | Pure SQL porn. Very exciting! | | Lately I am seeing rather really creative blog posts on HN. Keep | it up guys. | flakiness wrote: | The more I learn SQL, the less I write Python. | | Although the SQL syntax is weird and so dated, its portability | across tools trumps everything else. You finished the EDA and | decided to port the insights to a dashboard? With SQL it's | trivial. With Python... well, probably you'll have to port it to | SQL unless you have Netflix-like, Jupyter-backed dashboard | infrastructure in place. For many of us who only have much-more- | prevalent SQL-based dashboard platform, why not starting from | SQL? Copy-n-paste is your friend! | | I still hate the SQL as a programmer, but as a non-expert data | analyst I now have accepted it. | arcticfox wrote: | After 15 years programming I have come to love SQL - 10 yrs ago | I used to do everything in the book to avoid it out of hate, | even though I knew how to use it, but over time I have moved | more and more into the database / SQL and my code is just | better and better as a result. | smilbandit wrote: | I'm neutral on SQL but very big on the right tool for the job. | My rule of thumb is, if i'm querying data just to massage it | and then stuff it back into the database then I first try to do | that as a stored procedure. I remember learning about windowing | in SQL late into a project and after redoing some python | functions as stored procedures really improving performance. | simonw wrote: | This article is so useful. It starts out with SQL basics but then | quickly leaps into all kinds of PostgreSQL tricks that I didn't | know about - binning, efficient sampling, calculating, even | linear regression. | nerdponx wrote: | SQL syntax sucks for doing non-trivial data analysis. I've tried | it. Verbose, no composability or code reuse, not really portable | across different databases, no easy interoperability with other | tools, limited editor/IDE support, etc. | | I guess if you have huge amounts of data (10m+ rows) already | loaded into a database then sure, do your basic summary stats in | SQL. | | For everything else, I'll continue using SQL to get the data from | the database and use Pandas or Data.Table to actually analyze it. | | That said, this is a very comprehensive review of SQL techniques | which I think could be very useful for when you _do_ have bigger | datasets and /or just need to get data out of a database | efficiently. Great writeup and IMO required reading for anyone | looking to be a serious "independent" data scientist (i.e. not | relying on data engineers to do basic ETL for you). | | I'd be a huge fan of something like the PySpark DataFrame API | that "compiles" to SQL* (but that doesn't require you to actually | be using PySpark which is its own can of worms). I think this | would be a lot nicer for data analysis than any traditional ORM | style of API, at least for data analysis, while providing better | composability and IDE support than writing raw SQL. | | *I also want this for Numexpr: | https://numexpr.readthedocs.io/en/latest/user_guide.html, but | also want a lot of other things, like a Arrow-backed data frames | and a Numexpr-like C library to interact with them. | ramraj07 wrote: | Completely disagree. I have a choice of using snowflake and | spark/pandas to do my EDA and I'll choose sql every time. The | code is significantly more readable once you get used to it and | you can most definitely do things one step at a time using udfs | and temp tables / cte s. I've come back to EDA I did a year | back and it's always easier to read a long sql script than a | notebook with pandas code. | disgruntledphd2 wrote: | There's pluses and minuses to both. That being said, how do I | write a function in SQL which can abstract over something I | do a lot (like the pivoting example earlier), or even some | date function like iff(date>'important_date', 'before', | 'after') as grouper. | | Honestly, that's what ends up making me move away from doing | analytics in SQL. | saltcured wrote: | Edit: oops, I think I replied a level deeper than intended. | I was responding to the composition/abstraction topic. I | think I should just leave this here now though? | | I assume you are talking about composition of generic set- | processing routines, but I wonder if others realize that? | It is easy enough to write a set-returning function and | wrap it in arbitrary SQL queries to consume its output. | But, it is not easy to write a set-consuming function that | can be invoked on an arbitrary SQL query to define its | input. Thus, you cannot easily build a library of set- | manipulation functions and then compose them into different | pipelines. | | I think different RDBMS dialects have different approaches | here, but none feel like natural use of SQL. You might do | something terrible with cursors. Or you might start passing | around SQL string arguments to EXECUTE within the generic | function, much like an eval() step in other interpreted | languages. Other workarounds are to do everything as macro- | processing (write your compositions in a different | programming language and "compile" to SQL you pass to the | query engine) or to abuse arrays or other variable-sized | types (abuse some bloated "scalar" value in SQL as a quasi- | set). | | What's missing is some nice, first-class query (closure) | and type system. It would be nice to be able to write a CTE | and use it as a named input to a function and to have a | sub-query syntax to pass an anonymous input to a function. | Instead, all we can do is expand the library of scalar and | aggregate functions but constantly repeat ourselves with | the boilerplate SQL query structures that orchestrate these | row-level operations. | disgruntledphd2 wrote: | Yeah, that's exactly the issue. One can do this in | Python, but not in SQL, and this leads to boilerplate. | | I actually think that SparkSQL is a good solution here, | as you can create easily reusable functions. | rdedev wrote: | For apache arrow backed dataframes check out | Polars(https://github.com/ritchie46/polars) | disgruntledphd2 wrote: | > (i.e. not relying on data engineers to do basic ETL for you). | | Is this actually a thing? Surely it can't be a thing. | denimnerd42 wrote: | of course, that's my job... we do basic ETL on hundreds of | data sources to provide data to the analysts and quants | EForEndeavour wrote: | In a sufficiently large org, why not? | dreyfan wrote: | > SQL syntax sucks for doing non-trivial data analysis. I've | tried it. Verbose, no composability or code reuse, not really | portable across different databases, no easy interoperability | with other tools, limited editor/IDE support, etc. | | You're entitled to your opinion and tooling choices of course, | but the problem is you don't know SQL. | dunefox wrote: | That could be said about any tool, framework, language, | library, etc. Invest enough time and you can do everything | with malbolge but that doesn't mean it doesn't suck. | mixmastamyk wrote: | A more useful comment would be to illustrate how. I'd like to | know as well as I need to reuse queries in SQL occasionally | but am not an expert. Currently I believe learning pl/pgsql | is the answer, but even it reeks of punch-cards and other | sixtiesisms :-). Tough sell when you're used to Python etc. | pandasusr wrote: | I am a pandas user considering refactoring part of my ETL | pipeline to SQL. I see the trade off as memory efficiency vs | expressiveness, and for simple queries on big data, SQL wins. | Would you disagree that Pandas/Python is more expressive than | SQL? I'm less experienced in SQL but based on my limited | experience there, it seems Pandas is clearly more expressive. | What is the SQL equivalent of Pandas .apply(lambda x) ? | higeorge13 wrote: | Please define an example lambda function, in order to see | whether there is an sql equivalent. Imo >90% of the data | issues i have seen, can be solved with sql queries. I have | seen some in the finance sector which would require super | complex udfs, but other than these, sql is the first choise | to solve a data issue. | pandasusr wrote: | I do work in finance sector and need complex functions :) | | And it's interesting that pandas was invented at a hedge | fund, AQR. | | I agree that SQL may be better for vanilla BI. | hodgesrm wrote: | ClickHouse has lambdas for arrays. They are very useful. | Here's an example. WITH ['a', 'bc', 'def', | 'g'] AS array SELECT arrayFilter(v -> (length(v) > | 1), array) AS filtered +-filtered-----+ | | ['bc','def'] | +--------------+ | | The lambda in this case is a selector for strings with more | than one character. I would not argue that they are as | general as Pandas, but they are might useful. More examples | from the following article. | | https://altinity.com/blog/harnessing-the-power-of- | clickhouse... | default-kramer wrote: | "No composability or code reuse" is definitely a valid | criticism of SQL. Check out the very first example of my | personal project https://docs.racket- | lang.org/plisqin/Read_Me_First.html and let me know how you | would implement something akin to `(CategoryName p)` and | `(TotalSales p)` in SQL. A view does not count, because then | you either have one view per derived field and that is very | cumbersome to consume, or you have one view with all the | derived fields and you will end up paying performance costs | for fields that you are not using. | ok123456 wrote: | Don't views and materialized views give you reuse? | | Can't you do composibility with foreign references? | neurocean wrote: | Sure, while views may be fine for small projects, using | them for larger projects like data warehousing is usually | a mistake you'll come to regret. | | SQL views are rarely unit tested so you always end up in | a regression nightmare when you need to make updates. | | If you're going to go pure SQL, you should use something | like dbt. | hodgesrm wrote: | > I guess if you have huge amounts of data (10m+ rows) already | loaded into a database then sure, do your basic summary stats | in SQL. | | This is not huge. This is quite small. Pandas can't handle data | that runs into billions of rows _or_ sub-second response on | arbitrary queries. Both are common requirements in many | analytic applications. | | I like Pandas. It's flexible and powerful if you have | experience with it. But there's no question that SQL databases | (especially data warehouses) handle large datasets and low | latency response far better than anything in the Python | ecosystem. | nerdponx wrote: | _This is not huge. This is quite small. Pandas can 't handle | data that runs into billions of rows or sub-second response | on arbitrary queries. Both are common requirements in many | analytic applications._ | | You're right. It's "huge" with respect to what you can expect | to load into Pandas and get instant results from. But it's | not even "medium" data on the small-medium-big spectrum. | | _I like Pandas. It 's flexible and powerful if you have | experience with it. But there's no question that SQL | databases (especially data warehouses) handle large datasets | and low latency response far better than anything in the | Python ecosystem._ | | I agree with this 100%, but I think a lot of people missed it | in my post. | hodgesrm wrote: | OTOH Pandas/Numpy are great for results. One pattern that | I'm hoping to try is running heavy lifting in SQL, transfer | results via Apache Arrow, and manipulate the result set in | Python with aforesaid libraries. | | I don't know enough about how Arrow handles streaming to | understand how this would really work but it would break | away from the single-threaded connectivity with expensive | ser/deser databases have used since the days of Sybase Db- | Library, the predecessor to ODBC. 36 years is probably long | enough for that model. | neurocean wrote: | Yeah but that's not even close to being Pandas' value | proposition. It's for the data scientist and analyst, not the | db admin, data engineer or production applications. | beforeolives wrote: | Nice article. Pandas gets the job done but it's such a step | backwards in terms of useability, API consistency and code feel. | You can do anything that you can possibly need with it but you | regularly have to look up things that you've looked up before | because the different parts of the library are patched up | together and don't work consistenly in an intuitive way. And then | you end up with long lines | of().chained().['expressions'].like_this(0). | nvilcins wrote: | You might be thinking of specific functionality that you find | is being implemented in an overly long/verbose fashion.. But | generally speaking, how are | | > long lines of().chained().['expressions'].like_this(0) | | a _bad thing_? | | IMHO these pandas chains are easy to read and communicate quite | clearly what's being done. If anything, I've found that in my | day-to-day while reading pandas I parse the meaning of those | chains at least as efficiently as from comments of any level of | specificity, or from what other languages (that I have had | experience with) would've looked like. | musingsole wrote: | People don't like them because the information density of | pandas chains is soooo much higher than the rest of the | surrounding code. So, they're reading along at a happy place, | consuming a few concepts per statement | | ...and then BOOM, pandas chain! One statement containing 29+ | concepts and their implications. | | Followed by more low density code. The rollercoaster leads to | complaints because it _feels_ harder. Not because of any | actual change in difficulty. | | /that's my current working theory, anyway | disgruntledphd2 wrote: | Hmmm, interesting. I don't mind the information density, | coming from R which is even more terse, but the API itself | is just not that well thought out (which is fair enough, he | was learning as he went). | isoprophlex wrote: | This really grinds my gears too. There's something about the | pandas API that makes it impossible for me to do basic ops | without tedious manual browsing to get inplace or index | arguments right... assignments and conditionals are needlessly | verbose too. | | Pyspark on the other hand just sticks in my brain, somehow. | Chained pyspark method calls looks much neater. | ziml77 wrote: | Setting inplace=True isn't too bad, but I definitely have had | many issues with working with indexes in Pandas. I don't | understand why they didn't design it so that the index can be | referenced like any other columns. It overcomplicates things | like having to know the subtle difference between join() and | merge(). | nojito wrote: | Setting Inplace=True is not recommended and should be used | with caution | | https://github.com/pandas-dev/pandas/issues/16529 | ziml77 wrote: | I've never seen anything about inplace being a poor idea | to use. Is that documented anywhere or is that ticket the | only info about it? | Peritract wrote: | It's not particularly front-and-centre, but it's all over | various discussion boards if you go looking for it | directly. I'd like the debate to be more visible, | personally, particularly whenever the argument gets | deprecated for a particular method. | | Essentially, `inplace=True` rarely actually saves memory, | and causes problems if you like chaining things together. | The people who maintain the library/populate the | discussion boards are generally pro-chaining, so | `inplace` is slowly and quietly on its way out. | disgruntledphd2 wrote: | Pandas is just a bad API, to be honest. I know base-R very, | very well (which was one of the inspirations, I believe) and | I still spend most of my time looking stuff up. | | It's such a shame that python doesn't have a better DF | library. | shankr wrote: | I also switched from R to Python/pandas. I remember always | being frustrated with pandas since it tries to emulate | data.frame, but then just does its own thing without being | consistent. | EForEndeavour wrote: | I've used pandas regularly for the past ~5 years and find | its API intuitive enough not to complain. I can write and | read decently long pandas chained expressions fluently, but | I barely know any R. Am I unwittingly a hostage of an | inferior API and don't know what I'm missing? | disgruntledphd2 wrote: | Yes. | | Base R is OK, but dplyr is magical. | | For instance, integer indexing in base R is df[row,col] | rather than the iloc pandas stuff. | | plot, print and summary (and generic function OOP more | generally is really underappreciated). | | Python is a better programming language, but R is a | better data analysis environment. | | And dplyr is an incredibly fluent DSL for doing data | analysis (not quite as good for modelling though). | | Seriously, I read the original vignette for dplyr in late | 2013/early 2014 and within two weeks I'd switched most of | my new analytical code over to it. So very, very good. | Less idea-impedance match than any other environment, in | my experience. | shankr wrote: | I was actually using data.table The syntax can be bit | cryptic, but you get used to it. | [deleted] | shankr wrote: | This might not seem like a big issue but in the | beginning, these were my issues | | 1. Not so easy way to rename columns during aggregation | | 2. The group by generates its own grouped by data and | hence you almost always need `reset_index` | | 3. Sometimes group by can convert a dataframe to series | | 4. Now `.loc` has provided bit consistent | indexing/slicing, but earlier you had `.ix` `.iloc` and | what not | | These are something I can remember from top of my head. | Of course all of these have solutions, but it makes | pandas much more verbose. In R, these are just much more | succinct. | carabiner wrote: | Can you honestly say you'd prefer to be debug _thousands_ of | lines of SQL versus the usual <100 lines of Python/pandas that | does the same thing? It's no contest. A histogram in pandas: | df.col.hist(). Unique values: df.col.value_counts(). Off the | top of your head, what is the cleanest way of doing this in SQL | and how does it compare? How anyone can say that SQL is | objectively better (in readability, density, any metric) other | than the fact that they learned it first and now are frustrated | that they have to learn another new tool baffles me. | | I learned Pandas first. I have no issue with indexing, | different ways of referencing cells, modifying individual rows | and columns, numerous ways of slicing and dicing. It gets a | little sprawling but there's a method to the madness. I can | come back to it months later and easily debug. With SQL, it's | just madness and 10x more verbose. | beforeolives wrote: | > Can you honestly say you'd prefer to be debug thousands of | lines of SQL versus the usual <100 lines of Python/pandas | that does the same thing? | | That's fair, I was using the opportunity to complain about | pandas and didn't point out all the problems SQL has for some | tasks. What I really want is a dataframe library for Python | that's designed in a more sensible way than pandas. | marktl wrote: | Distinct Count and Group By in SQL will provide distinct | Count and Histogram (numerical value) output. Like No SQL vs | Relationship, seems there are use cases that lemme them | selves to new vs old technologies. It's hard for me to put | much stock in opinions from those that have vastly more | experience in one tool and not the other being compared. | nerdponx wrote: | I much prefer writing actual functions in a real programming | language to the verbose non-reusable non-composable relic that | is SQL syntax. | | Give me a better query language and I will gladly drop Pandas | and Data.Table. | semitones wrote: | SQL is a real programming language. | thinkharderdev wrote: | It's not turing complete which is I think why it doesn't | feel like a "real" programming language. | sagarm wrote: | Recursive CTEs exist; while I haven't investigated | thoroughly, I would be surprised if they didn't make SQL | turing complete. | | More usefully, most databases allow you to write user- | defined functions in other languages, including | imperative SQL dialects. | steve-chavez wrote: | SQL is turing complete. A demonstration with recursive | CTEs is done here[1]. | | [1]: https://wiki.postgresql.org/wiki/Cyclic_Tag_System | beforeolives wrote: | Yes, SQL does have the problem of bad composeability and some | things being less explicit than they are when working with | dataframes. Dplyr is probably the most sane API out of all of | them. | alexilliamson wrote: | +1 for dplyr. I've used both pandas and dplyr daily for a | couple years at different times in my career, and there is | no comparison in mind when it comes to | usability/verbosity/number of times I need to look at the | documentation. | m16ghost wrote: | >Pandas is a very popular tool for data analysis. It comes built- | in with many useful features, it's battle tested and widely | accepted. However, pandas is not always the best tool for the | job. | | SQL is very useful, but there are some data manipulations which | are much easier to perform in pandas/dplyr/data.table than in | SQL. For example, the article discusses how to perform a pivot | table, which takes data in a "long" format, and makes it "wider". | | In the article, the pandas version is: | | >pd.pivot_table(df, values='name', index='role', | columns='department', aggfunc='count') | | Compared to the SQL version: | | >SELECT role, SUM(CASE department WHEN 'R&D' THEN 1 ELSE 0 END) | as "R&D", SUM(CASE department WHEN 'Sales' THEN 1 ELSE 0 END) as | "Sales" FROM emp GROUP BY role; | | Not only does the SQL code require you to know up front how many | distinct columns you are creating, it requires you to write a | line out for each new column. This is okay in simple cases, but | is untenable when you are pivoting on a column with hundreds or | more distinct values, such as dates or zip codes. | | There are some SQL dialects which provide pivot functions like in | pandas, but they are not universal. | | There are other examples in the article where the SQL code is | much longer and less flexible, such as binning, where the bins | are hardcoded into the query. | michelpp wrote: | Does `tablefunc.crosstab()` do what you want? | | https://www.postgresql.org/docs/13/tablefunc.html | m16ghost wrote: | It is not much better than the canonical example given in the | article. It still has the following usability issues: | | -You still need to enumerate and label each new column and | their types. This particular problem is fixed by crosstabN(). | | -You need to know upfront how many columns are created before | performing the pivot. In the context of data analysis, this | is often dynamic or unknown. | | -The input to the function is not a dataframe, but a text | string that generates the pre-pivot results. This means your | analysis up to this point needs to be converted into a | string. Not only does this disrupt the flow of an analysis, | you also have to worry about escape characters in your | string. | | -It is not standard across SQL dialects. This function is | specific to Postgres, and other dialects have their own | version of this function with their own limitations. | | The article contains several examples like this where SQL is | much more verbose and brittle than the equivalent pandas | code. | papercrane wrote: | That's one of the non-standard ways to do it. MSSQL and | Oracle also have a pivot function to do this. Unfortunately | there is no standard way to do this. | ellis-bell wrote: | agreed that pivoting can be a pain. | | a pattern that i converged on --- at least in postgres --- is | to aggregate your data into json objects and then go from | there. you don't need to know how many attributes (columns) | should be in the result of your pivot. you can also do this in | reverse (pivot from wide to long) with the same technique. | | so for example if you have the schema `(obj_id, key, value)` in | a long-formatted table, where an `obj_id` will have data | spanning multiple rows, then you can issue a query like | | ``` SELECT obj_id, jsonb_object_agg(key, value) FROM table | GROUP BY obj_id; ``` | | up to actual syntax...it's been awhile since i've had to do a | task requiring this, so details are fuzzy but pattern's there. | | so each row in your query result would look like a json | document: `(obj_id, `{"key1": "value", "key2": "value", ...})` | | see https://www.postgresql.org/docs/current/functions-json.html | for more goodies. | [deleted] | hantusk wrote: | Agreed. https://ibis-project.org/ and | https://dbplyr.tidyverse.org/ can compile dataframe-like input | to SQL, which might bridge the gap in tooling (although there | still are small differences to the pure dataframe syntax) | kbelder wrote: | I've been doing a lot of data analysis in Pandas recently. I | started off thinking that for efficiency's sake, I should do as | much initial processing in the DB as possible, and use Pandas | just for the higher level functions that were difficult to do | in SQL. | | But after some trial and error, I find it much faster to pull | relatively large, unprocessed datasets and do everything in | Pandas on the local client. Faster both in total analysis time, | and faster in DB cycles. | | It seems like a couple of simple "select * from cars" and | "select * from drivers where age < 30", and doing all the | joining, filtering, and summarizing on my machine, is often | less burdensome on the db than doing it up-front in SQL. | | Of course, this can change depending on the specific dataset, | how big it is, how you're indexed, and all that jazz. Just | wanted to mention how my initial intuition was misguided. | ziml77 wrote: | I've always been disappointed by the SQL pivot. It's hardly | useful for me if I have to know up-front all of the columns | it's going to pivot out into. The solution would be to use | another SQL query to generate a dynamic SQL query, but at that | point I would rather just use Pandas | [deleted] | racl101 wrote: | I think I do a healthy mixture of both. | codeulike wrote: | I'm so glad to have been around long enough that SQL is now being | seen as exotic again | thinkharderdev wrote: | Not long after I got into software engineering is when the | first NoSQL craze began. I imagine that those who started not | long after me and spent a few years wrangling with MongoDB | queries (shudder) and trying to do joins, transactions and | consistency guarantees in application code must see RDBMS as | some sort of magical new technology. | dmitrykoval wrote: | Following similar observations I was wondering if one can | actually execute SQL queries inside of a Python process with the | access to native Python functions and Numpy as UDFs. Thanks to | Apache Arrow one can essentially combine DataFrame API with SQL | within data analysis workflows, without the need to copy the data | and write operators in a mix of C++ and Python, all within the | confines of the same Python process. | | So I implemented Vinum, which allows to execute queries which may | invoke Numpy or Python functions as UDFs available to the | interpreter. For example: "SELECT value, np.log(value) FROM t | WHERE ..". | | https://github.com/dmitrykoval/vinum | | Finally, DuckDB makes a great progress integrating pandas | dataframes into the API, with UDFs support coming soon. I would | certainly recommend giving it a shot for OLAP workflows. | justsomeuser wrote: | Also I think SQLite lets you call Python functions from the SQL | program. | dmitrykoval wrote: | That's correct, but SQLite would require to | serialize/deserialize the data sent to Python func (from C to | Python and back), while Arrow allows to get a "view" of the | same data without making a copy. Which is probably not an | issue in OLTP workloads, but may become more visible in OLAP. | sbuttgereit wrote: | The code: WITH dt AS ( SELECT | unnest(array[1, 2]) AS n ) SELECT * FROM dt; | | Is more complex than necessary. This produces the same result: | SELECT n FROM unnest(array[1, 2]) n; +---+ | n | | +---+ | 1 | | 2 | +---+ (2 rows) | | I think I see some other opportunities as well. | | I know the code is from a section dealing with CTEs, but CTEs | aren't needed for every situation including things like using | VALUES lists. Most people in the target audience can probably | ignore this next point (probably on a newer version of | PostgreSQL), but older versions of PostgreSQL would materialize | the CTE results prior to processing the main query, which is not | immediately obvious. | simonw wrote: | I think that was a deliberate choice by the author to | consistently demonstrate CTEs. | sbuttgereit wrote: | Sorry was editing to address this point probably while you | were typing your response. It's well taken, but we should be | clear that it's not required. | liprais wrote: | it seems the only reason people on hn hate sql is that they don't | understand sql and/or already buy in another toolset,if you only | have a hammer at hand,everything is a nail ___________________________________________________________________ (page generated 2021-05-03 23:00 UTC)