[HN Gopher] Stochastic gradient descent written in SQL
       ___________________________________________________________________
        
       Stochastic gradient descent written in SQL
        
       Author : Lemaxoxo
       Score  : 230 points
       Date   : 2023-03-07 12:57 UTC (10 hours ago)
        
 (HTM) web link (maxhalford.github.io)
 (TXT) w3m dump (maxhalford.github.io)
        
       | AndrewKemendo wrote:
       | This is genius I love it.
       | 
       | On stream inference with something like "continuous" data in the
       | same structure I think is the final material form for "AI" so
       | this is a great step towards that
       | 
       | Thanks for the writeup
        
       | yza wrote:
       | Next step would be to implement autodiff. If only PostgreSQL had
       | CUDA support.
        
       | Rickasaurus wrote:
       | Please don't do this in prod
        
         | dventimi wrote:
         | Many ML models experience offline training, and some online
         | training is fine-tuning a model that was pretrained offline.
         | Hence, I see no particular reason to worry about doing this "in
         | prod."
        
       | agnosticmantis wrote:
       | Just to disambiguate, here machine learning = linear regression.
       | So no auto-grad needed as gradients are explicit and simple.
       | Still interesting though.
        
       | saydaark wrote:
       | This is interesting, but is not using the GPU.
        
       | kilgnad wrote:
       | An even MORE useful idea is to do this in CSS.
        
       | hubja wrote:
       | Super intresting! Love the scrappy mindset ;)
        
         | Lemaxoxo wrote:
         | Cheers, I learnt with the best :)
        
       | simonw wrote:
       | I tried to replicate this in SQLite. The first few steps worked
       | OK, e.g.
       | 
       | https://lite.datasette.io/?json=https://gist.github.com/simo...
       | 
       | (I replaced "figures" with "raw" due to the way Datasette Lite
       | assigns a default table name to the imported JSON)
       | 
       | But the more complex recursive queries gave me this error and I'm
       | not sure how to work around it:                   recursive
       | reference in a subquery: state
       | 
       | E.g.
       | https://lite.datasette.io/?json=https://gist.github.com/simo...
        
         | Lemaxoxo wrote:
         | That's too bad, I would have expected it to work out of the
         | box. Other than rewriting the query in a different way, I'm not
         | sure I see an easy workaround. Are you still working on this?
        
       | andrenotgiant wrote:
       | This is really interesting, but a basic part I don't understand:
       | What would it actually look like to run this on a live dataset?
       | 
       | If I understand correctly: you'd run the recursive query, it
       | produces results for every step, effectively showing you the
       | progression of output over time, and then once it hits "present
       | day", it completes and stops?
       | 
       | How would you generate results going forward? I.E. A minute
       | elapses after the results return, do you have to re-run the whole
       | query for all time?
        
         | [deleted]
        
         | Lemaxoxo wrote:
         | Good question. I touched upon this in the conclusion.
         | Basically, if you run this in a streaming SQL database, such as
         | Materialize, then you would get a true online system which
         | doesn't restart from scratch.
        
           | noloblo wrote:
           | Does this work with Postgres?
        
             | Lemaxoxo wrote:
             | Postgres has excellent support for WITH RECURSIVE, so I see
             | no reason why it wouldn't. However, as I answered
             | elsewhere, you would need to set some stateful stuff up if
             | you don't want the query to start from scratch when you re-
             | run it.
        
           | sergius wrote:
           | Did you try running it using DuckDB?
        
             | Lemaxoxo wrote:
             | DuckDB is what I used in the blog post. Re-running this
             | query simply recomputes everything from the start. I didn't
             | store intermediary that would allow starting off from where
             | the query stopped. But it's possible!
        
           | hobs wrote:
           | In a non-streaming db What would prevent you from storing the
           | result set and just using the last iteration to calculate the
           | next?
        
             | swasheck wrote:
             | i'm trying to learn here so please pardon my ignorance.
             | wouldn't a pre-aggregated set affect the new aggregate
             | result? i suppose you could store avg, sum, count and then
             | add the new value(s) to sum, new count to count, and
             | recalculate average from that. or even just avg and count
             | and then re-aggregate as ((avg*count)+new values)/(count +
             | new values count) but i didn't know if there's a better way
             | to process new values into a set of data that's already
             | been aggregated
        
               | hobs wrote:
               | Yep that would be what I would do - effectively a
               | functional approach where no memory is required besides
               | the current set and iteration.
               | 
               | A big part of materializing datasets for performance is
               | finding the right grain that is both easy enough to
               | calculate and also can do nice things like be resumeable
               | for snapshots.
        
             | Lemaxoxo wrote:
             | This is definitely a possibility. What I meant to say is
             | that the implementation in the blog post doesn't support
             | that.
        
               | hobs wrote:
               | Makes sense, just an insert instead of a select, I used a
               | similar approach on newton-raphson to implement XIRR in
               | SQL and it worked well.
        
       | episode0x01 wrote:
       | From a comment by the author:
       | 
       | > Also, I can't justify why, but my gut feeling is that the
       | database should be an order of magnitude faster than Python,
       | provided the implementation were to be improved.
       | 
       | Would be curious how that could end up being the case. Perhaps if
       | NumPy wasn't used at all? That would mean no vectorization or
       | optimized storage.
       | 
       | Would be interesting to see how it scaled with length and
       | dimensionality
        
       | jackschultz wrote:
       | >A machine learning algorithm which can be trained using SQL
       | opens a world of possibilities. The model and the data live in
       | the same space. This is as simple as it gets in terms of
       | architecture. Basically, you only need a database which runs SQL.
       | 
       | First paragraph of the conclusion, and this very much fits with
       | the mindset that's been growing in me in the data world over the
       | past few years. Databases are much more powerful than we think,
       | they're not going to go away, only get better, and having the
       | data and the logic in the same space really removes tons of
       | headaches. ML models, transformation of data, generating json for
       | an API can all be done within the database rather than outside
       | scripting language.
       | 
       | Are others seeing this? Are the current tradeoffs just that more
       | people know python vs sql or database specific languages to where
       | moving logic to postgres or snowflake is looked down on?
        
         | noloblo wrote:
         | +1 sql is extremely elegant composable and is under rated
         | 
         | Postgres is very powerful. While I sought a short detour in
         | nosql Mongodb land now back to Mysql Postgresql sql territory
         | and glad for it
         | 
         | Being able to generate views is and stored procedures is useful
         | as well.having sql Take over more like ml, gradient descent
         | does open up good possibility.
         | 
         | Also since sql is declarative it Makes it so it's rather easier
         | than imperative scripting languages
        
           | naasking wrote:
           | SQL has some positives but it is not composable. At all. This
           | is because relations are not first-class values in SQL.
        
             | dventimi wrote:
             | Is a query not a relation?
        
               | naasking wrote:
               | Basically, but queries are not first class in SQL. You
               | can't assign a query to a variable, or pass it as a
               | parameter to a stored procedure, for example. This would
               | make SQL composable:                   declare @x =
               | (select * from Person)         select Name from @x where
               | Birthdate < '2000-01-01'
        
               | noloblo wrote:
               | Exactly since it declarative the style Lends itself using
               | stored procedure calls to become composable
        
               | naasking wrote:
               | You cannot abstract over stored procedures either, so
               | that's still not composable.
        
               | dventimi wrote:
               | No idea what this means
        
               | naasking wrote:
               | Think of first-class functions. Can't do that with stored
               | procedures, just like you can't do that with queries or
               | tables in SQL.
        
               | alphanumeric0 wrote:
               | Furthermore, stored procedures/functions are not queries.
        
               | dventimi wrote:
               | Stored procedures can be relations. Queries are
               | relations. Ergo, stored procedures can be queries.
        
               | Lemaxoxo wrote:
               | Isn't that the point of common table expressions (CTEs)?
        
               | naasking wrote:
               | See: https://news.ycombinator.com/item?id=35058927
        
               | Foobar8568 wrote:
               | You can do composition through CTE or Table-Valued
               | Function.
        
               | naasking wrote:
               | CTE and TVF still treat tables as second class citizens,
               | so while they enable some forms of composition, they're
               | still very restricted. This has been the consistent story
               | with SQL, 15 ways to use queries and tables, all to work
               | around the limitation that they are not first class
               | values.
        
               | datan3rd wrote:
               | with persons as (select * from Person)         select
               | Name from persons         where Birthdate < '2000-01-01
        
               | naasking wrote:
               | Where is the assignment to a variable? Where can you
               | construct a query using a variable in table/query
               | position? That's the whole point of being first class and
               | composable, a query becomes like any other value so you
               | should be able to parameterize any query by another query
               | assigned to a variable that may have been set inside an
               | if-statement, or accepted as a parameter to a stored
               | procedure. You know, the same kinds of composition we see
               | in ordinary programming languages.
        
               | chrstr wrote:
               | create table x as (select * from person);       select
               | name from x where ...;
               | 
               | there you go, just configure your editor to display
               | "create table x" as "declare x = " ;)
               | 
               | or even a version with lazy evaluation:
               | create view x as (select * from person);       select
               | name from x where ...;
        
               | naasking wrote:
               | You're still not getting it. First-class status means
               | that anywhere a value or variable can be used, a query or
               | table should also be able to appear, and vice versa. This
               | means a table or query can appear as a return type, a
               | parameter to a stored procedure or query, a variable, and
               | so on.
               | 
               | SQL just does not have this, it instead has 15 different
               | second class ways to handle tables and queries that try
               | to make up for the fact that they are not first-class
               | values. These include CTEs, table valued functions,
               | views, etc.
        
               | alphanumeric0 wrote:
               | What if I wrote a very long, complicated query that I'd
               | like to test against different tables (like test tables),
               | and let's say I can't use stored functions or procedures.
               | How could I pass different tables to my query?
        
               | chrstr wrote:
               | CREATE TABLE data_a AS (SELECT 'a' AS test_case, 1 AS
               | value);       CREATE TABLE data_b AS (SELECT 'b' AS
               | test_case, 2 AS value);       CREATE VIEW data AS (SELECT
               | * FROM data_a UNION ALL SELECT * FROM data_b);
               | CREATE VIEW complicated_query AS (SELECT test_case,
               | value+1 FROM data);            SELECT * FROM
               | complicated_query WHERE test_case = 'a';       SELECT *
               | FROM complicated_query WHERE test_case = 'b';
        
               | alphanumeric0 wrote:
               | Nice, that is what I was looking for. Of course, it'd
               | need to point to production data as well, so maybe
               | test_case is null, in that case:                 CREATE
               | TABLE data_a AS (SELECT 'a' AS test_case, 1 AS value);
               | CREATE TABLE data_b AS (SELECT 'b' AS test_case, 2 AS
               | value);       CREATE TABLE data_prod AS (SELECT NULL AS
               | test_case, prod_table.value FROM prod_table);
               | CREATE VIEW data AS (SELECT * FROM data_a UNION ALL
               | SELECT * FROM data_b UNION ALL SELECT * FROM data_prod);
               | CREATE VIEW complicated_query AS (SELECT test_case,
               | value+1 FROM data);            -- when testing
               | SELECT * FROM complicated_query WHERE test_case = 'a';
               | SELECT * FROM complicated_query WHERE test_case = 'b';
               | -- when in 'production'       SELECT * FROM
               | complicated_query WHERE test_case IS NULL;
        
               | naasking wrote:
               | You just reinvented defunctionalization, which is a
               | transformation from a domain that has first-class values
               | to a domain where support is only second-class.
               | Defunctionalization is typically used in programming
               | languages to simulate first-class functions in languages
               | where functions are only second-class citizens, like C
               | and Pascal.
               | 
               | This perfectly illustrates my point. You had to manually
               | defunctionalize your data model and queries to support
               | what I'm saying should be inherently part of SQL.
        
               | dventimi wrote:
               | If you can't use stored procedures which are good for
               | this very case, many databases offer dynamic SQL. That
               | might work in some cases.
        
               | alphanumeric0 wrote:
               | That's a good point. This would rule out SQLite for me.
        
               | Foobar8568 wrote:
               | TVF ( Table-Valued Function) with Cross apply.
        
               | adeelk93 wrote:
               | Check out dbt - it's a great tool for organizing queries
               | and solving such patterns
        
             | simonw wrote:
             | CTEs go a long way to making SQL more composable.
        
           | noloblo wrote:
           | Postgres has read write parallism that can scale across
           | millions of read writes ; if ml model is inherent in the
           | Postgres db it is indeed very elegant reminds me of the glory
           | days of tsearch2 to do text search in Postgres for searching
           | our internal document repository using nothing but Postgres
           | and boy was it faster than Java based search systems
        
         | somat wrote:
         | A dbms is really it's own operating system, usually this is
         | hosted on another operating system, one that understands the
         | hardware.
         | 
         | I remember one place I worked where we had several old
         | graybeard programmers who considered the dbms[1] the operating
         | system, as a unix sysadmin we had some interesting discussions
         | as I was always confused and confined by the dbms and they felt
         | the same about unix.
         | 
         | 1. unidata if curious, a weird multi value(not relational)
         | database, very vertically integrated compared to most databases
         | today.
        
         | CreRecombinase wrote:
         | I spend most of my time in the parallel universe that is
         | scientific computing/HPC. In this alternate reality SQL (not to
         | mention databases) never really took off. Instead of scalable,
         | performant databases, we have only the parallel filesystem. I'm
         | convinced the reason contemporary scientific computing don't
         | involve much SQL is sociological/path-dependency, but there are
         | also very good technical reasons. Optimizing software in
         | scientific computing involves two steps: 1) Think hard about
         | your problem until you can frame it as one or more matrix
         | multiplications 2) Plug that into a numerical linear algebra
         | library The SQL abstraction (in my experience) takes you very
         | much in the opposite direction.
        
         | dchftcs wrote:
         | There is a recent trend in database research, ML in databases.
         | Not sure how much an impact it can make though, the sweet spot
         | is doing relatively linear stuff, arguably just a step up from
         | analytical functions in queries, while cutting edge ML needs
         | dedicated GPUs for compute load and often uses unstructured and
         | even binary data.
        
         | Scubabear68 wrote:
         | With "traditional" RDBMS systems, putting a lot of code in the
         | DB lead to a lot of scaling issues where you'd need gigantic
         | machines with a lot of RAM and CPU: because the DB was doing so
         | much work. It was expensive and clunky to get HA right.
         | 
         | In more modern DBs being distributed horizontally, this
         | approach may see a rebound. The big "but" is still costs, in my
         | experience in AWS as an example, managed Postgres Aurora was
         | surprisingly expensive in terms of monthly cost.
        
           | dventimi wrote:
           | Why would running code in a database process be intrinsically
           | more expensive than running it in some other process?
        
             | yunwal wrote:
             | The issue is that many relational databases are not
             | horizontally scalable, so you want to be frugal with their
             | resources.
        
               | dventimi wrote:
               | The resources I'm familiar with are I/O, memory, and CPU.
               | The only one I believe can be spared in the database by
               | using that resource outside of the database, is CPU. When
               | the database is far from saturated on CPU and latency and
               | throughput are determined by I/O and memory, using CPU on
               | some other machine that isn't the database can't possibly
               | have any impact on latency and throughput.
        
               | yunwal wrote:
               | > When the database is far from saturated on CPU
               | 
               | The issue here is if you scale enough saturate the
               | database, you'll have to rewrite essentially all your
               | code if you're a typical CRUD webapp. Basically all of
               | your business logic is about data retrieval. There's
               | probably some companies that can get away with this, but
               | it would be way too expensive for most.
        
               | dventimi wrote:
               | If I scale up to saturate the database CPU...by doing
               | data retrieval? Setting aside my skepticism about
               | saturating the CPU with mere data retrieval, how is that
               | solved by moving the data to another host's CPU, when
               | moving the data involves the very data retrieval that's
               | saturating the database's CPU?
        
             | riku_iki wrote:
             | you can easily scale linearly your app layer on multiple
             | machines, while it is harder with previous gen databases.
        
               | dventimi wrote:
               | What good is that going to do if the data always resides
               | in the database?
        
         | bob1029 wrote:
         | > Databases are much more powerful than we think
         | 
         | The older I get the more I agree with this.
         | 
         | There is nothing you cannot build by combining SQL primitives.
         | Side effects can even be introduced - on purpose - by way of
         | UDFs that talk to the outside world.
         | 
         | I've seen more than one system where the database itself was
         | directly responsible for things like rendering final HTML for
         | use by the end clients. You might think this is horrible and
         | indeed many manifestations are. But, there lurk an opportunity
         | for incredible elegance down this path, assuming you have the
         | patience to engage in it.
        
           | giraffe_lady wrote:
           | > I've seen more than one system where the database itself
           | was directly responsible for things like rendering final HTML
           | for use by the end clients.
           | 
           | I did this for a side project a few months ago and even used
           | postgrest to serve the page with correct headers for html. It
           | felt simultaneously really cursed and obvious. Shit you could
           | even use plv8 to run mustache or whatever in the db if you
           | really wanted to piss people off.
        
             | dventimi wrote:
             | I'm doing this right now with a DO droplet built with
             | PostgreSQL, postgrest, nginx, and not much else. Do you
             | have any tips, tricks, or blog posts you can share based on
             | your experience? You should post it to HN. Strike while the
             | iron's hot. With a little luck you'll hit the front page.
        
               | giraffe_lady wrote:
               | Nah nothing that refined. Pretty predictably supabase is
               | doing some weird stuff along these lines and I found some
               | abandoned and semi-active repos associated with them and
               | people working for them that were useful examples of some
               | things.
               | 
               | As for posting to HN absolutely no thanks. These people
               | are so fucking hostile there is no accomplishment too
               | small to tear apart for entertainment here. I have no
               | interest in it.
        
               | dventimi wrote:
               | > As for posting to HN absolutely no thanks. These people
               | are so fucking hostile there is no accomplishment too
               | small to tear apart for entertainment here. I have no
               | interest in it.
               | 
               | You make a really good point.
        
         | nerdponx wrote:
         | In the data warehouse / OLAP space, I think we are heading
         | towards a world where the underlying data storage consists of
         | partitioned Arrow arrays, and you can run arbitrary UDFs in
         | arbitrary languages on the (read-only) raw data with zero copy
         | and essentially no overhead other than the language runtime
         | itself and marshalling of the data that is emitted from the
         | UDF.
         | 
         | Something like the Snowflake data storage model + DuckDB as an
         | execution engine + a Pandas/Polars-like API.
         | 
         | There is no reason why we have to be stuck with "database ==
         | SQL" all the time. SQL is extremely powerful, but sometimes you
         | need a bit more, and in that case we shouldn't be so
         | constrained.
         | 
         | But in general yes, the world is gradually waking up to the
         | idea that performance matters, and that data locality can be
         | extremely important for performance, and that doing your
         | computations and data processing on your warehouse data in-
         | place is going to be a huge time and money saver in the longer
         | term.
        
         | roflyear wrote:
         | > Are the current tradeoffs just that more people know python
         | vs sql or database specific languages to where moving logic to
         | postgres or snowflake is looked down on?
         | 
         | Yes, mostly development, deployment, etc.. concerns. I haven't
         | ever seen an org that versions their SQL queries, unless they
         | are in a codebase. The environment is just unfriendly towards
         | that type of management. Nevermind testing! Things that have
         | solutions but we haven't matured enough here, because that type
         | of development has been happening in application code.
         | 
         | Also, SQL is generally more complex than application logic,
         | because they are designed to do different things. What is a
         | simple exercise in iteration or recursion can more easily
         | become something a little more of a headache.
         | 
         | Problems that can be resolved, but they are problems.
        
         | _a_a_a_ wrote:
         | Self-proclaimed database expert here. What a database is good
         | at depends on what you're trying to get that database to do, at
         | least in part.
         | 
         | Take it into piecess, elegance and efficiency. These will
         | correspond to a logical statement of what you're trying to do,
         | and how quickly the database will actually do it in practice.
         | 
         | SQL can do some nice things in areas, making it elegant in
         | those areas. Elsewhere it can be pretty wordy and ugly.
         | 
         | In efficiency, it comes down largely to how the database is
         | implemented and that also includes the capability of the
         | optimiser. Both of these are out of your control. In my
         | experience trying to turn a database into a number cruncher is
         | just not going to work.
         | 
         | I guess that's long way round of me saying that I don't think I
         | agree with you!
        
         | maCDzP wrote:
         | I agree. I took a course in databases and SQL and was blown
         | away by its power. With CTE's and PLSQL you can do a lot of
         | stuff inside the database.
         | 
         | I played with SQLite and it's json columns. Once you get the
         | hang of the syntax for walking a json structure you can do all
         | sorts of neat things. Doing the same thing in Python would be
         | tedious.
         | 
         | And I also believe it ended up being way faster than what I did
         | in python.
        
         | ISL wrote:
         | The challenge I, with limited knowledge, see with developing
         | detailed algorithms in SQL is a lack of good testing,
         | abstraction, and review tooling. Similarly for a lot of the
         | user-defined-functions for the larger data warehouses
         | (redshift, bigQuery, etc.)
         | 
         | dbt solves a lot of it, but I'd love to learn more about good
         | resources for building reliable and readily-readable SQL
         | algorithms for complex logic.
        
           | AndrewKemendo wrote:
           | I mean this is an opportunity right? Build those things
        
             | Lemaxoxo wrote:
             | I agree. Databases are going to be here for a long time,
             | and we're barely scratching the surface of making people
             | productive with them. dbt is just the beginning.
        
           | adeelk93 wrote:
           | I've supported 3 different models over the years with
           | inference implemented in SQL. First one I inherited, loved it
           | so much that I implemented it twice again. Amazingly fast for
           | TBs of data and no waiting on other tech teams.
           | 
           | That tooling you're describing is definitely not there.
           | Bigquery has BQML but it's very much in its infancy. I tend
           | to do all the modeling in Python/R on sampled data and then
           | deploy to SQL.
        
             | fithisux wrote:
             | BQML should become standard.
        
         | actionfromafar wrote:
         | I have yet to see a decent IDE or system which allows great
         | version control, unit testing and collaboration with SQL source
         | code.
         | 
         | So I think a lot of the reluctance is from practical concerns.
        
           | noloblo wrote:
           | We store individual sql files in github and keep them in
           | separate folders
           | 
           | This is very simple and scales well for our purposes
        
         | jhd3 wrote:
         | > Databases are much more powerful than we think
         | 
         | and data has mass. One example of bringing the work to the data
         | is https://madlib.apache.org/ (works on Postgres and Greenplum)
         | 
         | [Disclaimer - former employee of Pivotal]
        
         | tomrod wrote:
         | Absolutely.
         | 
         | Consider too that PostgreSQL databases support different
         | languages, like Python.
         | 
         | Loads of for-profit companies have tried to cash in on this.
         | SAP HANA is one of the ones I've had recent experience with. It
         | is unfortunately a poor implementation. The right architecture
         | tends to be: put your model behind an API interface, not
         | internal on the system. Train your model separately from
         | production systems, and so on.
         | 
         | You might also be interested in checking out MLOps platforms
         | like Kubeflow, Flyte, and others.
        
         | 7thaccount wrote:
         | Database first designs make a lot of sense in a lot of ways.
         | I've worked for a company with an Oracle database that has SQL
         | scripts that do all the validation and create text files for
         | downstream usage. I think it makes more sense than a ton of
         | Java, but there are pros and cons. One is that SQL is
         | relational and the advanced stuff can be extra hard to
         | troubleshoot if you don't have enough experience. Even those
         | that can't code can usually understand a for loop and can think
         | imperatively.
         | 
         | Unfortunately it's an expensive commercial product or I'd
         | recommend you look at kdb+ if you work with time series data.
         | The big banks use it and essentially put all thier latest RT
         | data into kdb+ and then can write extremely succinct queries
         | with a SQL-like syntax, but the ability to approach it far more
         | programmatically than what is typically doable with something
         | like PL-SQL. You can even write your neural network or whatever
         | code in less than a page of code as the language of kdb+ is
         | extremely powerful, although also basically incomprehensible
         | until someone puts some time into learning it. It's extremely
         | lightweight though, so very easy to deal with in an interactive
         | fashion.
         | 
         | All that to say I agree with you that it's nice to just have
         | everything you want all in one spot rather than to deal with 4
         | different tools and pipelines and shared drives and so on.
        
         | daveguy wrote:
         | I would like for this to be the case. I was DBA of a Postgres
         | database for LIMS almost 2 decades ago. Back then you could
         | code functions for the database to execute on data and it was
         | very powerful, but also very clunky. The tools to do software
         | development in the database was not mature at all. A lot has
         | changed in the past 20 years and SQL has evolved. Do you think
         | SQL will expand that much or there will be APIs built into the
         | database? Near-data functions are powerful and useful, but I
         | would want my development environment to be more like version
         | controlled code than "built-in".
         | 
         | I wonder if near-data functions on small databases is the
         | solution to the limit of statelessness that you have with
         | functions as a service.
        
         | nonethewiser wrote:
         | > Databases are much more powerful than we think
         | 
         | And a function of what people think is attitudes towards
         | working at the DB level. I see this often with ORM's in the web
         | dev sphere (rather than Dat Science). Yes, ORM's are great but
         | many people rely on them to completely abstract away the
         | database and are terrified by raw sql or even query building.
         | You also see it with services that abstract away the backend
         | like PocketBase, Fireship, etc. Writing a window function or
         | even a sub select looks like black magic to many.
         | 
         | I say this after several experiences with codebases where joins
         | and filtering were often done at the application layer and raw
         | sql was seen as the devil.
        
           | boringg wrote:
           | Opposite here - dont like ORMs. Too much overhead - though i
           | get their value.
        
         | FpUser wrote:
         | It is tempting to combine web server, database and some
         | imperative language with built in data oriented / SQL features
         | in a single executable and call it an application server that
         | would communicate with the outside world using for example JSON
         | based RPC. I think there were / are some products in the area
         | even with the built in IDE (like Wakanda).
        
         | spprashant wrote:
         | The problem with throwing everything in a database is you end
         | up with brittle stored procedures all over the place, which are
         | painful to debug. There is no good support for version control
         | or testing, which means you end up creating a dozen copies of
         | each function named (sp_v1, sp_v2,.., etc.). It much more
         | harder to practice iterative development which the rest of
         | software development seems implements effectively.
         | 
         | Also traditional relational databases have a way to go before
         | they can support parallelized machine learning workloads. You
         | do not have control or the ability to spin up threads or
         | processes to boost your performance. You rely on the query
         | processor to make those decisions, and depending on your
         | platform the results will be mixed.
        
         | 62951413 wrote:
         | Junior developers like me were uncomfortable with SQL twenty
         | years ago. Java ORM frameworks became popular because of the
         | Object-Relational impedance. I kind of see the same kind of
         | sentiment nowadays among newer generations but in Python&Co.
         | 
         | The success of the Apache Spark engine can at least partially
         | be attributed to
         | 
         | * being able to have the same expressive power as SQL but with
         | a real Scala API (including having reusable libraries based on
         | it)
         | 
         | * being able to embed it into unit tests at a low price of
         | additional ~20 seconds latency to spin up a local Spark master
        
         | justsomeuser wrote:
         | I think general programming languages are better for general
         | programs than SQL.
         | 
         | Specifically they have: Type systems, compilers, debuggers,
         | text editors, package managers, C FFI etc.
         | 
         | But I agree that having the data and the program in the same
         | process has benefits.
         | 
         | Writing programs in SQL is one way.
         | 
         | Another way is to move your data to your general program with
         | SQLite.
         | 
         | I like using SQL for ACID, and queries as a first filter to get
         | the data into my program where I may do further processing with
         | the general language.
        
           | Serow225 wrote:
           | Another is MS SQL Server, which lets you run .NET on the
           | database server :D "you can author stored procedures,
           | triggers, user-defined functions, user-defined types, and
           | user-defined aggregates in managed code"
        
             | malnourish wrote:
             | I have had nothing but bad experiences trying to run .NET
             | in SSIS packages -- is there another way?
        
               | Serow225 wrote:
               | I've never had the pleasure(ha) of using SSIS, but this
               | is the stuff that I was talking about:
               | https://learn.microsoft.com/en-
               | us/dotnet/framework/data/adon...
        
           | fifilura wrote:
           | > Type systems
           | 
           | SQL has types
           | 
           | > compilers
           | 
           | For what specifically do you need a compiler?
           | 
           | > debuggers
           | 
           | Some tasks - like the concurrency SQL enables - are just very
           | difficult to debug with debuggers. It would be the same with
           | any other language. What SQL does here though is to allow you
           | to focus on the logic, not the actual concurrency,
           | 
           | > text editors, package managers
           | 
           | I feel like these two are just for filling up the space.
           | 
           | > C FFI
           | 
           | Many SQL engines have UDFs
        
         | swyx wrote:
         | had a very good chat with https://postgresml.org/ last week
         | which is focusing on bringing ML to postgres:
         | https://youtu.be/j8hE8-jZJGU
        
           | Lemaxoxo wrote:
           | I'm watching it, it's really good. Montana makes a great
           | point: you can move data to the models, or move the models to
           | the data. Data is typically larger than models, so it makes
           | sense to go with the latter.
        
         | fzeindl wrote:
         | Fully agree.
         | 
         | And using PostgREST [0] you can serve your postgreSQL database
         | as REST-API. And if you throw foreign data wrappers / multicorn
         | in the mix, you can map any other datasource into your
         | postgreSQL-db as table.
         | 
         | [0] https://postgrest.org/en/stable/
        
           | JUNGLEISMASSIVE wrote:
           | [dead]
        
       | [deleted]
        
       | airstrike wrote:
       | This is great. The only thing I dislike from this is using these
       | variables to try to predict Adj Close when they are not at all
       | correlated.
       | 
       | There are countless meaningful correlations in financial data
       | that would have been just as easy to play around with. One truly
       | valuable example would be to look at trading multiples of
       | comparable companies. Sticking to P/E would be easier as P is
       | easily observable and forward-looking EPS estimates are generally
       | always available. This would limit the exercise to more mature
       | companies than the ones commonly discussed on HN but would make
       | it actually meaningful
        
         | Lemaxoxo wrote:
         | Do you have some data/resources on this? I'm a total snowflake
         | at this, but I'm willing to learn.
        
           | airstrike wrote:
           | _In hindsight this is harder than it seems if you don 't
           | already have access to data, so I'll try to be as informative
           | as possible in my response_
           | 
           | It's hard to find this information out there, so here's ~all
           | you need to know.
           | 
           | Data is usually behind paywalls, unfortunately. Industry
           | standards are Bloomberg terminal (ridiculously expensive, 5
           | digits $), FactSet (very expensive, 4 digits), Capital IQ
           | (expensive, not sure)... but there are a number of up-and-
           | coming startups trying to disrupt the space so you may be
           | able to grab data from them. I think https://atom.finance has
           | a 7-day free trial you could use to play around with.
           | 
           | P/E simply means the company's _P_rice per share divided by
           | _E_arnings per share. Cancel out the "per share" terms and
           | you get total market capitalization (which is the value of
           | the total equity) divided by net income (since "earnings per
           | share" really means "net income per share")
           | 
           | So the "P" is easy to get. It's your Adj Close.
           | 
           | The "E" is trickier as it can mean a lot of things. Diluted
           | EPS from financial statements? Last year's EPS? Management's
           | guidance for EPS? None of those are actually correct even if
           | they are all "EPS"
           | 
           | Importantly--and contrary to 99% of the info you will find
           | online--the most relevant EPS number are _forward_ estimates
           | of EPS, usually for the next twelve months ( "NTM"). That is
           | based on an average or median of analyst estimates which is
           | called "consensus". These are analysts from financial
           | institutions who build their own little models based on their
           | own views of where the business is going to go, informed by
           | recent earnings, management's color in earnings calls and
           | filings, etc.
           | 
           | Believe it or not, as hairy as that sounds, EPS is fairly
           | easy to get as it's a metric that has less room for
           | interpretation than, say, EBITDA.
           | 
           | So you're not going to go out there, read all these (paid)
           | analyst reports, find their EPS, calculate the median, etc.
           | Bloomberg, Capital IQ, FactSet do this for you and it's
           | easily observable for the end user (that's their business).
           | 
           | The thing is, as you may have guessed, "next twelve months"
           | are a moving target across time. Analysts usually provide
           | estimates for the current fiscal year (i.e. FY 2023, ending
           | 12/31/2023 for most companies) and the following year, ending
           | 12/31/2024. Let's call these FY0_EPS and FY1_EPS, for
           | simplicity
           | 
           | You might be tempted to just take a moving average of these
           | two estimates, so that on 1/1/2023 it is 100% of FY0_EPS + 0%
           | of FY1_EPS, on 1/2/2023 it is 99.9% + 0.1% and gradually
           | "move forward in time" as the days pass. That sort of works
           | (and definitely checks the box for a proof-of-concept like in
           | your post) but for the sake of completeness, I'll just say
           | that the right-er approach is to only "move forward in time"
           | when new earnings are released. So it doesn't matter if we're
           | in 1/1/2023 or 2/1/2023--what matter is what is the latest
           | reported quarter. Take Coca-Cola for instance
           | (https://www.bamsec.com/companies/21344/coca-cola-co). Let's
           | roll the tape backward one year. They reported FY 2021
           | earnings on 2/22/2022, at which point analysts published new
           | estimates in revised models, so on from that day forward
           | until the next quarterly earnings we take 100% FY0_EPS + 0%
           | FY1_EPS, in which these correspond to estimates for FY 2022
           | and FY 2023, respectively.
           | 
           | On 4/1/2022, Coca-Cola reported Q1 2022 results, analysts
           | published new estimates, and we now take 75% FY0_EPS + 25%
           | FY1_EPS. On 7/1/2022, we move forward another quarter so 50%
           | + 50%, then 25% + 75% starting on 10/26 and then back to
           | square one with 100% + 0% except FY0_EPS now means FY 2023 vs
           | FY 2022 previously, and FY1_EPS means FY 2024
           | 
           | So your table is something like (I'm making up numbers)
           | +------------+--------+-----------+---------+---------+------
           | -------+-------------+-------------+         | Date       |
           | Ticker | Adj_Close | FY0_EPS | FY1_EPS | Period      |
           | NTM0_Weight | NTM1_Weight |         | 01/01/2022 |     KO |
           | 90.10 |   20.00 |   24.00 | Q1 2022     |        1.00 |
           | 0.00 |           | 01/02/2022 |     KO |     91.14 |   20.00
           | |   24.00 | Q1 2022     |        1.00 |        0.00 |
           | | 01/03/2022 |     KO |     89.30 |   20.00 |   24.00 | Q1
           | 2022     |        1.00 |        0.00 |           | 01/04/2022
           | |     KO |     91.09 |   20.00 |   24.00 | Q1 2022     |
           | 1.00 |        0.00 |           | 01/05/2022 |     KO |
           | 92.01 |   20.00 |   24.00 | Q1 2022     |        1.00 |
           | 0.00 |           | 01/06/2022 |     KO |     89.05 |   20.00
           | |   24.00 | Q1 2022     |        1.00 |        0.00 |
           | ...         | 07/02/2022 |     KO |     89.05 |   19.50 |
           | 23.20 | Q2 2022     |        0.75 |        0.25 |
           | 
           | With that you can take NTM0_Weight and NTM1_Weight to
           | calculate NTM_EPS by multiplying those weights by FY0_EPS and
           | FY1_EPS. And then can take AdjClose / NTM_EPS to calculate
           | P/E
           | 
           | Why is this useful? Because in theory you can take the
           | average P/E of companies X, Y and Z in one industry and
           | compare it to a fourth company W. Is W's P/E multiple above
           | or below the industry average? You now know if they are over
           | or undervalued, respectively, which means you know if you
           | should buy or sell that stock (if you believe you picked the
           | right "comparable" companies in that industry)
           | 
           | This is just one example... there are all sorts of similar
           | analyses done daily in the financial services industry. I'm
           | not saying it's easy to extract alpha from trading on these,
           | but that's the framework
        
             | Lemaxoxo wrote:
             | Thank you so much for this! It's very generous of you to
             | have taken the time.
        
               | airstrike wrote:
               | My pleasure! I've spent the better part of the last
               | decade doing this stuff and I appreciate how hard it is
               | to find resources on it, so thought I'd share since you
               | mentioned you were interested in learning
               | 
               | At the risk of exhausting said interest, here's one real
               | life example of a similar analysis: https://wsp-blog-
               | images.s3.amazonaws.com/uploads/2011/09/171...
               | 
               | See PDF page 14. Note the lines called "Composite P / NTM
               | EPS" which they built as a blend of American Eagle's,
               | Tilly's and Zumiez's P/E multiple, which are companies X,
               | Y and Z in my comment above (for some reason they gave AE
               | double the weight which is unusual) and compared it to
               | Heat's P/E multiple (Heat was the codename for retailer
               | Rue21, or hypothetical company W in my example above)
               | 
               | I got this deck from
               | https://www.wallstreetprep.com/knowledge/investment-
               | banking-... which has a few other examples and a variety
               | of other resources on the topic FYI
        
       | college_physics wrote:
       | Is there a risk that postgres becomes sentient in our lifetime?
        
       | zackmorris wrote:
       | This is great! Moving away from the proprietary nature of GPUs
       | and complex math gatekeeping should help democratize AI.
       | 
       | Has anyone converted stuff like gradient descent to set theory?
       | 
       | https://support.unicomsi.com/manuals/soliddb/7/SQL_Guide/2_G...
       | 
       | https://www.sqlshack.com/mathematics-sql-server-fast-introdu...
       | 
       | https://www.sqlshack.com/learn-sql-set-theory/
       | 
       | Right now AI algorithms kind of look imperative and stateful to
       | me, like state machines. But there should be a functional or
       | relational representation, where pattern matching would be used
       | to derive the current state instead.
       | 
       | It's trivial to go from functional to imperative representation,
       | but often nearly impossible to go the other way. That's why
       | monadic (sorry if I'm mincing terms) logic is so troublesome.
       | Mutable variables, futures/promises, async, etc can't be
       | statically analyzed, which is why most code today is difficult to
       | parallelize and stuff like C++ optimizers don't even increase
       | speed to the next order of magnitude. But functional languages
       | have nearly infinite optimization potential through divide and
       | conquer approaches like sharding and scatter-gather arrays that
       | can run internally without side effects. In other words, we can
       | throw hardware at SQL for linear speedup since it's
       | embarrassingly parallel, but might have limited success
       | optimizing something like Python.
        
         | dogcomplex wrote:
         | Replying to this to remind myself to research later, as this is
         | a very important question
        
       | krick wrote:
       | From the start I assumed this is a nice playful "Hexing the
       | technical interview" kinda joke. But given the tone of the
       | article, and some of the comments here... Uh, this cannot be
       | serious, right?
        
         | guhidalg wrote:
         | Why couldn't it be serious? Your database already has your data
         | so you don't have to stream it to another machine for training.
         | If you have a developed model that needs retraining on new
         | data, save yourself the network bandwidth (and time!) and just
         | make the DB retrain. This doesn't replace the exploration phase
         | of model-building but if you already know gradient descent
         | works for you then what's wrong with this approach?
        
         | sgu999 wrote:
         | I'm very confused as well. Are we facing a wall of GPT-
         | generated comments?
        
       | Terretta wrote:
       | Title here is wrong. Title in article and headings in article are
       | right: ONLINE gradient descent
       | 
       | It's specifically _not_ stochastic. From the article:
       | 
       |  _Online gradient descent_
       | 
       |  _Finally, we have enough experience to implement online gradient
       | descent. To keep things simple, we will use a very vanilla
       | version:_
       | 
       |  _- Constant learning rate, as opposed to a schedule._
       | 
       |  _- Single epoch, we only do one pass on the data._
       | 
       |  _- Not stochastic: the rows are not shuffled._ - - - -
       | 
       |  _- Squared loss, which is the standard loss for regression._
       | 
       |  _- No gradient clipping._
       | 
       |  _- No weight regularisation._
       | 
       |  _- No intercept term._
        
         | [deleted]
        
         | Lemaxoxo wrote:
         | Hehe I was wondering if someone would catch that. Rest assured,
         | I know the difference between online and stochastic gradient
         | descent. I admit I used stochastic on Hacker News because I
         | thought it would generate more engagement.
        
           | airstrike wrote:
           | Then just call it Non-stochastic Gradient Descent? You can't
           | editorialize titles per HN guidelines
           | 
           | https://news.ycombinator.com/newsguidelines.html
        
             | Lemaxoxo wrote:
             | Thanks, I wasn't aware.
        
               | airstrike wrote:
               | My pleasure. You can still edit the title, by the way ;-)
        
       | rlewkov wrote:
       | Uhhhh, ya know that Python has a library for that :-)
        
       | glomgril wrote:
       | Incredible post. I laughed when I saw the title, snickered at the
       | first paragraph, and then proceeded to be blown away by the rest
       | of it. Thought I was in for a joke and instead I'm thinking about
       | the nature of ML Ops and what it's become.
        
       | civilized wrote:
       | In the comments here so far, we see a pattern we've seen before.
       | When someone suggests doing something in SQL, there's a lot of
       | concern about SQL being a very limited programming language where
       | it's hard to do proper engineering.
       | 
       | Here's I would really love to know: why is it that SQL is, to
       | first order, the only language used to interact with databases,
       | and SQL has about the same features as it did in the 70s? It
       | seems analogous to if the general-purpose programming world
       | stopped with C.
        
         | bob1029 wrote:
         | > why is it that SQL is, to first order, the only language used
         | to interact with databases, and SQL has about the same features
         | as it did in the 70s?
         | 
         | Because SQL is effectively a domain-specific language. If you
         | added 100+ additional keywords/functions/etc., do you think it
         | would be easier or more difficult for the average developer to
         | build something clean with it?
         | 
         | I look at SQL like a pile of bricks. You don't want
         | complicated, unicorn bricks or you won't be able to fit
         | anything meaningful together. Experienced masons almost
         | certainly prefer their materials to be as dumb and consistent
         | as possible.
        
         | dventimi wrote:
         | I believe the PC revolution gave a few generations of
         | programmers the mistaken belief that imperative languages with
         | familiar features (Basic, Pascal, C, Perl, Python, Java, etc.)
         | are the only or the best way to interact with computers. Forms
         | of interaction developed earlier (Forth, Prolog, SQL, etc.)
         | were learned later, grudgingly, if at all. SQL was viewed with
         | disdain, a necessary evil to be held at arm's length. That
         | stalled evolution of SQL, the success of other query languages,
         | or really anything that didn't fit the Algol-style "general
         | purpose programming language" blinders.
         | 
         | Mercifully, finally, the cold war against SQL in particular and
         | against relational databases in general seems to be thawing.
         | Articles like this one exemplify that trend
        
         | maxfurman wrote:
         | Plenty of SQL features have been added since the 70s, notably
         | window functions (which TFA relies on heavily). Most of the
         | major databases are from the 80s, and even new kid on the block
         | MySQL has been around since the mid 90s.
         | 
         | SQL draws a very hard line between the expression of the query
         | and the AST that is used in the actual implementation. Database
         | vendors like this aspect because they are free to implement
         | whichever optimizations they want, but application developers
         | want to build queries programmatically and optimize them
         | themselves before passing them to the db engine, hence the
         | tension in threads like these.
        
       | snookerdooker wrote:
       | I challenge you to add more emojis to your bio ;)
        
         | Lemaxoxo wrote:
         | Yes I know some people look down on that. I hope it doesn't
         | take away the merits of the article for you hehe.
        
       | Foobar8568 wrote:
       | R has been natively supported in sql server for a long while (At
       | least since 2017), same for Java or python, e.g. Writing SP in
       | these languages.
        
       | kunalgupta wrote:
       | god tier
        
       | yttribium wrote:
       | At least one F500 company had iteratively reweighted least
       | squares implemented in MS SQL as of ~20y ago. When all you have
       | is a hammer...
        
       | gigatexal wrote:
       | I thought I was decent at SQL until I saw this. Nice article!
        
       | PaulHoule wrote:
       | Reminds me of the time I coded up an optimized nearest neighbor
       | search in SQL.
        
       | syats wrote:
       | Just don't.
       | 
       | SQL:
       | 
       | - does not allow for easy and clean importing of
       | modules/libraries
       | 
       | - is not easily to write tests for
       | 
       | - has limited support for a debugger
       | 
       | - lacks a consistent style for such large queries (plus most
       | textbook cover fairly simple stuff) which means it's hard for a
       | developer to start reading someone else's code (more than in
       | other languages)
       | 
       | - clearly indicates in its name that it is a Query language.
       | 
       | Save yourself the trouble and all your collaborators the pain of
       | working with this code in the future, of trying to add new
       | features, of trying to reuse it in another project.
       | 
       | If you want to operate near the data, use PL/Python for
       | PostgreSQL.
       | 
       | EDIT: Fixed formatting.
        
         | dventimi wrote:
         | -PostgreSQL extensions are easy to include and use.
         | 
         | -pgTAP exists for testing.
         | 
         | -A large query in SQL is not made smaller but translating it
         | into an ORM DSL.
         | 
         | -If "Query" in "SQL" means it's for querying data, then
         | evidently "Query" not being in say Java or Python means those
         | languages are NOT meant for querying data. If that's true, then
         | why would you use them for querying data?
        
           | nfw2 wrote:
           | > If "Query" in "SQL" means it's for querying data, then
           | evidently "Query" not being in say Java or Python means those
           | languages are NOT meant for querying data
           | 
           | If X then Y does not imply if not X then not Y. Java and
           | Python do not indicate a purpose in their name because they
           | are general-purpose.
        
             | dventimi wrote:
             | Are they meant for querying data?
        
           | syats wrote:
           | Re modules/libraries: I meant it is not easy to write a piece
           | of SQL code, and then import it into several queries to reuse
           | it, or lend it to someone else for use on their on schema. It
           | is possible, yes, but seldom done, because it is hell.
           | PostgreSQL extensions could be used for this purpose, but
           | developing an extension requires a different set of SQL
           | statements (or luckily, python or c) than those used by the
           | user of the extension, which makes compounding them a bit
           | hard. Not impossible, just hard to maintain,
           | 
           | About your last point, I don't think that was my line of
           | reasoning, but, yes, for the love of what is precious, don't
           | open SQL files as python/java file objects and then parse and
           | rummage through them to find the data you are looking for.
           | Not impossible, just hard to maintain.
           | 
           | Thanks for pointing out pgTAP, didn't know about this.
           | 
           | For some reason, data-science folks haven't yet caught up
           | with ORMs.. I don't know if this is good or bad, but (as the
           | OP shows) they are more used to rows and columns (or graphs)
           | than objects. Maybe that will change one day.
        
             | dventimi wrote:
             | > maybe that will change one day
             | 
             | I pray that it never does.
             | 
             | https://blog.codinghorror.com/object-relational-mapping-
             | is-t...
        
             | dventimi wrote:
             | As for sharing SQL, that's easy to do within a database
             | using views. Across databases with possibly different data
             | models, that's not something I personally ever want to do.
        
       | Narew wrote:
       | There was also this neural network module in redis if you want to
       | do training and inference in a redid DB. (quite old btw)
       | https://github.com/antirez/neural-redis
        
       | pradeepchhetri wrote:
       | You might want to consider checking out ClickHouse which supports
       | many ML functions natively:
       | 
       | - stochasticLinearRegression (https://clickhouse.com/docs/en/sql-
       | reference/aggregate-funct...)
       | 
       | - stochasticLogisticRegression
       | (https://clickhouse.com/docs/en/sql-reference/aggregate-funct...)
        
         | Lemaxoxo wrote:
         | Thanks for the links, I wasn't aware of them. The Russians
         | often seem to have a step ahead in the ML world.
        
       ___________________________________________________________________
       (page generated 2023-03-07 23:00 UTC)