[HN Gopher] Stochastic gradient descent written in SQL ___________________________________________________________________ Stochastic gradient descent written in SQL Author : Lemaxoxo Score : 230 points Date : 2023-03-07 12:57 UTC (10 hours ago) (HTM) web link (maxhalford.github.io) (TXT) w3m dump (maxhalford.github.io) | AndrewKemendo wrote: | This is genius I love it. | | On stream inference with something like "continuous" data in the | same structure I think is the final material form for "AI" so | this is a great step towards that | | Thanks for the writeup | yza wrote: | Next step would be to implement autodiff. If only PostgreSQL had | CUDA support. | Rickasaurus wrote: | Please don't do this in prod | dventimi wrote: | Many ML models experience offline training, and some online | training is fine-tuning a model that was pretrained offline. | Hence, I see no particular reason to worry about doing this "in | prod." | agnosticmantis wrote: | Just to disambiguate, here machine learning = linear regression. | So no auto-grad needed as gradients are explicit and simple. | Still interesting though. | saydaark wrote: | This is interesting, but is not using the GPU. | kilgnad wrote: | An even MORE useful idea is to do this in CSS. | hubja wrote: | Super intresting! Love the scrappy mindset ;) | Lemaxoxo wrote: | Cheers, I learnt with the best :) | simonw wrote: | I tried to replicate this in SQLite. The first few steps worked | OK, e.g. | | https://lite.datasette.io/?json=https://gist.github.com/simo... | | (I replaced "figures" with "raw" due to the way Datasette Lite | assigns a default table name to the imported JSON) | | But the more complex recursive queries gave me this error and I'm | not sure how to work around it: recursive | reference in a subquery: state | | E.g. | https://lite.datasette.io/?json=https://gist.github.com/simo... | Lemaxoxo wrote: | That's too bad, I would have expected it to work out of the | box. Other than rewriting the query in a different way, I'm not | sure I see an easy workaround. Are you still working on this? | andrenotgiant wrote: | This is really interesting, but a basic part I don't understand: | What would it actually look like to run this on a live dataset? | | If I understand correctly: you'd run the recursive query, it | produces results for every step, effectively showing you the | progression of output over time, and then once it hits "present | day", it completes and stops? | | How would you generate results going forward? I.E. A minute | elapses after the results return, do you have to re-run the whole | query for all time? | [deleted] | Lemaxoxo wrote: | Good question. I touched upon this in the conclusion. | Basically, if you run this in a streaming SQL database, such as | Materialize, then you would get a true online system which | doesn't restart from scratch. | noloblo wrote: | Does this work with Postgres? | Lemaxoxo wrote: | Postgres has excellent support for WITH RECURSIVE, so I see | no reason why it wouldn't. However, as I answered | elsewhere, you would need to set some stateful stuff up if | you don't want the query to start from scratch when you re- | run it. | sergius wrote: | Did you try running it using DuckDB? | Lemaxoxo wrote: | DuckDB is what I used in the blog post. Re-running this | query simply recomputes everything from the start. I didn't | store intermediary that would allow starting off from where | the query stopped. But it's possible! | hobs wrote: | In a non-streaming db What would prevent you from storing the | result set and just using the last iteration to calculate the | next? | swasheck wrote: | i'm trying to learn here so please pardon my ignorance. | wouldn't a pre-aggregated set affect the new aggregate | result? i suppose you could store avg, sum, count and then | add the new value(s) to sum, new count to count, and | recalculate average from that. or even just avg and count | and then re-aggregate as ((avg*count)+new values)/(count + | new values count) but i didn't know if there's a better way | to process new values into a set of data that's already | been aggregated | hobs wrote: | Yep that would be what I would do - effectively a | functional approach where no memory is required besides | the current set and iteration. | | A big part of materializing datasets for performance is | finding the right grain that is both easy enough to | calculate and also can do nice things like be resumeable | for snapshots. | Lemaxoxo wrote: | This is definitely a possibility. What I meant to say is | that the implementation in the blog post doesn't support | that. | hobs wrote: | Makes sense, just an insert instead of a select, I used a | similar approach on newton-raphson to implement XIRR in | SQL and it worked well. | episode0x01 wrote: | From a comment by the author: | | > Also, I can't justify why, but my gut feeling is that the | database should be an order of magnitude faster than Python, | provided the implementation were to be improved. | | Would be curious how that could end up being the case. Perhaps if | NumPy wasn't used at all? That would mean no vectorization or | optimized storage. | | Would be interesting to see how it scaled with length and | dimensionality | jackschultz wrote: | >A machine learning algorithm which can be trained using SQL | opens a world of possibilities. The model and the data live in | the same space. This is as simple as it gets in terms of | architecture. Basically, you only need a database which runs SQL. | | First paragraph of the conclusion, and this very much fits with | the mindset that's been growing in me in the data world over the | past few years. Databases are much more powerful than we think, | they're not going to go away, only get better, and having the | data and the logic in the same space really removes tons of | headaches. ML models, transformation of data, generating json for | an API can all be done within the database rather than outside | scripting language. | | Are others seeing this? Are the current tradeoffs just that more | people know python vs sql or database specific languages to where | moving logic to postgres or snowflake is looked down on? | noloblo wrote: | +1 sql is extremely elegant composable and is under rated | | Postgres is very powerful. While I sought a short detour in | nosql Mongodb land now back to Mysql Postgresql sql territory | and glad for it | | Being able to generate views is and stored procedures is useful | as well.having sql Take over more like ml, gradient descent | does open up good possibility. | | Also since sql is declarative it Makes it so it's rather easier | than imperative scripting languages | naasking wrote: | SQL has some positives but it is not composable. At all. This | is because relations are not first-class values in SQL. | dventimi wrote: | Is a query not a relation? | naasking wrote: | Basically, but queries are not first class in SQL. You | can't assign a query to a variable, or pass it as a | parameter to a stored procedure, for example. This would | make SQL composable: declare @x = | (select * from Person) select Name from @x where | Birthdate < '2000-01-01' | noloblo wrote: | Exactly since it declarative the style Lends itself using | stored procedure calls to become composable | naasking wrote: | You cannot abstract over stored procedures either, so | that's still not composable. | dventimi wrote: | No idea what this means | naasking wrote: | Think of first-class functions. Can't do that with stored | procedures, just like you can't do that with queries or | tables in SQL. | alphanumeric0 wrote: | Furthermore, stored procedures/functions are not queries. | dventimi wrote: | Stored procedures can be relations. Queries are | relations. Ergo, stored procedures can be queries. | Lemaxoxo wrote: | Isn't that the point of common table expressions (CTEs)? | naasking wrote: | See: https://news.ycombinator.com/item?id=35058927 | Foobar8568 wrote: | You can do composition through CTE or Table-Valued | Function. | naasking wrote: | CTE and TVF still treat tables as second class citizens, | so while they enable some forms of composition, they're | still very restricted. This has been the consistent story | with SQL, 15 ways to use queries and tables, all to work | around the limitation that they are not first class | values. | datan3rd wrote: | with persons as (select * from Person) select | Name from persons where Birthdate < '2000-01-01 | naasking wrote: | Where is the assignment to a variable? Where can you | construct a query using a variable in table/query | position? That's the whole point of being first class and | composable, a query becomes like any other value so you | should be able to parameterize any query by another query | assigned to a variable that may have been set inside an | if-statement, or accepted as a parameter to a stored | procedure. You know, the same kinds of composition we see | in ordinary programming languages. | chrstr wrote: | create table x as (select * from person); select | name from x where ...; | | there you go, just configure your editor to display | "create table x" as "declare x = " ;) | | or even a version with lazy evaluation: | create view x as (select * from person); select | name from x where ...; | naasking wrote: | You're still not getting it. First-class status means | that anywhere a value or variable can be used, a query or | table should also be able to appear, and vice versa. This | means a table or query can appear as a return type, a | parameter to a stored procedure or query, a variable, and | so on. | | SQL just does not have this, it instead has 15 different | second class ways to handle tables and queries that try | to make up for the fact that they are not first-class | values. These include CTEs, table valued functions, | views, etc. | alphanumeric0 wrote: | What if I wrote a very long, complicated query that I'd | like to test against different tables (like test tables), | and let's say I can't use stored functions or procedures. | How could I pass different tables to my query? | chrstr wrote: | CREATE TABLE data_a AS (SELECT 'a' AS test_case, 1 AS | value); CREATE TABLE data_b AS (SELECT 'b' AS | test_case, 2 AS value); CREATE VIEW data AS (SELECT | * FROM data_a UNION ALL SELECT * FROM data_b); | CREATE VIEW complicated_query AS (SELECT test_case, | value+1 FROM data); SELECT * FROM | complicated_query WHERE test_case = 'a'; SELECT * | FROM complicated_query WHERE test_case = 'b'; | alphanumeric0 wrote: | Nice, that is what I was looking for. Of course, it'd | need to point to production data as well, so maybe | test_case is null, in that case: CREATE | TABLE data_a AS (SELECT 'a' AS test_case, 1 AS value); | CREATE TABLE data_b AS (SELECT 'b' AS test_case, 2 AS | value); CREATE TABLE data_prod AS (SELECT NULL AS | test_case, prod_table.value FROM prod_table); | CREATE VIEW data AS (SELECT * FROM data_a UNION ALL | SELECT * FROM data_b UNION ALL SELECT * FROM data_prod); | CREATE VIEW complicated_query AS (SELECT test_case, | value+1 FROM data); -- when testing | SELECT * FROM complicated_query WHERE test_case = 'a'; | SELECT * FROM complicated_query WHERE test_case = 'b'; | -- when in 'production' SELECT * FROM | complicated_query WHERE test_case IS NULL; | naasking wrote: | You just reinvented defunctionalization, which is a | transformation from a domain that has first-class values | to a domain where support is only second-class. | Defunctionalization is typically used in programming | languages to simulate first-class functions in languages | where functions are only second-class citizens, like C | and Pascal. | | This perfectly illustrates my point. You had to manually | defunctionalize your data model and queries to support | what I'm saying should be inherently part of SQL. | dventimi wrote: | If you can't use stored procedures which are good for | this very case, many databases offer dynamic SQL. That | might work in some cases. | alphanumeric0 wrote: | That's a good point. This would rule out SQLite for me. | Foobar8568 wrote: | TVF ( Table-Valued Function) with Cross apply. | adeelk93 wrote: | Check out dbt - it's a great tool for organizing queries | and solving such patterns | simonw wrote: | CTEs go a long way to making SQL more composable. | noloblo wrote: | Postgres has read write parallism that can scale across | millions of read writes ; if ml model is inherent in the | Postgres db it is indeed very elegant reminds me of the glory | days of tsearch2 to do text search in Postgres for searching | our internal document repository using nothing but Postgres | and boy was it faster than Java based search systems | somat wrote: | A dbms is really it's own operating system, usually this is | hosted on another operating system, one that understands the | hardware. | | I remember one place I worked where we had several old | graybeard programmers who considered the dbms[1] the operating | system, as a unix sysadmin we had some interesting discussions | as I was always confused and confined by the dbms and they felt | the same about unix. | | 1. unidata if curious, a weird multi value(not relational) | database, very vertically integrated compared to most databases | today. | CreRecombinase wrote: | I spend most of my time in the parallel universe that is | scientific computing/HPC. In this alternate reality SQL (not to | mention databases) never really took off. Instead of scalable, | performant databases, we have only the parallel filesystem. I'm | convinced the reason contemporary scientific computing don't | involve much SQL is sociological/path-dependency, but there are | also very good technical reasons. Optimizing software in | scientific computing involves two steps: 1) Think hard about | your problem until you can frame it as one or more matrix | multiplications 2) Plug that into a numerical linear algebra | library The SQL abstraction (in my experience) takes you very | much in the opposite direction. | dchftcs wrote: | There is a recent trend in database research, ML in databases. | Not sure how much an impact it can make though, the sweet spot | is doing relatively linear stuff, arguably just a step up from | analytical functions in queries, while cutting edge ML needs | dedicated GPUs for compute load and often uses unstructured and | even binary data. | Scubabear68 wrote: | With "traditional" RDBMS systems, putting a lot of code in the | DB lead to a lot of scaling issues where you'd need gigantic | machines with a lot of RAM and CPU: because the DB was doing so | much work. It was expensive and clunky to get HA right. | | In more modern DBs being distributed horizontally, this | approach may see a rebound. The big "but" is still costs, in my | experience in AWS as an example, managed Postgres Aurora was | surprisingly expensive in terms of monthly cost. | dventimi wrote: | Why would running code in a database process be intrinsically | more expensive than running it in some other process? | yunwal wrote: | The issue is that many relational databases are not | horizontally scalable, so you want to be frugal with their | resources. | dventimi wrote: | The resources I'm familiar with are I/O, memory, and CPU. | The only one I believe can be spared in the database by | using that resource outside of the database, is CPU. When | the database is far from saturated on CPU and latency and | throughput are determined by I/O and memory, using CPU on | some other machine that isn't the database can't possibly | have any impact on latency and throughput. | yunwal wrote: | > When the database is far from saturated on CPU | | The issue here is if you scale enough saturate the | database, you'll have to rewrite essentially all your | code if you're a typical CRUD webapp. Basically all of | your business logic is about data retrieval. There's | probably some companies that can get away with this, but | it would be way too expensive for most. | dventimi wrote: | If I scale up to saturate the database CPU...by doing | data retrieval? Setting aside my skepticism about | saturating the CPU with mere data retrieval, how is that | solved by moving the data to another host's CPU, when | moving the data involves the very data retrieval that's | saturating the database's CPU? | riku_iki wrote: | you can easily scale linearly your app layer on multiple | machines, while it is harder with previous gen databases. | dventimi wrote: | What good is that going to do if the data always resides | in the database? | bob1029 wrote: | > Databases are much more powerful than we think | | The older I get the more I agree with this. | | There is nothing you cannot build by combining SQL primitives. | Side effects can even be introduced - on purpose - by way of | UDFs that talk to the outside world. | | I've seen more than one system where the database itself was | directly responsible for things like rendering final HTML for | use by the end clients. You might think this is horrible and | indeed many manifestations are. But, there lurk an opportunity | for incredible elegance down this path, assuming you have the | patience to engage in it. | giraffe_lady wrote: | > I've seen more than one system where the database itself | was directly responsible for things like rendering final HTML | for use by the end clients. | | I did this for a side project a few months ago and even used | postgrest to serve the page with correct headers for html. It | felt simultaneously really cursed and obvious. Shit you could | even use plv8 to run mustache or whatever in the db if you | really wanted to piss people off. | dventimi wrote: | I'm doing this right now with a DO droplet built with | PostgreSQL, postgrest, nginx, and not much else. Do you | have any tips, tricks, or blog posts you can share based on | your experience? You should post it to HN. Strike while the | iron's hot. With a little luck you'll hit the front page. | giraffe_lady wrote: | Nah nothing that refined. Pretty predictably supabase is | doing some weird stuff along these lines and I found some | abandoned and semi-active repos associated with them and | people working for them that were useful examples of some | things. | | As for posting to HN absolutely no thanks. These people | are so fucking hostile there is no accomplishment too | small to tear apart for entertainment here. I have no | interest in it. | dventimi wrote: | > As for posting to HN absolutely no thanks. These people | are so fucking hostile there is no accomplishment too | small to tear apart for entertainment here. I have no | interest in it. | | You make a really good point. | nerdponx wrote: | In the data warehouse / OLAP space, I think we are heading | towards a world where the underlying data storage consists of | partitioned Arrow arrays, and you can run arbitrary UDFs in | arbitrary languages on the (read-only) raw data with zero copy | and essentially no overhead other than the language runtime | itself and marshalling of the data that is emitted from the | UDF. | | Something like the Snowflake data storage model + DuckDB as an | execution engine + a Pandas/Polars-like API. | | There is no reason why we have to be stuck with "database == | SQL" all the time. SQL is extremely powerful, but sometimes you | need a bit more, and in that case we shouldn't be so | constrained. | | But in general yes, the world is gradually waking up to the | idea that performance matters, and that data locality can be | extremely important for performance, and that doing your | computations and data processing on your warehouse data in- | place is going to be a huge time and money saver in the longer | term. | roflyear wrote: | > Are the current tradeoffs just that more people know python | vs sql or database specific languages to where moving logic to | postgres or snowflake is looked down on? | | Yes, mostly development, deployment, etc.. concerns. I haven't | ever seen an org that versions their SQL queries, unless they | are in a codebase. The environment is just unfriendly towards | that type of management. Nevermind testing! Things that have | solutions but we haven't matured enough here, because that type | of development has been happening in application code. | | Also, SQL is generally more complex than application logic, | because they are designed to do different things. What is a | simple exercise in iteration or recursion can more easily | become something a little more of a headache. | | Problems that can be resolved, but they are problems. | _a_a_a_ wrote: | Self-proclaimed database expert here. What a database is good | at depends on what you're trying to get that database to do, at | least in part. | | Take it into piecess, elegance and efficiency. These will | correspond to a logical statement of what you're trying to do, | and how quickly the database will actually do it in practice. | | SQL can do some nice things in areas, making it elegant in | those areas. Elsewhere it can be pretty wordy and ugly. | | In efficiency, it comes down largely to how the database is | implemented and that also includes the capability of the | optimiser. Both of these are out of your control. In my | experience trying to turn a database into a number cruncher is | just not going to work. | | I guess that's long way round of me saying that I don't think I | agree with you! | maCDzP wrote: | I agree. I took a course in databases and SQL and was blown | away by its power. With CTE's and PLSQL you can do a lot of | stuff inside the database. | | I played with SQLite and it's json columns. Once you get the | hang of the syntax for walking a json structure you can do all | sorts of neat things. Doing the same thing in Python would be | tedious. | | And I also believe it ended up being way faster than what I did | in python. | ISL wrote: | The challenge I, with limited knowledge, see with developing | detailed algorithms in SQL is a lack of good testing, | abstraction, and review tooling. Similarly for a lot of the | user-defined-functions for the larger data warehouses | (redshift, bigQuery, etc.) | | dbt solves a lot of it, but I'd love to learn more about good | resources for building reliable and readily-readable SQL | algorithms for complex logic. | AndrewKemendo wrote: | I mean this is an opportunity right? Build those things | Lemaxoxo wrote: | I agree. Databases are going to be here for a long time, | and we're barely scratching the surface of making people | productive with them. dbt is just the beginning. | adeelk93 wrote: | I've supported 3 different models over the years with | inference implemented in SQL. First one I inherited, loved it | so much that I implemented it twice again. Amazingly fast for | TBs of data and no waiting on other tech teams. | | That tooling you're describing is definitely not there. | Bigquery has BQML but it's very much in its infancy. I tend | to do all the modeling in Python/R on sampled data and then | deploy to SQL. | fithisux wrote: | BQML should become standard. | actionfromafar wrote: | I have yet to see a decent IDE or system which allows great | version control, unit testing and collaboration with SQL source | code. | | So I think a lot of the reluctance is from practical concerns. | noloblo wrote: | We store individual sql files in github and keep them in | separate folders | | This is very simple and scales well for our purposes | jhd3 wrote: | > Databases are much more powerful than we think | | and data has mass. One example of bringing the work to the data | is https://madlib.apache.org/ (works on Postgres and Greenplum) | | [Disclaimer - former employee of Pivotal] | tomrod wrote: | Absolutely. | | Consider too that PostgreSQL databases support different | languages, like Python. | | Loads of for-profit companies have tried to cash in on this. | SAP HANA is one of the ones I've had recent experience with. It | is unfortunately a poor implementation. The right architecture | tends to be: put your model behind an API interface, not | internal on the system. Train your model separately from | production systems, and so on. | | You might also be interested in checking out MLOps platforms | like Kubeflow, Flyte, and others. | 7thaccount wrote: | Database first designs make a lot of sense in a lot of ways. | I've worked for a company with an Oracle database that has SQL | scripts that do all the validation and create text files for | downstream usage. I think it makes more sense than a ton of | Java, but there are pros and cons. One is that SQL is | relational and the advanced stuff can be extra hard to | troubleshoot if you don't have enough experience. Even those | that can't code can usually understand a for loop and can think | imperatively. | | Unfortunately it's an expensive commercial product or I'd | recommend you look at kdb+ if you work with time series data. | The big banks use it and essentially put all thier latest RT | data into kdb+ and then can write extremely succinct queries | with a SQL-like syntax, but the ability to approach it far more | programmatically than what is typically doable with something | like PL-SQL. You can even write your neural network or whatever | code in less than a page of code as the language of kdb+ is | extremely powerful, although also basically incomprehensible | until someone puts some time into learning it. It's extremely | lightweight though, so very easy to deal with in an interactive | fashion. | | All that to say I agree with you that it's nice to just have | everything you want all in one spot rather than to deal with 4 | different tools and pipelines and shared drives and so on. | daveguy wrote: | I would like for this to be the case. I was DBA of a Postgres | database for LIMS almost 2 decades ago. Back then you could | code functions for the database to execute on data and it was | very powerful, but also very clunky. The tools to do software | development in the database was not mature at all. A lot has | changed in the past 20 years and SQL has evolved. Do you think | SQL will expand that much or there will be APIs built into the | database? Near-data functions are powerful and useful, but I | would want my development environment to be more like version | controlled code than "built-in". | | I wonder if near-data functions on small databases is the | solution to the limit of statelessness that you have with | functions as a service. | nonethewiser wrote: | > Databases are much more powerful than we think | | And a function of what people think is attitudes towards | working at the DB level. I see this often with ORM's in the web | dev sphere (rather than Dat Science). Yes, ORM's are great but | many people rely on them to completely abstract away the | database and are terrified by raw sql or even query building. | You also see it with services that abstract away the backend | like PocketBase, Fireship, etc. Writing a window function or | even a sub select looks like black magic to many. | | I say this after several experiences with codebases where joins | and filtering were often done at the application layer and raw | sql was seen as the devil. | boringg wrote: | Opposite here - dont like ORMs. Too much overhead - though i | get their value. | FpUser wrote: | It is tempting to combine web server, database and some | imperative language with built in data oriented / SQL features | in a single executable and call it an application server that | would communicate with the outside world using for example JSON | based RPC. I think there were / are some products in the area | even with the built in IDE (like Wakanda). | spprashant wrote: | The problem with throwing everything in a database is you end | up with brittle stored procedures all over the place, which are | painful to debug. There is no good support for version control | or testing, which means you end up creating a dozen copies of | each function named (sp_v1, sp_v2,.., etc.). It much more | harder to practice iterative development which the rest of | software development seems implements effectively. | | Also traditional relational databases have a way to go before | they can support parallelized machine learning workloads. You | do not have control or the ability to spin up threads or | processes to boost your performance. You rely on the query | processor to make those decisions, and depending on your | platform the results will be mixed. | 62951413 wrote: | Junior developers like me were uncomfortable with SQL twenty | years ago. Java ORM frameworks became popular because of the | Object-Relational impedance. I kind of see the same kind of | sentiment nowadays among newer generations but in Python&Co. | | The success of the Apache Spark engine can at least partially | be attributed to | | * being able to have the same expressive power as SQL but with | a real Scala API (including having reusable libraries based on | it) | | * being able to embed it into unit tests at a low price of | additional ~20 seconds latency to spin up a local Spark master | justsomeuser wrote: | I think general programming languages are better for general | programs than SQL. | | Specifically they have: Type systems, compilers, debuggers, | text editors, package managers, C FFI etc. | | But I agree that having the data and the program in the same | process has benefits. | | Writing programs in SQL is one way. | | Another way is to move your data to your general program with | SQLite. | | I like using SQL for ACID, and queries as a first filter to get | the data into my program where I may do further processing with | the general language. | Serow225 wrote: | Another is MS SQL Server, which lets you run .NET on the | database server :D "you can author stored procedures, | triggers, user-defined functions, user-defined types, and | user-defined aggregates in managed code" | malnourish wrote: | I have had nothing but bad experiences trying to run .NET | in SSIS packages -- is there another way? | Serow225 wrote: | I've never had the pleasure(ha) of using SSIS, but this | is the stuff that I was talking about: | https://learn.microsoft.com/en- | us/dotnet/framework/data/adon... | fifilura wrote: | > Type systems | | SQL has types | | > compilers | | For what specifically do you need a compiler? | | > debuggers | | Some tasks - like the concurrency SQL enables - are just very | difficult to debug with debuggers. It would be the same with | any other language. What SQL does here though is to allow you | to focus on the logic, not the actual concurrency, | | > text editors, package managers | | I feel like these two are just for filling up the space. | | > C FFI | | Many SQL engines have UDFs | swyx wrote: | had a very good chat with https://postgresml.org/ last week | which is focusing on bringing ML to postgres: | https://youtu.be/j8hE8-jZJGU | Lemaxoxo wrote: | I'm watching it, it's really good. Montana makes a great | point: you can move data to the models, or move the models to | the data. Data is typically larger than models, so it makes | sense to go with the latter. | fzeindl wrote: | Fully agree. | | And using PostgREST [0] you can serve your postgreSQL database | as REST-API. And if you throw foreign data wrappers / multicorn | in the mix, you can map any other datasource into your | postgreSQL-db as table. | | [0] https://postgrest.org/en/stable/ | JUNGLEISMASSIVE wrote: | [dead] | [deleted] | airstrike wrote: | This is great. The only thing I dislike from this is using these | variables to try to predict Adj Close when they are not at all | correlated. | | There are countless meaningful correlations in financial data | that would have been just as easy to play around with. One truly | valuable example would be to look at trading multiples of | comparable companies. Sticking to P/E would be easier as P is | easily observable and forward-looking EPS estimates are generally | always available. This would limit the exercise to more mature | companies than the ones commonly discussed on HN but would make | it actually meaningful | Lemaxoxo wrote: | Do you have some data/resources on this? I'm a total snowflake | at this, but I'm willing to learn. | airstrike wrote: | _In hindsight this is harder than it seems if you don 't | already have access to data, so I'll try to be as informative | as possible in my response_ | | It's hard to find this information out there, so here's ~all | you need to know. | | Data is usually behind paywalls, unfortunately. Industry | standards are Bloomberg terminal (ridiculously expensive, 5 | digits $), FactSet (very expensive, 4 digits), Capital IQ | (expensive, not sure)... but there are a number of up-and- | coming startups trying to disrupt the space so you may be | able to grab data from them. I think https://atom.finance has | a 7-day free trial you could use to play around with. | | P/E simply means the company's _P_rice per share divided by | _E_arnings per share. Cancel out the "per share" terms and | you get total market capitalization (which is the value of | the total equity) divided by net income (since "earnings per | share" really means "net income per share") | | So the "P" is easy to get. It's your Adj Close. | | The "E" is trickier as it can mean a lot of things. Diluted | EPS from financial statements? Last year's EPS? Management's | guidance for EPS? None of those are actually correct even if | they are all "EPS" | | Importantly--and contrary to 99% of the info you will find | online--the most relevant EPS number are _forward_ estimates | of EPS, usually for the next twelve months ( "NTM"). That is | based on an average or median of analyst estimates which is | called "consensus". These are analysts from financial | institutions who build their own little models based on their | own views of where the business is going to go, informed by | recent earnings, management's color in earnings calls and | filings, etc. | | Believe it or not, as hairy as that sounds, EPS is fairly | easy to get as it's a metric that has less room for | interpretation than, say, EBITDA. | | So you're not going to go out there, read all these (paid) | analyst reports, find their EPS, calculate the median, etc. | Bloomberg, Capital IQ, FactSet do this for you and it's | easily observable for the end user (that's their business). | | The thing is, as you may have guessed, "next twelve months" | are a moving target across time. Analysts usually provide | estimates for the current fiscal year (i.e. FY 2023, ending | 12/31/2023 for most companies) and the following year, ending | 12/31/2024. Let's call these FY0_EPS and FY1_EPS, for | simplicity | | You might be tempted to just take a moving average of these | two estimates, so that on 1/1/2023 it is 100% of FY0_EPS + 0% | of FY1_EPS, on 1/2/2023 it is 99.9% + 0.1% and gradually | "move forward in time" as the days pass. That sort of works | (and definitely checks the box for a proof-of-concept like in | your post) but for the sake of completeness, I'll just say | that the right-er approach is to only "move forward in time" | when new earnings are released. So it doesn't matter if we're | in 1/1/2023 or 2/1/2023--what matter is what is the latest | reported quarter. Take Coca-Cola for instance | (https://www.bamsec.com/companies/21344/coca-cola-co). Let's | roll the tape backward one year. They reported FY 2021 | earnings on 2/22/2022, at which point analysts published new | estimates in revised models, so on from that day forward | until the next quarterly earnings we take 100% FY0_EPS + 0% | FY1_EPS, in which these correspond to estimates for FY 2022 | and FY 2023, respectively. | | On 4/1/2022, Coca-Cola reported Q1 2022 results, analysts | published new estimates, and we now take 75% FY0_EPS + 25% | FY1_EPS. On 7/1/2022, we move forward another quarter so 50% | + 50%, then 25% + 75% starting on 10/26 and then back to | square one with 100% + 0% except FY0_EPS now means FY 2023 vs | FY 2022 previously, and FY1_EPS means FY 2024 | | So your table is something like (I'm making up numbers) | +------------+--------+-----------+---------+---------+------ | -------+-------------+-------------+ | Date | | Ticker | Adj_Close | FY0_EPS | FY1_EPS | Period | | NTM0_Weight | NTM1_Weight | | 01/01/2022 | KO | | 90.10 | 20.00 | 24.00 | Q1 2022 | 1.00 | | 0.00 | | 01/02/2022 | KO | 91.14 | 20.00 | | 24.00 | Q1 2022 | 1.00 | 0.00 | | | 01/03/2022 | KO | 89.30 | 20.00 | 24.00 | Q1 | 2022 | 1.00 | 0.00 | | 01/04/2022 | | KO | 91.09 | 20.00 | 24.00 | Q1 2022 | | 1.00 | 0.00 | | 01/05/2022 | KO | | 92.01 | 20.00 | 24.00 | Q1 2022 | 1.00 | | 0.00 | | 01/06/2022 | KO | 89.05 | 20.00 | | 24.00 | Q1 2022 | 1.00 | 0.00 | | ... | 07/02/2022 | KO | 89.05 | 19.50 | | 23.20 | Q2 2022 | 0.75 | 0.25 | | | With that you can take NTM0_Weight and NTM1_Weight to | calculate NTM_EPS by multiplying those weights by FY0_EPS and | FY1_EPS. And then can take AdjClose / NTM_EPS to calculate | P/E | | Why is this useful? Because in theory you can take the | average P/E of companies X, Y and Z in one industry and | compare it to a fourth company W. Is W's P/E multiple above | or below the industry average? You now know if they are over | or undervalued, respectively, which means you know if you | should buy or sell that stock (if you believe you picked the | right "comparable" companies in that industry) | | This is just one example... there are all sorts of similar | analyses done daily in the financial services industry. I'm | not saying it's easy to extract alpha from trading on these, | but that's the framework | Lemaxoxo wrote: | Thank you so much for this! It's very generous of you to | have taken the time. | airstrike wrote: | My pleasure! I've spent the better part of the last | decade doing this stuff and I appreciate how hard it is | to find resources on it, so thought I'd share since you | mentioned you were interested in learning | | At the risk of exhausting said interest, here's one real | life example of a similar analysis: https://wsp-blog- | images.s3.amazonaws.com/uploads/2011/09/171... | | See PDF page 14. Note the lines called "Composite P / NTM | EPS" which they built as a blend of American Eagle's, | Tilly's and Zumiez's P/E multiple, which are companies X, | Y and Z in my comment above (for some reason they gave AE | double the weight which is unusual) and compared it to | Heat's P/E multiple (Heat was the codename for retailer | Rue21, or hypothetical company W in my example above) | | I got this deck from | https://www.wallstreetprep.com/knowledge/investment- | banking-... which has a few other examples and a variety | of other resources on the topic FYI | college_physics wrote: | Is there a risk that postgres becomes sentient in our lifetime? | zackmorris wrote: | This is great! Moving away from the proprietary nature of GPUs | and complex math gatekeeping should help democratize AI. | | Has anyone converted stuff like gradient descent to set theory? | | https://support.unicomsi.com/manuals/soliddb/7/SQL_Guide/2_G... | | https://www.sqlshack.com/mathematics-sql-server-fast-introdu... | | https://www.sqlshack.com/learn-sql-set-theory/ | | Right now AI algorithms kind of look imperative and stateful to | me, like state machines. But there should be a functional or | relational representation, where pattern matching would be used | to derive the current state instead. | | It's trivial to go from functional to imperative representation, | but often nearly impossible to go the other way. That's why | monadic (sorry if I'm mincing terms) logic is so troublesome. | Mutable variables, futures/promises, async, etc can't be | statically analyzed, which is why most code today is difficult to | parallelize and stuff like C++ optimizers don't even increase | speed to the next order of magnitude. But functional languages | have nearly infinite optimization potential through divide and | conquer approaches like sharding and scatter-gather arrays that | can run internally without side effects. In other words, we can | throw hardware at SQL for linear speedup since it's | embarrassingly parallel, but might have limited success | optimizing something like Python. | dogcomplex wrote: | Replying to this to remind myself to research later, as this is | a very important question | krick wrote: | From the start I assumed this is a nice playful "Hexing the | technical interview" kinda joke. But given the tone of the | article, and some of the comments here... Uh, this cannot be | serious, right? | guhidalg wrote: | Why couldn't it be serious? Your database already has your data | so you don't have to stream it to another machine for training. | If you have a developed model that needs retraining on new | data, save yourself the network bandwidth (and time!) and just | make the DB retrain. This doesn't replace the exploration phase | of model-building but if you already know gradient descent | works for you then what's wrong with this approach? | sgu999 wrote: | I'm very confused as well. Are we facing a wall of GPT- | generated comments? | Terretta wrote: | Title here is wrong. Title in article and headings in article are | right: ONLINE gradient descent | | It's specifically _not_ stochastic. From the article: | | _Online gradient descent_ | | _Finally, we have enough experience to implement online gradient | descent. To keep things simple, we will use a very vanilla | version:_ | | _- Constant learning rate, as opposed to a schedule._ | | _- Single epoch, we only do one pass on the data._ | | _- Not stochastic: the rows are not shuffled._ - - - - | | _- Squared loss, which is the standard loss for regression._ | | _- No gradient clipping._ | | _- No weight regularisation._ | | _- No intercept term._ | [deleted] | Lemaxoxo wrote: | Hehe I was wondering if someone would catch that. Rest assured, | I know the difference between online and stochastic gradient | descent. I admit I used stochastic on Hacker News because I | thought it would generate more engagement. | airstrike wrote: | Then just call it Non-stochastic Gradient Descent? You can't | editorialize titles per HN guidelines | | https://news.ycombinator.com/newsguidelines.html | Lemaxoxo wrote: | Thanks, I wasn't aware. | airstrike wrote: | My pleasure. You can still edit the title, by the way ;-) | rlewkov wrote: | Uhhhh, ya know that Python has a library for that :-) | glomgril wrote: | Incredible post. I laughed when I saw the title, snickered at the | first paragraph, and then proceeded to be blown away by the rest | of it. Thought I was in for a joke and instead I'm thinking about | the nature of ML Ops and what it's become. | civilized wrote: | In the comments here so far, we see a pattern we've seen before. | When someone suggests doing something in SQL, there's a lot of | concern about SQL being a very limited programming language where | it's hard to do proper engineering. | | Here's I would really love to know: why is it that SQL is, to | first order, the only language used to interact with databases, | and SQL has about the same features as it did in the 70s? It | seems analogous to if the general-purpose programming world | stopped with C. | bob1029 wrote: | > why is it that SQL is, to first order, the only language used | to interact with databases, and SQL has about the same features | as it did in the 70s? | | Because SQL is effectively a domain-specific language. If you | added 100+ additional keywords/functions/etc., do you think it | would be easier or more difficult for the average developer to | build something clean with it? | | I look at SQL like a pile of bricks. You don't want | complicated, unicorn bricks or you won't be able to fit | anything meaningful together. Experienced masons almost | certainly prefer their materials to be as dumb and consistent | as possible. | dventimi wrote: | I believe the PC revolution gave a few generations of | programmers the mistaken belief that imperative languages with | familiar features (Basic, Pascal, C, Perl, Python, Java, etc.) | are the only or the best way to interact with computers. Forms | of interaction developed earlier (Forth, Prolog, SQL, etc.) | were learned later, grudgingly, if at all. SQL was viewed with | disdain, a necessary evil to be held at arm's length. That | stalled evolution of SQL, the success of other query languages, | or really anything that didn't fit the Algol-style "general | purpose programming language" blinders. | | Mercifully, finally, the cold war against SQL in particular and | against relational databases in general seems to be thawing. | Articles like this one exemplify that trend | maxfurman wrote: | Plenty of SQL features have been added since the 70s, notably | window functions (which TFA relies on heavily). Most of the | major databases are from the 80s, and even new kid on the block | MySQL has been around since the mid 90s. | | SQL draws a very hard line between the expression of the query | and the AST that is used in the actual implementation. Database | vendors like this aspect because they are free to implement | whichever optimizations they want, but application developers | want to build queries programmatically and optimize them | themselves before passing them to the db engine, hence the | tension in threads like these. | snookerdooker wrote: | I challenge you to add more emojis to your bio ;) | Lemaxoxo wrote: | Yes I know some people look down on that. I hope it doesn't | take away the merits of the article for you hehe. | Foobar8568 wrote: | R has been natively supported in sql server for a long while (At | least since 2017), same for Java or python, e.g. Writing SP in | these languages. | kunalgupta wrote: | god tier | yttribium wrote: | At least one F500 company had iteratively reweighted least | squares implemented in MS SQL as of ~20y ago. When all you have | is a hammer... | gigatexal wrote: | I thought I was decent at SQL until I saw this. Nice article! | PaulHoule wrote: | Reminds me of the time I coded up an optimized nearest neighbor | search in SQL. | syats wrote: | Just don't. | | SQL: | | - does not allow for easy and clean importing of | modules/libraries | | - is not easily to write tests for | | - has limited support for a debugger | | - lacks a consistent style for such large queries (plus most | textbook cover fairly simple stuff) which means it's hard for a | developer to start reading someone else's code (more than in | other languages) | | - clearly indicates in its name that it is a Query language. | | Save yourself the trouble and all your collaborators the pain of | working with this code in the future, of trying to add new | features, of trying to reuse it in another project. | | If you want to operate near the data, use PL/Python for | PostgreSQL. | | EDIT: Fixed formatting. | dventimi wrote: | -PostgreSQL extensions are easy to include and use. | | -pgTAP exists for testing. | | -A large query in SQL is not made smaller but translating it | into an ORM DSL. | | -If "Query" in "SQL" means it's for querying data, then | evidently "Query" not being in say Java or Python means those | languages are NOT meant for querying data. If that's true, then | why would you use them for querying data? | nfw2 wrote: | > If "Query" in "SQL" means it's for querying data, then | evidently "Query" not being in say Java or Python means those | languages are NOT meant for querying data | | If X then Y does not imply if not X then not Y. Java and | Python do not indicate a purpose in their name because they | are general-purpose. | dventimi wrote: | Are they meant for querying data? | syats wrote: | Re modules/libraries: I meant it is not easy to write a piece | of SQL code, and then import it into several queries to reuse | it, or lend it to someone else for use on their on schema. It | is possible, yes, but seldom done, because it is hell. | PostgreSQL extensions could be used for this purpose, but | developing an extension requires a different set of SQL | statements (or luckily, python or c) than those used by the | user of the extension, which makes compounding them a bit | hard. Not impossible, just hard to maintain, | | About your last point, I don't think that was my line of | reasoning, but, yes, for the love of what is precious, don't | open SQL files as python/java file objects and then parse and | rummage through them to find the data you are looking for. | Not impossible, just hard to maintain. | | Thanks for pointing out pgTAP, didn't know about this. | | For some reason, data-science folks haven't yet caught up | with ORMs.. I don't know if this is good or bad, but (as the | OP shows) they are more used to rows and columns (or graphs) | than objects. Maybe that will change one day. | dventimi wrote: | > maybe that will change one day | | I pray that it never does. | | https://blog.codinghorror.com/object-relational-mapping- | is-t... | dventimi wrote: | As for sharing SQL, that's easy to do within a database | using views. Across databases with possibly different data | models, that's not something I personally ever want to do. | Narew wrote: | There was also this neural network module in redis if you want to | do training and inference in a redid DB. (quite old btw) | https://github.com/antirez/neural-redis | pradeepchhetri wrote: | You might want to consider checking out ClickHouse which supports | many ML functions natively: | | - stochasticLinearRegression (https://clickhouse.com/docs/en/sql- | reference/aggregate-funct...) | | - stochasticLogisticRegression | (https://clickhouse.com/docs/en/sql-reference/aggregate-funct...) | Lemaxoxo wrote: | Thanks for the links, I wasn't aware of them. The Russians | often seem to have a step ahead in the ML world. ___________________________________________________________________ (page generated 2023-03-07 23:00 UTC)