[HN Gopher] ArcticDB: Why a Hedge Fund Built Its Own Database
       ___________________________________________________________________
        
       ArcticDB: Why a Hedge Fund Built Its Own Database
        
       Author : todsacerdoti
       Score  : 93 points
       Date   : 2024-08-21 13:17 UTC (3 days ago)
        
 (HTM) web link (www.infoq.com)
 (TXT) w3m dump (www.infoq.com)
        
       | dang wrote:
       | Related:
       | 
       |  _ArcticDB: A high-performance, serverless Pandas DataFrame
       | database_ - https://news.ycombinator.com/item?id=35198131 - March
       | 2023 (1 comment)
       | 
       |  _Introducing ArcticDB: Powering data science at Man Group_ -
       | https://news.ycombinator.com/item?id=35181870 - March 2023 (1
       | comment)
       | 
       |  _Introducing ArcticDB: A Database for Observability_ -
       | https://news.ycombinator.com/item?id=31260597 - May 2022 (31
       | comments)
        
         | Nelkins wrote:
         | I don't think the last link is related. Different database.
        
           | silisili wrote:
           | Correct. They renamed FrostDB, here is the announcement -
           | 
           | https://www.polarsignals.com/blog/posts/2022/06/16/arcticdb-.
           | ..
        
       | OutOfHere wrote:
       | https://github.com/man-group/arcticDB
        
       | stackskipton wrote:
       | Read the presentation. Answer was what I expected. We had unique
       | problem and because we make oil drums amount of cash, dipping a
       | bucket and taking that cash to solve the problem was easy
       | justification.
       | 
       | These are really smart people solving problems they have but many
       | companies don't have buckets of cash to hire really smart people
       | to solve those problems.
       | 
       | Also, the questions after presentation pointed out the data isn't
       | always analyzed in their database so it's more like storage
       | system then database.
       | 
       | >Participant 1: What's the optimization happening on the pandas
       | DataFrames, which we obviously know are not very good at scaling
       | up to billions of rows? How are you doing that? On the pandas
       | DataFrames, what kind of optimizations are you running under the
       | hood? Are you doing some Spark?
       | 
       | >Munro: The general pattern we have internally and the users
       | have, is that your returning pandas DataFrames are usable.
       | They're fitting in memory. You're doing the querying, so it's
       | like, limit your results to that. Then, once people have got
       | their DataFrame back, they might choose another technology like
       | Polars, DuckDB to do their analytics, depending on if they don't
       | like pandas or they think it's too slow.
        
         | datahack wrote:
         | This comment is underrated comedy gold. You clearly have worked
         | with big data.
        
         | primitivesuave wrote:
         | I skipped to the "why build a database" section and then
         | skipped another two minutes of his tangential thoughts - seems
         | like the answer is "because Moore's law"?
        
       | tda wrote:
       | I know there are tons of problems that are solved in excel while
       | they really shouldn't. Instead of getting the expert business
       | analyst to use a better tool (like pandas), money is spent to
       | "fix" excel.
       | 
       | Apparently there is also a class of problems that outgrow pandas.
       | And instead of the business side switching to more suitable
       | tools, some really smart people are hired to build crutches
       | around pandas.
       | 
       | Oh well, they probably had fun doing it. Maybe they get to work
       | on nogil python next
        
         | laweijfmvo wrote:
         | i'm so sorry someone built something that you don't agree with
        
         | beckingz wrote:
         | There's value in 'backwards compatibility' from a
         | process/skills perspective. I agree that companies usually pay
         | too high a premium on that, but there is value.
        
       | bdjsiqoocwk wrote:
       | Isn't it constrained to minutely timestamps or something like
       | that.
        
         | dnadler wrote:
         | Nope, it's fairly generic. I worked at Man for a number of
         | years and with Arctic a lot during that time.
         | 
         | In my role I was mostly working with daily data, but also
         | higher and lower frequency data, and never had issues.
        
       | chirau wrote:
       | Two Sigma did a similar thing a few years back. It's called
       | Smooth Storage.
       | 
       | https://www.twosigma.com/articles/smooth-storage-a-distribut...
        
       | dnadler wrote:
       | If it wasn't clear from the article, this is open source and
       | available on Man's GitHub page:
       | 
       | https://github.com/man-group/arcticDB
       | 
       | I used to work at man, so take this with a grain of salt, but I
       | really liked this and have spun it up at home for side projects
       | over the years.
       | 
       | I'm not aware of other specialized storage options for
       | dataframes, but would be curious if anyone knows of any.
        
         | cgio wrote:
         | Need a licence apparently, source available.
        
         | camkego wrote:
         | The license is the Business Source License, which is not open
         | source: https://en.wikipedia.org/wiki/Business_Source_License
         | 
         | The current license says the terms shall change to Apache after
         | two years a release has been available.
         | 
         | Not for me, but some might find it interesting.
        
         | flockonus wrote:
         | Few doc choices are as annoying as using GIF to show code that
         | i should read in 2 seconds.
        
       | andrewstuart wrote:
       | Blog post from same company in two years:
       | 
       | "How we switched from a custom database to Postgres".
        
         | Loughla wrote:
         | I don't understand your cynicism here. Should people/businesses
         | not try new things?
        
           | andrewstuart wrote:
           | >> Should people/businesses not try new things?
           | 
           | Well yes, in the right context, like hobby/personal
           | programming. But things like "we built our own database" tend
           | to be really hard to justify and mostly represent technical
           | people playing toys when they actually have an obligation to
           | the business that is spending the money to not play toys but
           | to spend the money wisely and build a wise architecture,
           | especially for business critical systems.
           | 
           | It's indulgent and irresponsible to do otherwise if standard
           | technology will get you there. The other point is that there
           | are few few applications in 2024 that would have data storage
           | requirements that cannot be met by Postgres or some other
           | common database and if not then perhaps the architecture
           | should be changed to do things in a way that is compatible
           | with existing data storage systems.
           | 
           | Databases in particular as the core of a business system have
           | not only the get "put data in/get data out" requirement but
           | hundreds of sub requirements that relate to deployment and
           | operations and a million other things. You build a database
           | for that core set/get requirement and before you know it
           | you're wondering how to fulfill that vast array of other
           | requirements.
           | 
           | This happens everywhere in corporate development that some
           | CTO who has a personal like for some technology makes the
           | business use that technology when in fact the business needs
           | nothing more than ordinary technology, avoiding all sorts of
           | issues such as recruiting. The CTO moves on, leaving behind
           | the project built with the technology flavor of the month and
           | a project that either needs to struggle to maintain it into
           | the future, or the need to replace it with a more ordinary
           | way of doing things. Likely even the CTO has now lost
           | interest in their playtime technology fad that the industry
           | toyed with and decided isn't a great idea.
           | 
           | So I stand by my comment - they are likely to replace this
           | within a few years with something normal, likely Postgres.
        
         | ateng wrote:
         | I used to work at Man, and they have been using ArcticDB
         | (including the earlier iteration) for over 10 years now.
        
         | oxfordmale wrote:
         | They have tons of money, enough to support a development team
         | improving their database. In addition there is a long legacy,
         | both technically and politically, making it hard to even
         | propose to get rid of it. The only likely switch is a gradual
         | decline, with parts of the system moved to another database.
        
       | faizshah wrote:
       | I still didn't get why they built this, there's a better
       | explanation of the feature set in the FAQ comparison with
       | parquet: https://docs.arcticdb.io/latest/faq/
       | 
       | > How does ArcticDB differ from Apache Parquet?P
       | 
       | > Both ArcticDB and Parquet enable the storage of columnar data
       | without requiring additional infrastructure.
       | 
       | > ArcticDB however uses a custom storage format that means it
       | offers the following functionality over Parquet:
       | 
       | > Versioned modifications ("time travel") - ArcticDB is
       | bitemporal. > Timeseries indexes. ArcticDB is a timeseries
       | database and as such is optimised for slicing and dicing
       | timeseries data containing billions of rows. > Data discovery -
       | ArcticDB is built for teams. Data is structured into libraries
       | and symbols rather than raw filepaths. > Support for streaming
       | data. ArcticDB is a fully functional streaming/tick database,
       | enabling the storage of both batch and streaming data. > Support
       | for "dynamic schemas" - ArcticDB supports datasets with changing
       | schemas (column sets) over time. > Support for automatic data
       | deduplication.
       | 
       | The other answer I was looking for was why not kdb since this is
       | a hedge fund.
        
         | mianos wrote:
         | I think people are getting a little tired of being held ransom
         | to kdb/q and kdb consultants. Even if you have 'oil barrels' of
         | money, eventually it is annoying enough to look elsewhere.
        
           | blitzar wrote:
           | You don't get 'oil barrels' of money if you are in the habit
           | of giving big chunks of it to other people.
           | 
           | Handing over seed capital levels of cash to external vendors
           | inevitably gives someone the idea to seed a competitor
           | instead.
        
         | porker wrote:
         | > Versioned modifications ("time travel") - ArcticDB is
         | bitemporal. > Timeseries indexes.
         | 
         | That's a nice addition.
        
           | blitzar wrote:
           | Given how high a selling point this is, it is something that
           | I cannot recall ever using in ArcticDB (5+ years of using).
           | It's (financial) time series data, the past doesn't change.
           | 
           | Back adjusting futures contracts or back adjusting for
           | dividends / splits come to mind as I write this, but I would
           | just reprocess these tasks from the raw data "as of date" if
           | needed.
        
       | jmakov wrote:
       | Is there any reason to use that instead of Delta lake?
        
       ___________________________________________________________________
       (page generated 2024-08-25 09:01 UTC)