[HN Gopher] Building a new vector based storage model
       ___________________________________________________________________
        
       Building a new vector based storage model
        
       Author : bluestreak
       Score  : 51 points
       Date   : 2021-05-11 13:51 UTC (9 hours ago)
        
 (HTM) web link (questdb.slab.com)
 (TXT) w3m dump (questdb.slab.com)
        
       | bluestreak wrote:
       | We launched QuestDB last summer [1, 2]. Our storage model is
       | vector-based and append-only. This meant that all incoming data
       | had to arrive in the correct time order. This worked well for
       | some use cases but we increasingly saw real-world cases where
       | data doesn't always land at the database in chronological order.
       | We saw plenty of developers and users come and go specifically
       | because of this technical limitation. So it became a priority to
       | deal with out-of-order data.
       | 
       | The big decision was which direction to take to tackle the
       | problem. LSM trees seemed an obvious choice, but we chose an
       | alternative route so we wouldn't lose the performance we spent
       | years building. Our latest release supports out-of-order
       | ingestion by re-ordering data on the fly. That's what this
       | article is about.
       | 
       | Also, we had many people asking about the differences between
       | QuestDB and other open-source databases and why users should
       | consider giving it a try instead of other systems. When we
       | launched on HN, readers showed a lot of interest in side-by-side
       | comparisons to other databases on the market. One suggestion [3]
       | that we thought would be great to try out was to benchmark
       | ingestion and query speeds using the Time Series Benchmark Suite
       | (TSBS) [4] developed by TimescaleDB. We're super excited to share
       | the results in the article.
       | 
       | [1] https://news.ycombinator.com/item?id=23975807
       | 
       | [2] https://news.ycombinator.com/item?id=23616878
       | 
       | [3] https://news.ycombinator.com/item?id=23977183
       | 
       | [4] https://github.com/timescale/tsbs
        
         | Darkphibre wrote:
         | Oh, this is fascinating. Seven years ago I architected a true-
         | realtime telemetry pipeline with end-to-end sequential
         | guarantees (with roundtrip times <200ms excluding network
         | latencies, and cloud processing times <20ms, leveraging
         | BOND/ProtocolBuffer over AMQP over Websocket). It's still used
         | by every 1st-party game for a large publisher.
         | 
         | It allowed for non-windowed event sequence analytics, enabling
         | realtime feedback (think achievements that have multiple
         | conditions).
         | 
         | And then the requirement was dropped, and (as you've found),
         | everyone just uses it like a standard telemetry stream and is
         | OK with 5-15min bins. :P
         | 
         | I still have a passion for the space, will definitely be
         | reading up on this. I firmly believe this is the future of
         | telemetry analytics; Congratulations on your efforts seeing the
         | light of day!!
         | 
         | Disclaimer, I currently work for Microsoft, all words here are
         | my own and do not necessarily reflect those of my employer,
         | etc. ;)
        
           | j1897 wrote:
           | Thanks for the kind words and your perspective !
        
         | [deleted]
        
       | alcio wrote:
       | Excited to see this new release. Seems to me this would
       | (slightly?) negatively impact query performance for recent data
       | (when the query concerns data is both in O3 and persisted zones),
       | is that the case?
        
         | bluestreak wrote:
         | Query performance would be affected in so far as ingest jobs
         | share the same thread pool as query jobs. As I am writing this
         | I am also realising that perhaps we should have an option to
         | separate these jobs... If we ignore resource usage and commit()
         | latency, query performance would remain unaffected. Reader
         | remains lockless largely unchanged code-wise. This was one of
         | our major objectives to maintain data model as seen by the
         | readers. I hope I'm making sense here?
        
       | hartem_ wrote:
       | Congrats on the release! The benchmark results look really
       | impressive :).
       | 
       | Curious to learn more about your approach to verifying the
       | correctness of the implementation. Did you try testing it with
       | Jepsen or something similar?
        
         | bluestreak wrote:
         | Thank you! We are not yet distributed. That's coming right up
         | along with Jensen style tests. We are really serious about
         | testing!
        
       ___________________________________________________________________
       (page generated 2021-05-11 23:00 UTC)