[HN Gopher] Apache Pinot 1.0
       ___________________________________________________________________
        
       Apache Pinot 1.0
        
       Author : PeterCorless
       Score  : 36 points
       Date   : 2023-09-19 19:44 UTC (3 hours ago)
        
 (HTM) web link (pinot.apache.org)
 (TXT) w3m dump (pinot.apache.org)
        
       | gregw2 wrote:
       | I poked around trying to find a high level understanding.
       | 
       | Here's the best place to start from what I could tell:
       | https://docs.pinot.apache.org/basics/concepts
       | 
       | Based on that, it's a MPP columnar database focused on low-
       | latency streaming-ingested/realtimeish use cases open sourced by
       | LinkedIn's infra teams:
       | 
       |  _" Pinot is designed to deliver low latency queries on large
       | datasets. To achieve this performance, Pinot stores data in a
       | columnar format and adds additional indices to perform fast
       | filtering, aggregation and group by._
       | 
       |  _Raw data is broken into small data shards. Each shard is
       | converted into a unit called a segment. One or more segments
       | together form a table, which is the logical container for
       | querying Pinot using SQL /PQL._
       | 
       |  _... Logically, a cluster is simply a group of tenants. As with
       | the classical definition of a cluster, it is also a grouping of a
       | set of compute nodes. Typically, there is only one cluster per
       | environment /data center. There is no needed to create multiple
       | clusters since Pinot supports the concept of tenants. At
       | LinkedIn, the largest Pinot cluster consists of 1000+ nodes
       | distributed across a data center. The number of nodes in a
       | cluster can be added in a way that will linearly increase
       | performance and availability of queries."_
       | 
       | Also per https://docs.pinot.apache.org/basics/getting-
       | started/frequen...
       | 
       |  _Q: When are new events queryable when getting ingested into a
       | real-time table?_
       | 
       |  _A: Events are available to queries as soon as they are
       | ingested. This is because events are instantly indexed in memory
       | upon ingestion._
       | 
       |  _The ingestion of events into the real-time table is not
       | transactional, so replicas of the open segment are not
       | immediately consistent. Pinot trades consistency for availability
       | upon network partitioning (CAP theorem) to provide ultra-low
       | ingestion latencies at high throughput. However, when the open
       | segment is closed and its in-memory indexes are flushed to
       | persistent storage, all its replicas are guaranteed to be
       | consistent, with the commit protocol._
       | 
       |  _... Q: Why are segments not strictly time-partitioned?_
       | 
       |  _A: It might seem odd that segments are not strictly time-
       | partitioned, unlike similar systems such as Apache Druid. This
       | allows real-time ingestion to consume out-of-order events. Even
       | though segments are not strictly time-partitioned, Pinot will
       | still index, prune, and query segments intelligently by time
       | intervals for the performance of hybrid tables and time-filtered
       | data. When generating offline segments, the segments generated
       | such that segments only contain one time interval and are well
       | partitioned by the time column._
        
       | emmanueloga_ wrote:
       | Does anyone understand how the Apache foundation works? Do
       | projects receive monetary funding, or is it just the "prestige"
       | of becoming an Apache project? What's the advantage of being
       | under their umbrella?
       | 
       | At this point, any legitimacy of working with their foundation
       | may be lost under the weight of hundreds or even thousands of
       | projects of unknown quality levels (I'm not talking about this
       | project's merits, which I know nothing about).
        
         | latchkey wrote:
         | 20+ year Apache Member here... yea, it is pretty much prestige.
         | But you get community, infrastructure, legal, branding, as well
         | as mentoring on 'how do to open source'.
         | 
         | It is all pretty well documented. Here are a couple good links
         | to get you started...
         | 
         | The Apache Way
         | 
         | https://www.apache.org/theapacheway/
         | 
         | The PMC oversees the projects:
         | 
         | https://www.apache.org/dev/pmc.html
        
         | drewda wrote:
         | Apache Foundation provides well-trod legal path for large
         | corporations to release their internal code as open-source.
         | 
         | They do have some competition. Linux Foundation is another
         | large non-profit that creates umbrella entities for a bunch of
         | open-source software originally created within larger tech
         | companies. I get the impression that Apache Foundation goes for
         | breadth, taking any and all donations, while Linux Foundation
         | goes for depth in specific topics.
         | 
         | In terms of funding, for open-source projects originally
         | created within a larger company, that company will often
         | provide a financial donation to the foundation that is taking
         | on its ongoing management. The foundation will also take a cut
         | of future donations to the project, to pay for the
         | administrative overhead of the non-profit.
        
       | politelemon wrote:
       | So is this similar to Amazon's Athena? I'm trying to place what a
       | 'realtime distributed OLAP datastore' is, or competes with, in
       | cloudy/naive terms.
        
         | fiddlerwoaroof wrote:
         | My impression is that it's in the same space as RedShift,
         | Snowflake, Citus, Greenplum, ClickHouse.
        
           | glogla wrote:
           | RedShift, Snowflake, Citus, Greenplum and Athena are OLAP
           | engines, but not real-time focused. For this one, it is more
           | similar to Druid, ClickHouse or RockSet.
           | 
           | The 1.0 version of Pinot seems to bring a lot of maturity,
           | they seem to have added new engine that can do joins now. I'm
           | not sure how stable it is, but it seems interesting.
           | 
           | As for what is this kind of database usedful for, this is for
           | operational analytics on large data that also update in real
           | time. In my domain that would be things like having insight
           | into large supply chains or manufacturing operations, like
           | power plants or factories, just in general for monitoring
           | stuff. I know it's also used in security and finance (for
           | fraud).
        
           | zX41ZdbW wrote:
           | It's hardly comparable with ClickHouse. Even loading a table
           | with 100M rows is not an easy endeavor in Pinot: https://gith
           | ub.com/ClickHouse/ClickBench/blob/main/pinot/ben...
        
       ___________________________________________________________________
       (page generated 2023-09-19 23:00 UTC)