[HN Gopher] Apache Pinot 1.0 ___________________________________________________________________ Apache Pinot 1.0 Author : PeterCorless Score : 36 points Date : 2023-09-19 19:44 UTC (3 hours ago) (HTM) web link (pinot.apache.org) (TXT) w3m dump (pinot.apache.org) | gregw2 wrote: | I poked around trying to find a high level understanding. | | Here's the best place to start from what I could tell: | https://docs.pinot.apache.org/basics/concepts | | Based on that, it's a MPP columnar database focused on low- | latency streaming-ingested/realtimeish use cases open sourced by | LinkedIn's infra teams: | | _" Pinot is designed to deliver low latency queries on large | datasets. To achieve this performance, Pinot stores data in a | columnar format and adds additional indices to perform fast | filtering, aggregation and group by._ | | _Raw data is broken into small data shards. Each shard is | converted into a unit called a segment. One or more segments | together form a table, which is the logical container for | querying Pinot using SQL /PQL._ | | _... Logically, a cluster is simply a group of tenants. As with | the classical definition of a cluster, it is also a grouping of a | set of compute nodes. Typically, there is only one cluster per | environment /data center. There is no needed to create multiple | clusters since Pinot supports the concept of tenants. At | LinkedIn, the largest Pinot cluster consists of 1000+ nodes | distributed across a data center. The number of nodes in a | cluster can be added in a way that will linearly increase | performance and availability of queries."_ | | Also per https://docs.pinot.apache.org/basics/getting- | started/frequen... | | _Q: When are new events queryable when getting ingested into a | real-time table?_ | | _A: Events are available to queries as soon as they are | ingested. This is because events are instantly indexed in memory | upon ingestion._ | | _The ingestion of events into the real-time table is not | transactional, so replicas of the open segment are not | immediately consistent. Pinot trades consistency for availability | upon network partitioning (CAP theorem) to provide ultra-low | ingestion latencies at high throughput. However, when the open | segment is closed and its in-memory indexes are flushed to | persistent storage, all its replicas are guaranteed to be | consistent, with the commit protocol._ | | _... Q: Why are segments not strictly time-partitioned?_ | | _A: It might seem odd that segments are not strictly time- | partitioned, unlike similar systems such as Apache Druid. This | allows real-time ingestion to consume out-of-order events. Even | though segments are not strictly time-partitioned, Pinot will | still index, prune, and query segments intelligently by time | intervals for the performance of hybrid tables and time-filtered | data. When generating offline segments, the segments generated | such that segments only contain one time interval and are well | partitioned by the time column._ | emmanueloga_ wrote: | Does anyone understand how the Apache foundation works? Do | projects receive monetary funding, or is it just the "prestige" | of becoming an Apache project? What's the advantage of being | under their umbrella? | | At this point, any legitimacy of working with their foundation | may be lost under the weight of hundreds or even thousands of | projects of unknown quality levels (I'm not talking about this | project's merits, which I know nothing about). | latchkey wrote: | 20+ year Apache Member here... yea, it is pretty much prestige. | But you get community, infrastructure, legal, branding, as well | as mentoring on 'how do to open source'. | | It is all pretty well documented. Here are a couple good links | to get you started... | | The Apache Way | | https://www.apache.org/theapacheway/ | | The PMC oversees the projects: | | https://www.apache.org/dev/pmc.html | drewda wrote: | Apache Foundation provides well-trod legal path for large | corporations to release their internal code as open-source. | | They do have some competition. Linux Foundation is another | large non-profit that creates umbrella entities for a bunch of | open-source software originally created within larger tech | companies. I get the impression that Apache Foundation goes for | breadth, taking any and all donations, while Linux Foundation | goes for depth in specific topics. | | In terms of funding, for open-source projects originally | created within a larger company, that company will often | provide a financial donation to the foundation that is taking | on its ongoing management. The foundation will also take a cut | of future donations to the project, to pay for the | administrative overhead of the non-profit. | politelemon wrote: | So is this similar to Amazon's Athena? I'm trying to place what a | 'realtime distributed OLAP datastore' is, or competes with, in | cloudy/naive terms. | fiddlerwoaroof wrote: | My impression is that it's in the same space as RedShift, | Snowflake, Citus, Greenplum, ClickHouse. | glogla wrote: | RedShift, Snowflake, Citus, Greenplum and Athena are OLAP | engines, but not real-time focused. For this one, it is more | similar to Druid, ClickHouse or RockSet. | | The 1.0 version of Pinot seems to bring a lot of maturity, | they seem to have added new engine that can do joins now. I'm | not sure how stable it is, but it seems interesting. | | As for what is this kind of database usedful for, this is for | operational analytics on large data that also update in real | time. In my domain that would be things like having insight | into large supply chains or manufacturing operations, like | power plants or factories, just in general for monitoring | stuff. I know it's also used in security and finance (for | fraud). | zX41ZdbW wrote: | It's hardly comparable with ClickHouse. Even loading a table | with 100M rows is not an easy endeavor in Pinot: https://gith | ub.com/ClickHouse/ClickBench/blob/main/pinot/ben... ___________________________________________________________________ (page generated 2023-09-19 23:00 UTC)