[HN Gopher] The Rise and Fall of the OLAP Cube (2020) ___________________________________________________________________ The Rise and Fall of the OLAP Cube (2020) Author : smartmic Score : 129 points Date : 2021-07-05 11:18 UTC (11 hours ago) (HTM) web link (www.holistics.io) (TXT) w3m dump (www.holistics.io) | random3 wrote: | Brings up nostalgic memories from my previous life building | distributed analytics systems. We had a fully distributed, | scalable, real-time OLAP cube 10 years ago on top of HBase and | Storm (or any other realtime processing pipeline). It was less | than 30 KLOC and could be used as a library as well as fully | distributed system. | | There are several OSS solutions but I still weep every time I | think about the stupid reasons for which we delayed open-sourcing | it. It was such a simple and scalable data model including many | concepts that were not common at the time. | https://medium.com/hstackdotorg/hbasecon-low-latency-olap-wi... | | I still believe there's room for a similar simple approach, | likely taking advantage of some of the developments related to | Algebird, Calcite, Gandiva, etc. | georgewfraser wrote: | The OLAP cube needs to come back, as an _optimization_ of a data- | warehouse-centric workflow. If you are routinely running queries | like: SELECT dimension, measure FROM | table WHERE filter = ? GROUP BY 1 | | You can save a lot of compute time by creating a materialized | view [1] of: SELECT dimension, filter, measure | FROM table GROUP BY 1, 2 | | and the query optimizer will even automatically redirect your | query for you! At this point, the main thing we need is for BI | tools to take advantage of this capability in the background. | | [1] https://docs.snowflake.com/en/user-guide/views- | materialized.... | motogpjimbo wrote: | Did you mean to write HAVING in your first query? Otherwise | your second query is not equivalent to the first, because the | WHERE will not be performed prior to the aggregation. | buremba wrote: | While native materialized view feature is a great start, | unfortunately they're not useful in a practical way if you have | data lineage. They works like a black-box and they can't | guarantee the query performance. | | The new generation ELT tools such as dbt partially solve this | problem. You can model your data and incrementally update the | tables that can be used in your BI tools. Looker's Aggregate | Awareness is also a great start but unfortunately it only works | for Looker. | | We try to solve this problem with metriql as well: | https://metriql.com/introduction/aggregates The idea is to | define these measures & dimensions once and use it everyone; | your BI tools, data science tools, etc. | | Disclaimer: I'm the tech lead of the project. | glogla wrote: | Is that related to lightdash.com somehow? | | It seems like a very similar technology and also the webpages | for both are almost identical. | buremba wrote: | We both use dbt as the data modeling layer but we don't | actually develop a standalone BI tool. Instead, we | integrate to third-party BI tools such as Data Studio, | Tableau, Metabase, etc. | | We love Looker and wanted bring the LookML experience to | existing BI tools rather than introducing a new BI tool, | that's how metriql was born. I believe that Lightdash is a | cool project especially for data analysts who are | extensively using dbt but metriql targets users who are | already using a BI tool. I'm not particularly sure which | pages are identical, can you please point me? | glogla wrote: | Compare https://metriql.com/introduction/creating- | datasets and https://docs.lightdash.com/guides/how-to- | create-metrics | | I though you are affiliated somehow, but looking at it | now, it seems you just use the same documentation website | generator :) | mulmen wrote: | But those queries aren't equivalent so how is anything saved by | materializing the second one? | | e: | | I believe (I could be wrong!) you edited the second query from | SELECT dimension, measure FROM table GROUP BY 1 | | To SELECT dimension, filter, measure | FROM table GROUP BY 1, 2 | | This addresses the filtering but how is that any different from | the original table? Presumably `table` could have been a finer | grain than the filter and dimension but you'd do better to add | the rest of the dimensions as well, at which point you're most | of the way to a star schema. | | This kind of pre-computed aggregate is typical in data | warehousing. But is it really an "OLAP cube"? | | In general I agree there is value in the methods of the past | and we would be well served to adapt those concepts to our work | today. | georgewfraser wrote: | It's much smaller than the original table. If you compute | lots of these, then voila, you have an OLAP cube. | motogpjimbo wrote: | He did edit his comment, but unfortunately didn't acknowledge | the edit. | _dark_matter_ wrote: | looker does exactly this, though you do have to specify which | dimensions to aggregate: https://docs.looker.com/data- | modeling/learning-lookml/aggreg... | unklefolk wrote: | I'm not a big fan of OLAP cubes / data warehouses but it should | be noted that these aren't always just about scale and speed. | They also serve to provide a single view of the data when the | data is held across multiple databases / systems. You're orders | may be in PostgreSQL, your inventory tracked in MySQL etc. So | bringing all that data into a single database for analytics can | be valuable. | chipsa wrote: | This isn't against data warehouses, but just the cubes. | Bringing everything into a single database for analysis is | still required. But you extract load then transform, instead of | extract transform load. | rossdavidh wrote: | While that's true, it feels to me that your comment made this | point a lot more clearly than the article did. | AtlasBarfed wrote: | Which is why hadoop took off? | debarshri wrote: | I am not the subject matter expert but have spent first few year | of my career doing data warehousing. I think OLAP cubes are | falling out of fashion because traditional BI and reporting in | general is falling out of fashion. | | With self-service BI tools like looker and other tools, Business | folks are running more adhoc queries than traditionally piped | data through ETL processes. | dang wrote: | One past thread: | | _The Rise and Fall of the OLAP Cube_ - | https://news.ycombinator.com/item?id=22189178 - Jan 2020 (53 | comments) | gengstrand wrote: | My first exposure to OLAP was on a team developing a front end to | Essbase that semi-automated report generation after reloading the | cube. Since then, I have worked with various OLAP technologies | from Microsoft to Mondrian. | | I agree with Mr. Chin who is the author of this blog. OLAP cubes | can scale up but not out which limits their potential size so | they can't really accommodate big data scenarios. My previous | company used Apache Druid instead. Druid can do a lot of what | OLAP did but using a distributed approach. Although Druid does | not provide tight integration with spreadsheets, it does provide | near real-time ingestion, on-the-fly rollup, and first class time | series analysis. | iblaine wrote: | I'm a bit opinionated on data modelling, so a couple points. | | 1. "OLAP Cubes" arguably belong to Microsoft and refer to SQL | Server cubes that require MDX queries. It's a solution given to | us by Microsoft that comes with well understood features. "OLAP" | itself is a general term used to describe any database used for | analytical processing, so OLAP has a wide range of uses. | | 2. OLAP Cubes (as defined above) started to decrease in | population in 2015 (I'd argue). | | 3. Any solution to fixing "OLAP" that comes from a commercial | vendor is suspicious. As painful as Kimbal & Inmon are, they are | ideas that don't require vendor lock in. | | 4. At my current company, we recently wrapped up a process where | we set out to define the future of our DW. Our DW was encumbered | by various design patterns and contractors that came and went | over the past decade. We analyzed the various tables & ETLs to | come up with a clear set of principles. The end result looks | eerily close to Kimball but with our own naming conventions. Our | new set of principles clearly solve our business needs. Point | being you don't need Kimball, Inmon, Looker, etc to solve data | modelling problems. | | 5. Columnar databases are so pervasive these days that you should | no longer need to worry about a lack of options when choosing the | right storage engine for your OLAP db. | | 6. More and more data marts are being defined by BI Teams | w/little modeling/CS skills. To that end, I think it's important | to educate your BI teams as a means to minimize the fallout, and | be accommodating in your DW as means to work efficiently with | your BI teams. This is to say settle on a set of data modelling | principles that work for you, but may not work for someone else. | willvarfar wrote: | I think the article is spot on about workflow but off on tech. | Under the hood, Cubes will be back. | | Snowflake is the cubeless column store. | | And now firebolt comes along, promising order of magnitudes | improvement. How? Largely "join indexes" and "aggregation | indexes" which seem to smell like cubes. | | So take the modern column store and slap some cubes in front. | | In firebolt this seems to be added by programmers. But I think | the future will see olap mpp engines that transparently pick | cubes, maintain them, reuse them and discard them all | automatically. | dominotw wrote: | > How? Largely "join indexes" and "aggregation indexes" which | seem to smell like cubes. | | Just curious. What stops snowflake from adding these too. | Trying to understand what innovation firebolt is doing. | elthor89 wrote: | "Be suspicious of companies that are heavily locked into the OLAP | cube workflow. (Learn how to do this here)." | | The opposite is true too. Be suspicious of companies trying to | lock you into their selfservice of data visualization tool. I | have seen many BI tool vendors trying lock their clients into | their tool. As result they get unnecessary complex models | running. | | I find the olap a pretty good mental model. That developers and | users can understand. | | The sheer amount of handwritten, tailor made etl or elt pipelines | is something I like to see automated or replaced with something | better. | glogla wrote: | Yup. At my job we ended up some way too important stuff | "modelled in BI tool" in a way that nobody could understand how | it actually works, and had to do full rewrite to model it in | SQL. | | But not all BI tools are alike - something like Superset which | internally uses SQL and expects users to do self-service | transformation using views is easy to industrialize since you | already have the queries. Something like Tableau, not so much. | | (If only Superset were more mature) | [deleted] | [deleted] | malkia wrote: | At previous job, in an an human ads evaluation team, we had a big | row table that got turned into column oriented one every couple | of hours. I think it got close to million columns (expanded from | leaf protobuf fields). We had some black/white listing as the | schema grew too big (~20GB? just the schema, and the limits were | few gigs). That was years ago, so probably changed a lot. | | But the use case was for various analysts (statisticians, | linguists, etc., or any internal users other teams of the table). | As their quires might use any column they like - there was | probably a way to track down which columns got used over a time, | but since it was used for launching new things, you would not | know - better be safe than sorry I guess. | | Anyway, I'm still fascinated that this terabyte thing got | converted using 20k machines for few hours and worked! (it wasn't | streaming, it was converting over and over from the begining, | this might've changed too!) | intrasight wrote: | I've been doing OLAP for over 20 years, so I can say with | authority that OLAP is generally a bag of hurt. The tooling | sucks, the languages suck, the costs suck. The name itself harks | back to seemingly a time before computers. "Online"? As opposed | to paper tablets? | [deleted] | myohmy wrote: | OLAP is legacy hell. I built an MDX query to solve a report | problem, but now I can't replace it because my seniors failed | to deconstruct it because it is stupid difficult to wrap your | head around. So now it sits in limbo until I can find time to | do the work myself. Let me just take a look at my calendar and | find my next free week... oh, there is none. | goodpoint wrote: | Online as opposed to batch. | intrasight wrote: | I know. But still, you must agree, a stupid name. The rest of | the IT industry would say "real-time vs batch". And OLAP is | batch for most implementations. | purerandomness wrote: | The name was coined way before "online" meant what it means | today. | bradleyjg wrote: | Real-time is engine control software not a webpage with a | spinner. | darepublic wrote: | I remember on a dark gloomy day going to the bookstore and | picking up a book about SQL server report server. Some of the | toughest time in my life. | crazydoggers wrote: | I always felt the value of an OLAP cube was uncovering questions | you may not know to ask. If you've got a cube with a good | interface, it means more people in your organization can inspect | the data looking for trends across dimensions that might not be | obvious. As an example in a previous organization, someone just | playing with an olap cube once found that certain DUI searches | became much more common on Weekend mornings (perhaps obvious in | retrospect). That insight allowed better targeting and sales, and | made concrete what otherwise is just a guess. | | Sometimes just exploring the data allows stuff like that to pop | out. | | Are there similar interfaces with columnar stores? Or do all the | analytics need to be pre compiled? The ability to | slice/dice/filter/aggregate your data in perhaps non obvious ways | is really the value of business analytics in my opinion. | mulmen wrote: | I don't think this benefit is unique to OLAP _cubes_. This is | just generally the reason you have a data warehouse at all. The | implementation is up to you but if you model the _business | process_ then you will be ready to answer questions that have | not been asked. | lou1306 wrote: | The author mentions the Codd-OLAP conflict of interest, lamenting | that despite what happened "Codd is still regarded as the father | of the relational database". But this remark sounds like a non | sequitur to me. No matter the scandal he got into in the 90s, he | still invented relational algebra back in the 70s, didn't he? | data_ders wrote: | the "scandal" | | > coined the term Online analytical processing (OLAP) and wrote | the "twelve laws of online analytical processing". Controversy | erupted, however, after it was discovered that this paper had | been sponsored by Arbor Software (subsequently Hyperion, now | acquired by Oracle), a conflict of interest that had not been | disclosed, and Computerworld withdrew the paper. | | https://en.m.wikipedia.org/wiki/Edgar_F._Codd | brazzy wrote: | He published a _scientific paper_ that was to some degree | paid advertising, and didn 't disclose the payment. | | That kind of thing still totally counts as a scandal. | ryanmarsh wrote: | Indeed, but was the paper wrong? | skeeter2020 wrote: | It doesn't matter. The undisclosed conflict of interest | should eliminate it from consideration. | coldacid wrote: | >If you're running a query like "give me the average price of all | transactions over the past 5 years", a relational database would | have to load all the rows from the previous 5 years even though | it merely wants to aggregate the price field; a columnar database | would only have to examine one column -- the price column. | | Wouldn't that be _two_ columns? | glogla wrote: | Yeah but it's nitpicky :) | | Maybe "past 5 years" mean "over everything" and then it might | technically be one column. | | Or if you're using something like S3 with Parquet for "columnar | database" and it is partitioned by date, then the date values | are stored in metadata - so you would logically read two | columns but physically only read one from storage. Same story | for something like Redshift and using date as sortkey. | citrin_ru wrote: | > Wouldn't that be two columns? | | Columnar databases can also have indices. If there is an index | by date (or time) then you DB will know row range from the | index and will read only price column within given row range. | If there is no index by data it would be two columns, but it is | still much less than a full row with many columns. | Animats wrote: | If the number of transactions is small enough to query quickly, | a simple SQL query will give you that. If the number of | transactions is very large, querying a random sample of perhaps | 1 in 1000 will give you reasonably accurate results. You can | calculate the probable error from the sample size. | | There are questions which require examining all the data, but | often you don't really need to. | ryanmarsh wrote: | You can pry the star schema database from my cold dead hands. You | can get a lot of work done, without much effort, in multi- | dimensional row stores with bare SQL. The ETL's are easier to | write than the map reduce jobs I've seen on columnar stores. ETL | pipelines get the data in the structure I need before I begin | analysis. ELT requires me to know too much about the original | structure of the data. Sure, that's useful when you find a novel | case, but shouldn't that be folded back into the data warehouse | for everyone? | garethrowlands wrote: | Columnar stores speak SQL these days, though they may map- | reduce under the hood. ELT vs ETL just moves some | transformation into SQL the data normally still ends up in the | data warehouse. | mulmen wrote: | I absolutely love Looker for this reason. It understands | foreign keys so if you model everything as stars it just writes | SQL for you. | | So simple, so powerful. I wish something open source would do | this painfully obvious thing. | marcinzm wrote: | >Sure, that's useful when you find a novel case, but shouldn't | that be folded back into the data warehouse for everyone? | | And this isn't possible to do in the data warehouse why | exactly? Most every company seems to use DBT so even analysts | can write transforms to generate tables and views in the | warehouse from other tables. Hell, even Fivetran lets you run | DBT or transformations after loading data. | eyeball wrote: | One big table combining all of your facts and dimensions for me. | | Less compute time, less time for business users since they don't | need to figure out star schema joins. Storage is cheap. | RantyDave wrote: | I'm not sure we can call NoSQL "a fad that has gone away". | ridaj wrote: | I would say yes it's now possible to do analytics without cubes, | except when butting up against two things: | | * Really big data (~ PB/day), which makes it impractical to | process every analytics query by going back to the source | | * User data subject to GDPR-like retention constraints, where you | cannot usually hold on to the base data for ad long as you need | analytics for | didibus wrote: | One thing about columnar store that annoys me, is that when you | do want to retrieve the records for something you've found it's | really slow. I wish there was something of a hybrid. | mwexler wrote: | "We model data according to rigorous frameworks like Kimball or | Inmon because we must regularly construct OLAP cubes for our | analyses." | | Actually, no. We model our data this way so it can be used for | business decisions. It doesn't take long for any entity of any | scale to discover that the logging and eventing done for | heartbeat status, debugging, and scaling is just different than | what you need to make a variety of business decisions. | | You can solve with more events at different scales (button clicks | nested in screen views) or pick events or event rolkups that | appear to be clean business stages ("completed checkout") but | still, your finance team, marketing group, all have different | needs. | | So, you decide to have some core shared metrics, derived and | defined, and make them usable by everyone. Folks agree on the | defns, and due to ease and trust, you see more data supporting | more decisions. | | You discover that some folks are doing 10 table joins to get an | answer; it's fast but difficult to extend for new questions. You | decide to build a view that solves some of these pains, and | refactoring to allow a better time dimension. Your version links | with the metrics you created, and the resulting queries shed tons | of CTEs while becoming readable to the average user. | | And now, you have some ELT pipelines, some event transforms that | result in counts and filters that map nicely to your business | needs but still allow you to get atomic raws, and you and your | teams start to trust in consistent results. Your metrics are | mostly clearly summable, and ones that aren't are in table views | that precalc the "daily uniques" or other metrics that may need a | bit special handling. | | You've started modeling your data. | | No, we don't need olap cubes. But we do need some type of rigor | around analytic data. Otherwise, why go to all the trouble to | collect it, count it, and predict from it if it may be wrong with | no measure of that uncertainty? | | And yeah, Kimball et al are from a world where olap was the | answr, but it turns out they solved a broader set of sql | problems. So, worth learning the good, toss the dated, and see | what good data modeling can do for your analysis and predictions. | glogla wrote: | > Actually, no. We model our data this way so it can be used | for business decisions | | Yeah, that would be my disagreement as well. | | Sure, some of Kimball might be obsoleted by modern technology. | But I don't think the point of Kimball is cubes, and even if it | was, what better is there? | | I'd be really interested in what else is there, that is more | modern and suited to the modern world. | | Data Vault? That probably isn't it, most of all because it's | all about the "platform to build you Kimball on" but not the | actual business modelling. (But the parts about Metrics Vault | and Quality Vault and the "service stuff" are really good.) | | Anchor Modelling? That one seems, looking by 2021 eyes, like a | way to implement columnar storage and schema-on-read in MS SQL | Server using store procedures ... which is probably not | actually a good idea. | | Puppini Bridge aka. Unified Star Schema seems like interesting | concept, especially in self-service space. But even it's | proponents warn you it doesn't scale performance-wise and also | it is kind of incremental change on Kimball. (But the way the | bridge table is kinda adjacency matrix of graph of your objects | tickles my inner computer scientist fancy) | | So really, what else is there? | mwexler wrote: | BTW, their followup piece at | https://www.holistics.io/blog/olap-is-not-olap-cube/ is a nicer | drill down into some of these issues. Also a good read. | tresil wrote: | Could not have written this any better. I've been in the data | business for over 10 years. A number of technologies have come | and gone during that time, but dimensional modeling and solid | SQL skills keep delivering remarkable value. | chrstr wrote: | Indeed, it is generally good for analytics to have a conceptual | data model that is reasonably efficient both on the technical | side (to quickly get answers for queries), and also on the | logical side (to quickly translate questions to queries). | | This model doesn't need to be OLAP cubes, but it's also not | that easy to find something better. ___________________________________________________________________ (page generated 2021-07-05 23:00 UTC)