[HN Gopher] Modern Data Lakes Overview ___________________________________________________________________ Modern Data Lakes Overview Author : developersh Score : 46 points Date : 2020-02-23 17:02 UTC (5 hours ago) (HTM) web link (developer.sh) (TXT) w3m dump (developer.sh) | effnorwood wrote: | *Swamps | cateye wrote: | Isn't it just a paradox to store infinite data, to use it later | for very specific things without having to define it first? | | It sounds very common sense to not to "limit the potential of | intelligence by enforcing schema on Write" while in reality, the | same problem just shifts (or gets hidden) in the next steps. | | For example: there are 10 data sources with each 100TB of data. I | aggregate these to my new shiny data lake with a fast adapter. | Just suck it all without any worries about Schema. So, now I have | 1PB of semi unstructured data. | | How do I find the fields X and Y when these are all named | differently in 10 sources? Can I even find it without having | business domain experts for each data source? How do I keep | things in sync when the structure of my data sources change | (frequently)? | | It seems like there is an underlying social/political problem | that technology can't really fix. | | Reminds me the quote: "There are only two hard things in Computer | Science: cache invalidation and naming things." | nixpulvis wrote: | So where are we on Data Lakes vs NewSQL [1]. | | [1]: https://en.wikipedia.org/wiki/NewSQL | ozkatz wrote: | Most "NewSQL" databases are designed for OLTP use cases (i.e. | many small queries that do little aggregation). Data Lakes are | optimized for OLAP (i.e. doing a smaller amount of queries, but | aggregating over large amounts of data). | | As an example, Athena would do a terrible job at finding a | specific user by its ID, while Spanner would behave just as | poorly at calculating the cumulative sales of all products for | a given category, grouped by store location (assuming many | millions of rows representing sales). | | Hope this analogy makes sense. | sologoub wrote: | Not a bad read, but it's written from the perspective of large | mature operations. If your company is just starting out, the | advice is actually there but not quite spelled out - use S3/GCS | to store data (ideally in parquet format) and query it using | Athena/bigquery. | mattbillenstein wrote: | Can you query parquet from bigquery without loading it into a | table from gcs? | | I've gotten pretty far with jsonl on gcs and bigquery - even | some bigquery streaming for more real-time stuff. | lmkg wrote: | If the data is in Cloud Storage, BigQuery can query it in- | place without loading it. BigQuery calls this an External | Data Source. | | https://cloud.google.com/bigquery/external-data-sources | | My biggest papercut with using this was having to make sure | that all of the locations matched exactly. | meritt wrote: | Fairly new to this topic and coming from a traditional RDBMS | background. How do you go about deciding how many rows/records | to store per object? And how does Athena/Bigquery know which | objects to query? Do people use partitioning methods (e.g. by | time or customer ID etc) to reduce the need to scan the entire | corpus every time you run a query? | dswalter wrote: | If you're using AWS athena for querying, you're also using | the aws glue catalog (managed hive metastore-ish service) to | know where partitions are, but yeah, you'll need to partition | and sort your data to make sure you're not doing full table | scans. | lmkg wrote: | From the Google side: In traditional BigQuery, the answer to | all three questions are related. You shard the files by | partition key and put the key into the file name. You can | filter the file name in the WHERE clause, and the query will | skip filtered objects, but otherwise fully scan every object | it touches. | | There is apparently now experimental support for using Hive | partitions natively. Never used it, literally found out two | minutes ago. | | The number of records per object is usually "all of them" | (restricted by partition keys). The main exception is live | queries of compressed JSON or CSV data, because BigQuery | can't parallelize them. But generally you trust the tool to | handle workload distribution for you. | | This works a little differently if you load the data into | BigQuery instead of doing queries against data that lives in | Cloud Storage. You can use partitioning and clustering | columns to cut down on full-table scans. | FridgeSeal wrote: | Having spent the last ~8 months at my work grappling with the | consequences and downsides of a Data lake, all I want to do is | never deal with one again. | | Nothing about it was superior or even on par with simply fixing | our current shortcomings OLAP database setup. | | The data lake is not faster to write to; it's definitely not | faster to read from. Querying using Athena/etc was slow, painful | to use, broke exceedingly often and would have resulted in us | doing so much work stapling in schemas/etc that we would have | been net better off to just do things properly from the start and | use a database. The data lake also does not have better access | semantics and our implementation has resulted in some of my | teammates practically reinventing consistency from first | principles. By hand. Except worse. | | Save yourself from this pain: find the right database and figure | out how to use it, don't reinvent one from first principles. | towelpluswater wrote: | Completely agree with you. Data lakes were marketed well | because, well... data warehousing is hard, and a lot of work. | Data lakes don't make that hard work disappear, it just changes | how and where it happens. | | I've found data lakes complement DW's (in databases) well. Keep | the raw data in the lake and query as needed for discovery, and | load it into structured tables as the business needs arise. | | Data lakes alone are doomed to be failures. | luckydata wrote: | I don't think anyone ever suggested that. The use case for a | data lake is precisely the one you describe, it allows you to | start collecting data without having to do a lot of work | ahead of time before y9u know how you actually want to | structure things. Allows for schema evolution too. It's not a | panacea, it's just a way to avoid the inertia most large data | projects have. | towelpluswater wrote: | Nobody here suggested it, just something I see | organizations doing quite often. | killjoywashere wrote: | So, let's say I have a DB of a million rows, anticipate having | 100M rows of archived data, then adding 5M rows per year; each of | my rows has some metadata and points to an image on the order of | 10 gigapixels, in a bucket. | | There is presently strong interest in associating this data with | other DBs, of which I am aware of about 80, with a total of | probably 500-1000 tables, along with some very old "nosql" b-tree | datastores in MUMPS. There are new $10M+ projects coming online | around the enterprise roughly every day. | | Where would you start? | georgewfraser wrote: | Modern data warehouses (Snowflake, BigQuery, and maybe Redshift | RA3) have incorporated all the key features of data lakes: | | - The cost of storage is the same as S3. | | - Storage and compute can be scaled independently. | | - You can store multiple levels of curation in the same system: a | normalized schema that reflects the source, alongside a | dimensional schema that has been thoroughly ETL'd. | | - Compute can be scaled horizontally to basically any level of | parallelism you desire. | | Given these facts, it is unclear what rationale still exists for | data lakes. The only remaining major advantage of a data lake is | that you aren't subject to as much vendor lock-in. | cjalmeida wrote: | Not being subject to vendor lock-in is huge in itself. | | You can save plenty of money if you have the scale to move out | of S3. That's important because you can usually trade CPU for | storage by storing data in multiple formats, optimized for | different access patterns. | | But mostly, the Hadoop ecosystem is very open. The tools are | still maturing and it's easier to debug open source tools than | dealing with the generally poor support in most managed | solutions. ___________________________________________________________________ (page generated 2020-02-23 23:00 UTC)