[HN Gopher] Modern Data Lakes Overview
       ___________________________________________________________________
        
       Modern Data Lakes Overview
        
       Author : developersh
       Score  : 46 points
       Date   : 2020-02-23 17:02 UTC (5 hours ago)
        
 (HTM) web link (developer.sh)
 (TXT) w3m dump (developer.sh)
        
       | effnorwood wrote:
       | *Swamps
        
       | cateye wrote:
       | Isn't it just a paradox to store infinite data, to use it later
       | for very specific things without having to define it first?
       | 
       | It sounds very common sense to not to "limit the potential of
       | intelligence by enforcing schema on Write" while in reality, the
       | same problem just shifts (or gets hidden) in the next steps.
       | 
       | For example: there are 10 data sources with each 100TB of data. I
       | aggregate these to my new shiny data lake with a fast adapter.
       | Just suck it all without any worries about Schema. So, now I have
       | 1PB of semi unstructured data.
       | 
       | How do I find the fields X and Y when these are all named
       | differently in 10 sources? Can I even find it without having
       | business domain experts for each data source? How do I keep
       | things in sync when the structure of my data sources change
       | (frequently)?
       | 
       | It seems like there is an underlying social/political problem
       | that technology can't really fix.
       | 
       | Reminds me the quote: "There are only two hard things in Computer
       | Science: cache invalidation and naming things."
        
       | nixpulvis wrote:
       | So where are we on Data Lakes vs NewSQL [1].
       | 
       | [1]: https://en.wikipedia.org/wiki/NewSQL
        
         | ozkatz wrote:
         | Most "NewSQL" databases are designed for OLTP use cases (i.e.
         | many small queries that do little aggregation). Data Lakes are
         | optimized for OLAP (i.e. doing a smaller amount of queries, but
         | aggregating over large amounts of data).
         | 
         | As an example, Athena would do a terrible job at finding a
         | specific user by its ID, while Spanner would behave just as
         | poorly at calculating the cumulative sales of all products for
         | a given category, grouped by store location (assuming many
         | millions of rows representing sales).
         | 
         | Hope this analogy makes sense.
        
       | sologoub wrote:
       | Not a bad read, but it's written from the perspective of large
       | mature operations. If your company is just starting out, the
       | advice is actually there but not quite spelled out - use S3/GCS
       | to store data (ideally in parquet format) and query it using
       | Athena/bigquery.
        
         | mattbillenstein wrote:
         | Can you query parquet from bigquery without loading it into a
         | table from gcs?
         | 
         | I've gotten pretty far with jsonl on gcs and bigquery - even
         | some bigquery streaming for more real-time stuff.
        
           | lmkg wrote:
           | If the data is in Cloud Storage, BigQuery can query it in-
           | place without loading it. BigQuery calls this an External
           | Data Source.
           | 
           | https://cloud.google.com/bigquery/external-data-sources
           | 
           | My biggest papercut with using this was having to make sure
           | that all of the locations matched exactly.
        
         | meritt wrote:
         | Fairly new to this topic and coming from a traditional RDBMS
         | background. How do you go about deciding how many rows/records
         | to store per object? And how does Athena/Bigquery know which
         | objects to query? Do people use partitioning methods (e.g. by
         | time or customer ID etc) to reduce the need to scan the entire
         | corpus every time you run a query?
        
           | dswalter wrote:
           | If you're using AWS athena for querying, you're also using
           | the aws glue catalog (managed hive metastore-ish service) to
           | know where partitions are, but yeah, you'll need to partition
           | and sort your data to make sure you're not doing full table
           | scans.
        
           | lmkg wrote:
           | From the Google side: In traditional BigQuery, the answer to
           | all three questions are related. You shard the files by
           | partition key and put the key into the file name. You can
           | filter the file name in the WHERE clause, and the query will
           | skip filtered objects, but otherwise fully scan every object
           | it touches.
           | 
           | There is apparently now experimental support for using Hive
           | partitions natively. Never used it, literally found out two
           | minutes ago.
           | 
           | The number of records per object is usually "all of them"
           | (restricted by partition keys). The main exception is live
           | queries of compressed JSON or CSV data, because BigQuery
           | can't parallelize them. But generally you trust the tool to
           | handle workload distribution for you.
           | 
           | This works a little differently if you load the data into
           | BigQuery instead of doing queries against data that lives in
           | Cloud Storage. You can use partitioning and clustering
           | columns to cut down on full-table scans.
        
       | FridgeSeal wrote:
       | Having spent the last ~8 months at my work grappling with the
       | consequences and downsides of a Data lake, all I want to do is
       | never deal with one again.
       | 
       | Nothing about it was superior or even on par with simply fixing
       | our current shortcomings OLAP database setup.
       | 
       | The data lake is not faster to write to; it's definitely not
       | faster to read from. Querying using Athena/etc was slow, painful
       | to use, broke exceedingly often and would have resulted in us
       | doing so much work stapling in schemas/etc that we would have
       | been net better off to just do things properly from the start and
       | use a database. The data lake also does not have better access
       | semantics and our implementation has resulted in some of my
       | teammates practically reinventing consistency from first
       | principles. By hand. Except worse.
       | 
       | Save yourself from this pain: find the right database and figure
       | out how to use it, don't reinvent one from first principles.
        
         | towelpluswater wrote:
         | Completely agree with you. Data lakes were marketed well
         | because, well... data warehousing is hard, and a lot of work.
         | Data lakes don't make that hard work disappear, it just changes
         | how and where it happens.
         | 
         | I've found data lakes complement DW's (in databases) well. Keep
         | the raw data in the lake and query as needed for discovery, and
         | load it into structured tables as the business needs arise.
         | 
         | Data lakes alone are doomed to be failures.
        
           | luckydata wrote:
           | I don't think anyone ever suggested that. The use case for a
           | data lake is precisely the one you describe, it allows you to
           | start collecting data without having to do a lot of work
           | ahead of time before y9u know how you actually want to
           | structure things. Allows for schema evolution too. It's not a
           | panacea, it's just a way to avoid the inertia most large data
           | projects have.
        
             | towelpluswater wrote:
             | Nobody here suggested it, just something I see
             | organizations doing quite often.
        
       | killjoywashere wrote:
       | So, let's say I have a DB of a million rows, anticipate having
       | 100M rows of archived data, then adding 5M rows per year; each of
       | my rows has some metadata and points to an image on the order of
       | 10 gigapixels, in a bucket.
       | 
       | There is presently strong interest in associating this data with
       | other DBs, of which I am aware of about 80, with a total of
       | probably 500-1000 tables, along with some very old "nosql" b-tree
       | datastores in MUMPS. There are new $10M+ projects coming online
       | around the enterprise roughly every day.
       | 
       | Where would you start?
        
       | georgewfraser wrote:
       | Modern data warehouses (Snowflake, BigQuery, and maybe Redshift
       | RA3) have incorporated all the key features of data lakes:
       | 
       | - The cost of storage is the same as S3.
       | 
       | - Storage and compute can be scaled independently.
       | 
       | - You can store multiple levels of curation in the same system: a
       | normalized schema that reflects the source, alongside a
       | dimensional schema that has been thoroughly ETL'd.
       | 
       | - Compute can be scaled horizontally to basically any level of
       | parallelism you desire.
       | 
       | Given these facts, it is unclear what rationale still exists for
       | data lakes. The only remaining major advantage of a data lake is
       | that you aren't subject to as much vendor lock-in.
        
         | cjalmeida wrote:
         | Not being subject to vendor lock-in is huge in itself.
         | 
         | You can save plenty of money if you have the scale to move out
         | of S3. That's important because you can usually trade CPU for
         | storage by storing data in multiple formats, optimized for
         | different access patterns.
         | 
         | But mostly, the Hadoop ecosystem is very open. The tools are
         | still maturing and it's easier to debug open source tools than
         | dealing with the generally poor support in most managed
         | solutions.
        
       ___________________________________________________________________
       (page generated 2020-02-23 23:00 UTC)