[HN Gopher] Shopify's Data Science and Engineering Foundations (...
       ___________________________________________________________________
        
       Shopify's Data Science and Engineering Foundations (2020)
        
       Author : mooreds
       Score  : 102 points
       Date   : 2022-03-11 18:09 UTC (4 hours ago)
        
 (HTM) web link (shopify.engineering)
 (TXT) w3m dump (shopify.engineering)
        
       | kevinsundar wrote:
       | Having recently worked on a data team at FAANG, all this is an
       | ops nightmare for the team running the platform itself if you
       | want to ensure data quality for everyone querying the data. Im
       | talking when you have hundreds of data sources and hundreds of
       | query use cases.
       | 
       | Anyone have any solutions you've tried?
        
         | atwebb wrote:
         | FAANG seems to be an outlier but, it sounds a lot like the
         | enterprise data mart strategy covered under a mix of stuff from
         | principle #1.
         | 
         | If you want quality, you need structure and review. Accessible
         | data is helpful and needed to develop some of the mature
         | processes, but for most day to day analysis/reporting, no one
         | wants to create their own data model from scratch.
         | 
         | Lots of FAANG doesn't apply to any other companies so it may
         | just be a case of having a wholly unique use case. Though I'm
         | surprised there isn't something already in place at this point
         | (of course having very little knowledge of the case). For the
         | dims/facts/marts, they tend to be business use case focused and
         | not source/data which can reduce the targets down significantly
         | since business use cases tend to repeat (or rhyme).
        
         | bushbaba wrote:
         | Checkout Apache Iceberg. Does a great job of handling many
         | readers few writers. With data consistency and query
         | consistency.
         | 
         | It's a great approach for your data lake and data warehousing
         | needs.
        
           | faizshah wrote:
           | This timetravel/rollback feature is really interesting:
           | https://iceberg.apache.org/docs/latest/spark-
           | queries/#time-t...
        
       | xhevahir wrote:
       | I've read stuff before about Shopify's use of Nix. Since this
       | post doesn't mention Nix, I take it they don't use it in this
       | department of the company?
        
       | csears wrote:
       | It sounds like they have data science and data engineering in one
       | organization. Is that team structure something that others have
       | seen work well?
        
         | erulabs wrote:
         | One of the most interesting bits of devops work I've done was
         | when I was embedded with a data science team. Infrastructure
         | for data science is just so different than traditional ops -
         | but I feel like I was able to both help the team move more
         | quickly and also prevent them from spending all of the
         | companies money - so at least in that case, it worked quite
         | well.
         | 
         | I've never understood why data science teams are typically so
         | far removed from "normal" engineering teams. Maybe it's the
         | DevOps kool-aide speaking, but in my opinion, teams should be
         | more horizontal than vertical!
        
         | cromd wrote:
         | I've been in orgs where it was on same team, and on different
         | teams, both as a modeler and a data engineer. So far, I
         | personally prefer when they're on the same team.
         | 
         | Pros of same-team: fewer ideas "lost in translation" between
         | data scientists and data engineers, better understanding of
         | which datasets/flows are top priority, can sometimes share some
         | stack components and help datascientists improve their code,
         | better chances of getting data scientists to contribute their
         | own batch jobs (there's just more trust as opposed to dealing
         | with some "engineering" team that is less connected to you)
         | 
         | Cons of same team: data engineers may not be as in-the-loop on
         | what's happening with production datasets, may not be as
         | tightly integrated with a devops team, may get overly caught up
         | in "business logic" as opposed to "plumbing".
        
         | quadrature wrote:
         | Data scientists are embedded in product teams and data platform
         | engineers are in a platform engineer org
        
         | thenipper wrote:
         | I work with operations research teams in a blended model of
         | engineering being embedded with the OR Scientists. I really
         | prefer it. Code can get to prod a lot quicker and we don't have
         | the "throw it over the fence to engineering" issues that can
         | arise.
        
       | mooreds wrote:
       | I liked how they took some of the essences of software
       | development (one set of tooling, DRY, re-use) and applied it to
       | the data science arena.
        
       | [deleted]
        
       | JHonaker wrote:
       | I started this expecting to be disappointed, but I really like
       | all of the principles they're describing. I've been pushing for
       | more of this attitude at my own company.
        
       ___________________________________________________________________
       (page generated 2022-03-11 23:00 UTC)