[HN Gopher] The Need for Data Engineers
       ___________________________________________________________________
        
       The Need for Data Engineers
        
       Author : eaguyhn
       Score  : 34 points
       Date   : 2020-02-21 18:54 UTC (1 days ago)
        
 (HTM) web link (thenewstack.io)
 (TXT) w3m dump (thenewstack.io)
        
       | ianamartin wrote:
       | I'm interviewing for a Data Engineering position right now, and
       | one of the questions I was told to prepare for is "What is data
       | engineering?" I think it's far more than just the data science
       | aspects this article talks about. Data Engineering touches more
       | aspects of your engineering projects than most people think.
       | Curious what this crowd has to say about my idea here. Also, I'm
       | looking for work. If you like my thoughts, hit me up.
       | 
       | I think there are 4 buckets of data engineering problems, each
       | with their own challenges and solutions.
       | 
       | Operational Data Engineering This is the detritus that grows like
       | weeds as parts of other projects and often isn't recognized as a
       | data engineering problem. We need to pull a file off an FTP
       | server or hit an API and do something with it. Next thing you
       | know, there are dozens of these little things that are not
       | individually hard, but having visibility into dependency trees
       | and failure cases becomes difficult because they are spread out
       | everywhere and it's not obvious where to look when things go
       | wrong. Tools like Apache Airflow are a good solution even if you
       | don't use them in other ways because they can centralize
       | monitoring, logging, and graphs. Scaling isn't resource intensive
       | for these tasks because they are discrete. You can fan out. The
       | scaling challenge for this type of data engineering is really
       | about tending your garden and keeping things coherently
       | organized.
       | 
       | Business Logic Data Engineering This is processing where the data
       | is highly structured and sometimes even ordered or sequenced.
       | It's hard to scale because you can't just throw things into a
       | stream and apply multiple workers. You have to have a managed
       | process and likely shared in-memory state that collects the
       | worker results and applies strict rules to a process. This is the
       | opposite problem from big data. It's small data, rigidly
       | organized, and carefully managed.
       | 
       | Data Science Data Engineering This is sort of classic ETL with a
       | twist. ETL systems are typically pretty static once the E, T, and
       | L are known quantities. But working with Data Scientists requires
       | that your pipelines have to be pretty flexible because scientists
       | are doing experiments. But they also have to be repeatable and
       | comparable, which means your pipeline has to maintain version.
       | This is also the area where you are most likely to encounter Big
       | Data, so you have to be prepared to change your mental model and
       | be able to use tools like Hadoop and Spark to bring compute to
       | where your data is.
       | 
       | Analytics Data Engineering This is classic ETL pipelines that
       | move data from point A to data lakes or data warehouses. The key
       | thing to understand here is what you are modeling at the
       | endpoint. If it's a legit data warehouse, you are modeling
       | business processes. If you aren't doing that, you are--by
       | definition--pushing data to a lake. Understanding your endpoint
       | is key to choosing your reporting and analytics tools to lay on
       | top of your data source. Data lakes are a good use case for ad-
       | hoc, SQL-driven reporting tools like MetaBase. But if you are
       | sitting on top of a well-structured fact/dimension type of
       | warehouse, you will want more formal tools like Tableau, Pentaho,
       | or Cognos.
        
         | thedudeabides5 wrote:
         | Good description. I've found it easy to explain to people that
         | data scientists are often your explorers or researchers, the
         | folks that go out and deal with raw, uncleaned, poorly modeled
         | information, looking for relationships that are relevant to the
         | business/study.
         | 
         | Data Engineers are the folks that show up once the boss says
         | 'yeah that's good enough we want to see the result of that
         | process/model/algorithm on an ongoing basis'...now what was
         | likely a pile of unsystemized jupyter notebooks and excel needs
         | to get cleaned, sytemized and productionalized, preferably in
         | tools designed to handle pipelines and scheduled jobs etc.
        
       | AznHisoka wrote:
       | If one wants to become a data engineer, what specific
       | vendors/technologies are increasing in demand? Ie. Databricks,
       | talend, Cloudera?
        
         | guessmyname wrote:
         | Here is a good infographic [1] taken from DataCamp [2].
         | 
         | The infographic and article show what skills and tools are
         | relevant for a job as a web developer _(more specifically doing
         | Python Web Development)_ and compares them with similarly
         | important skills and tools for data science. It includes
         | average salary expectations and links to websites where you can
         | both learn, practice and search for a job.
         | 
         | [1]
         | https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Pyt...
         | 
         | [2] https://www.datacamp.com/community/blog/web-development-
         | data...
        
       ___________________________________________________________________
       (page generated 2020-02-22 23:00 UTC)