[HN Gopher] The Need for Data Engineers ___________________________________________________________________ The Need for Data Engineers Author : eaguyhn Score : 34 points Date : 2020-02-21 18:54 UTC (1 days ago) (HTM) web link (thenewstack.io) (TXT) w3m dump (thenewstack.io) | ianamartin wrote: | I'm interviewing for a Data Engineering position right now, and | one of the questions I was told to prepare for is "What is data | engineering?" I think it's far more than just the data science | aspects this article talks about. Data Engineering touches more | aspects of your engineering projects than most people think. | Curious what this crowd has to say about my idea here. Also, I'm | looking for work. If you like my thoughts, hit me up. | | I think there are 4 buckets of data engineering problems, each | with their own challenges and solutions. | | Operational Data Engineering This is the detritus that grows like | weeds as parts of other projects and often isn't recognized as a | data engineering problem. We need to pull a file off an FTP | server or hit an API and do something with it. Next thing you | know, there are dozens of these little things that are not | individually hard, but having visibility into dependency trees | and failure cases becomes difficult because they are spread out | everywhere and it's not obvious where to look when things go | wrong. Tools like Apache Airflow are a good solution even if you | don't use them in other ways because they can centralize | monitoring, logging, and graphs. Scaling isn't resource intensive | for these tasks because they are discrete. You can fan out. The | scaling challenge for this type of data engineering is really | about tending your garden and keeping things coherently | organized. | | Business Logic Data Engineering This is processing where the data | is highly structured and sometimes even ordered or sequenced. | It's hard to scale because you can't just throw things into a | stream and apply multiple workers. You have to have a managed | process and likely shared in-memory state that collects the | worker results and applies strict rules to a process. This is the | opposite problem from big data. It's small data, rigidly | organized, and carefully managed. | | Data Science Data Engineering This is sort of classic ETL with a | twist. ETL systems are typically pretty static once the E, T, and | L are known quantities. But working with Data Scientists requires | that your pipelines have to be pretty flexible because scientists | are doing experiments. But they also have to be repeatable and | comparable, which means your pipeline has to maintain version. | This is also the area where you are most likely to encounter Big | Data, so you have to be prepared to change your mental model and | be able to use tools like Hadoop and Spark to bring compute to | where your data is. | | Analytics Data Engineering This is classic ETL pipelines that | move data from point A to data lakes or data warehouses. The key | thing to understand here is what you are modeling at the | endpoint. If it's a legit data warehouse, you are modeling | business processes. If you aren't doing that, you are--by | definition--pushing data to a lake. Understanding your endpoint | is key to choosing your reporting and analytics tools to lay on | top of your data source. Data lakes are a good use case for ad- | hoc, SQL-driven reporting tools like MetaBase. But if you are | sitting on top of a well-structured fact/dimension type of | warehouse, you will want more formal tools like Tableau, Pentaho, | or Cognos. | thedudeabides5 wrote: | Good description. I've found it easy to explain to people that | data scientists are often your explorers or researchers, the | folks that go out and deal with raw, uncleaned, poorly modeled | information, looking for relationships that are relevant to the | business/study. | | Data Engineers are the folks that show up once the boss says | 'yeah that's good enough we want to see the result of that | process/model/algorithm on an ongoing basis'...now what was | likely a pile of unsystemized jupyter notebooks and excel needs | to get cleaned, sytemized and productionalized, preferably in | tools designed to handle pipelines and scheduled jobs etc. | AznHisoka wrote: | If one wants to become a data engineer, what specific | vendors/technologies are increasing in demand? Ie. Databricks, | talend, Cloudera? | guessmyname wrote: | Here is a good infographic [1] taken from DataCamp [2]. | | The infographic and article show what skills and tools are | relevant for a job as a web developer _(more specifically doing | Python Web Development)_ and compares them with similarly | important skills and tools for data science. It includes | average salary expectations and links to websites where you can | both learn, practice and search for a job. | | [1] | https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Pyt... | | [2] https://www.datacamp.com/community/blog/web-development- | data... ___________________________________________________________________ (page generated 2020-02-22 23:00 UTC)