[HN Gopher] The evolution of the data engineer role ___________________________________________________________________ The evolution of the data engineer role Author : Arimbr Score : 123 points Date : 2022-10-24 14:30 UTC (8 hours ago) (HTM) web link (airbyte.com) (TXT) w3m dump (airbyte.com) | unholyguy001 wrote: | It's fascinating that the conclusion / forecast is that tools | will abstract engineering problems and DE will move closer to the | business . While over the last 20 years the exact opposite has | happened and the toolset has actually become harder (not easier | to use) but orders of magnitude more powerful and DE has moved | closer to engineering, to the point where a good data engineer | basically is a specialized software engineer. | | The absolute pinnacle of "easy to use" was probably the | Informatica / Oracle stack of the late 90's and early 00's. It | just wasn't powerful or scalable enough to meet the needs of the | Big Data shift | | Of course I guess this makes sense given the author works for a | company with a vested interest in reversing that trend. | hnews_account_1 wrote: | I think those days were easy to use within the zeitgeist. Even | advanced versions of those tools would struggle against the | data needs today which have become incredibly bespoke. My skill | set extends all the way from my actual industry (finance) to | the boundary of software development. I also have data, big | data and cluster usage skills (slurm etc). I don't use | everything every day and obviously I cannot be a specialist in | most of this stuff (I concentrate on finance more than anything | else) considering the incredible range, but this is just the | past 2 years for me. | | I cannot imagine a less specialized future looking around today | where some nice tool does 80% of my work. Not because the work | I do is difficult to automate. But because the work I do won't | match the work other industries may do (beyond existing | generalizations of pandas, regression toolkits and other low | level stuff). There's no point building a full automation suite | just for my single work profile which itself will differ from | other areas of finance. | prions wrote: | IMO Data engineering is already a specialized form of software | engineering. However what people interpret as DE's being slow to | adopt best practices from traditional software engineering is | more about the unique difficulties of working with data | (especially at scale) and less about the awareness or desire to | use best practices. | | Speaking from my DE experience at Spotify and previously in | startup land, the biggest challenge is the slow and distant | feedback loop. The vast majority of data pipelines don't run on | your machine and don't behave like they do on a local machine. | They run as massively distributed processes and their state is | opaque to the developer. | | Validating the correctness of a large scale data pipeline can be | incredibly difficult as the successful operation of a pipeline | doesn't conclusively determine whether the data is actually | correct for the end user. People working seriously in this space | understand that traditional practices here like unit testing only | go so far. And integration testing really needs to work at scale | with easily recyclable infrastructure (and data) to not be a | massive drag on developer productivity. Even getting the correct | kind of data to be fed into a test can be very difficult if the | ops/infra of the org isn't designed for it. | | The best data tooling isn't going to look exactly like | traditional swe tooling. Tools that vastly reduce the feedback | loop of developing (and debugging) distributed pipelines running | in the cloud and also provide means of validating the output on | meaningful data is where tooling should be going. Trying to | shoehorn traditional SWE best practices will really only take off | once that kind of developer experience is realized. | mywittyname wrote: | > Validating the correctness of a large scale data pipeline can | be incredibly difficult as the successful operation of a | pipeline doesn't conclusively determine whether the data is | actually correct for the end user. People working seriously in | this space understand that traditional practices here like unit | testing only go so far. | | I'm glad to see someone calling this out because the comment | here are a sea of "data engineering needs more unit tests." | Reliably getting data into a database is rarely where I've | experienced issues. That's the easy part. | | This is the biggest opportunity in this space, IMHO, since | validation and data completeness/accuracy is where I spend the | bulk of my work. Something that can analyze datasets and | provide some sort of ongoing monitoring for confidence on the | completeness and accuracy of the data would be great. These | tools seem to exist mainly in the network security realm, but | I'm sure they could be generalized to the DE space. When I | can't leverage a second system for validation, I will generally | run some rudimentary statistics to check to see if the volume | and types of data I'm getting is similar to what is expected. | abrazensunset wrote: | There is a huge round of "data observability" startups that | address exactly this. As a category it was overfunded prior | to the VC squeeze. Some of them are actually good. | | They all have various strengths and weaknesses with respect | to anomaly detection, schema change alerts, rules-based | approaches, sampled diffs on PRs, incident management, | tracking lineage for impact analysis, and providing | usage/performance monitoring. | | Datafold, Metaplane, Validio, Monte Carlo, Bigeye | | Great Expectations has always been an open source standby as | well and is being turned into a product. | robertlagrant wrote: | I've worked with medium-sized ETL, and not only does it have | unique challenges, it's a sub-domain that seems to reward quick | and dirty and "it works" over strong validation. | | The key problem is that more you validate incoming data, the | more you can demonstrate correctness, but then the more often | data coming in will be rejected, and you will be paged out of | hours :) | conkeisterdoor wrote: | I also manage a medium sized set of ETL pipelines (approx 40 | pipelines across 13k-ish lines of Python) and have a very | similar experience. | | I've never been in a SWE role before, but am related to and | have known a number of them, and have a general sense of what | being a SWE entails. That disclaimer out of the way, it's my | gut feeling that a DE typically does more "hacky" kind of | coding than a SWE. Whereas SWEs have much more clearly | established standards for how to do certain things. | | My first modules were a hot nasty mess. I've been refactoring | and refining them over the past 1.5 years so they're more | effective, efficient, and easier to maintain. But they've | always just worked, and that has been good enough for my | employer. | | I have one 1600 line module solely dedicated to validating a | set of invoices from a single source. It took me months of | trial and error to get that monster working reliably. | azurezyq wrote: | This is actually a great observation. Data pipelines are often | written in various languages, running on heterogenous systems, | with different time alignment schemes. I always found it tricky | to "fully trust" a piece of result. Hmm, any best practice from | your side? | oa335 wrote: | Not OP, but a Data Engineer with 4 years experience in the | space - I think the key is to first build the feedback loop - | i.e. any thing that helps you answer how do you know the data | pipeline is flowing and that the data is correct - then | getting sign-off from both the producers and consumers of the | data. Actually getting the data flowing is usually pretty | easy after both parties agree about what that actually means. | Tycho wrote: | I would describe myself as a _dataframe_ engineer. | usgroup wrote: | I live in the world of data lakes and elaborate pipelines. Now | and again I get to use a traditional star schema data warehouse | and ... it is an absolute pleasure to use in contrast to modern | data access patterns. | sherifnada wrote: | In some sense, Data engineering today is where software | engineering was a decade ago: | | - Infrastructure as code is not the norm. Most tools are UI- | focused. It's the equivalent of setting up your infra via the AWS | UI. | | - Prod/Staging/Dev environments are not the norm | | - Version Control is not a first class concept | | - DRY and component re-use is exceedingly difficult (how many | times did you walk into a meeting where 3 people had 3 different | definitions of the same metric?) | | - API Interfaces are rarely explicitly defined, and fickle when | they are (the hot name for this nowadays is "data contracts") | | - unit/integration/acceptance testing is not as nearly as | ubiquitous as it is in software | | On the bright side, I think this means DE doesn't need to re- | invent the wheel on a lot of these issues. We can borrow a lot | from software engineering. | beckingz wrote: | You're talking about analytics not data engineering. | | But yes, Data Analysis still needs more of this, though the | smarter folks are getting on the Analytics Engineering / | DataOps trains. | mywittyname wrote: | My DE team has all of these, and I've never worked on a team | without them. I speak as someone whose official title has been | Data Engineer since 2015 and I've consulted for lots of F500 | companies. | | Unit testing is the only thing we tend to skip, mainly because | it's more reliable to allow for fluidity in the data that's | being ingested. Which is really easy now that so many databases | can support automatic schema detection. External APIs can | change without notice, so it's better to just design for that, | then use the time you would spend on unit tests to build alerts | around automated data validation. | jimcavel888 wrote: | swyx wrote: | > Titles and responsibilities will also morph, potentially | deeming the "data engineer" term obsolete in favor of more | specialized and specific titles. | | "analytics engineer" is mentioned but also just had its first | conference at dbt's conference. all the talks are already up | https://coalesce.getdbt.com/agenda/keynote-the-end-of-the-ro... | pjot wrote: | Just to clarify, last week was dbt's first _in person_ | conference. Third overall. | iblaine wrote: | A full history of DE should include some original low code tools | (Cognos, Informatica, SSIS). To some extent, the failure of these | tools to adopt to the evolution of the DE role has lead to our | modern data stack. | Eumenes wrote: | Agreed. This is the first thing I thought about - the evolution | from reporting systems to ETL code to Hadoop to Spark, etc. | MrPowers wrote: | Great article. | | > data engineers have been increasingly adopting software | engineering best practices | | I think the data engineering field is starting to adopt some | software engineering best practices, but it's still really early | days. I am the author of popular Spark testing libraries (spark- | fast-tests, chispa) and they definitely have a large userbase, | but could also grow a lot. | | > The way organizations structure data teams has changed over the | years. Now we see a shift towards decentralized data teams, self- | serve data platforms, and ways to store data beyond the data | warehouse - such as the data lake, data lakehouse, or the | previously mentioned data mesh - to better serve the needs of | each data consumer. | | I think the Lakehouse architecture is the real future of data | engineering, see the paper: | https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf | | Disclosure: I am on the Delta Lake team, but joined because I | believe in the Lakehouse architecture vision. | | It will take a long time for folks to understand all the | differences between data lakes, Lakehouses, data warehouses, etc. | Over time, I think mass adoption of the Lakehouse architecture is | inevitable (benefits of open file formats, no lock in, separating | compute from storage, cost management, scalability, etc.). | victor106 wrote: | > It will take a long time for folks to understand all the | differences between data lakes, Lakehouses, data warehouses, | etc. | | What are some good resources that can help educate folks on | these differences? | claytonjy wrote: | Short version: | | - data warehouse: schema on write. you have to know the end | form before you load it. breaks every time upstream changes | (a lot, in this world) | | - data lake: schema on read. load everything into S3 and deal | with it later. Mongo for data platforms | | - data lakehouse: something in between. store everything | loosely like a lake, but have in-lakehouse processes present | user-friendly transforms or views like a warehouse. Made | possible by cheap storage (parquet on S3), reduces schema | breakage by keeping both sides of the T in the same place | DougBTX wrote: | Materialised views for cloud storage? | MrPowers wrote: | I am working on some blogs / videos that will hopefully help | clarify the differences. I'm working on a Delta Lake vs | Parquet blog post right now and gave a 5 Reasons Parquet | files are better than CSV talk last year: | https://youtu.be/9LYYOdIwQXg | | Most of the content that I've seen in this area is really | high-level. I'm trying to write posts that are a bit more | concrete with some code snippets / high level benchmarks, | etc. Hopefully this will help. | abrazensunset wrote: | "Lakehouse" usually means a data lake (bunch of files in | object storage with some arbitrary structure) that has an | open source "table format" making it act like a database. | E.g. using Iceberg or Delta Lake to handle deletes, | transactions, concurrency control on top of parquet (the | "file format"). | | The advantage is that various query engines will make it | quack like a database, but you have a completely open interop | layer that will let any combination of query engines (or just | SDKs that implement the table format, or whatever) coexist. | And in addition, you can feel good about "owning" your data | and not being overtly locked in to Snowflake or Databricks. | swyx wrote: | (OP's coworker) We actually published a guide on data | lakes/lakehouses last month! https://airbyte.com/blog/data- | lake-lakehouse-guide-powered-b... | | covering: | | - What's a Data Lake and Why Do You Need One? | | - What's the Differences between a Data Lake, Data Warehouse, | and Data Lakehouse | | - Components of a Data Lake | | - Data Lake Trends in the Market | | - How to Turn Your Data Lake into a Lakehouse | oa335 wrote: | I am a data engineer, and I STILL don't understand the | differences between the following terms: | | 1. Data Warehouse | | 2. Datalake | | 3. Data Lakehouse | | 4. Data Mesh | | Can someone please clearly explain the differences between | these concepts? | chrisjc wrote: | You mean Lake 'Shanty' Architecture (think DataSwamp vs | DataLake) am I right? | | But in all seriousness, I totally agree with your opinion on | LakeHouse Architecture and am especially excited about Apache | Iceberg (external table format) and the support and attention | it's getting. | | Although I don't think that selecting any of these data | technologies/philosophies comes down to making a mutually | exclusive decision. In my opinion, they either build on or | compliment each other quite nicely. | | For those that are interested, here are my descriptions of | each... | | Data Lake Arch - all of your data is stored on blob-storage | (S3, etc) in a way that is partitioned thoughtfully and easily | accessible, along with a meta-index/catalogue of what data is | there, and where it is. | | Lake House Arch - similar to a DataLake but data is structured | and mutable, and hopefully allows for transactions/atomic-ops, | schema evolution/drift, time-travel/rollback, so on... Ideally | all of the properties that you usually assume to get with any | sort of OLAP (maybe even OLTP) DB table. But the most important | property in my opinion is that the table is accessible through | any compatible compute/query engine/layer. Separating storage | and compute has revolutionized the Data Warehouse as we know | it, and this is the next iteration of this movement in my | opinion. | | Data Mesh/Grid Arch - designing how the data moves from a | source all the way through each and every target while | storing/capturing this information in an accessible | catalogue/meta-database even as things transpire and change. As | a result it provides data lineage and provenance, potentially | labeling/tagging, inventory, data-dictionary-like information | etc... This one is the most ambiguous and maybe most difficult | to describe and probably design/implement, and to be honest | I've never seen a real-life working example. I do think this | functionality is a critical missing piece of the data stack, | whether the solution is a Data Mesh/Grid or something else. | Data Engineers have their work cutout on this one, mostly bc | this is where their paths cross with those of | Application/Service Developers, Software Engineers. In my | opinion, developers are usually creating services/applications | that are glorified CRUD wrappers around some kind of | operational/transactional data store like MySQL, Postgres, | Mongo, etc. Analytics, reporting, retention, volume, etc are | usually an after thought and not their problem. Until someone | hooks the operational data store up to their SQL IDE or | Tableau/Looker and takes down prod. Then along comes the data | engineer to come up with yet another ETL/ELT to get the data | out of the operational data store and into a data warehouse so | that reports and analytics can be run without taking prod down | again. | | Data Warehouse (modern) - Massive Parallel Processing (MPP) | over detached/separated columnar (for now) data. Some Data | Warehouses are already somewhat compatible with Data Lakes | since they can use their MPP compute to index and access | external tables. Some are already planning to be even more Lake | House compatible by not only leveraging their own MPP compute | against externally managed tables (eg), but also managing | external tables in the first place. That includes managing | schemas and running all of the DDLs (CREATE, ALTER, DROP, etc) | as well as DQLs (SELECT) and DMLs (MERGE, INSERT, UPDATE, | DELETE, ...). Querying data across native DB tables, external | tables (potentially from multiple Lake Houses, Data Lakes) all | becomes possible with a join in a SQL statement. Additionally | this allows for all kinds of governance related functionality | as well. Masking, row/column level security, logging, auditing, | so on. | | As you might be able to tell from this post (and my post | history) is that I'm a big fan of Snowflake. I'm excited for | Snowflake managed Iceberg tables and then consume the data with | a different compute/query engine. Snowflake (or other modern | DW) could prepare the data (ETL/calc/crunch/etc) and then | manage (DDL & DML) it in an Iceberg table. Then something like | DuckDB could consume the Iceberg table schema and listen for | table changes (oplog?), and then read/query the data performing | last-mile analytics (pagination, order, filter, aggs, etc). | | DuckDB doesn't support Apache Iceberg, but it can read parquet | files which are used internally in Iceberg. Obviously | supporting external tables is far more complex than just | reading a parquet file, but I don't see why this isn't in their | future. DuckDB guys, I know you're out there reading this :) | | https://iceberg.apache.org/ | | https://www.snowflake.com/guides/olap-vs-oltp | | https://www.snowflake.com/blog/iceberg-tables-powering-open-... | | Finally one of my favorite articles: | | https://engineering.linkedin.com/distributed-systems/log-wha... | beckingz wrote: | I'm going to use "Lake Shanty" in the future. Powerful phrase | to describe what happens when you run aground on the shore of | a data swamp. | oa335 wrote: | Great write-up. I would add that I actually have seen | something like a "Data Mesh" architecture, at a bank of all | places. The key was a very stable, solid infrastructure and | dev platform, as well as a custom Python library that worked | across that Platform which was capable of ELT across all | supported datastores and would properly log/annotate/catalog | the data flows. Such a thing is really only possible when the | platform is actually very stable and devs are somewhat forced | to use the library. | z3c0 wrote: | For those of you who are genuinely curious why this field has so | many similarly-named roles, here's a sincere, non-snarky, non- | ironic explanation: | | A Data Analyst is distinct from a Systems Analyst or a Business | Analyst. They may perform both systems and business analysis | tasks, but their distinction comes from their understanding of | statistics and how they apply that to other forms of analysis. | | A ML specialist is not a Data Scientist. Did you successfully | build and deploy an ML model to production? Great! That's still | uncommon, despite the hype. However, that would land you in the | former position. You can claim the latter once you've walked that | model through the scientific method, complete with hypothesis | verification and democratization of your methodology. | | A BI Engineer and a Data Engineer are going to overlap a lot, but | the former is going to lean more towards report development, | where the latter will spend more time with ELTs/ETLs. As a data | engineer, most of the report development that I do is to report | on the state of data pipelines. BI BI, I like to call it. | | A Big Data Engineer or Specialist is a subset of data engineers | and architects angled towards the problems of big data. This | distinction actually matters now, because I'm encountering data | professionals these day who have never worked outside the cloud | or with small enterprise datasets (unthinkable only half-a-decade | ago.) | | It doesn't help that lack of understanding often leads to | misnomer positions, but anybody who has spent time in this field | gets used to the subtle differences quickly. | itsoktocry wrote: | What about _Analytics Engineer_ , the hypiest-of-the-hyped | right now? | claytonjy wrote: | BI engineer that knows dbt | claytonjy wrote: | This strikes me as incredibly rosy; I want to live in this | world, but I don't. The world I live in: | | - Data Analyst: someone who knows some SQL but not enough | programming, so we can pay < 6 figures | | - ML specialist: someone who figured out DS is a race to the | bottom and ML in a title gets you paid more. Spends most of | their time installing pytorch in various places | | - BI Engineer: Data Analyst but paid a bit more | | - Data Engineer: Airflow babysitter | | - Big Data Engineer: middle-adged Scala user, Hadoop babysitter | tdj wrote: | In my experience, I have started to believe ML Engineer is | short for "YAML Engineer". | z3c0 wrote: | Out of all the snark in this thread, this is the only bit | to illicit a chuckle from me. Thank you. | mywittyname wrote: | > The declarative concept is highly tied to the trend of moving | away from data pipelines and embracing data products - | | Of course an Airbyte article would say this, because they are | selling these tools, but my experience has been the opposite. | People buy these tools because they claim to make it easier for | non-software people to build pipelines. But the problem is that | these tools seem to end up being far more complicated and less | reliable than pipelines built in code. | | There's a reason that this domain is saturated with so. many. | tools. None of them do a great job. And when a company invariably | hits the limits of one, they start shopping for a replacement, | which will have it's own set of limitations. Lather-rinse-repeat. | | I built a solid career over the past 8 or so years of replacing | these "no code" pipeline tools with code once companies hit the | ceilings of these tools. You can get surprisingly far in the data | world with Airflow + a large scale database, but all of the major | cloud providers have great tool offerings in this space. Plus, | for platforms that these tools don't interface with, you're going | to have to write code anyway. | muspimerol wrote: | > I built a solid career over the past 8 or so years of | replacing these "no code" pipeline tools with code once | companies hit the ceilings of these tools. | | I'm sure you earn a nice living doing this, but surely this is | not a convincing argument against using off-the-shelf data | products. It will always come down to the cost (including | ongoing maintenance) for the business. Bespoke in-house | software is always the most flexible route, but rarely the | cheaper one. | Arimbr wrote: | Oh, declarative doesn't necessarily mean no-code. Airbyte data | integration connectors are built with an SDK in Python, Java, | and a low-code SDK that was just released... | | You can then build custom connectors on top of these and many | users actually need to modify an existing connector, but would | rather start from a template than from scratch. | | Airbyte also provides a CLI and YAML configuration language | that you can use to declare sources, destinations and | connections without the UI: | https://github.com/airbytehq/airbyte/blob/master/octavia-cli... | | I agree with you that code is here to stay and power users need | to see the code and modify it. That's why Airbyte code is open- | source. | buscoquadnary wrote: | Business Analyst, Big Data Specialist, Data Mining Engineer, Data | Scientist, Data Engineer. | | Why is this field so prone to hype and repeating the same things | with a new coat of paint. I mean what ever happened to OLAP, data | cubes, Big Data, and whatever other super big next thing that has | happened in the past 2 decades? | | Methinks the problem with Business Intelligence solving problems | is the firdt part of the term and not the second. | hatware wrote: | > Why is this field so prone to hype and repeating the same | things with a new coat of paint. | | Money and Marketing. It's no different from how Hadoop was a | big deal around 2010, or how Functional Programming became the | new thing from 2015 onwards. | | Personally I think this is a failure of regulatory agencies. | MonkeyMalarky wrote: | I dunno, I have to first put my data somewhere though. But | where.. In a warehouse? Silo? Data lake? Lake house? (I really | despise that last one, who could coin that phase with a | straight face..) | MrPowers wrote: | Data warehouse: bundles compute & storage and comes at a | comparatively high price point. Great option for certain | workflows. Not as great for scaling & non-SQL workflows. | | Data lake: typically refers to Parquet files / CSV files in | some storage system (cloud or HDFS). Data lakes are better | for non-SQL workflows compared to data warehouses, but have a | number of disadvantages. | | Lakehouse storage formats: Based on OSS files and solve a | number of data lake limitations. Options are Delta Lake, | Iceberg, and Hudi. Lakehouse storage formats offer a ton of | advantages and basically no downsides compared to Parquet | tables for example. | | Lakehouse architecture: An architectural paradigm to store | data in a way such that's it's easily accessible for SQL- | based and non-SQL-based workflows, see the paper: | https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf | | There are a variety of tradeoffs to be weighed when selecting | the optimal solution for your particular needs. | marktangotango wrote: | If this is satire it's brilliant. I don't doubt it's | factual, but the last sentence is a slayer. | fabatka wrote: | Can you explain why do you find the above explanation | amusing? I honestly don't see the absurdity of it, | although, my livelihood may depend on me not seeing it :) | victor106 wrote: | It's seems like you are just being negative to a reply | that was meant to genuinely clarify confusing | terminology. | | If not please elaborate why you doubt this is not | factual? | PubliusMI wrote: | IMHO, | | "data lake" = collection of all data sources: HDFS, S3, SQL | DB, etc | | "lake house" = tool to query and combine results from all | such sources: Dremio | buscoquadnary wrote: | But what lake house is complete without a boat. | | That's why my company is looking for investors who are | interested in being at the forefront of the data | revolution, using our data rowboat that will allow you to | proactively leverage your data synergies to break down | organizational data silos and use analytics to address your | core competencies in order to leverage a strategic | advantage to become the platform of choice in a holistic | environment. | | Tell me if this sounds familiar, your company has tons of | data but it is spread out all over the place and you can't | seem to get good info, you end up hounding engineers to get | your reports and provide you information so you can look | like you are making data driven decisions. Maybe you've | implemented a data lake but now have no idea how to use it. | We've got you covered with our patent pending data rowboat | solution. | | This will allow you to impressive everyone else in the mid | level staff meetings by allowing you to say you are doing | something around the "data revolution" in your org. The | best part is that every implementation will come with a | team of our in house consultants that will allow the | project to drag on forever so that you always have | something to report on in staff meetings and make you look | good to your higher ups. | | Now you may be an engineer looking to revolutionize your | career and get involved in the next step of the glorious | October data revolution. Well we've got you covered for a | very reasonable price you can enroll in our "data rowboat | boot camp", where you will spend hours locked in a room | where someone who barely speaks English will read | documentation to you. | | But act quick otherwise you'll end up as one of the data | kulaks as the new data rowboat revolution proceeds into a | glorious future with our 5 year plan. | jdgoesmarching wrote: | Brb, running to trademark every nautical data metaphor I | can get my hands on. | | What happens when your data rowboat runs ashore? | Introducing Data Tugboat(tm), your single pane of glass | solution for shoring up undocumented ETLs and reeling | your data lineage safely into harbor. | MonkeyMalarky wrote: | Need to run ML on your data? Try our DeepSea data | drilling rigs, delivered in containers! | MonkeyMalarky wrote: | Sir, I'm sorry, but a rowboat just won't scale, my needs | are too vast. What I'm proposing is the next level of | data extraction. You've heard of data mining? Well meet | the aquatic equivalent, the Data Trawler. To find out | more, contact our solution consultants today! | MikeDelta wrote: | 'Tis a field riddled with yachts... but where are all the | customers' yachts? | Avicebron wrote: | I think the real interesting point is slapping the title of | engineer/scientiest on to anything and everything regardless of | the accreditation actual handed out. soon coming up.."cafeteria | engineer", "janitorial engineer"... | the-printer wrote: | Woah, woah, woah. Cool it, buddy. | | That's already begun. | Test0129 wrote: | The difference of course being that other types of engineers | have to take a PE. The idea of requiring a PE to have that | title is protectionism no different than limiting the number | of graduating doctors to keep salaries high. No one will ask | a software engineer to build a bridge - relax. Your | protection racket is safe. Software engineer is a title | conferred on someone who builds systems. It is fitting. And, | if we're being honest, the average "job threat level" of a | so-called "real" engineer is about the same as a software | engineer these days anyway. With the exception of some niche | jobs every engineer I know is just a CADIA/SW/etc jocky and | the real work is gatekept by greybeards. | | No one will call someone a cafeteria engineer or janitorial | engineer. The premise is ridiculous. There is a title called | "operations engineer" that uses math to optimize processes. | Does this one bother you too? | iamjk wrote: | I won't be surprised if DE ends up just falling under the | "software engineering" umbrella as the jobs grow closer together. | With hybrid OLAP/OLTP databases becoming more popular, the | skillset delta is definitely smaller than it used to be. Data | Engineers are higher leverage assets to an organization than they | ever have been before. | snapetom wrote: | I think it's mostly already there, but your big, enterprise | houses were late getting the memo. About 12 years ago, I | switched to a DE role/title and held it ever since. I worked in | a variety of startups doing DE - moving data from over here to | over there, with a variety of tools from orchestration | frameworks to homegrown code in a variety of languages. | | About six years ago, I walked into a local hospital to | interview for a DE role and it was very clear that their | definition of DE was different than mine. The whole dept worked | in nothing but SQL. I thought I was good with SQL, but they | absolutely crucified me on SQL and data architecture theories. | I ended up getting kicked over to a software engineering role, | doing DE in another capacity, which made more sense for me. | | Only now I'm hearing that they're migrating to other tools like | dbt and requiring their DEs to learn programming languages. | buscoquadnary wrote: | Well my understanding is that a Data Engineer is basically just | a DevOps engineer but instead of building infra to run | applications they build infra to process, sanitize and | normalize data. | Avalaxy wrote: | Imho that is absolutely not doing the role justice. For some | people that may hold true, but I would expect a data engineer | to know everything about distributed systems, database | indexes, how different databases work and why you pick them, | partitioning, replication, transactions/locking. These are | topics a software engineer is typically familiar with. A | DevOps engineer wouldn't be. | hbarka wrote: | Or to denormalize data, the distinction of which the data | engineer would be most familiar with why and how. | tbarrera wrote: | Author here - Of course, data engineering involves building | infra and being knowledgeable about DevOps practices, but | that's not the only area data engineers should be familiar | with. There are many, many more! In my personal experience, | sometimes we end up not using DevOps best practices because | we spread too thin. That's why I believe in specialization | within data engineering and the surge of "data reliability | engineer" or similar | 3minus1 wrote: | Yeah, maybe this will happen. Where I work (FAANG), I know that | DEs get lower compensation than SWEs and SREs. | rectang wrote: | I'm a software dev who's been bumping up against the data | engineering field lately, and I've been dismayed as to how many | popular tools shunt you towards unmaintainable, unevolvable | system design. | | - A predilection for SQL, yielding "get the answer right once" | big-ball-of-SQL solutions which are infeasible to debug or modify | without causing regressions. | | - Poor support for unit testing. | | - Poor support for version control. | | - Frameworks over libraries (because the vendors want to lock you | in). | | > data engineers have been increasingly adopting software | engineering best practices | | We can only hope. I think it's more likely that in the near term | data engineers will get better and better at prototyping within | low-code frameworks, and that transitioning from the prototype to | an evolvable system will get harder. | MrPowers wrote: | > A predilection for SQL, yielding "get the answer right once" | big-ball-of-SQL solutions which are infeasible to debug or | modify without causing regressions. | | Yea, thankfully some frameworks have Python / Scala APIs that | let you abstract "SQL logic" into programatic functions that | can be chained and reused independently to avoid the big-ball- | of-SQL problem. The best ones also allow for SQL because that's | the best way to express some logic. | | > Poor support for unit testing. | | I've written pandas, PySpark, Scala Spark, and Dask testing | libs. Not sure which framework you're referring to. | | > Poor support for version control. | | The execution platform should make it easy to package up code | in JAR files / Python wheel files and attach them to your | cluster, so you can version control the business logic. If not, | yea, that's a huge limitation. | | > Frameworks over libraries (because the vendors want to lock | you in) | | Not sure what you're referring to, but interested in learning | more. | forgetfulness wrote: | They were talking about the "modern data stack" no doubt. | | The trend has been to shift as much work possible to the | current generation of Data Warehouses, that abstract the | programming model that Spark on columnar storage provided | with only a SQL interface, reducing the space where you'd use | Spark. | | It makes it very accessible to write data pipelines then | using dbt (which outcompeted Dataform, though the latter is | still kicking), but then you don't have the richer | programming facilities, stricter type systems, tooling and | the practices of Python or Scala programming, you're in the | world of SQL, set back decade or two in testing, checking, | and a culture of using them, and with little tools to | organize your code. | | That, if the team has rebuked the siren songs of a myriad of | cloud, low-code platforms for this or the other, with even | fewer facilities to keep the data pipelines under control, be | it that we call control any of: access, versioning, | monitoring, data quality, anything really. | morelisp wrote: | Let me offer the more blunt materialist analysis: Data | engineers are being deskilled into data analysts and too | blinded by shiny cloud advertisements to notice. | | (In this view though, "lack of tests" or whatever is the | least concern - until someone figures out how to spin up | another expensive cloud tool selling "testable queries".) | forgetfulness wrote: | The "data engineer" became a distinct role to bring over | Software Engineering practices to data processing; such | as those practices are, they were a marked improvement | over their absence. | | Building a bridge from one shore to the other with | application programming languages and data processing | tools that worked much closer to other forms of | programming was a huge part of that push. | | Of course, the big data tools were intricate machines | that were easy to learn and very hard to master, and data | engineers had to be pretty sophisticated. | | So, it became cheaper to move much of that apparatus to | data warehouses and, as you said, commoditize the | building of data pipelines that way. | | Software is as widespread as it is today because in every | generation the highly skilled priestly classes that were | needed to get the job done were displaced by people with | less training enabled by new tools or hardware; else it'd | be all rocket simulations done by PhD physicists still. | | But the technical debt will be hefty from this shift. | chrisjc wrote: | FYI | | > write data pipelines then using dbt (which outcompeted | Dataform, though the latter is still kicking), but then you | don't have the richer programming facilities, stricter type | systems, tooling and the practices of Python or Scala | programming, you're in the world of SQL... | | Recently announced and limited to only a handful of data | platforms, but dbt now supports python models. | | https://docs.getdbt.com/docs/building-a-dbt- | project/building... | MrPowers wrote: | > The trend has been to shift as much work possible to the | current generation of Data Warehouses, that abstract the | programming model that Spark on columnar storage provided | with only a SQL interface, reducing the space where you'd | use Spark. | | I feel like there there are some data professionals that | only want to use SQL. Other data professionals only want to | use Python. I feel like the trend is to provide users with | interfaces that let them be productive. I could be | misreading the trend of course. | morelisp wrote: | It's very unclear to me that anyone is more productive | under these new tooling stacks. I'm certain they're not | more productive commensurately with new costs and long- | term risks. | marcosdumay wrote: | > stricter type systems ... the practices of Python or | Scala | | I do understand what you are talking about. But I really | think you and the OP are both complaining about the wrong | problem. | | SQL doesn't require bad practices, doesn't inherently harm | composability (the way the OP was referring), and don't | inherently harm verification. Instead, it has stronger | support for many of those than the languages you want to | replace it with. | | The problems you are talking about are very real. But they | do not come from the language. (SQL does bring a few | problems by itself, but they are much more subtle than | those.) | forgetfulness wrote: | At least BigQuery does a fair bit of typechecking, and | gives error messages in a way that's to the par of | application programming (e.g. not letting you pass a | timestamp to a DATE function and stating that there's no | matching signature). | | But a tool that doesn't "require" bad practices but | doesn't require good practices either makes your work | harder in the long run. | | Tooling is poor, the best IDE-similes you got until | recently were of the type that connects to a live | environment but doesn't tie to your codebase, and | encourages you to put your code directly on the database | rather than version control, the problems of developing | with a REPL and little in the way to mitigate them. I'm | talking of course of the problem of having view and | function definitions live in the database with no tools | to statically navigate the code. | | Testing used to be completely hand rolled if anyone | bothered with it at all. | | That was until now, that data pipeline orchestration | tools exist and let you navigate the pipeline as a | dependency graph, a marked improvement, but until dbt's | Python version is ready for production, we're talking | here of a graph of Jinja templates and YAML definitions, | with modest support for unit testing. | | Dataform is a bit better but virtually unknown and was | greatly hindered by the Google acquisition. | | Functions have always been clunky and still are. | | RDDs and then, to a lesser extent, Dataframes offered a | much stronger programming model, but they were still | subject to a lack of programming discipline from data | engineers in many shops. The results of that, however, | are on a different scale with undisciplined SQL | programming, and it's downright hard to be disciplined | when using it. | | The trend to move from ETL to ELT I feel shouldn't have | been unquestioningly transitioned to untyped Dataframes | and then SQL. | mynameisash wrote: | I'm also a software engineer, though I've had the unofficial | title "data engineer" applied to me for quite some time now. | | The more I work with tools like Spark, the more dissatisfied I | become with the data engineering world. Spark is a hot mess of | fiddling with configuration - I've lost more productivity to | dealing with memory limit, executor count, and other | configuration than I think is reasonable. | | Pandas is another one. It was good enough to make quick | processing concise that it got significant uptake and became de | facto. The API is a pain, though, and processing is slow. Now, | couple Pandas and Spark in your day-to-day job and you get what | I see from my data science colleagues: "I'll hack together some | processing in Pandas until my machine can't handle any more | data, at which point I'll throw it into Spark and provision a | bunch of nodes to do that work for me." I don't mean that to | sound pejorative, as they're generally just trying to be | productive, but there's so little attention paid to real | efficiency in the frameworks and infrastructure that we're | blowing through compute, memory, and person-hours unnecessarily | (IMHO). | rectang wrote: | > Pandas is another one. | | But at least if I write a transform in pandas it's | straightforward to unit test it: create a DataFrame with some | dummy data, send it through the function which wraps the | transform, test that what comes out is what's expected. | | Validating a transform done in SQL is not nearly as | straightforward. For starters it needs to be an integration | test not a unit test (because you need a database). And | that's assuming there's even a way to hook unit tests into | your framework. | | I'm not a huge fan of Pandas -- it's _way_ too prone to | silent failure. I 've written wrappers around e.g. read_csv | which are there to defeat all the magic. But at least I can | do that with Python code instead of being stuck with the big- | ball-of-SQL (e.g. some complicated view SELECT statement that | does a bunch of JOINs, CASE tests and casts). | dang wrote: | Please stick to plain text - formatting gimmicks are | distracting, not the convention here, and tend to start an arms | race of copycats. | | I've put hyphens there now. Your comment is otherwise fine of | course. | nmarinov wrote: | Out of curiosity, could you give or point to an example of | "formatting gimmicks"? | | I tried searching the FAQ but only found the formatdoc guide: | https://news.ycombinator.com/formatdoc | rectang wrote: | OT... | | I used one of the Unicode bullets instead of a hyphen (*). | I understand what dang is getting at here -- it's in the | same spirit as stripping emojis (because otherwise we'd be | drowning in them). I'm a past graphic designer so I'm | accustomed to working with the medium to achieve maximum | communicative impact, but I don't mind operating within | constraints. | | The convention of using hyphens to start a bulleted list | isn't officially documented AFAIK. Having to separate | bulleted list items with double newlines is a little weird, | but it's fine. | webshit2 wrote: | As someone who knows nothing about this stuff, I'm looking at the | "Data Mart" wiki page: https://en.wikipedia.org/wiki/Data_mart. | Ok, so the entire diagram here is labelled "Data Warehouse", and | within that there's a "Data Warehouse" block which seems to be | solely comprised of a "Data Vault". Do you need a special data | key to get into the data vault in the data warehouse? Okay, | naturally the data marts are divided into normal marts and | strategic marts - seems smart. But all the arrows between | everything are labelled "ETL". Seems redundant. What does it mean | anyway? Ok apparently it's just... moving data. | | Now I look at | https://en.wikipedia.org/wiki/Online_analytical_processing. | What's that? First sentence: "is an approach to answer multi- | dimensional analytical (MDA)". I click through to | https://en.wikipedia.org/wiki/Multidimensional_analysis ... MDA | "is a data analysis process that groups data into two categories: | data dimensions and measurements". What the fuck? Who wrote this? | Alright, back on the OLAP wiki page... "The measures are placed | at the intersections of the hypercube, which is spanned by the | dimensions as a vector space." Ah yes, the intersections... why | not keep math out of it if you have no idea how to talk about it? | Also, there's no actual mention of why this is considered | "online" in the first place. I feel like I'm in a nightmare where | the pandas documentation was rewritten in MBA-speak. | rjbwork wrote: | It's a difficult sphere of knowledge to penetrate. All of that | is perfectly coherent to me, FWIW. | | From first principals, I can highly recommend Ralph Kimball's | primer, The Data Warehouse Toolkit: The Definitive Guide to | Dimensional Modeling.[1] | | [1]https://www.amazon.com/gp/product/B00DRZX6XS ___________________________________________________________________ (page generated 2022-10-24 23:00 UTC)