hngopher.com

       [HN Gopher] The evolution of the data engineer role
       ___________________________________________________________________
        
       The evolution of the data engineer role
        
       Author : Arimbr
       Score  : 123 points
       Date   : 2022-10-24 14:30 UTC (8 hours ago)
        
 (HTM) web link (airbyte.com)
 (TXT) w3m dump (airbyte.com)
        
       | unholyguy001 wrote:
       | It's fascinating that the conclusion / forecast is that tools
       | will abstract engineering problems and DE will move closer to the
       | business . While over the last 20 years the exact opposite has
       | happened and the toolset has actually become harder (not easier
       | to use) but orders of magnitude more powerful and DE has moved
       | closer to engineering, to the point where a good data engineer
       | basically is a specialized software engineer.
       | 
       | The absolute pinnacle of "easy to use" was probably the
       | Informatica / Oracle stack of the late 90's and early 00's. It
       | just wasn't powerful or scalable enough to meet the needs of the
       | Big Data shift
       | 
       | Of course I guess this makes sense given the author works for a
       | company with a vested interest in reversing that trend.
        
         | hnews_account_1 wrote:
         | I think those days were easy to use within the zeitgeist. Even
         | advanced versions of those tools would struggle against the
         | data needs today which have become incredibly bespoke. My skill
         | set extends all the way from my actual industry (finance) to
         | the boundary of software development. I also have data, big
         | data and cluster usage skills (slurm etc). I don't use
         | everything every day and obviously I cannot be a specialist in
         | most of this stuff (I concentrate on finance more than anything
         | else) considering the incredible range, but this is just the
         | past 2 years for me.
         | 
         | I cannot imagine a less specialized future looking around today
         | where some nice tool does 80% of my work. Not because the work
         | I do is difficult to automate. But because the work I do won't
         | match the work other industries may do (beyond existing
         | generalizations of pandas, regression toolkits and other low
         | level stuff). There's no point building a full automation suite
         | just for my single work profile which itself will differ from
         | other areas of finance.
        
       | prions wrote:
       | IMO Data engineering is already a specialized form of software
       | engineering. However what people interpret as DE's being slow to
       | adopt best practices from traditional software engineering is
       | more about the unique difficulties of working with data
       | (especially at scale) and less about the awareness or desire to
       | use best practices.
       | 
       | Speaking from my DE experience at Spotify and previously in
       | startup land, the biggest challenge is the slow and distant
       | feedback loop. The vast majority of data pipelines don't run on
       | your machine and don't behave like they do on a local machine.
       | They run as massively distributed processes and their state is
       | opaque to the developer.
       | 
       | Validating the correctness of a large scale data pipeline can be
       | incredibly difficult as the successful operation of a pipeline
       | doesn't conclusively determine whether the data is actually
       | correct for the end user. People working seriously in this space
       | understand that traditional practices here like unit testing only
       | go so far. And integration testing really needs to work at scale
       | with easily recyclable infrastructure (and data) to not be a
       | massive drag on developer productivity. Even getting the correct
       | kind of data to be fed into a test can be very difficult if the
       | ops/infra of the org isn't designed for it.
       | 
       | The best data tooling isn't going to look exactly like
       | traditional swe tooling. Tools that vastly reduce the feedback
       | loop of developing (and debugging) distributed pipelines running
       | in the cloud and also provide means of validating the output on
       | meaningful data is where tooling should be going. Trying to
       | shoehorn traditional SWE best practices will really only take off
       | once that kind of developer experience is realized.
        
         | mywittyname wrote:
         | > Validating the correctness of a large scale data pipeline can
         | be incredibly difficult as the successful operation of a
         | pipeline doesn't conclusively determine whether the data is
         | actually correct for the end user. People working seriously in
         | this space understand that traditional practices here like unit
         | testing only go so far.
         | 
         | I'm glad to see someone calling this out because the comment
         | here are a sea of "data engineering needs more unit tests."
         | Reliably getting data into a database is rarely where I've
         | experienced issues. That's the easy part.
         | 
         | This is the biggest opportunity in this space, IMHO, since
         | validation and data completeness/accuracy is where I spend the
         | bulk of my work. Something that can analyze datasets and
         | provide some sort of ongoing monitoring for confidence on the
         | completeness and accuracy of the data would be great. These
         | tools seem to exist mainly in the network security realm, but
         | I'm sure they could be generalized to the DE space. When I
         | can't leverage a second system for validation, I will generally
         | run some rudimentary statistics to check to see if the volume
         | and types of data I'm getting is similar to what is expected.
        
           | abrazensunset wrote:
           | There is a huge round of "data observability" startups that
           | address exactly this. As a category it was overfunded prior
           | to the VC squeeze. Some of them are actually good.
           | 
           | They all have various strengths and weaknesses with respect
           | to anomaly detection, schema change alerts, rules-based
           | approaches, sampled diffs on PRs, incident management,
           | tracking lineage for impact analysis, and providing
           | usage/performance monitoring.
           | 
           | Datafold, Metaplane, Validio, Monte Carlo, Bigeye
           | 
           | Great Expectations has always been an open source standby as
           | well and is being turned into a product.
        
         | robertlagrant wrote:
         | I've worked with medium-sized ETL, and not only does it have
         | unique challenges, it's a sub-domain that seems to reward quick
         | and dirty and "it works" over strong validation.
         | 
         | The key problem is that more you validate incoming data, the
         | more you can demonstrate correctness, but then the more often
         | data coming in will be rejected, and you will be paged out of
         | hours :)
        
           | conkeisterdoor wrote:
           | I also manage a medium sized set of ETL pipelines (approx 40
           | pipelines across 13k-ish lines of Python) and have a very
           | similar experience.
           | 
           | I've never been in a SWE role before, but am related to and
           | have known a number of them, and have a general sense of what
           | being a SWE entails. That disclaimer out of the way, it's my
           | gut feeling that a DE typically does more "hacky" kind of
           | coding than a SWE. Whereas SWEs have much more clearly
           | established standards for how to do certain things.
           | 
           | My first modules were a hot nasty mess. I've been refactoring
           | and refining them over the past 1.5 years so they're more
           | effective, efficient, and easier to maintain. But they've
           | always just worked, and that has been good enough for my
           | employer.
           | 
           | I have one 1600 line module solely dedicated to validating a
           | set of invoices from a single source. It took me months of
           | trial and error to get that monster working reliably.
        
         | azurezyq wrote:
         | This is actually a great observation. Data pipelines are often
         | written in various languages, running on heterogenous systems,
         | with different time alignment schemes. I always found it tricky
         | to "fully trust" a piece of result. Hmm, any best practice from
         | your side?
        
           | oa335 wrote:
           | Not OP, but a Data Engineer with 4 years experience in the
           | space - I think the key is to first build the feedback loop -
           | i.e. any thing that helps you answer how do you know the data
           | pipeline is flowing and that the data is correct - then
           | getting sign-off from both the producers and consumers of the
           | data. Actually getting the data flowing is usually pretty
           | easy after both parties agree about what that actually means.
        
       | Tycho wrote:
       | I would describe myself as a _dataframe_ engineer.
        
       | usgroup wrote:
       | I live in the world of data lakes and elaborate pipelines. Now
       | and again I get to use a traditional star schema data warehouse
       | and ... it is an absolute pleasure to use in contrast to modern
       | data access patterns.
        
       | sherifnada wrote:
       | In some sense, Data engineering today is where software
       | engineering was a decade ago:
       | 
       | - Infrastructure as code is not the norm. Most tools are UI-
       | focused. It's the equivalent of setting up your infra via the AWS
       | UI.
       | 
       | - Prod/Staging/Dev environments are not the norm
       | 
       | - Version Control is not a first class concept
       | 
       | - DRY and component re-use is exceedingly difficult (how many
       | times did you walk into a meeting where 3 people had 3 different
       | definitions of the same metric?)
       | 
       | - API Interfaces are rarely explicitly defined, and fickle when
       | they are (the hot name for this nowadays is "data contracts")
       | 
       | - unit/integration/acceptance testing is not as nearly as
       | ubiquitous as it is in software
       | 
       | On the bright side, I think this means DE doesn't need to re-
       | invent the wheel on a lot of these issues. We can borrow a lot
       | from software engineering.
        
         | beckingz wrote:
         | You're talking about analytics not data engineering.
         | 
         | But yes, Data Analysis still needs more of this, though the
         | smarter folks are getting on the Analytics Engineering /
         | DataOps trains.
        
         | mywittyname wrote:
         | My DE team has all of these, and I've never worked on a team
         | without them. I speak as someone whose official title has been
         | Data Engineer since 2015 and I've consulted for lots of F500
         | companies.
         | 
         | Unit testing is the only thing we tend to skip, mainly because
         | it's more reliable to allow for fluidity in the data that's
         | being ingested. Which is really easy now that so many databases
         | can support automatic schema detection. External APIs can
         | change without notice, so it's better to just design for that,
         | then use the time you would spend on unit tests to build alerts
         | around automated data validation.
        
       | jimcavel888 wrote:
        
       | swyx wrote:
       | > Titles and responsibilities will also morph, potentially
       | deeming the "data engineer" term obsolete in favor of more
       | specialized and specific titles.
       | 
       | "analytics engineer" is mentioned but also just had its first
       | conference at dbt's conference. all the talks are already up
       | https://coalesce.getdbt.com/agenda/keynote-the-end-of-the-ro...
        
         | pjot wrote:
         | Just to clarify, last week was dbt's first _in person_
         | conference. Third overall.
        
       | iblaine wrote:
       | A full history of DE should include some original low code tools
       | (Cognos, Informatica, SSIS). To some extent, the failure of these
       | tools to adopt to the evolution of the DE role has lead to our
       | modern data stack.
        
         | Eumenes wrote:
         | Agreed. This is the first thing I thought about - the evolution
         | from reporting systems to ETL code to Hadoop to Spark, etc.
        
       | MrPowers wrote:
       | Great article.
       | 
       | > data engineers have been increasingly adopting software
       | engineering best practices
       | 
       | I think the data engineering field is starting to adopt some
       | software engineering best practices, but it's still really early
       | days. I am the author of popular Spark testing libraries (spark-
       | fast-tests, chispa) and they definitely have a large userbase,
       | but could also grow a lot.
       | 
       | > The way organizations structure data teams has changed over the
       | years. Now we see a shift towards decentralized data teams, self-
       | serve data platforms, and ways to store data beyond the data
       | warehouse - such as the data lake, data lakehouse, or the
       | previously mentioned data mesh - to better serve the needs of
       | each data consumer.
       | 
       | I think the Lakehouse architecture is the real future of data
       | engineering, see the paper:
       | https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf
       | 
       | Disclosure: I am on the Delta Lake team, but joined because I
       | believe in the Lakehouse architecture vision.
       | 
       | It will take a long time for folks to understand all the
       | differences between data lakes, Lakehouses, data warehouses, etc.
       | Over time, I think mass adoption of the Lakehouse architecture is
       | inevitable (benefits of open file formats, no lock in, separating
       | compute from storage, cost management, scalability, etc.).
        
         | victor106 wrote:
         | > It will take a long time for folks to understand all the
         | differences between data lakes, Lakehouses, data warehouses,
         | etc.
         | 
         | What are some good resources that can help educate folks on
         | these differences?
        
           | claytonjy wrote:
           | Short version:
           | 
           | - data warehouse: schema on write. you have to know the end
           | form before you load it. breaks every time upstream changes
           | (a lot, in this world)
           | 
           | - data lake: schema on read. load everything into S3 and deal
           | with it later. Mongo for data platforms
           | 
           | - data lakehouse: something in between. store everything
           | loosely like a lake, but have in-lakehouse processes present
           | user-friendly transforms or views like a warehouse. Made
           | possible by cheap storage (parquet on S3), reduces schema
           | breakage by keeping both sides of the T in the same place
        
             | DougBTX wrote:
             | Materialised views for cloud storage?
        
           | MrPowers wrote:
           | I am working on some blogs / videos that will hopefully help
           | clarify the differences. I'm working on a Delta Lake vs
           | Parquet blog post right now and gave a 5 Reasons Parquet
           | files are better than CSV talk last year:
           | https://youtu.be/9LYYOdIwQXg
           | 
           | Most of the content that I've seen in this area is really
           | high-level. I'm trying to write posts that are a bit more
           | concrete with some code snippets / high level benchmarks,
           | etc. Hopefully this will help.
        
           | abrazensunset wrote:
           | "Lakehouse" usually means a data lake (bunch of files in
           | object storage with some arbitrary structure) that has an
           | open source "table format" making it act like a database.
           | E.g. using Iceberg or Delta Lake to handle deletes,
           | transactions, concurrency control on top of parquet (the
           | "file format").
           | 
           | The advantage is that various query engines will make it
           | quack like a database, but you have a completely open interop
           | layer that will let any combination of query engines (or just
           | SDKs that implement the table format, or whatever) coexist.
           | And in addition, you can feel good about "owning" your data
           | and not being overtly locked in to Snowflake or Databricks.
        
           | swyx wrote:
           | (OP's coworker) We actually published a guide on data
           | lakes/lakehouses last month! https://airbyte.com/blog/data-
           | lake-lakehouse-guide-powered-b...
           | 
           | covering:
           | 
           | - What's a Data Lake and Why Do You Need One?
           | 
           | - What's the Differences between a Data Lake, Data Warehouse,
           | and Data Lakehouse
           | 
           | - Components of a Data Lake
           | 
           | - Data Lake Trends in the Market
           | 
           | - How to Turn Your Data Lake into a Lakehouse
        
         | oa335 wrote:
         | I am a data engineer, and I STILL don't understand the
         | differences between the following terms:
         | 
         | 1. Data Warehouse
         | 
         | 2. Datalake
         | 
         | 3. Data Lakehouse
         | 
         | 4. Data Mesh
         | 
         | Can someone please clearly explain the differences between
         | these concepts?
        
         | chrisjc wrote:
         | You mean Lake 'Shanty' Architecture (think DataSwamp vs
         | DataLake) am I right?
         | 
         | But in all seriousness, I totally agree with your opinion on
         | LakeHouse Architecture and am especially excited about Apache
         | Iceberg (external table format) and the support and attention
         | it's getting.
         | 
         | Although I don't think that selecting any of these data
         | technologies/philosophies comes down to making a mutually
         | exclusive decision. In my opinion, they either build on or
         | compliment each other quite nicely.
         | 
         | For those that are interested, here are my descriptions of
         | each...
         | 
         | Data Lake Arch - all of your data is stored on blob-storage
         | (S3, etc) in a way that is partitioned thoughtfully and easily
         | accessible, along with a meta-index/catalogue of what data is
         | there, and where it is.
         | 
         | Lake House Arch - similar to a DataLake but data is structured
         | and mutable, and hopefully allows for transactions/atomic-ops,
         | schema evolution/drift, time-travel/rollback, so on... Ideally
         | all of the properties that you usually assume to get with any
         | sort of OLAP (maybe even OLTP) DB table. But the most important
         | property in my opinion is that the table is accessible through
         | any compatible compute/query engine/layer. Separating storage
         | and compute has revolutionized the Data Warehouse as we know
         | it, and this is the next iteration of this movement in my
         | opinion.
         | 
         | Data Mesh/Grid Arch - designing how the data moves from a
         | source all the way through each and every target while
         | storing/capturing this information in an accessible
         | catalogue/meta-database even as things transpire and change. As
         | a result it provides data lineage and provenance, potentially
         | labeling/tagging, inventory, data-dictionary-like information
         | etc... This one is the most ambiguous and maybe most difficult
         | to describe and probably design/implement, and to be honest
         | I've never seen a real-life working example. I do think this
         | functionality is a critical missing piece of the data stack,
         | whether the solution is a Data Mesh/Grid or something else.
         | Data Engineers have their work cutout on this one, mostly bc
         | this is where their paths cross with those of
         | Application/Service Developers, Software Engineers. In my
         | opinion, developers are usually creating services/applications
         | that are glorified CRUD wrappers around some kind of
         | operational/transactional data store like MySQL, Postgres,
         | Mongo, etc. Analytics, reporting, retention, volume, etc are
         | usually an after thought and not their problem. Until someone
         | hooks the operational data store up to their SQL IDE or
         | Tableau/Looker and takes down prod. Then along comes the data
         | engineer to come up with yet another ETL/ELT to get the data
         | out of the operational data store and into a data warehouse so
         | that reports and analytics can be run without taking prod down
         | again.
         | 
         | Data Warehouse (modern) - Massive Parallel Processing (MPP)
         | over detached/separated columnar (for now) data. Some Data
         | Warehouses are already somewhat compatible with Data Lakes
         | since they can use their MPP compute to index and access
         | external tables. Some are already planning to be even more Lake
         | House compatible by not only leveraging their own MPP compute
         | against externally managed tables (eg), but also managing
         | external tables in the first place. That includes managing
         | schemas and running all of the DDLs (CREATE, ALTER, DROP, etc)
         | as well as DQLs (SELECT) and DMLs (MERGE, INSERT, UPDATE,
         | DELETE, ...). Querying data across native DB tables, external
         | tables (potentially from multiple Lake Houses, Data Lakes) all
         | becomes possible with a join in a SQL statement. Additionally
         | this allows for all kinds of governance related functionality
         | as well. Masking, row/column level security, logging, auditing,
         | so on.
         | 
         | As you might be able to tell from this post (and my post
         | history) is that I'm a big fan of Snowflake. I'm excited for
         | Snowflake managed Iceberg tables and then consume the data with
         | a different compute/query engine. Snowflake (or other modern
         | DW) could prepare the data (ETL/calc/crunch/etc) and then
         | manage (DDL & DML) it in an Iceberg table. Then something like
         | DuckDB could consume the Iceberg table schema and listen for
         | table changes (oplog?), and then read/query the data performing
         | last-mile analytics (pagination, order, filter, aggs, etc).
         | 
         | DuckDB doesn't support Apache Iceberg, but it can read parquet
         | files which are used internally in Iceberg. Obviously
         | supporting external tables is far more complex than just
         | reading a parquet file, but I don't see why this isn't in their
         | future. DuckDB guys, I know you're out there reading this :)
         | 
         | https://iceberg.apache.org/
         | 
         | https://www.snowflake.com/guides/olap-vs-oltp
         | 
         | https://www.snowflake.com/blog/iceberg-tables-powering-open-...
         | 
         | Finally one of my favorite articles:
         | 
         | https://engineering.linkedin.com/distributed-systems/log-wha...
        
           | beckingz wrote:
           | I'm going to use "Lake Shanty" in the future. Powerful phrase
           | to describe what happens when you run aground on the shore of
           | a data swamp.
        
           | oa335 wrote:
           | Great write-up. I would add that I actually have seen
           | something like a "Data Mesh" architecture, at a bank of all
           | places. The key was a very stable, solid infrastructure and
           | dev platform, as well as a custom Python library that worked
           | across that Platform which was capable of ELT across all
           | supported datastores and would properly log/annotate/catalog
           | the data flows. Such a thing is really only possible when the
           | platform is actually very stable and devs are somewhat forced
           | to use the library.
        
       | z3c0 wrote:
       | For those of you who are genuinely curious why this field has so
       | many similarly-named roles, here's a sincere, non-snarky, non-
       | ironic explanation:
       | 
       | A Data Analyst is distinct from a Systems Analyst or a Business
       | Analyst. They may perform both systems and business analysis
       | tasks, but their distinction comes from their understanding of
       | statistics and how they apply that to other forms of analysis.
       | 
       | A ML specialist is not a Data Scientist. Did you successfully
       | build and deploy an ML model to production? Great! That's still
       | uncommon, despite the hype. However, that would land you in the
       | former position. You can claim the latter once you've walked that
       | model through the scientific method, complete with hypothesis
       | verification and democratization of your methodology.
       | 
       | A BI Engineer and a Data Engineer are going to overlap a lot, but
       | the former is going to lean more towards report development,
       | where the latter will spend more time with ELTs/ETLs. As a data
       | engineer, most of the report development that I do is to report
       | on the state of data pipelines. BI BI, I like to call it.
       | 
       | A Big Data Engineer or Specialist is a subset of data engineers
       | and architects angled towards the problems of big data. This
       | distinction actually matters now, because I'm encountering data
       | professionals these day who have never worked outside the cloud
       | or with small enterprise datasets (unthinkable only half-a-decade
       | ago.)
       | 
       | It doesn't help that lack of understanding often leads to
       | misnomer positions, but anybody who has spent time in this field
       | gets used to the subtle differences quickly.
        
         | itsoktocry wrote:
         | What about _Analytics Engineer_ , the hypiest-of-the-hyped
         | right now?
        
           | claytonjy wrote:
           | BI engineer that knows dbt
        
         | claytonjy wrote:
         | This strikes me as incredibly rosy; I want to live in this
         | world, but I don't. The world I live in:
         | 
         | - Data Analyst: someone who knows some SQL but not enough
         | programming, so we can pay < 6 figures
         | 
         | - ML specialist: someone who figured out DS is a race to the
         | bottom and ML in a title gets you paid more. Spends most of
         | their time installing pytorch in various places
         | 
         | - BI Engineer: Data Analyst but paid a bit more
         | 
         | - Data Engineer: Airflow babysitter
         | 
         | - Big Data Engineer: middle-adged Scala user, Hadoop babysitter
        
           | tdj wrote:
           | In my experience, I have started to believe ML Engineer is
           | short for "YAML Engineer".
        
             | z3c0 wrote:
             | Out of all the snark in this thread, this is the only bit
             | to illicit a chuckle from me. Thank you.
        
       | mywittyname wrote:
       | > The declarative concept is highly tied to the trend of moving
       | away from data pipelines and embracing data products -
       | 
       | Of course an Airbyte article would say this, because they are
       | selling these tools, but my experience has been the opposite.
       | People buy these tools because they claim to make it easier for
       | non-software people to build pipelines. But the problem is that
       | these tools seem to end up being far more complicated and less
       | reliable than pipelines built in code.
       | 
       | There's a reason that this domain is saturated with so. many.
       | tools. None of them do a great job. And when a company invariably
       | hits the limits of one, they start shopping for a replacement,
       | which will have it's own set of limitations. Lather-rinse-repeat.
       | 
       | I built a solid career over the past 8 or so years of replacing
       | these "no code" pipeline tools with code once companies hit the
       | ceilings of these tools. You can get surprisingly far in the data
       | world with Airflow + a large scale database, but all of the major
       | cloud providers have great tool offerings in this space. Plus,
       | for platforms that these tools don't interface with, you're going
       | to have to write code anyway.
        
         | muspimerol wrote:
         | > I built a solid career over the past 8 or so years of
         | replacing these "no code" pipeline tools with code once
         | companies hit the ceilings of these tools.
         | 
         | I'm sure you earn a nice living doing this, but surely this is
         | not a convincing argument against using off-the-shelf data
         | products. It will always come down to the cost (including
         | ongoing maintenance) for the business. Bespoke in-house
         | software is always the most flexible route, but rarely the
         | cheaper one.
        
         | Arimbr wrote:
         | Oh, declarative doesn't necessarily mean no-code. Airbyte data
         | integration connectors are built with an SDK in Python, Java,
         | and a low-code SDK that was just released...
         | 
         | You can then build custom connectors on top of these and many
         | users actually need to modify an existing connector, but would
         | rather start from a template than from scratch.
         | 
         | Airbyte also provides a CLI and YAML configuration language
         | that you can use to declare sources, destinations and
         | connections without the UI:
         | https://github.com/airbytehq/airbyte/blob/master/octavia-cli...
         | 
         | I agree with you that code is here to stay and power users need
         | to see the code and modify it. That's why Airbyte code is open-
         | source.
        
       | buscoquadnary wrote:
       | Business Analyst, Big Data Specialist, Data Mining Engineer, Data
       | Scientist, Data Engineer.
       | 
       | Why is this field so prone to hype and repeating the same things
       | with a new coat of paint. I mean what ever happened to OLAP, data
       | cubes, Big Data, and whatever other super big next thing that has
       | happened in the past 2 decades?
       | 
       | Methinks the problem with Business Intelligence solving problems
       | is the firdt part of the term and not the second.
        
         | hatware wrote:
         | > Why is this field so prone to hype and repeating the same
         | things with a new coat of paint.
         | 
         | Money and Marketing. It's no different from how Hadoop was a
         | big deal around 2010, or how Functional Programming became the
         | new thing from 2015 onwards.
         | 
         | Personally I think this is a failure of regulatory agencies.
        
         | MonkeyMalarky wrote:
         | I dunno, I have to first put my data somewhere though. But
         | where.. In a warehouse? Silo? Data lake? Lake house? (I really
         | despise that last one, who could coin that phase with a
         | straight face..)
        
           | MrPowers wrote:
           | Data warehouse: bundles compute & storage and comes at a
           | comparatively high price point. Great option for certain
           | workflows. Not as great for scaling & non-SQL workflows.
           | 
           | Data lake: typically refers to Parquet files / CSV files in
           | some storage system (cloud or HDFS). Data lakes are better
           | for non-SQL workflows compared to data warehouses, but have a
           | number of disadvantages.
           | 
           | Lakehouse storage formats: Based on OSS files and solve a
           | number of data lake limitations. Options are Delta Lake,
           | Iceberg, and Hudi. Lakehouse storage formats offer a ton of
           | advantages and basically no downsides compared to Parquet
           | tables for example.
           | 
           | Lakehouse architecture: An architectural paradigm to store
           | data in a way such that's it's easily accessible for SQL-
           | based and non-SQL-based workflows, see the paper:
           | https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf
           | 
           | There are a variety of tradeoffs to be weighed when selecting
           | the optimal solution for your particular needs.
        
             | marktangotango wrote:
             | If this is satire it's brilliant. I don't doubt it's
             | factual, but the last sentence is a slayer.
        
               | fabatka wrote:
               | Can you explain why do you find the above explanation
               | amusing? I honestly don't see the absurdity of it,
               | although, my livelihood may depend on me not seeing it :)
        
               | victor106 wrote:
               | It's seems like you are just being negative to a reply
               | that was meant to genuinely clarify confusing
               | terminology.
               | 
               | If not please elaborate why you doubt this is not
               | factual?
        
           | PubliusMI wrote:
           | IMHO,
           | 
           | "data lake" = collection of all data sources: HDFS, S3, SQL
           | DB, etc
           | 
           | "lake house" = tool to query and combine results from all
           | such sources: Dremio
        
             | buscoquadnary wrote:
             | But what lake house is complete without a boat.
             | 
             | That's why my company is looking for investors who are
             | interested in being at the forefront of the data
             | revolution, using our data rowboat that will allow you to
             | proactively leverage your data synergies to break down
             | organizational data silos and use analytics to address your
             | core competencies in order to leverage a strategic
             | advantage to become the platform of choice in a holistic
             | environment.
             | 
             | Tell me if this sounds familiar, your company has tons of
             | data but it is spread out all over the place and you can't
             | seem to get good info, you end up hounding engineers to get
             | your reports and provide you information so you can look
             | like you are making data driven decisions. Maybe you've
             | implemented a data lake but now have no idea how to use it.
             | We've got you covered with our patent pending data rowboat
             | solution.
             | 
             | This will allow you to impressive everyone else in the mid
             | level staff meetings by allowing you to say you are doing
             | something around the "data revolution" in your org. The
             | best part is that every implementation will come with a
             | team of our in house consultants that will allow the
             | project to drag on forever so that you always have
             | something to report on in staff meetings and make you look
             | good to your higher ups.
             | 
             | Now you may be an engineer looking to revolutionize your
             | career and get involved in the next step of the glorious
             | October data revolution. Well we've got you covered for a
             | very reasonable price you can enroll in our "data rowboat
             | boot camp", where you will spend hours locked in a room
             | where someone who barely speaks English will read
             | documentation to you.
             | 
             | But act quick otherwise you'll end up as one of the data
             | kulaks as the new data rowboat revolution proceeds into a
             | glorious future with our 5 year plan.
        
               | jdgoesmarching wrote:
               | Brb, running to trademark every nautical data metaphor I
               | can get my hands on.
               | 
               | What happens when your data rowboat runs ashore?
               | Introducing Data Tugboat(tm), your single pane of glass
               | solution for shoring up undocumented ETLs and reeling
               | your data lineage safely into harbor.
        
               | MonkeyMalarky wrote:
               | Need to run ML on your data? Try our DeepSea data
               | drilling rigs, delivered in containers!
        
               | MonkeyMalarky wrote:
               | Sir, I'm sorry, but a rowboat just won't scale, my needs
               | are too vast. What I'm proposing is the next level of
               | data extraction. You've heard of data mining? Well meet
               | the aquatic equivalent, the Data Trawler. To find out
               | more, contact our solution consultants today!
        
               | MikeDelta wrote:
               | 'Tis a field riddled with yachts... but where are all the
               | customers' yachts?
        
         | Avicebron wrote:
         | I think the real interesting point is slapping the title of
         | engineer/scientiest on to anything and everything regardless of
         | the accreditation actual handed out. soon coming up.."cafeteria
         | engineer", "janitorial engineer"...
        
           | the-printer wrote:
           | Woah, woah, woah. Cool it, buddy.
           | 
           | That's already begun.
        
           | Test0129 wrote:
           | The difference of course being that other types of engineers
           | have to take a PE. The idea of requiring a PE to have that
           | title is protectionism no different than limiting the number
           | of graduating doctors to keep salaries high. No one will ask
           | a software engineer to build a bridge - relax. Your
           | protection racket is safe. Software engineer is a title
           | conferred on someone who builds systems. It is fitting. And,
           | if we're being honest, the average "job threat level" of a
           | so-called "real" engineer is about the same as a software
           | engineer these days anyway. With the exception of some niche
           | jobs every engineer I know is just a CADIA/SW/etc jocky and
           | the real work is gatekept by greybeards.
           | 
           | No one will call someone a cafeteria engineer or janitorial
           | engineer. The premise is ridiculous. There is a title called
           | "operations engineer" that uses math to optimize processes.
           | Does this one bother you too?
        
       | iamjk wrote:
       | I won't be surprised if DE ends up just falling under the
       | "software engineering" umbrella as the jobs grow closer together.
       | With hybrid OLAP/OLTP databases becoming more popular, the
       | skillset delta is definitely smaller than it used to be. Data
       | Engineers are higher leverage assets to an organization than they
       | ever have been before.
        
         | snapetom wrote:
         | I think it's mostly already there, but your big, enterprise
         | houses were late getting the memo. About 12 years ago, I
         | switched to a DE role/title and held it ever since. I worked in
         | a variety of startups doing DE - moving data from over here to
         | over there, with a variety of tools from orchestration
         | frameworks to homegrown code in a variety of languages.
         | 
         | About six years ago, I walked into a local hospital to
         | interview for a DE role and it was very clear that their
         | definition of DE was different than mine. The whole dept worked
         | in nothing but SQL. I thought I was good with SQL, but they
         | absolutely crucified me on SQL and data architecture theories.
         | I ended up getting kicked over to a software engineering role,
         | doing DE in another capacity, which made more sense for me.
         | 
         | Only now I'm hearing that they're migrating to other tools like
         | dbt and requiring their DEs to learn programming languages.
        
         | buscoquadnary wrote:
         | Well my understanding is that a Data Engineer is basically just
         | a DevOps engineer but instead of building infra to run
         | applications they build infra to process, sanitize and
         | normalize data.
        
           | Avalaxy wrote:
           | Imho that is absolutely not doing the role justice. For some
           | people that may hold true, but I would expect a data engineer
           | to know everything about distributed systems, database
           | indexes, how different databases work and why you pick them,
           | partitioning, replication, transactions/locking. These are
           | topics a software engineer is typically familiar with. A
           | DevOps engineer wouldn't be.
        
           | hbarka wrote:
           | Or to denormalize data, the distinction of which the data
           | engineer would be most familiar with why and how.
        
           | tbarrera wrote:
           | Author here - Of course, data engineering involves building
           | infra and being knowledgeable about DevOps practices, but
           | that's not the only area data engineers should be familiar
           | with. There are many, many more! In my personal experience,
           | sometimes we end up not using DevOps best practices because
           | we spread too thin. That's why I believe in specialization
           | within data engineering and the surge of "data reliability
           | engineer" or similar
        
         | 3minus1 wrote:
         | Yeah, maybe this will happen. Where I work (FAANG), I know that
         | DEs get lower compensation than SWEs and SREs.
        
       | rectang wrote:
       | I'm a software dev who's been bumping up against the data
       | engineering field lately, and I've been dismayed as to how many
       | popular tools shunt you towards unmaintainable, unevolvable
       | system design.
       | 
       | - A predilection for SQL, yielding "get the answer right once"
       | big-ball-of-SQL solutions which are infeasible to debug or modify
       | without causing regressions.
       | 
       | - Poor support for unit testing.
       | 
       | - Poor support for version control.
       | 
       | - Frameworks over libraries (because the vendors want to lock you
       | in).
       | 
       | > data engineers have been increasingly adopting software
       | engineering best practices
       | 
       | We can only hope. I think it's more likely that in the near term
       | data engineers will get better and better at prototyping within
       | low-code frameworks, and that transitioning from the prototype to
       | an evolvable system will get harder.
        
         | MrPowers wrote:
         | > A predilection for SQL, yielding "get the answer right once"
         | big-ball-of-SQL solutions which are infeasible to debug or
         | modify without causing regressions.
         | 
         | Yea, thankfully some frameworks have Python / Scala APIs that
         | let you abstract "SQL logic" into programatic functions that
         | can be chained and reused independently to avoid the big-ball-
         | of-SQL problem. The best ones also allow for SQL because that's
         | the best way to express some logic.
         | 
         | > Poor support for unit testing.
         | 
         | I've written pandas, PySpark, Scala Spark, and Dask testing
         | libs. Not sure which framework you're referring to.
         | 
         | > Poor support for version control.
         | 
         | The execution platform should make it easy to package up code
         | in JAR files / Python wheel files and attach them to your
         | cluster, so you can version control the business logic. If not,
         | yea, that's a huge limitation.
         | 
         | > Frameworks over libraries (because the vendors want to lock
         | you in)
         | 
         | Not sure what you're referring to, but interested in learning
         | more.
        
           | forgetfulness wrote:
           | They were talking about the "modern data stack" no doubt.
           | 
           | The trend has been to shift as much work possible to the
           | current generation of Data Warehouses, that abstract the
           | programming model that Spark on columnar storage provided
           | with only a SQL interface, reducing the space where you'd use
           | Spark.
           | 
           | It makes it very accessible to write data pipelines then
           | using dbt (which outcompeted Dataform, though the latter is
           | still kicking), but then you don't have the richer
           | programming facilities, stricter type systems, tooling and
           | the practices of Python or Scala programming, you're in the
           | world of SQL, set back decade or two in testing, checking,
           | and a culture of using them, and with little tools to
           | organize your code.
           | 
           | That, if the team has rebuked the siren songs of a myriad of
           | cloud, low-code platforms for this or the other, with even
           | fewer facilities to keep the data pipelines under control, be
           | it that we call control any of: access, versioning,
           | monitoring, data quality, anything really.
        
             | morelisp wrote:
             | Let me offer the more blunt materialist analysis: Data
             | engineers are being deskilled into data analysts and too
             | blinded by shiny cloud advertisements to notice.
             | 
             | (In this view though, "lack of tests" or whatever is the
             | least concern - until someone figures out how to spin up
             | another expensive cloud tool selling "testable queries".)
        
               | forgetfulness wrote:
               | The "data engineer" became a distinct role to bring over
               | Software Engineering practices to data processing; such
               | as those practices are, they were a marked improvement
               | over their absence.
               | 
               | Building a bridge from one shore to the other with
               | application programming languages and data processing
               | tools that worked much closer to other forms of
               | programming was a huge part of that push.
               | 
               | Of course, the big data tools were intricate machines
               | that were easy to learn and very hard to master, and data
               | engineers had to be pretty sophisticated.
               | 
               | So, it became cheaper to move much of that apparatus to
               | data warehouses and, as you said, commoditize the
               | building of data pipelines that way.
               | 
               | Software is as widespread as it is today because in every
               | generation the highly skilled priestly classes that were
               | needed to get the job done were displaced by people with
               | less training enabled by new tools or hardware; else it'd
               | be all rocket simulations done by PhD physicists still.
               | 
               | But the technical debt will be hefty from this shift.
        
             | chrisjc wrote:
             | FYI
             | 
             | > write data pipelines then using dbt (which outcompeted
             | Dataform, though the latter is still kicking), but then you
             | don't have the richer programming facilities, stricter type
             | systems, tooling and the practices of Python or Scala
             | programming, you're in the world of SQL...
             | 
             | Recently announced and limited to only a handful of data
             | platforms, but dbt now supports python models.
             | 
             | https://docs.getdbt.com/docs/building-a-dbt-
             | project/building...
        
             | MrPowers wrote:
             | > The trend has been to shift as much work possible to the
             | current generation of Data Warehouses, that abstract the
             | programming model that Spark on columnar storage provided
             | with only a SQL interface, reducing the space where you'd
             | use Spark.
             | 
             | I feel like there there are some data professionals that
             | only want to use SQL. Other data professionals only want to
             | use Python. I feel like the trend is to provide users with
             | interfaces that let them be productive. I could be
             | misreading the trend of course.
        
               | morelisp wrote:
               | It's very unclear to me that anyone is more productive
               | under these new tooling stacks. I'm certain they're not
               | more productive commensurately with new costs and long-
               | term risks.
        
             | marcosdumay wrote:
             | > stricter type systems ... the practices of Python or
             | Scala
             | 
             | I do understand what you are talking about. But I really
             | think you and the OP are both complaining about the wrong
             | problem.
             | 
             | SQL doesn't require bad practices, doesn't inherently harm
             | composability (the way the OP was referring), and don't
             | inherently harm verification. Instead, it has stronger
             | support for many of those than the languages you want to
             | replace it with.
             | 
             | The problems you are talking about are very real. But they
             | do not come from the language. (SQL does bring a few
             | problems by itself, but they are much more subtle than
             | those.)
        
               | forgetfulness wrote:
               | At least BigQuery does a fair bit of typechecking, and
               | gives error messages in a way that's to the par of
               | application programming (e.g. not letting you pass a
               | timestamp to a DATE function and stating that there's no
               | matching signature).
               | 
               | But a tool that doesn't "require" bad practices but
               | doesn't require good practices either makes your work
               | harder in the long run.
               | 
               | Tooling is poor, the best IDE-similes you got until
               | recently were of the type that connects to a live
               | environment but doesn't tie to your codebase, and
               | encourages you to put your code directly on the database
               | rather than version control, the problems of developing
               | with a REPL and little in the way to mitigate them. I'm
               | talking of course of the problem of having view and
               | function definitions live in the database with no tools
               | to statically navigate the code.
               | 
               | Testing used to be completely hand rolled if anyone
               | bothered with it at all.
               | 
               | That was until now, that data pipeline orchestration
               | tools exist and let you navigate the pipeline as a
               | dependency graph, a marked improvement, but until dbt's
               | Python version is ready for production, we're talking
               | here of a graph of Jinja templates and YAML definitions,
               | with modest support for unit testing.
               | 
               | Dataform is a bit better but virtually unknown and was
               | greatly hindered by the Google acquisition.
               | 
               | Functions have always been clunky and still are.
               | 
               | RDDs and then, to a lesser extent, Dataframes offered a
               | much stronger programming model, but they were still
               | subject to a lack of programming discipline from data
               | engineers in many shops. The results of that, however,
               | are on a different scale with undisciplined SQL
               | programming, and it's downright hard to be disciplined
               | when using it.
               | 
               | The trend to move from ETL to ELT I feel shouldn't have
               | been unquestioningly transitioned to untyped Dataframes
               | and then SQL.
        
         | mynameisash wrote:
         | I'm also a software engineer, though I've had the unofficial
         | title "data engineer" applied to me for quite some time now.
         | 
         | The more I work with tools like Spark, the more dissatisfied I
         | become with the data engineering world. Spark is a hot mess of
         | fiddling with configuration - I've lost more productivity to
         | dealing with memory limit, executor count, and other
         | configuration than I think is reasonable.
         | 
         | Pandas is another one. It was good enough to make quick
         | processing concise that it got significant uptake and became de
         | facto. The API is a pain, though, and processing is slow. Now,
         | couple Pandas and Spark in your day-to-day job and you get what
         | I see from my data science colleagues: "I'll hack together some
         | processing in Pandas until my machine can't handle any more
         | data, at which point I'll throw it into Spark and provision a
         | bunch of nodes to do that work for me." I don't mean that to
         | sound pejorative, as they're generally just trying to be
         | productive, but there's so little attention paid to real
         | efficiency in the frameworks and infrastructure that we're
         | blowing through compute, memory, and person-hours unnecessarily
         | (IMHO).
        
           | rectang wrote:
           | > Pandas is another one.
           | 
           | But at least if I write a transform in pandas it's
           | straightforward to unit test it: create a DataFrame with some
           | dummy data, send it through the function which wraps the
           | transform, test that what comes out is what's expected.
           | 
           | Validating a transform done in SQL is not nearly as
           | straightforward. For starters it needs to be an integration
           | test not a unit test (because you need a database). And
           | that's assuming there's even a way to hook unit tests into
           | your framework.
           | 
           | I'm not a huge fan of Pandas -- it's _way_ too prone to
           | silent failure. I 've written wrappers around e.g. read_csv
           | which are there to defeat all the magic. But at least I can
           | do that with Python code instead of being stuck with the big-
           | ball-of-SQL (e.g. some complicated view SELECT statement that
           | does a bunch of JOINs, CASE tests and casts).
        
         | dang wrote:
         | Please stick to plain text - formatting gimmicks are
         | distracting, not the convention here, and tend to start an arms
         | race of copycats.
         | 
         | I've put hyphens there now. Your comment is otherwise fine of
         | course.
        
           | nmarinov wrote:
           | Out of curiosity, could you give or point to an example of
           | "formatting gimmicks"?
           | 
           | I tried searching the FAQ but only found the formatdoc guide:
           | https://news.ycombinator.com/formatdoc
        
             | rectang wrote:
             | OT...
             | 
             | I used one of the Unicode bullets instead of a hyphen (*).
             | I understand what dang is getting at here -- it's in the
             | same spirit as stripping emojis (because otherwise we'd be
             | drowning in them). I'm a past graphic designer so I'm
             | accustomed to working with the medium to achieve maximum
             | communicative impact, but I don't mind operating within
             | constraints.
             | 
             | The convention of using hyphens to start a bulleted list
             | isn't officially documented AFAIK. Having to separate
             | bulleted list items with double newlines is a little weird,
             | but it's fine.
        
       | webshit2 wrote:
       | As someone who knows nothing about this stuff, I'm looking at the
       | "Data Mart" wiki page: https://en.wikipedia.org/wiki/Data_mart.
       | Ok, so the entire diagram here is labelled "Data Warehouse", and
       | within that there's a "Data Warehouse" block which seems to be
       | solely comprised of a "Data Vault". Do you need a special data
       | key to get into the data vault in the data warehouse? Okay,
       | naturally the data marts are divided into normal marts and
       | strategic marts - seems smart. But all the arrows between
       | everything are labelled "ETL". Seems redundant. What does it mean
       | anyway? Ok apparently it's just... moving data.
       | 
       | Now I look at
       | https://en.wikipedia.org/wiki/Online_analytical_processing.
       | What's that? First sentence: "is an approach to answer multi-
       | dimensional analytical (MDA)". I click through to
       | https://en.wikipedia.org/wiki/Multidimensional_analysis ... MDA
       | "is a data analysis process that groups data into two categories:
       | data dimensions and measurements". What the fuck? Who wrote this?
       | Alright, back on the OLAP wiki page... "The measures are placed
       | at the intersections of the hypercube, which is spanned by the
       | dimensions as a vector space." Ah yes, the intersections... why
       | not keep math out of it if you have no idea how to talk about it?
       | Also, there's no actual mention of why this is considered
       | "online" in the first place. I feel like I'm in a nightmare where
       | the pandas documentation was rewritten in MBA-speak.
        
         | rjbwork wrote:
         | It's a difficult sphere of knowledge to penetrate. All of that
         | is perfectly coherent to me, FWIW.
         | 
         | From first principals, I can highly recommend Ralph Kimball's
         | primer, The Data Warehouse Toolkit: The Definitive Guide to
         | Dimensional Modeling.[1]
         | 
         | [1]https://www.amazon.com/gp/product/B00DRZX6XS
        
       ___________________________________________________________________
       (page generated 2022-10-24 23:00 UTC)