[HN Gopher] How Airbnb Built "Wall" to prevent data bugs ___________________________________________________________________ How Airbnb Built "Wall" to prevent data bugs Author : charlysl Score : 23 points Date : 2022-05-18 18:43 UTC (2 days ago) (HTM) web link (medium.com) (TXT) w3m dump (medium.com) | SOLAR_FIELDS wrote: | Ive been in the end stage of this (worked on data validation for | a good chunk of my career) and these are my thoughts on the | article: | | Determining blocking vs non blocking is a big issue - deciding | which checks should be stoppers and which shouldn't is often a | matter of extensive debate. In my experience, only a few data | checks are absolute show stoppers under any circumstance and a | lot of things need to spawn tickets that should be routed to the | correct team and followed up on. Some type of tracking system is | necessary for this. | | Defining the logic of checks themselves in YAML is a trap. We | went down this DSL route first and it basically just completely | falls apart once you want to add moderately complex logic to your | check. AirBnB will almost certainly discover this eventually. | YAML does work well for the specification of how the check should | behave though (eg metadata of the data check). The solution we | were eventually able to scale up with was coupling specifications | in a human readable but parseable file with code in a single unit | known as the check. These could then be grouped according to | various pipeline use cases. | | A model that plugs into an Airflow DAG as AirBnB has designed | seems like a good approach. Often when it was time to incorporate | checks into the pipeline we had heterogenous strategies to invoke | our checks engines. Having a standardized approach helps drive | adoption across the organization- oftentimes I've found that | people are reluctant to run non critical checks if it's a | significant time and effort cost and will only run critical ones | to try and push data quality accountability either upstream or | downstream. If it's really easy to turn on and incorporate that's | one less excuse that can be used to not run the checks. | testbjjl wrote: | Maybe Jim Buckmaster and Craig Neumark are taking notes. ___________________________________________________________________ (page generated 2022-05-20 23:00 UTC)