[HN Gopher] Launch HN: Elementary (YC W22) - Open-source data ob...
       Launch HN: Elementary (YC W22) - Open-source data observability
       Hey HN! We're Maayan and Or, and we are building Elementary
       (https://github.com/elementary-data/elementary), an open-source
       framework that continuously monitors your data and sends alerts
       when anomalies are detected.  Elementary can alert you, for
       example, when a table in Snowflake hasn't been updated as expected
       or when a revenue column has too many nulls. It also monitors
       operations in the data stack, and provides for analyzing both
       impact and root cause. For example, it can alert you when your dbt
       runs or tests fail, including the impacted dependencies. A data
       lineage graph visualizes the data flow and can be used to find the
       source of invalid data.  We have both been working in the data
       space for over a decade, Maayan in analytics and Or in data
       engineering. Despite working at very different companies with
       different data stacks and use cases, we had the same reliability
       problems. Data is incomplete and inconsistent, and the abundance of
       technologies creates more complexity and inconsistency. Data
       reliability issues cause distrust, delays, and bad decisions. It's
       hard to achieve high data reliability, detect issues fast,
       understand impact and resolve quickly.  We also found that we had
       built similar solutions, and as we talked to other developers, we
       learned that most data teams have their own version of the same
       thing. They usually don't go for a commercial observability
       solution unless they have major incidents and technical debt. Until
       that point, they prefer to build for themselves, for two reasons:
       to avoid the overhead of procurement and security compliance; and
       to customize to their own stack, data sources, business logic, etc,
       and have all the metadata and metrics in their stack to support
       additional use cases.  We decided to build an open-source
       alternative--one that can be implemented easily, hosted yourself,
       and customized. This solves the compliance and the data ownership
       problem. It also solves the build-your-own problem, because teams
       can deploy an extensive solution early on, instead of waiting till
       later when there are major problems.  Elementary stores all the
       logs, metadata, and metrics it collects and generates in the data
       warehouse, so users can easily add their own detections and logic
       to it. Additionally, the solution is dbt native, which means it
       provides a dbt package that can be easily installed in a dbt
       environment as well as configured directly from a dbt project.
       Since it's part of the existing workflow and environment, it makes
       it convenient for data engineers, analytics engineers, and data
       analysts to enhance and contribute.  Open source eliminates the
       need to pay for getting started or to grant access to a third
       party. A managed service and additional enterprise features will be
       available in the cloud in the future. Critically, though, the
       user's metadata will continue to reside in their environment, under
       their control, and they will still have full customization
       available.  Currently Elementary supports Snowflake, BigQuery and
       dbt. It collects metadata such as schemas, query logs and dbt
       artifacts. It generates and monitors data quality metrics, sends
       Slack alerts, and visualizes the data lineage. If this is your data
       stack, we'd love you to give it a try!  We would love to hear your
       feedback, experiences and ideas from trying to solve data
       observability in your organizations.
       Author : Maayansa
       Score  : 91 points
       Date   : 2022-03-04 15:12 UTC (7 hours ago)
       | tutuca wrote:
       | What's with this comment section? there's a lot of dead
       | comments...
         | dang wrote:
         | Someone sent out a link or something and it led to a bunch of
         | booster comments. That's not allowed on HN, so I killed the
         | comments and emailed the founders asking them to make it stop.
         | The guidelines say not to do it:
         | https://news.ycombinator.com/newsguidelines.html, the FAQ says
         | not to do it: https://news.ycombinator.com/newsfaq.html, and
         | the Launch HN advice we give to YC startups says to "make sure"
         | (in bold!) not to do it: https://news.ycombinator.com/yli.html.
         | But people who aren't familiar with HN's conventions still end
         | up doing it sometimes.
         | We make a distinction between innocent mistakes (which are
         | usually obvious) and repeat offenses (where people usually try
         | to cover their tracks). The former isn't a big deal, the latter
         | we ban accounts for.
       | yogevyuval wrote:
       | edublancas wrote:
       | Congrats on the launch! As a former data scientist, I suffered
       | from bad data on a daily basis. Can you provide some details on
       | how anomalies are detected? Is it some kind of threshold-based
       | approach defined by the user or are you running statistical
       | analysis on user's data? Curious to learn more!
         | oravidov wrote:
         | Anomalies are detected based on a statistical analysis of the
         | data and is measured in terms of standard deviations from the
         | mean. We have lots of plans on improvements in the future and
         | curious also to learn how you would approach this problem as
         | well.
       | brryant wrote:
       | this is awesome - much needed and your OSS starting point will be
       | really attractive. will give it a go!
         | oravidov wrote:
         | Thanks! we would really love to hear your feedback or if we can
         | help somehow!
       | hekmati wrote:
       | llambda wrote:
       | I'm disappointed to see there isn't Redshift support. What's on
       | the roadmap to address that?
         | Maayansa wrote:
         | Hi, it's defiantly in our plans for the next few weeks. As I
         | mentioned in the post we leverage dbt for the whole data
         | monitoring layer. We wrote the package with all the cross-db
         | best practices, but didn't test it on Redshift so there are
         | probably some gaps. There are a few users on our community on
         | Slack that are Redshift users and they will help us to test on
         | production with real data. Hopefully it will not require much
         | to make it run smoothly.
       | [deleted]
       | aspectmin wrote:
       | This sounds very cool. Looking forward to trying it out on our
       | data layer.
         | Maayansa wrote:
         | Thank you, would love to get your feedback and thoughts on what
         | we should add.
       | mateuszklimek wrote:
       | Hey Maayan and Or, Nice project, at re_data we just got over a
       | lot of your new updates and it seems a quite large part of your
       | project is "inspired" by code from our library
       | https://github.com/re-data/re-data. Even with parts, we are not
       | especially proud of ;)
       | If you decide to copy not only ideas but a big part of internal
       | implementation, I think you should include that information in
       | your LICENSE.
       | Cheers
         | Maayansa wrote:
         | idomi wrote:
         | Pretty strong accusation, are you sure re-data isn't "inspired"
         | from Monte Carlo? :)
           | mateuszklimek wrote:
           | It is! But it doesn't have Monte Carlo code in it :)
           | And it's open-source so it's generally okay to do that, but
           | it should be reflected in the LICENSE.
         | windsquirrel wrote:
         | Is the idea here that it's inspired by re_data due to using dbt
         | transformations underneath or because it's reposted looking
         | nearly the same? (or both?)
         | Looks like much of the lineage code is also largely a wrapper
         | around this library: https://github.com/reata/sqllineage
         | Would be curious to understand the project's purpose and unique
         | contributions vs. the underlying dependencies powering it as
         | there seems to be some ambiguity. Is this just a wrapper around
         | dbt transformations and a lineage library in one package? Can I
         | just use them directly?
           | mateuszklimek wrote:
           | It's "inspired" the dbt transformation part by using the same
           | models and logic/part of code of generating them. We, for
           | example, had a funny thing of computing metrics in 4 threads
           | via multiple dbt models, and this is also done in elementary
           | in a very similar way :)
           | The lineage part is independent (re_data uses lineage from
           | dbt), so I haven't looked into that much.
             | Maayansa wrote:
             | While writing our dbt project we looked into more than 60
             | dbt projects to learn from prior work while developing
             | Elementary, and have been inspired by different things in
             | different places. You're right that we were inspired by a
             | couple of techniques you used, one being that creative way
             | to improve performance (though the 4 thread setting itself
             | is the dbt recommendation in their docs). Another is using
             | z-score for anomaly detection, which we saw in a number of
             | related projects and it's widely used in the industry.
             | In terms of the lineage, you can see in the code that we
             | mostly rely on query and access history that exist in
             | Snowflake and Bigquery to parse the queries and learn about
             | the connection between nodes in the graph. We use other
             | python libraries like sqlfluff and sqllineage as low level
             | parsers for some specific use cases which we extend and
             | solve many things on top of them. Actually we're heavy open
             | source users, depending on around 20 libraries, all MIT or
             | Apache.
               | nuclearnice1 wrote:
               | I think mateuszklimek is pointing out that the MIT
               | license requires you to include the redata copyright in
               | your source.
             | windsquirrel wrote:
             | Gotcha - I can see what you mean, appreciate the
             | clarification
         | mritchie712 wrote:
         | If you're going to make an accusation like this on HN, you
         | should provide line by line evidence. Saying "you copied us"
         | without any examples makes you incredible.
       | [deleted]
       | Gal_Aharon wrote:
       | theboat wrote:
       | For any dbt users, their reliability package has the best and
       | most comprehensive way to upload artifacts directly to the
       | warehouse after a dbt invocation.
       | https://github.com/elementary-data/dbt-data-reliability
         | Maayansa wrote:
         | Thank you! We believe that this upload is super valuable and
         | could unlock a lot of additional use cases. We are already
         | working on some of these and will release in the next few
         | weeks.
       | gadilif wrote:
         | Maayansa wrote:
       | idomi wrote:
       | Great job Elementary team! Does this in essence similar to the
       | aws deeque project but fancier and more inclusive of edge cases,
       | common scenarios? (https://github.com/awslabs/deequ)
         | Maayansa wrote:
         | Hi, thank you! The way we see it, AWS Deeque, as well as Great
         | Expectations and dbt tests, are used for static data testing.
         | This is great for many use cases, however there are problems
         | you will only detect by continuously monitoring. Just like in
         | software engineering you use both unit testing and monitoring.
       | aldo195 wrote:
         | oravidov wrote:
       (page generated 2022-03-04 23:00 UTC)