hngopher.com

       [HN Gopher] Launch HN: Hubble (YC S20) - Monitor data quality in...
       ___________________________________________________________________
        
       Launch HN: Hubble (YC S20) - Monitor data quality inside data
       warehouses
        
       Hey everyone! We're Oliver and Hamzah from Hubble
       (https://gethubble.io/hn). Hubble runs tests on your data warehouse
       so you can identify issues with data quality. You can test for
       things like missing values, uniqueness of data or how frequently
       data is added/updated.  We worked together for the last 4 years at
       a startup where we built and managed data products for insurers and
       banks. A common pattern we saw was teams taking data from their
       internal tools (CRM, HR system, etc.), application databases, and
       3rd party data and storing it in a warehouse for analysis. However,
       when analysts/data scientists used the data for reports they would
       spot something suspicious and the engineering team would have to
       manually go through the data pipelines to find the source of the
       problem. More often than not it was simple things like a spike in
       missing values because an ETL job failed or stale data because a
       3rd party data source hadn't updated correctly. We realised that
       reliability/ trustworthiness of the raw data was essential before
       you could start abstracting away more interesting tasks like
       analysis, insight or predictions.  We wanted to do this without
       having to write and maintain lots of individual tests in our code.
       So we built Hubble, which connects to a data warehouse and creates
       tests based on the type of data being stored (i.e. freshness of
       timestamps, the cardinality of strings, max value of numbers,
       missing values, etc.). We've also added the ability to write any
       custom tests using a built-in SQL editor. All the tests run on a
       schedule and you'll get an email or slack alert when they fail.
       We're also building webhooks and an Airflow operator so you can run
       tests immediately after running an ETL job or trigger a process to
       fix a failing test.  Instead of asking users to send their data to
       us, the tests are run in the data warehouse and we track the test
       results over time. Today we support BigQuery, Snowflake and Rockset
       (which lets us work with MongoDB and DynamoDB) and are adding more
       on request.  We're planning on charging $200 a month for a few
       seats, and $30-50 for extra users after that.  We're still at an
       early access stage but want the HN community's feedback so we've
       opened up access to the app for a few days, you can try it out here
       https://gethubble.io/hn. We've added a demo data warehouse you can
       start with that has data on COVID-19 cases in Italy and bike-share
       trips in San Francisco. Thanks and looking forward to hearing your
       ideas, experiences and feedback!
        
       Author : oliver101
       Score  : 104 points
       Date   : 2020-08-20 15:38 UTC (7 hours ago)
        
       | hribo wrote:
       | I signed up and I think the concept is promising. It was very
       | easy to add a couple of tests. SQL interface is handy and
       | convenient, but sometimes still limited. It would be good to add
       | a support for some custom scripts (i.e. Python, R). Another
       | important thing for my team would also be seamless integration
       | with other tools (i.e. email, SMS, Slack) to notify the team
       | about the failed test(s).
        
       | verhey wrote:
       | How does hubble compare to Great Expectations or DBT for pipeline
       | testing? It looks like more emphasis on automated profiling than
       | "having to write and maintain lots of individual tests" and
       | obviously hubble being a saas offering is the big difference?
       | 
       | Also any plans to profile and test file-based stores as well?
       | There's a lot that can go wrong in a pipeline before data even
       | reaches BigQuery or Snowflake, and you may help your customers
       | save money if you could profile data in S3 before it goes through
       | a potentially expensive transform process.
       | 
       | Best of luck, though! Data testing is a very real need in most
       | data organizations I've been in, and I'm glad more and more tools
       | seem to be popping up recently to help with it.
        
         | oliver101 wrote:
         | Thanks! We love DBT and take a lot of inspiration from their
         | work. We're putting a lot of effort into suggesting the right
         | tests based on the data types, sources, and field names. A lot
         | of these tests are pretty repetitive to write so we want to
         | make it easy to spin them up.
         | 
         | We've also found that keeping a history of the state of the
         | warehouse over time is really useful context for determining
         | whether a test has failed (example: this table tends to update
         | every 30-40 minutes so we'll set a threshold at an hour).
         | 
         | We also handle the scheduling, which is surprisingly annoying
         | to manage (we built a couple of internal tools for this in the
         | past). That's something we really missed with great
         | expectations (you get this with DBT cloud). Testing files is an
         | interesting use case, to an extent we support this using Athena
         | or Bigquery external tables for json/csv/parquet. We're
         | intentionally limiting it to SQL for now.
        
       | iblaine wrote:
       | What does the tech stack look like?
       | 
       | Is there any caching for those situations where you may read the
       | same historical data over & over?
        
       | 12ian34 wrote:
       | +1 for alleviating data scientists/engineers of boring,
       | repetitive manual tasks and empowering them to focus on the more
       | challenging stuff
        
       | LittlePeter wrote:
       | Running a full table scan on BigQuery every hour can get quite
       | expensive. Do you support some sort of deltas?
       | 
       | I signed up. Unlike the video, I do not see Redshift as an
       | option. Any idea when Redshift will be supported?
       | 
       | How does billing per user make sense here? What prevents me
       | monitoring thousands of tables under single user? Your workload
       | costs will be higher than $200 here, no?
       | 
       | Do you have a set of fixed IPs you're connecting from to allow me
       | to whitelist you?
        
         | oliver101 wrote:
         | Full table scans can get expensive. We're adding support for
         | incremental tests so for append-only tables you'll only test
         | the recent rows. This is especially useful if you use
         | partitioned tables in bigquery.
         | 
         | Actually in the first version of the product we automatically
         | tested every column in every table. The tests are more
         | selective now, which is partially due to cost and partially
         | because nobody wants to navigate through 10,000 tests.
         | 
         | Redshift will be supported this week! We have a list of new
         | sources to get through and it's right at the top. We've been
         | emailing over the IP for whitelisting but we'll add it to the
         | connection page too.
         | 
         | As for pricing, we're experimenting. Our costs do scale with
         | number of tests (more scheduled tasks, more historical results
         | stored). At the moment we retain the last month or so of test
         | results, which is manageable for pretty large workloads.
        
           | LittlePeter wrote:
           | Looking forward to Redshift!
           | 
           | BTW, you don't need to navigate 10K tests... you only need to
           | navigate the failing ones.
        
       | mushufasa wrote:
       | this is interesting! running tests on data is certainly a pain
       | point for me, and there doesn't seem to be nearly the kind of
       | infrastructure available as for, say, tests for code
       | functionality.
       | 
       | Is this open source? Sending my data to a third party is a no-go,
       | as is having a third-party connect to the database. Something
       | part of a managed hosting service, though, or an add-on to an
       | existing trusted hosted service that has gone through compliance
       | (e.g. Heroku, AWS), would be more palatable.
        
         | hamzahc wrote:
         | This was the same pain point we had when we saw how good the
         | tools were for testing our software vs our data.
         | 
         | It's not open source but we can deploy on-prem (or cloud-prem
         | more accurately) pretty easily. We're also going to setup as an
         | add-on available through AWS marketplace. Feel free to shoot me
         | an email if you want to see if this can work for you
         | hamzah[at]gethubble.io
        
       | _Microft wrote:
       | Have you considered picking a different name? Searching for
       | "Hubble" for whatever reason is going to return millions of
       | irrelevant results for your customers.
        
         | 12ian34 wrote:
         | I'm sure Jobs and Woz heard similar...
        
           | vikramkr wrote:
           | Yes of course, because of how important search engine
           | optimization was in 1976. Nothing has changed in the business
           | environment between now and then.
        
             | dang wrote:
             | Please don't be snarky.
             | 
             | https://news.ycombinator.com/newsguidelines.html
        
         | tapoxi wrote:
         | Yeah it immediately brings to mind https://hubblestack.io/
        
         | xyzzy_plugh wrote:
         | My first thought was https://github.com/cilium/hubble
        
         | oliver101 wrote:
         | Yeah we called this project hubble long before we were worried
         | about SEO.
         | 
         | Actually, the name does relate back to Edwin Hubble. We
         | previously worked together on an internal data tool called
         | Telescope (it was used for annotating medical images for
         | computer vision). The telescope project slowly evolved into the
         | product we have today. So we changed the name to our favourite
         | telescope. I have a fondness for the Hubble telescope: there
         | was a huge poster of it on the way into the computational
         | physics dept. and takes me back to the grad school days!
        
         | anticsapp wrote:
         | I can't think of a worse name for SEO purposes. You'd have to
         | fight through a well loved and well known space telescope, the
         | astronomer it was named after, and Hubble contact lenses, which
         | has raised ~74MM.
        
           | gk1 wrote:
           | > Search for "hubble"
           | 
           | > See irrelevant results
           | 
           | > Search for "hubble data"
           | 
           | Problem solved. People are smart enough to modify their
           | search if the initial results are about telescopes and not
           | data pipelines.
           | 
           | One of my clients had a similar name to a global pizza chain.
           | It hasn't been an issue at all, besides having to hear the
           | same pizza puns over and over.
        
           | switz wrote:
           | If a customer is looking for you specifically, they will find
           | you (e.g. "hubble data" as stated above). If they are looking
           | for a "data quality monitor" then the SEO will need to
           | reflect that. The name is largely irrelevant at that point,
           | it's merely a moniker.
           | 
           | In the grand scheme of problems a new company has, this is so
           | trivially minor that I can't fathom this having any tangible
           | effect on the success of a company. It's one thing if there's
           | another data warehousing company called "hubble", but that's
           | not the case you're making.
        
             | Kye wrote:
             | Hubble data brings up, as I would expect, data from the
             | Hubble Space Telescope. Not one of the first page of
             | results points to anything else but HSTS information.
        
               | switz wrote:
               | The product literally just launched -- give it a few
               | weeks, it'll show up.
        
               | Kye wrote:
               | I don't know who's advising you on SEO, but you will not
               | ever outrank STSCI, NASA, ESA, AWS Open Data's HSTS
               | archive, The Planetary Society, the National Academy of
               | Sciences, or the ESO on "hubble data" as long as Hubble
               | is still what people think of when they hear Hubble. The
               | telescope and related sites/agencies/organizations have a
               | 22 year head start building a relevant link profile in
               | Google. And if you did, Google would get suspicious.
               | 
               | Hubble is fine as a name if you pick the right keywords
               | to target in your marketing, but "Hubble data" is never
               | going to show a link to something that isn't at least
               | tangentially related to the telescope.
        
       | [deleted]
        
       | jeremynevans wrote:
       | Customer here (comment not solicited!). We've been trying out
       | Hubble for a month or so and it's looking really promising.
       | 
       | I love the idea of being able to outsource the creativity/problem
       | solving of predicting things that could go wrong with our data to
       | a service that specialises in just that, and I can totally see
       | how they can automate this in a big way as they grow.
        
       | scapecast wrote:
       | Co-founder of intermix.io here (which we sold in March). We came
       | more from the performance monitoring angle (specifically for
       | Redshift), but then shifted to a product that works horizontally
       | across all warehouses, to track usage, workflows and user
       | engagement. "Shift to Data Products" was the narrative we started
       | using in Q4 2019. If you read the copy on the current intermix.io
       | website, I think you'll find yourself nodding. (FYI - we got
       | bought by a small PE Fund that is rolling the product into
       | Xplenty, an ETL product).
       | 
       | My experience is that monitoring data quality is a still an
       | under-appreciated discipline. I've found that most teams still
       | have an "not invented here" mentality, or don't even know they
       | have the problem! That can lead to a "oh, we can just fix it when
       | it happens" type of mentality. But your timing may be better than
       | ours - we started back in 2016.
       | 
       | I haven't played with your product (yet), only took a look at
       | this thread and your website. Some observations:
       | 
       | - SQL Editor - big plus! I think giving your users a space where
       | they can take action is a super value-add, we didn't have that.
       | 
       | - nice work running the tests inside the customer's warehouse.
       | That has two benefits for you. 1) you're not incurring the cost
       | to crunch the metadata, it can get quite expensive, depending on
       | the number of tables in the warehouse. 2) you're avoiding data
       | access issues, getting access to the warehouse was always a
       | hurdle, even though we only needed access to the system tables.
       | 
       | - pricing model. I think the per-seat model is the way to go. We
       | tried charging by number of rows, and size of the warehouse
       | (number of nodes), but then you run into weird situations with
       | customers who are dealing with huge historic datasets, but really
       | only look at the last 30 of data.
       | 
       | My unsolicited $0.02 is that you think hard about distribution. I
       | think you want to think about hitching your wagon to the cloud
       | marketplaces, and Snowflake's marketplace. For example, attaching
       | themselves to Snowflake is what made all the difference for
       | Fivetran.
       | 
       | I have a bunch of more scars that I can share if you care to know
       | them :-)
        
         | hamzahc wrote:
         | > My experience is that monitoring data quality is still an
         | under-appreciated discipline.
         | 
         | We agree with this a lot, we found there are often a lot of
         | unknown unknowns that drive data issues, and a lot of teams
         | aren't sure of where to start. It's why we're spending so much
         | time on trying to make relevant tests in Hubble that are easy
         | to set up and use (and then let users create custom tests once
         | they get the hang of it).
         | 
         | Great point on the distribution, we do think being close to the
         | data warehouses is really important for us, most teams already
         | have one set up, but don't know if what's inside it is correct
         | or useful. We're looking to get set up on their marketplaces
         | soon!
         | 
         | It sounds super relevant, we'd love to hear more - you can get
         | me at hamzah[at]gethubble.io
        
         | texasbigdata wrote:
         | Fantastic blog post, thanks for sharing.
         | 
         | So I guess if you had to pick arbitrary revenue/data/fte
         | cutoffs, do you see the org chart of these adopters as you've
         | described looking a certain way? Let me try to rephrase that.
         | 
         | Do you think there's a step function of "here you need one DBA
         | who is a holy librarian" and "here we need a gitlab styled data
         | team with SLAs and the data equivalent of HR business partners
         | who get assigned to the BU"?
         | 
         | Tangential to your comment but curious if you believe the human
         | side scales akin to the infrastructure side.
        
         | lifeisstillgood wrote:
         | FYI: Snowflake seems to be a commercial marketplace that lets
         | users download data sets (weather, marketing etc) and
         | presumably people to upload their data sets
         | 
         | https://www.snowflake.com/data-marketplace/
         | 
         | I assume there is a open version that's really good but less
         | cool
        
       ___________________________________________________________________
       (page generated 2020-08-20 23:00 UTC)