[HN Gopher] Datastream for BigQuery Preview
       ___________________________________________________________________
        
       Datastream for BigQuery Preview
        
       Author : rntn
       Score  : 87 points
       Date   : 2022-09-16 16:04 UTC (6 hours ago)
        
 (HTM) web link (cloud.google.com)
 (TXT) w3m dump (cloud.google.com)
        
       | mousetree wrote:
       | Finally! We use both Cloud SQL (Postgres) and BigQuery. Given
       | both are managed GCP products you'd expect a simple way to
       | replicate Postgres into BigQuery. Instead we've had to pay for
       | Stirch/Fivetran/Airbyte or some janky "BigQuery external
       | connections" to do this. Very happy they're finally supporting an
       | obvious use case. Surprise it's taken so long
        
         | mywittyname wrote:
         | We've been using the external queries feature of BQ against
         | Cloud SQL, then redirecting the output to a new table. It's not
         | perfect, but it's functional, so long as your tables are very
         | simple.
        
       | jimmyechan wrote:
       | On a related note, Salesforce and Snowflake announced a
       | partnership for real-time data sharing. Not exactly the same, but
       | it's interesting to see data warehouse companies building
       | upstream. I wonder if there's a trend going on here where data
       | warehouse companies start natively integrating with more
       | databases, apps, and other data sources in (near) real-time
       | 
       | Edit. Link to announcement:
       | https://techcrunch.com/2022/09/15/salesforce-snowflake-partn...
        
         | mywittyname wrote:
         | Yes, I think this will be the case. Google makes it very easy
         | to integration with customers' BigQuery instances. Have the
         | customer supply a service account token and a target dataset,
         | then run a query with the output table configured in that
         | target dataset. Done.
        
       | gigatexal wrote:
       | Managed debezium?
        
       | ta3411 wrote:
       | This seems like an ideal use case for us. I have a naive thinking
       | of my workflow: can someone please comment if I am off track.
       | 
       | I am building an e-commerce product on AWS PostgresSQL. Everyday,
       | I want to be able to do analytics on order volume, new customers,
       | etc. - For us to track internally: we fire client and backend
       | events into Amplitude - For sellers to track: we directly query
       | PostgressQL to export
       | 
       | Now with this, I am thinking of constantly streaming our SQL
       | table to BigQuery. And any analysis can be done on top of this
       | BigQuery instance across both internal tracking and external
       | export.
       | 
       | Is RedShift the AWS equivalent of this?
        
         | te_chris wrote:
         | BQ is saas proper vs redshift where you have to pick instance
         | sizes etc. It's amazing, true superpower stuff in how little
         | you have to think about it to get loads out of it.
        
         | mchaynes wrote:
         | basically yes. kinesis -> firehose -> s3
        
         | faizshah wrote:
         | As a heavy BQ user on my side projects, there isn't really an
         | alternative to BQ in AWS. I find that RedShift does not provide
         | a lot of the functionality and ease of use that BQ provides.
         | 
         | That said the closest thing is Amazon Athena.
         | 
         | The architecture would basically be Kinesis -> S3 <- Athena
         | where S3 is your data lake or you can do it like AWS DMS -> S3
         | <- Athena.
         | 
         | To accomplish this or the redshift solution you need to
         | implement change data capture from your relational DB, for that
         | you can sue AWS Database Migration Service like this for
         | redshift: https://aws.amazon.com/blogs/apn/change-data-capture-
         | from-on...
         | 
         | Like this for kinesis: https://aws.amazon.com/blogs/big-
         | data/stream-change-data-to-...
         | 
         | The reason you may want to use Kinesis is because you can use
         | Flink in Kinesis Data Analytics just like you can use DataFlow
         | in GCP to aggregate some metrics before dumping them into your
         | data lake/warehouse.
        
       | pramsey wrote:
       | No support for geometry or ranges, alas! "Datastream doesn't
       | support columns of the Geometric data types or Range data types.
       | The values in these columns will be replaced with NULL values."
        
       | neeleshs wrote:
       | Shameless plug : Syncari supports this as well, without most
       | limits around backfills and schema changes
        
       | tehalex wrote:
       | Hopefully it 'just works' and is easier than AWS's DMS.
       | 
       | I use AWS DMS to replica data from RDS => Redshift and it's a
       | never-ending source of pain.
        
         | johndfsgdgdfg wrote:
         | Doesn't matter. Goggle will shut this service down. Moreover,
         | Google is an ad company. It can never compete to customer focus
         | of Amazon.
        
           | mritchie712 wrote:
           | doesnt amazon have an ecomm business or something? Thought I
           | heard something about that.
        
       | brad0 wrote:
       | I'm curious about customer use cases for this. Can anyone
       | elaborate?
        
         | skadamat wrote:
         | Few come to mind!
         | 
         | - Startup that's running analytics off their production
         | Postgres DB can use this as a gateway drug to try out BigQuery.
         | Bonus for not having to ETL the data themselves or pay FiveTran
         | / host Airbyte to do it
         | 
         | - Replace existing internally maintained sync job / ETL
         | pipelines with this, if you're somewhat all-in on BigQuery /
         | GCP for analytics (BQ is quite popular in analytics because of
         | the pricing structure)
        
         | dtauzell wrote:
         | If you use some sort of ETL to copy data from your OLTP to data
         | warehouse you might use this to stream data from your OLTP to
         | Big Query so you can use Big Query as a data warehouse.
        
         | jerglingu wrote:
         | The lifting-and-dropping process from transactional stores into
         | analytics stores is usually the first step in analytics data
         | pipelines. This appears to fulfill that initial ingestion step
         | specifically for Google BigQuery. A lot of times these days
         | this ingestion step is ran through some third party service:
         | Fivetran, Stitch, Matillion, etc. Similar to AWS's offerings
         | for Redshift, this looks like Google's "native" ingestion tool
         | for BigQuery.
         | 
         | I'd be curious to know how it handles the wrinkles Fivetran
         | effectively solves (schema drift, logging, general performance)
         | and if it does so at a much cheaper price.
        
       | gfarah wrote:
       | We have been using RDS -> DMS -> Kinesis -> BigQuery for 3 years
       | now. Very little maintenance and has worked without issues. I
       | doubt this service will be cost competitive over a setup like
       | ours. Perhaps between Cloud SQL -> BigQuery will make sense?
        
         | stingraycharles wrote:
         | Kinesis -> BigQuery? How does that work? Or did you mean Kafka?
        
         | faizshah wrote:
         | Is this using Anthos? If not aren't the bandwidth costs high
         | for this solution from AWS -> GCP?
        
       | polskibus wrote:
       | I wonder how well does it handle schema changes. Adding new
       | nullable columns is easy to implement, but more complex changes
       | can be a pain.
        
       | kppullin wrote:
       | We currently use the kafka connect bigquery connector, along with
       | debezium, to "stream" both the "changelog"/"transaction log" and
       | to "mirror" our rdbms instances into bigquery. While this works,
       | it's been a fair amount of effort to iron out issues over time.
       | We also have had to work around bigquery limits including issues
       | exceeding concurrent queries (switched to batch mode, which has
       | it's own issues) and frequency of writes (we've had to throttle
       | to flushing every minute, which is good enough, but did have a
       | use case for faster updates). Also have issues related to
       | partitioning and clustering, and more...
       | 
       | So seeing this to potentially replace the kafka connect bigquery
       | connector looked appealing. However, according to the docs and
       | listed limitations
       | (https://cloud.google.com/datastream/docs/sources-postgresql) it
       | does not handle schema changes well nor postgres array types. Not
       | that any of these tools handle this well, but given the open
       | source bigquery connector, we've been able to work around this
       | with customizations to the code. Hopefully they'll continue to
       | iterate on the product and I'll be keeping an eye out.
        
         | atwebb wrote:
         | Yeah the Debezium connectors have some issues that really get
         | in the way. I'm less familiar with BQ but some other DBs, the
         | data typing is really, really basic
         | 
         | (case when string then varchar(4000/max))
         | 
         | and similar. It looks like a relatively easy thing to
         | incrementally improve.
        
           | gunnarmorling wrote:
           | Hey, Gunnar here from the Debezium team. I would love to
           | learn more about those issues you encountered with Debezium.
           | Perhaps it's something we could improve. So if you could
           | share your feedback either here, on our mailing list
           | (https://groups.google.com/g/debezium), that would be
           | awesome. Thanks!
        
             | kppullin wrote:
             | For us, debezium has been working rather well in recent
             | memory (there were a couple PRs I submitted that I'm proud
             | of, even if tiny!). Most of the issues are on the kafka
             | connector side, whether it's the bigquery sink, jdbc sink,
             | and s3 sink.
             | 
             | A couple things that do pop to mind as it relates to
             | debezium include better capabilities around backup +
             | restores and disaster recovery, and any further hints
             | around schema changes. Admittedly I haven't looked at these
             | areas for 6+ months so they may be improved.
        
       | yumraj wrote:
       | What do they mean by _near real-time_?
       | 
       | Is it microsecond, millisecond, seconds, minutes, hours ... ?
        
         | rochoa wrote:
         | From the documentation
         | (https://cloud.google.com/datastream/docs/behavior-
         | overview#p...):
         | 
         | > When using the default stream settings, the latency from
         | reading the data in the source to streaming it into the
         | destination is between 10 and 120 seconds.
        
         | lmkg wrote:
         | This is a common thing for BigQuery to claim. It's quite fast
         | for an analytics database, but it is not a transactional
         | database. Even for static data, the simplest queries on the
         | tiniest data still take 1/10th of a second or so, and realistic
         | queries run in the 1-5s range.
         | 
         | Basically you could use BigQuery to build tools supporting
         | human-in-the-loop processes (BI reports, exec dashboards), and
         | you could call those "real-time" to the humans involved. But it
         | will not support transactional workloads, and it does not
         | provide SLAs to support that. I don't know about this
         | particular feature, I'm guessing seconds but maybe minutes.
        
           | yolo3000 wrote:
           | They are also introducing a sort of search/indexing soon,
           | which will probably speed up certain types of queries.
        
       | carlineng wrote:
       | I wonder if the underlying technology is from their acquisition
       | of Alooma [1] several years back. High volume change data capture
       | from an OLTP database to a data warehouse is a problem that seems
       | simple on its face, but has a lot of nasty edge cases. Alooma did
       | a decent job of handling the problem at moderate scale, so I'm
       | cautiously optimistic about this announcement.
       | 
       | [1]: https://techcrunch.com/2019/02/19/google-acquires-cloud-
       | migr...
        
         | base3 wrote:
         | Definitely alooma. It even uses alooma's semantics for naming
         | variables and staging tables.
        
         | dominotw wrote:
         | > problem that seems simple on its face, but has a lot of nasty
         | edge cases.
         | 
         | I highly recommend teams not rolling out their own solutions
         | because of all these edge cases. My boss refused to pay a
         | vendor solution because it was a 'easy problem' according to
         | him. Denied whole team promotion because we spent lot of time
         | solving 'easy problem' , outright asked us 'whats so hard about
         | it that it deserves promotion' . Worst person i've ever worked
         | for.
        
           | anonymousDan wrote:
           | Well what are the hard parts then?
        
             | mywittyname wrote:
             | Keeping track of new record without hammering the database.
             | 
             | Keeping track of updates without a dedicated column.
             | 
             | Ditto for deletes.
             | 
             | Schema drift.
             | 
             | Managing datatype differences. I.e., Postgres supports
             | complex datatypes that other DBs do not.
             | 
             | Stored procedures.
             | 
             | Basically any features of a DB that are beyond the bare
             | minimum necessary to call something a relation database is
             | an edge case.
             | 
             | I rolled a solution for Postgres -> BQ by hand and it was
             | much more involved than I expected. I had to set up a read-
             | only replica and completely copying over the databases each
             | run, which is only possible because they are so tiny.
        
             | carlineng wrote:
             | Just to name a few off the top of my head --
             | 
             | - Data type mismatches between systems
             | 
             | - Differences in handling ambiguous or bad data (e.g., null
             | characters)
             | 
             | - Handling backfills
             | 
             | - Handling table schema changes
             | 
             | - Writing merge queries to handle deletes/updates in a
             | cost-effective way
             | 
             | - Scrubbing the binlog of PII or other information that
             | shouldn't make its way into the data warehouse
             | 
             | - Determining which tables to replicate, and which to leave
             | behind
             | 
             | - Ability to replay the log from a point-in-time in case of
             | an outage or other incident
             | 
             | And I'm sure there are a lot more I'm not thinking of. None
             | of these are terribly difficult in isolation, but there's a
             | long tail of issues like these that need to be solved.
        
       ___________________________________________________________________
       (page generated 2022-09-16 23:00 UTC)