[HN Gopher] Datastream for BigQuery Preview ___________________________________________________________________ Datastream for BigQuery Preview Author : rntn Score : 87 points Date : 2022-09-16 16:04 UTC (6 hours ago) (HTM) web link (cloud.google.com) (TXT) w3m dump (cloud.google.com) | mousetree wrote: | Finally! We use both Cloud SQL (Postgres) and BigQuery. Given | both are managed GCP products you'd expect a simple way to | replicate Postgres into BigQuery. Instead we've had to pay for | Stirch/Fivetran/Airbyte or some janky "BigQuery external | connections" to do this. Very happy they're finally supporting an | obvious use case. Surprise it's taken so long | mywittyname wrote: | We've been using the external queries feature of BQ against | Cloud SQL, then redirecting the output to a new table. It's not | perfect, but it's functional, so long as your tables are very | simple. | jimmyechan wrote: | On a related note, Salesforce and Snowflake announced a | partnership for real-time data sharing. Not exactly the same, but | it's interesting to see data warehouse companies building | upstream. I wonder if there's a trend going on here where data | warehouse companies start natively integrating with more | databases, apps, and other data sources in (near) real-time | | Edit. Link to announcement: | https://techcrunch.com/2022/09/15/salesforce-snowflake-partn... | mywittyname wrote: | Yes, I think this will be the case. Google makes it very easy | to integration with customers' BigQuery instances. Have the | customer supply a service account token and a target dataset, | then run a query with the output table configured in that | target dataset. Done. | gigatexal wrote: | Managed debezium? | ta3411 wrote: | This seems like an ideal use case for us. I have a naive thinking | of my workflow: can someone please comment if I am off track. | | I am building an e-commerce product on AWS PostgresSQL. Everyday, | I want to be able to do analytics on order volume, new customers, | etc. - For us to track internally: we fire client and backend | events into Amplitude - For sellers to track: we directly query | PostgressQL to export | | Now with this, I am thinking of constantly streaming our SQL | table to BigQuery. And any analysis can be done on top of this | BigQuery instance across both internal tracking and external | export. | | Is RedShift the AWS equivalent of this? | te_chris wrote: | BQ is saas proper vs redshift where you have to pick instance | sizes etc. It's amazing, true superpower stuff in how little | you have to think about it to get loads out of it. | mchaynes wrote: | basically yes. kinesis -> firehose -> s3 | faizshah wrote: | As a heavy BQ user on my side projects, there isn't really an | alternative to BQ in AWS. I find that RedShift does not provide | a lot of the functionality and ease of use that BQ provides. | | That said the closest thing is Amazon Athena. | | The architecture would basically be Kinesis -> S3 <- Athena | where S3 is your data lake or you can do it like AWS DMS -> S3 | <- Athena. | | To accomplish this or the redshift solution you need to | implement change data capture from your relational DB, for that | you can sue AWS Database Migration Service like this for | redshift: https://aws.amazon.com/blogs/apn/change-data-capture- | from-on... | | Like this for kinesis: https://aws.amazon.com/blogs/big- | data/stream-change-data-to-... | | The reason you may want to use Kinesis is because you can use | Flink in Kinesis Data Analytics just like you can use DataFlow | in GCP to aggregate some metrics before dumping them into your | data lake/warehouse. | pramsey wrote: | No support for geometry or ranges, alas! "Datastream doesn't | support columns of the Geometric data types or Range data types. | The values in these columns will be replaced with NULL values." | neeleshs wrote: | Shameless plug : Syncari supports this as well, without most | limits around backfills and schema changes | tehalex wrote: | Hopefully it 'just works' and is easier than AWS's DMS. | | I use AWS DMS to replica data from RDS => Redshift and it's a | never-ending source of pain. | johndfsgdgdfg wrote: | Doesn't matter. Goggle will shut this service down. Moreover, | Google is an ad company. It can never compete to customer focus | of Amazon. | mritchie712 wrote: | doesnt amazon have an ecomm business or something? Thought I | heard something about that. | brad0 wrote: | I'm curious about customer use cases for this. Can anyone | elaborate? | skadamat wrote: | Few come to mind! | | - Startup that's running analytics off their production | Postgres DB can use this as a gateway drug to try out BigQuery. | Bonus for not having to ETL the data themselves or pay FiveTran | / host Airbyte to do it | | - Replace existing internally maintained sync job / ETL | pipelines with this, if you're somewhat all-in on BigQuery / | GCP for analytics (BQ is quite popular in analytics because of | the pricing structure) | dtauzell wrote: | If you use some sort of ETL to copy data from your OLTP to data | warehouse you might use this to stream data from your OLTP to | Big Query so you can use Big Query as a data warehouse. | jerglingu wrote: | The lifting-and-dropping process from transactional stores into | analytics stores is usually the first step in analytics data | pipelines. This appears to fulfill that initial ingestion step | specifically for Google BigQuery. A lot of times these days | this ingestion step is ran through some third party service: | Fivetran, Stitch, Matillion, etc. Similar to AWS's offerings | for Redshift, this looks like Google's "native" ingestion tool | for BigQuery. | | I'd be curious to know how it handles the wrinkles Fivetran | effectively solves (schema drift, logging, general performance) | and if it does so at a much cheaper price. | gfarah wrote: | We have been using RDS -> DMS -> Kinesis -> BigQuery for 3 years | now. Very little maintenance and has worked without issues. I | doubt this service will be cost competitive over a setup like | ours. Perhaps between Cloud SQL -> BigQuery will make sense? | stingraycharles wrote: | Kinesis -> BigQuery? How does that work? Or did you mean Kafka? | faizshah wrote: | Is this using Anthos? If not aren't the bandwidth costs high | for this solution from AWS -> GCP? | polskibus wrote: | I wonder how well does it handle schema changes. Adding new | nullable columns is easy to implement, but more complex changes | can be a pain. | kppullin wrote: | We currently use the kafka connect bigquery connector, along with | debezium, to "stream" both the "changelog"/"transaction log" and | to "mirror" our rdbms instances into bigquery. While this works, | it's been a fair amount of effort to iron out issues over time. | We also have had to work around bigquery limits including issues | exceeding concurrent queries (switched to batch mode, which has | it's own issues) and frequency of writes (we've had to throttle | to flushing every minute, which is good enough, but did have a | use case for faster updates). Also have issues related to | partitioning and clustering, and more... | | So seeing this to potentially replace the kafka connect bigquery | connector looked appealing. However, according to the docs and | listed limitations | (https://cloud.google.com/datastream/docs/sources-postgresql) it | does not handle schema changes well nor postgres array types. Not | that any of these tools handle this well, but given the open | source bigquery connector, we've been able to work around this | with customizations to the code. Hopefully they'll continue to | iterate on the product and I'll be keeping an eye out. | atwebb wrote: | Yeah the Debezium connectors have some issues that really get | in the way. I'm less familiar with BQ but some other DBs, the | data typing is really, really basic | | (case when string then varchar(4000/max)) | | and similar. It looks like a relatively easy thing to | incrementally improve. | gunnarmorling wrote: | Hey, Gunnar here from the Debezium team. I would love to | learn more about those issues you encountered with Debezium. | Perhaps it's something we could improve. So if you could | share your feedback either here, on our mailing list | (https://groups.google.com/g/debezium), that would be | awesome. Thanks! | kppullin wrote: | For us, debezium has been working rather well in recent | memory (there were a couple PRs I submitted that I'm proud | of, even if tiny!). Most of the issues are on the kafka | connector side, whether it's the bigquery sink, jdbc sink, | and s3 sink. | | A couple things that do pop to mind as it relates to | debezium include better capabilities around backup + | restores and disaster recovery, and any further hints | around schema changes. Admittedly I haven't looked at these | areas for 6+ months so they may be improved. | yumraj wrote: | What do they mean by _near real-time_? | | Is it microsecond, millisecond, seconds, minutes, hours ... ? | rochoa wrote: | From the documentation | (https://cloud.google.com/datastream/docs/behavior- | overview#p...): | | > When using the default stream settings, the latency from | reading the data in the source to streaming it into the | destination is between 10 and 120 seconds. | lmkg wrote: | This is a common thing for BigQuery to claim. It's quite fast | for an analytics database, but it is not a transactional | database. Even for static data, the simplest queries on the | tiniest data still take 1/10th of a second or so, and realistic | queries run in the 1-5s range. | | Basically you could use BigQuery to build tools supporting | human-in-the-loop processes (BI reports, exec dashboards), and | you could call those "real-time" to the humans involved. But it | will not support transactional workloads, and it does not | provide SLAs to support that. I don't know about this | particular feature, I'm guessing seconds but maybe minutes. | yolo3000 wrote: | They are also introducing a sort of search/indexing soon, | which will probably speed up certain types of queries. | carlineng wrote: | I wonder if the underlying technology is from their acquisition | of Alooma [1] several years back. High volume change data capture | from an OLTP database to a data warehouse is a problem that seems | simple on its face, but has a lot of nasty edge cases. Alooma did | a decent job of handling the problem at moderate scale, so I'm | cautiously optimistic about this announcement. | | [1]: https://techcrunch.com/2019/02/19/google-acquires-cloud- | migr... | base3 wrote: | Definitely alooma. It even uses alooma's semantics for naming | variables and staging tables. | dominotw wrote: | > problem that seems simple on its face, but has a lot of nasty | edge cases. | | I highly recommend teams not rolling out their own solutions | because of all these edge cases. My boss refused to pay a | vendor solution because it was a 'easy problem' according to | him. Denied whole team promotion because we spent lot of time | solving 'easy problem' , outright asked us 'whats so hard about | it that it deserves promotion' . Worst person i've ever worked | for. | anonymousDan wrote: | Well what are the hard parts then? | mywittyname wrote: | Keeping track of new record without hammering the database. | | Keeping track of updates without a dedicated column. | | Ditto for deletes. | | Schema drift. | | Managing datatype differences. I.e., Postgres supports | complex datatypes that other DBs do not. | | Stored procedures. | | Basically any features of a DB that are beyond the bare | minimum necessary to call something a relation database is | an edge case. | | I rolled a solution for Postgres -> BQ by hand and it was | much more involved than I expected. I had to set up a read- | only replica and completely copying over the databases each | run, which is only possible because they are so tiny. | carlineng wrote: | Just to name a few off the top of my head -- | | - Data type mismatches between systems | | - Differences in handling ambiguous or bad data (e.g., null | characters) | | - Handling backfills | | - Handling table schema changes | | - Writing merge queries to handle deletes/updates in a | cost-effective way | | - Scrubbing the binlog of PII or other information that | shouldn't make its way into the data warehouse | | - Determining which tables to replicate, and which to leave | behind | | - Ability to replay the log from a point-in-time in case of | an outage or other incident | | And I'm sure there are a lot more I'm not thinking of. None | of these are terribly difficult in isolation, but there's a | long tail of issues like these that need to be solved. ___________________________________________________________________ (page generated 2022-09-16 23:00 UTC)