[HN Gopher] Show HN: Syncing data to your customer's Google Sheets
       ___________________________________________________________________
        
       Show HN: Syncing data to your customer's Google Sheets
        
       Hey HN! Charles here from Prequel (https://prequel.co). We just
       launched the ability to sync data from your own app/db/data
       warehouse to any of your customer's Google Sheets, CSV, or Excel -
       and I wanted to share a bit more about how we built the Google
       Sheets integration. If you're curious, see here for a quick GIF
       demo of our Google Sheets destination:
       https://storage.googleapis.com/hn_asset/Prequel_GoogleSheets....
       Quick background on us: we make it easy to integrate with and sync
       data to data warehouses. Problem is, there are plenty of folks who
       want access to their data, but don't have or don't know how to use
       a data warehouse. For example, FP&A teams, customer success teams,
       etc.  To get around that, we added some non-db destinations to
       Prequel: Google Sheets, CSV, and Excel. We had to rework some core
       assumptions in order to get Google Sheets to work.  By default,
       Prequel does incremental syncs, meaning we only write net new or
       updated data to the destination. To avoid duplicate rows, we
       typically perform those writes as upserts - this is pretty trivial
       in most SQL dialects. But since Google Sheets is not actually a db,
       it doesn't have a concept of upserts, and we had to get creative.
       We had two options: either force all Google Sheets syncs to be
       "full refreshes" every time (eg grab all the data and brute-force
       write it to the sheet). The downside is, this can get expensive
       quickly for our customers, especially when data gets refreshed at
       higher frequencies (eg every 15 minutes).  The other, and better,
       option was to figure out how to perform upserts in Sheets. To do
       so, we read the data from the sheet we're about to write to into
       memory. We store it in a large map by primary key. We reconcile it
       with the data we're about to write. We then dump the contents of
       the map back to the sheet. In order to make the user experience
       smoother, we also sort the rows by timestamp before writing it
       back. This guarantees that we don't accidentally shuffle rows with
       every transfer, which might leave users feeling confused.  "Wait,
       you keep all the data in memory... so how do you avoid blowing up
       your pods?". Great question! Luckily, Google Sheets has pretty
       stringent cell / row size limits. This allows us to restrict the
       amount of data that can be written to these destinations (we throw
       a nice error if someone tries to sync too much data), and thereby
       also guarantees that we don't OOM our poor pods.  Another
       interesting problem we had to solve was auth: how do we let users
       give us access to their sheets in a way that both feels intuitive
       and upholds strong security guarantees? It seemed like the cleanest
       user experience was to ask the spreadsheet owner to share access
       with a new user - much like they would with any real human user. To
       make this possible without creating a superuser that would have
       access to _all_ the sheets, we had to programmatically generate a
       different user for each of our customers. We do this via the GCP
       IAM API, creating a new service account every time. We then auth
       into the sheet through this service account.  One last fun UX
       challenge to think through was how to prevent users from editing
       the "golden" data we just sync'd. It might not be immediately clear
       to them that this data is meant as a source of truth record, rather
       than a playground. To get around this, we create protected ranges
       and prevent them from editing the sheets we write to. Sheets even
       adds a little padlock icon to the relevant sheets, which helps
       convey the "don't mess with this".  If you want to take it for a
       spin, you can sign up on our site or reach us at hello (at)
       prequel.co. Happy to answer any other questions about the design!
        
       Author : ctc24
       Score  : 55 points
       Date   : 2023-01-27 17:15 UTC (5 hours ago)
        
       | rubenv wrote:
       | Not very familiar with the sheets API, but was wondering: how are
       | rate limits applied? Per sheet or in total?
        
         | ctc24 wrote:
         | They're applied to the caller, so "in total" in this instance.
         | They're pretty high, however -- 300 write requests per minute
         | per GCP project.
        
           | rubenv wrote:
           | That's 5 per second over all your customers? Not too worried
           | about reaching that quickly?
        
             | ctc24 wrote:
             | Not particularly. A large portion of our customers who sync
             | data to Google Sheets use a daily frequency, so the
             | theoretical upper limit is close to a half million sheets
             | being written to (per GCP project). We have other projects
             | available that we can start using once this gets close to
             | becoming an issue.
        
               | rubenv wrote:
               | That's more reassuring indeed.
               | 
               | Cool project, good luck!
        
               | potmat wrote:
               | At my company we routinely write to hundreds/thousands of
               | sheets a minute, you can get much higher quotas if you
               | ask.
        
       | typingmonkey wrote:
       | Is there a way to get a stream of changes out of prequel (via
       | websocket)? That would make it possible to live-sync between the
       | customer's devices.
        
         | ctc24 wrote:
         | Can you tell us a bit more about the type of use-case you have
         | in mind?
         | 
         | We support S3 as a destination, so you could listen for changes
         | on a given S3 bucket and pipe that to a stream (eg as outlined
         | here [0] or here [1]).
         | 
         | [0]: https://aws.amazon.com/blogs/big-data/streaming-data-from-
         | am... [1]: https://stackoverflow.com/questions/48147123/how-to-
         | read-dat...
        
           | typingmonkey wrote:
           | The use case is to build a RxDB plugin that backups the users
           | data into the prequel api.
        
       | char_star_star wrote:
       | I wonder how long it will be until you have a customer using this
       | just to share data to their own internal teams just so the data
       | team doesn't have to mess around with Sheets!
       | 
       | Also--I want Prequel for Zendesk and Greenhouse (and Asana and
       | ...) so badly. There are so many more interesting things I want
       | to be doing with my time at work than babysitting pipelines.
        
         | ctc24 wrote:
         | We're actually starting to get those requests!
         | 
         | Really appreciate the kind words. We'll do our best to those
         | teams offer data warehouse integrations, so you can focus on
         | higher value data engineering work!
        
       | mritchie712 wrote:
       | You should get Apollo to buy Prequel. Fivetran doesn't support
       | Apollo and I could really use my Apollo data in Snowflake.
        
         | ctc24 wrote:
         | Ha, thanks for the suggestion. We'll reach out to them.
        
       | orlp wrote:
       | That's a bit of an unfortunate name clash with PRQL, also
       | pronounced prequel: https://prql-lang.org/
        
       | stephen wrote:
       | Interesting! Any plans for SalesForce support?
       | 
       | We'd love to have a way to easily sync our internal system's data
       | in/out of SFDC.
       | 
       | ...and a source of GraphQL? :-)
       | 
       | One of our engineers recently suggested syncing our PG database
       | to Airtable, solely b/c Airtable has out-of-the-box SFDC
       | integration (webhooks/etc), so our SFDC team could get at the
       | data easier than they could from our PG database.
       | 
       | I'm hesitant about "Airtable as our official 3rd-party
       | integration strategy", but it does make me pine for a "protocol"
       | for one-way/two-way real-time/batch syncing between two systems
       | that just want to share "dumb" tables+fields/entities.
       | 
       | I was thinking Zapier might have that, like if we implemented the
       | Zapier-TM backend protocol on top of our custom system, and it
       | would ~insta integrate with everything, but in my ~10 minutes of
       | scanning their docs, it seemed less "protocol" and more
       | "configure all the things via clicking around".
        
       ___________________________________________________________________
       (page generated 2023-01-27 23:01 UTC)