[HN Gopher] How Spotify ran a large Google Dataflow job for Wrap...
       ___________________________________________________________________
        
       How Spotify ran a large Google Dataflow job for Wrapped 2019
        
       Author : jhatax
       Score  : 151 points
       Date   : 2020-02-18 19:20 UTC (3 hours ago)
        
 (HTM) web link (labs.spotify.com)
 (TXT) w3m dump (labs.spotify.com)
        
       | rsmets wrote:
       | I thought this was such a marvel! However, my excitement level
       | was tapered when I realized the playlist Best of the Decade was
       | not created by only my music listening habits.
       | 
       | Seems as though users were pinned to some general playlist that
       | had characteristics similar to listening habits? Still hats off
       | from an engineering perspective. I as well wish there was more
       | technical detail provided.
       | 
       | The year recap playlists though are fun personal snapshot of
       | time.
        
         | aabeshou wrote:
         | It's interesting to confirm that because anecdotally my best of
         | the decade playlist sucked lol. It had songs that I really
         | don't think I listened to that much or liked that much. It was
         | weird.
        
       | booboolayla wrote:
       | This smells like an initiative to boost "women in tech" storyline
       | - the task at hand is something many, many companies do daily,
       | the blog has no technical substance, and also just take a look at
       | the team members at the bottom of the page.
       | 
       | Kind of reminds me of the time a trans person from Google
       | "calculated the number Pi" and we were all supposed to cheer it
       | as some kind of an accomplishment.
        
       | dvtrn wrote:
       | I thought we had a thing about preserving post titles from the
       | source?
        
         | capableweb wrote:
         | That's still true, submission used to link to
         | https://techcrunch.com/2020/02/18/how-spotify-ran-the-larges...
         | 
         | See https://news.ycombinator.com/item?id=22359865
        
         | jsnell wrote:
         | The source changed, the title didn't.
        
       | dna_polymerase wrote:
       | Basically the perfect use case for cloud computing. Tons of
       | compute for a short time. In this case there can't possibly be
       | people arguing for their own datacenter over cloud.
        
         | wrkronmiller wrote:
         | > Basically the perfect use case for cloud computing. Tons of
         | compute for a short time.
         | 
         | I completely agree.
         | 
         | > In this case there can't possibly be people arguing for their
         | own datacenter over cloud.
         | 
         | Devil's advocate time: This solution was great for the cloud
         | because it was designed for the cloud. There might be equally
         | good or even superior solutions designed for on-prem or even
         | on-device computing. For example, this ceases to be a big-data
         | problem if you are simply aggregating listening metrics for a
         | single user on a single device.
        
         | hinkley wrote:
         | That works until the bean counters invade and someone gets the
         | bright idea to reduce the ratio of surplus hardware to reduce
         | CAPEX and boost quarterly profits.
         | 
         | We've seen that in every industry including healthcare. Every
         | health crisis now takes us back to field hospitals.
        
       | drdoooom wrote:
       | Was a neat little feature, too bad the share functionality didn't
       | actually work.
        
       | fs111 wrote:
       | why is this link doing a redirect through some ad network?
        
         | dang wrote:
         | We've since changed the link, which originally was
         | https://techcrunch.com/2020/02/18/how-spotify-ran-the-
         | larges....
        
         | C14L wrote:
         | Becasue more and more browsers are limiting access to cookies
         | not only depending on first-party context but also third-party
         | context. So tracking users by web bugs becomes less reliable.
         | By redirecting throught the domain, they can set and access
         | cockies in a first-party context.
        
         | Swtrz wrote:
         | I wonder why I never see this behavior despite every other
         | person mentioning it
        
           | corndoge wrote:
           | ublock origin shows a confirm page when it happens
        
           | jdormit wrote:
           | It's really quick. Open the network tab and check the
           | "persist logs" checkbox to ensure that the request logs don't
           | disappear after every redirect, then clear your cookies for
           | advertising.com and guce.techcrunch.com and reload the page.
           | You'll see the request for techcrunch.com redirect to
           | guce.techcrunch.com, which redirects to guce.advertising.com,
           | which redirects back to techcrunch.com. It happens so fast
           | it's not noticeable on page load.
        
       | stilisstuk wrote:
       | No tech crunch... You can't have my cookies..
        
       | data4lyfe wrote:
       | One massive SQL query across a billion plus users.
        
         | ipnon wrote:
         | Databases are the one area of computer science that makes me
         | realize these machines can do magical things.
        
       | gwittel wrote:
       | Interesting. I wish it had more details as far as inputs/outputs,
       | data sizes in different phases.
       | 
       | One thing that I wonder about is how much work could they do to
       | collect this data on a forward moving basis. Often I see huge
       | lookback jobs that answer predictable/static questions -- prime
       | candidates for aggregation during ingest.
        
         | wobblykiwi wrote:
         | This is the thing I was most forward to reading about in the
         | article, but there were no figures about how large the "largest
         | Google Dataflow job ever" is. There are a bunch of relative
         | figures, 5x 2018 - but what does that translate to? How long
         | did it take?
        
       | deepsun wrote:
       | I'd recommend them to check out Clickhouse for exactly the same
       | purposes. Works well for Cloudflare, Yandex, Sentry.
       | 
       | Another idea is to run probabilistic queries instead of exact
       | ones, could bring down costs way more.
        
       | fmjrey wrote:
       | This may be a more appropriate source, from the source:
       | 
       | https://labs.spotify.com/2019/11/12/spotifys-event-delivery-...
        
         | stingraycharles wrote:
         | Much better article, thanks for sharing.
        
         | mackey wrote:
         | This is correct link
         | https://labs.spotify.com/2020/02/18/wrapping-up-the-decade-a...
        
           | dang wrote:
           | Ok, we've changed to that from
           | https://techcrunch.com/2020/02/18/how-spotify-ran-the-
           | larges.... Thanks all!
        
             | gabagool wrote:
             | The new Spotify blog only states that "the Wrapped Campaign
             | data pipeline had one of the largest Dataflow jobs to ever
             | run on GCP," without claiming that it was the largest ever.
             | I didn't see any additional evidence in the TechCrunch
             | article to support this being the largest either.
             | 
             | Not sure if a better title is warranted ("How Spotify ran
             | its massive Google Dataflow job for Wrapped 2019", "How
             | Spotify ran one of the largest Google Dataflow jobs ever
             | for Wrapped 2019"?).
        
               | dang wrote:
               | Ok, we've knocked "the largest" down to size in the title
               | above. I always tell startups not to use superlatives on
               | HN. Modest language sounds stronger.
        
       | dang wrote:
       | There's more info at https://techcrunch.com/2020/02/18/how-
       | spotify-ran-the-larges....
       | 
       | (via https://news.ycombinator.com/item?id=22359528)
        
       | downerending wrote:
       | Impressive, but I'd be more impressed if they fixed their random
       | shuffle.
        
         | nvarsj wrote:
         | What's wrong with the spotify shuffle?
         | 
         | edit: Did a search, seems like there's quite a few problems
         | (only playing recently added songs, only playing 100 songs out
         | of the playlist, etc.). I know google music has also had long
         | standing issues with shuffle play - and in fact I left it over
         | these kind of issues. Is it really difficult to implement a
         | shuffle?!
        
           | fuzzmz wrote:
           | It's not really random, in the sense that if you have a
           | playlist and hit shuffle it'll always play in the same order
           | instead of randomizing the play order each time you listen to
           | that playlist. Basically, with the current behavior, once you
           | learned the order of the shuffled songs you can always know
           | what comes next.
        
             | joegahona wrote:
             | Is there a technical reason it does this, and why it's so
             | difficult to correct?
        
               | tjoff wrote:
               | Technical debt.
        
           | mrkeen wrote:
           | It may be the case that 100 tracks are sent to the device and
           | the shuffle logic chooses from them locally.
        
             | kingosticks wrote:
             | Not sure why you are being down voted, this is essentially
             | how Spotify's shuffle works. At least, if you MITM the
             | official client and load a large playlist/context you'll
             | only see a small window worth of tracks being loaded. And
             | you won't see any request from the client when you then
             | shuffle that playlist, it's done locally.
             | 
             | This may, of course, have changed. My experiments while
             | (badly) implementing librespot's shuffle functionality were
             | a few years ago now.
        
             | downerending wrote:
             | In my case, all of the tracks are already on the device.
             | But yes, it's possible that they're doing something like
             | this anyway.
        
           | The_Latecomer wrote:
           | Google stopped supporting Play Music a while back to be fair.
           | Have you tried using YouTube music? Would you say you find
           | this same issue there?
        
             | anoonmoose wrote:
             | What do you mean, stopped supporting?
        
               | reciprocornous wrote:
               | https://www.digitaltrends.com/music/what-happens-to-
               | google-p...
        
           | downerending wrote:
           | For me, I listen only from "Songs" (my entire collection,
           | which is about 3000 tracks). Even when shuffled, almost
           | everything I hear is something I've heard within the last
           | week or two.
           | 
           | When I use the Amazon app under the same conditions, I often
           | hear a track I haven't heard for a long time. Which is what
           | I'd expect when random sampling from 200 hours of music.
           | 
           | (I don't use playlists, as they're simply too much work.)
        
         | sushisource wrote:
         | Or the "queue album/song" functionality. It's amazing how
         | absolutely dogshit the Spotify UX is. I keep using it because
         | they have the best selection / device compatability but god the
         | UI is just awful.
        
         | Barrin92 wrote:
         | while we're asking for spotify features and in case someone at
         | spotify sees this post: You've put a lot of money into
         | podcasting, please add the 'new episodes' feature of the mobile
         | app to the desktop/web app. Essential feature that's still
         | missing.
        
         | kossTKR wrote:
         | Yeah it's pretty interesting that they undertake this huge task
         | when one of the basic features still doesn't work.
         | 
         | Simply put when you shuffle from all of you liked songs you
         | will mostly get the same tracks over and over - some tracks
         | will stay hidden forever, - pretty weird and annoying.
         | 
         | It seems to stem from issues in relation to this post, ie. sql
         | queries and caching to prevent too much CPU use on their end.
        
           | 2bitencryption wrote:
           | I think the root cause is because spotify shuffle isn't true
           | "shuffle" in the mathematical, random sense.
           | 
           | They perform some analysis to increase the "perceived
           | randomness" - e.g., if the true random seed picks the same
           | artist twice in a row (totally possible), pick another song
           | by a different artist, or else people will perceive the
           | shuffle as not "random" enough.
           | 
           | Unfortunately I don't have the source for this right now, but
           | I'm sure someone will hop in and provide it if I'm wrong
           | about this :)
        
             | downerending wrote:
             | I'm familiar with the idea. Their custom algorithm seems to
             | do the opposite. The order actually being generated has
             | very little perceived randomness, far less than what a true
             | random shuffle would look like.
        
             | claudiulodro wrote:
             | They have also further modified the shuffle algorithm
             | within the last year or two to favor putting songs at the
             | top that the user hasn't listened to a lot. There are
             | definitely a variety of heuristics involved with their
             | shuffling algorithm.
        
             | sorenjan wrote:
             | https://labs.spotify.com/2014/02/28/how-to-shuffle-songs/
        
               | downerending wrote:
               | Amusingly, the comments at the bottom are from a large
               | number of others also noting that their algorithm doesn't
               | work as described.
        
       ___________________________________________________________________
       (page generated 2020-02-18 23:00 UTC)