[HN Gopher] How Spotify ran a large Google Dataflow job for Wrap... ___________________________________________________________________ How Spotify ran a large Google Dataflow job for Wrapped 2019 Author : jhatax Score : 151 points Date : 2020-02-18 19:20 UTC (3 hours ago) (HTM) web link (labs.spotify.com) (TXT) w3m dump (labs.spotify.com) | rsmets wrote: | I thought this was such a marvel! However, my excitement level | was tapered when I realized the playlist Best of the Decade was | not created by only my music listening habits. | | Seems as though users were pinned to some general playlist that | had characteristics similar to listening habits? Still hats off | from an engineering perspective. I as well wish there was more | technical detail provided. | | The year recap playlists though are fun personal snapshot of | time. | aabeshou wrote: | It's interesting to confirm that because anecdotally my best of | the decade playlist sucked lol. It had songs that I really | don't think I listened to that much or liked that much. It was | weird. | booboolayla wrote: | This smells like an initiative to boost "women in tech" storyline | - the task at hand is something many, many companies do daily, | the blog has no technical substance, and also just take a look at | the team members at the bottom of the page. | | Kind of reminds me of the time a trans person from Google | "calculated the number Pi" and we were all supposed to cheer it | as some kind of an accomplishment. | dvtrn wrote: | I thought we had a thing about preserving post titles from the | source? | capableweb wrote: | That's still true, submission used to link to | https://techcrunch.com/2020/02/18/how-spotify-ran-the-larges... | | See https://news.ycombinator.com/item?id=22359865 | jsnell wrote: | The source changed, the title didn't. | dna_polymerase wrote: | Basically the perfect use case for cloud computing. Tons of | compute for a short time. In this case there can't possibly be | people arguing for their own datacenter over cloud. | wrkronmiller wrote: | > Basically the perfect use case for cloud computing. Tons of | compute for a short time. | | I completely agree. | | > In this case there can't possibly be people arguing for their | own datacenter over cloud. | | Devil's advocate time: This solution was great for the cloud | because it was designed for the cloud. There might be equally | good or even superior solutions designed for on-prem or even | on-device computing. For example, this ceases to be a big-data | problem if you are simply aggregating listening metrics for a | single user on a single device. | hinkley wrote: | That works until the bean counters invade and someone gets the | bright idea to reduce the ratio of surplus hardware to reduce | CAPEX and boost quarterly profits. | | We've seen that in every industry including healthcare. Every | health crisis now takes us back to field hospitals. | drdoooom wrote: | Was a neat little feature, too bad the share functionality didn't | actually work. | fs111 wrote: | why is this link doing a redirect through some ad network? | dang wrote: | We've since changed the link, which originally was | https://techcrunch.com/2020/02/18/how-spotify-ran-the- | larges.... | C14L wrote: | Becasue more and more browsers are limiting access to cookies | not only depending on first-party context but also third-party | context. So tracking users by web bugs becomes less reliable. | By redirecting throught the domain, they can set and access | cockies in a first-party context. | Swtrz wrote: | I wonder why I never see this behavior despite every other | person mentioning it | corndoge wrote: | ublock origin shows a confirm page when it happens | jdormit wrote: | It's really quick. Open the network tab and check the | "persist logs" checkbox to ensure that the request logs don't | disappear after every redirect, then clear your cookies for | advertising.com and guce.techcrunch.com and reload the page. | You'll see the request for techcrunch.com redirect to | guce.techcrunch.com, which redirects to guce.advertising.com, | which redirects back to techcrunch.com. It happens so fast | it's not noticeable on page load. | stilisstuk wrote: | No tech crunch... You can't have my cookies.. | data4lyfe wrote: | One massive SQL query across a billion plus users. | ipnon wrote: | Databases are the one area of computer science that makes me | realize these machines can do magical things. | gwittel wrote: | Interesting. I wish it had more details as far as inputs/outputs, | data sizes in different phases. | | One thing that I wonder about is how much work could they do to | collect this data on a forward moving basis. Often I see huge | lookback jobs that answer predictable/static questions -- prime | candidates for aggregation during ingest. | wobblykiwi wrote: | This is the thing I was most forward to reading about in the | article, but there were no figures about how large the "largest | Google Dataflow job ever" is. There are a bunch of relative | figures, 5x 2018 - but what does that translate to? How long | did it take? | deepsun wrote: | I'd recommend them to check out Clickhouse for exactly the same | purposes. Works well for Cloudflare, Yandex, Sentry. | | Another idea is to run probabilistic queries instead of exact | ones, could bring down costs way more. | fmjrey wrote: | This may be a more appropriate source, from the source: | | https://labs.spotify.com/2019/11/12/spotifys-event-delivery-... | stingraycharles wrote: | Much better article, thanks for sharing. | mackey wrote: | This is correct link | https://labs.spotify.com/2020/02/18/wrapping-up-the-decade-a... | dang wrote: | Ok, we've changed to that from | https://techcrunch.com/2020/02/18/how-spotify-ran-the- | larges.... Thanks all! | gabagool wrote: | The new Spotify blog only states that "the Wrapped Campaign | data pipeline had one of the largest Dataflow jobs to ever | run on GCP," without claiming that it was the largest ever. | I didn't see any additional evidence in the TechCrunch | article to support this being the largest either. | | Not sure if a better title is warranted ("How Spotify ran | its massive Google Dataflow job for Wrapped 2019", "How | Spotify ran one of the largest Google Dataflow jobs ever | for Wrapped 2019"?). | dang wrote: | Ok, we've knocked "the largest" down to size in the title | above. I always tell startups not to use superlatives on | HN. Modest language sounds stronger. | dang wrote: | There's more info at https://techcrunch.com/2020/02/18/how- | spotify-ran-the-larges.... | | (via https://news.ycombinator.com/item?id=22359528) | downerending wrote: | Impressive, but I'd be more impressed if they fixed their random | shuffle. | nvarsj wrote: | What's wrong with the spotify shuffle? | | edit: Did a search, seems like there's quite a few problems | (only playing recently added songs, only playing 100 songs out | of the playlist, etc.). I know google music has also had long | standing issues with shuffle play - and in fact I left it over | these kind of issues. Is it really difficult to implement a | shuffle?! | fuzzmz wrote: | It's not really random, in the sense that if you have a | playlist and hit shuffle it'll always play in the same order | instead of randomizing the play order each time you listen to | that playlist. Basically, with the current behavior, once you | learned the order of the shuffled songs you can always know | what comes next. | joegahona wrote: | Is there a technical reason it does this, and why it's so | difficult to correct? | tjoff wrote: | Technical debt. | mrkeen wrote: | It may be the case that 100 tracks are sent to the device and | the shuffle logic chooses from them locally. | kingosticks wrote: | Not sure why you are being down voted, this is essentially | how Spotify's shuffle works. At least, if you MITM the | official client and load a large playlist/context you'll | only see a small window worth of tracks being loaded. And | you won't see any request from the client when you then | shuffle that playlist, it's done locally. | | This may, of course, have changed. My experiments while | (badly) implementing librespot's shuffle functionality were | a few years ago now. | downerending wrote: | In my case, all of the tracks are already on the device. | But yes, it's possible that they're doing something like | this anyway. | The_Latecomer wrote: | Google stopped supporting Play Music a while back to be fair. | Have you tried using YouTube music? Would you say you find | this same issue there? | anoonmoose wrote: | What do you mean, stopped supporting? | reciprocornous wrote: | https://www.digitaltrends.com/music/what-happens-to- | google-p... | downerending wrote: | For me, I listen only from "Songs" (my entire collection, | which is about 3000 tracks). Even when shuffled, almost | everything I hear is something I've heard within the last | week or two. | | When I use the Amazon app under the same conditions, I often | hear a track I haven't heard for a long time. Which is what | I'd expect when random sampling from 200 hours of music. | | (I don't use playlists, as they're simply too much work.) | sushisource wrote: | Or the "queue album/song" functionality. It's amazing how | absolutely dogshit the Spotify UX is. I keep using it because | they have the best selection / device compatability but god the | UI is just awful. | Barrin92 wrote: | while we're asking for spotify features and in case someone at | spotify sees this post: You've put a lot of money into | podcasting, please add the 'new episodes' feature of the mobile | app to the desktop/web app. Essential feature that's still | missing. | kossTKR wrote: | Yeah it's pretty interesting that they undertake this huge task | when one of the basic features still doesn't work. | | Simply put when you shuffle from all of you liked songs you | will mostly get the same tracks over and over - some tracks | will stay hidden forever, - pretty weird and annoying. | | It seems to stem from issues in relation to this post, ie. sql | queries and caching to prevent too much CPU use on their end. | 2bitencryption wrote: | I think the root cause is because spotify shuffle isn't true | "shuffle" in the mathematical, random sense. | | They perform some analysis to increase the "perceived | randomness" - e.g., if the true random seed picks the same | artist twice in a row (totally possible), pick another song | by a different artist, or else people will perceive the | shuffle as not "random" enough. | | Unfortunately I don't have the source for this right now, but | I'm sure someone will hop in and provide it if I'm wrong | about this :) | downerending wrote: | I'm familiar with the idea. Their custom algorithm seems to | do the opposite. The order actually being generated has | very little perceived randomness, far less than what a true | random shuffle would look like. | claudiulodro wrote: | They have also further modified the shuffle algorithm | within the last year or two to favor putting songs at the | top that the user hasn't listened to a lot. There are | definitely a variety of heuristics involved with their | shuffling algorithm. | sorenjan wrote: | https://labs.spotify.com/2014/02/28/how-to-shuffle-songs/ | downerending wrote: | Amusingly, the comments at the bottom are from a large | number of others also noting that their algorithm doesn't | work as described. ___________________________________________________________________ (page generated 2020-02-18 23:00 UTC)