[HN Gopher] Show HN: Goodreads Data Pipeline ___________________________________________________________________ Show HN: Goodreads Data Pipeline Author : san089 Score : 150 points Date : 2020-02-27 15:41 UTC (7 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | bilater wrote: | Nice - You can use UNION ALL instead of UNION in your query at | the end if you're confident the datasets don't overlap. Query is | less expensive. I'm also curious what the backfilling/recovery | process is if something goes wrong and you have to stop your 10 | min load jobs. | jonluca wrote: | As an aside, good reads is the slowest site I use on a regular | basis. It's genuinely shocking how there are 5 to 6 second page | load times. I'm not sure what their stack is but I'm always blown | away by how any continues to use this. It feels like a competitor | could beat them by just being faster. | richie5um wrote: | Couldn't agree more. It must have one of the worst usage-to- | enjoyment ratios of any site. | spullara wrote: | Probably Ruby on Rails. | https://stackshare.io/goodreads/goodreads | traverseda wrote: | That stackshare site won't let me view content without | creating an account. | | So that's a downvote from me, please try to find a source | that doesn't require you register an account. | idkris wrote: | I just accessed the stackshare URL without signing in. | jaredmosley wrote: | I was able to access that link but couldn't do anything | else without having to sign in. I also can no longer go | back to that link without signing in. | FalconSensei wrote: | Downvoting someone for contributing to the conversation, | without providing any better source. Great | FalconSensei wrote: | Thanks for the downvote bro! | kgraves wrote: | Interesting, would like some detail on the cost of this ETL setup | on AWS, unfortunately I can't see anything on this from the | project page. | jmedefind wrote: | Since they are owned by Amazon. I would think their cost is | close to nothing. | dswalter wrote: | That's a bit of a misconception. From what I understand, | Amazon's non-AWS branches don't get deeply-discounted | services from AWS. There is a discount, but it's not enough | to turn dark skies into sunshine and rainbows. | | Amazon tends to want every part of itself to be in ship- | shape, and giving itself a massive discount would discourage | efficiency in non-AWS parts of the business. | | Disclosure: neither a current nor former Amazon employee. | augmachina wrote: | This is a misconception. Amazon wants to depict itself as | wanting every part to be in ship-shape, but it does not | operate that way and AWS is treated like any internal | resource like printers and staplers. | [deleted] | devcpp wrote: | AWS basically finances the rest of Amazon. It's 70% of its | revenue (that is public info). Except for retail, the rest | is all losses. So the discounts don't matter much, other | branches just try to save money (frugality is one of | Amazon's core values) but basically get what they need. | danial wrote: | This repository is not associated with Amazon. | gwern wrote: | Is this limited strictly to the GoodReads API or does it pull in | more interesting data like the shelf-tags? When I did | https://www.gwern.net/GoodReads the other month, I had to | literally scrape shelves by hand because the API doesn't cover | them and they lie to bots. | habosa wrote: | Someone, somewhere has to be able to make a better alternative to | Goodreads right? The site is slow, ugly, and buggy. The | functionality is so simple: I tell you when I read a book and | what I think about it. | | I'm just shocked Amazon has been able to own this niche with so | little effort. | drusepth wrote: | I've been working on a competitor for a while now, and the | hardest part of replicating functionality is the data. | OpenLibrary is probably the best source for book metadata | online, but even their library dumps are riddled with mistakes | that manifest in weird ways as you start building your own | library. The Goodreads site sucks, but they've got surprising | data quality that I don't think anyone else has; and they have | a super restrictive data policy so you can't repurpose book | data, reviews, shelf data, etc, even when users auth with a | Goodreads account. | | It's a small moat, but definitely penetrable with more than a | little effort. | hombre_fatal wrote: | Demanding perfect data is a waste of time that will let you | procrastinate your product indefinitely. | maw wrote: | If you're aiming for "like X, but for people who are | actually interested in the market X supposedly serves" then | maybe this isn't true. | barkingcat wrote: | The good data quality is actually an artifact of humans being | involved in every step of the cataloging process. There's a | large group in goodreads called the GoodReads Librarians, and | that group has around a hundred thousand dedicated people who | go through and flag anomalies, correct titles and indexes etc | | Book publishers or people who've worked in book publishing | will know that the book database is one area you don't want | to mess with unless you know what you are doing. ISBN's are | not the be all and end all of the story, and when you start | taking into account special editions, covers, ebook editions, | language translations, you'll start to realize that the Book | Catalog system going back in history, including Dewey decimal | system is a marvel of human achievement. | | Of course establishing a good quality index is going to take | work. People often forget that quality take human work and | effort. | | EDIT: I lied. I changed the number from my original estimate | of a "few hundred" to "hundred thousand". The Goodreads | Librarians group has 103718 members as of when I just peeked | now - so it's actually a large number of humans submitting | fixes to their catalog. | | https://www.goodreads.com/group/show/220-goodreads- | librarian... | | If you take a look at the kind of discussions taking place, | those are the kinds of things any competitor to Goodreads | needs to know about. | spullara wrote: | Facts are not copyrightable and scraping has been determined | to be legal. IANAL, but I'm not sure law would protect them | from the factual metadata about books being repurposed. | jjeaff wrote: | Not being illegal wouldn't protect you from a crushing | lawsuit though. Especially since the details likely vary | (linked in data was publicly accessible, not sure if | Goodreads requires a login). | trollied wrote: | You could nick the scraping code from Calibre... | dewey wrote: | The scraping part is probably not the complicated portion | of the endeavor. | aaron-santos wrote: | Just signed up for FediReads[1] last week. It's a decentralized | Goodreads with ActivityPub, and open source[2]. | | [1] http://fedireads-test.glitch.me/ | | [2] https://github.com/mouse-reeve/fedireads/ | FalconSensei wrote: | > This is just a demo, any data here may be deleted without | warning. sign up for email updates | | So basically, keep using GoodReads for now? | 101008 wrote: | I've been working on something like this. Super simple, like an | Spreadsheet of what you read but as a SaaS. I was thinking in | monetize it a la Pinboard: focused on privacy. Like, $3 per | month and you have it, without Amazon or Google knowing what | books you read and how you rate them. | grimgrin wrote: | Not to imply this functionality is complex, but really the most | important thing for me are the lists: | | I _love_ that I can take a book I enjoyed, see it's on a list | of "Best Magic Systems", and note what was rated even better | for its magic system | | A simple method of discovery for me | | https://www.goodreads.com/list/show/871.Most_Interesting_Mag... | | https://www.goodreads.com/list/show/8497.Aliens_First_Contac... | spillguard wrote: | As a regular Goodreads user, I've never cared about the site's | relatively slow load times. What's important to me is the trust | that the site will still be around in 20 years, largely thanks | to the Amazon ownership and Kindle integrations. I wouldn't | have that same faith in an anonymous competitor. | FalconSensei wrote: | The kindle integration, the amount of correct data, and the | fact that it's not going to vanish in the next year is what | keeps me using GoodReads. | kirubakaran wrote: | Goodreads was that anonymous site once upon a time. You're | just not an early adopter and that's okay. That's no reason | to not create a better alternative. | trollied wrote: | The thing that goodreads has that will be hard to replicate is | the Kindle integration. | mmanfrin wrote: | This is the Achilles heel of any potential competitor. The | lazy integration means there is a big subset of users who | simply won't engage with a competitor because it requires | more work. Couple that with the social graph Goodreads | already has and you're looking at a huge moat. | sidthekid wrote: | I find the site decently fast, definitely ugly but then again I | don't want it to get a reddit-style redesign either. The | information density is ok right now, and I'm actually impressed | by the wide range of functionalities they have, related to | reviewing books and updating your progress. | hombre_fatal wrote: | Have you tried looking? There's LibraryThing and a couple | others. | | I don't think there's much value left on the table in the | niche, though. Kindles have first-class Goodreads sync and even | a Goodreads button in their global navbar. And Goodreads' | competitors, for the few people who don't want to use | Goodreads, already have a deep rut of incumbency. | | Even you, who has supposed great issues with Goodreads, | apparently wasn't bothered enough to even see if competitors | existed all this time, much less before writing your comment. | Doesn't bode well for the Goodreads' competitor market, lol. | dlsso wrote: | None of the competitors I'm aware of have fuzzy search | though, which is pretty annoying. | | "color prple" | | LibraryThing: 0 results | | Goodreads: 2,000+ results and they're well sorted | hombre_fatal wrote: | Reddit has terrible search too, but you can appreciate that | "Reddit but with good search" isn't all it takes to compete | with Reddit. That's 0.001% of the work. | | And of course Goodreads has issues of its own, but none of | them are show-stoppers for most people, especially few of | the people who just use it as a glorified Excel | spreadsheet. | | I only chuckle about this because, like many enterprising | HNers, I myself have considered building a Goodreads | competitor in the past and even managed to build the ol' | weekend prototype (i.e. 0.001% of the work). It's one of | those projects where you start and, after you get some of | the easy things done like fuzzy search, you go "wait, wtf | am I doing? Who would switch to this?" | | Using and improving OpenLibrary is also alluring, but | pretty hard to do without an application with actual users | that have some sort of "edit book" functionality that you | can then moderate and submit upstream to the OpenLibrary | data source. | | For example, look how ListenNotes.com lets users edit its | podcast database: https://www.listennotes.com/podcasts/the- | joe-rogan-experienc... -> the "Edit" tab. | FalconSensei wrote: | GoodReads already doesn't look awesome, but whoa, | LibraryThing looks like it hasn't been update in the past | decade | ldng wrote: | For my own education, what is "Data Lake" ? Data wharejouse is | "has-been" and that's the new hype way to call it ? | bdibs wrote: | A data lake is raw, unstructured data vs. a warehouse where | everything is already parsed, processed, and currently query- | able is my understanding. | [deleted] | prions wrote: | Really similar to the pipelines that I engineer/manage at my | current company. Although we have our Airflow on kubernetes. | | One optimization though is separating your loading tasks from | compute tasks. This makes the pipeline more resilient and makes | backfilling/reprocessing less of a headache. | skandl wrote: | Amazon literally siphons off the data and has invested so little | in its users. | | Any recommendations for an alternative? | FalconSensei wrote: | LibraryThing if you don't mind the ugly website | adam-_- wrote: | This is potentially quite interesting to me we are having | conversations on/off at work about data reporting, visualising | etc., which is leading me to pay attention to related topics. | | However, it's lacking in any context explaining what you're | trying to achieve and why. | | It's probably obvious to some people but for me, it's not, which | I think is a shame. | mrlatinos wrote: | Beyond just data replication/archival purposes, it seems you | can use the this to run analysis against Goodreads entire | public dataset. This is much more efficient than using their | API alone. | lcfcjs2 wrote: | Wow this is cool. | cal97g wrote: | This doesn't seem to be goodreads' actual etl pipeline, it's just | an example ETL pipeline from some guy. | krmmalik wrote: | Super interesting. Surely there are some business cases of how | someone could use this data for good (?) | | For example someone could show the disparity between a New York | times bestseller and the book getting the most amount of activity | on GoodReads (added to most shelves for example) ___________________________________________________________________ (page generated 2020-02-27 23:00 UTC)