[HN Gopher] Show HN: Goodreads Data Pipeline
       Show HN: Goodreads Data Pipeline
       Author : san089
       Score  : 150 points
       Date   : 2020-02-27 15:41 UTC (7 hours ago)
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
       | bilater wrote:
       | Nice - You can use UNION ALL instead of UNION in your query at
       | the end if you're confident the datasets don't overlap. Query is
       | less expensive. I'm also curious what the backfilling/recovery
       | process is if something goes wrong and you have to stop your 10
       | min load jobs.
       | jonluca wrote:
       | As an aside, good reads is the slowest site I use on a regular
       | basis. It's genuinely shocking how there are 5 to 6 second page
       | load times. I'm not sure what their stack is but I'm always blown
       | away by how any continues to use this. It feels like a competitor
       | could beat them by just being faster.
         | richie5um wrote:
         | Couldn't agree more. It must have one of the worst usage-to-
         | enjoyment ratios of any site.
         | spullara wrote:
         | Probably Ruby on Rails.
         | https://stackshare.io/goodreads/goodreads
           | traverseda wrote:
           | That stackshare site won't let me view content without
           | creating an account.
           | So that's a downvote from me, please try to find a source
           | that doesn't require you register an account.
             | idkris wrote:
             | I just accessed the stackshare URL without signing in.
               | jaredmosley wrote:
               | I was able to access that link but couldn't do anything
               | else without having to sign in. I also can no longer go
               | back to that link without signing in.
             | FalconSensei wrote:
             | Downvoting someone for contributing to the conversation,
             | without providing any better source. Great
               | FalconSensei wrote:
               | Thanks for the downvote bro!
       | kgraves wrote:
       | Interesting, would like some detail on the cost of this ETL setup
       | on AWS, unfortunately I can't see anything on this from the
       | project page.
         | jmedefind wrote:
         | Since they are owned by Amazon. I would think their cost is
         | close to nothing.
           | dswalter wrote:
           | That's a bit of a misconception. From what I understand,
           | Amazon's non-AWS branches don't get deeply-discounted
           | services from AWS. There is a discount, but it's not enough
           | to turn dark skies into sunshine and rainbows.
           | Amazon tends to want every part of itself to be in ship-
           | shape, and giving itself a massive discount would discourage
           | efficiency in non-AWS parts of the business.
           | Disclosure: neither a current nor former Amazon employee.
             | augmachina wrote:
             | This is a misconception. Amazon wants to depict itself as
             | wanting every part to be in ship-shape, but it does not
             | operate that way and AWS is treated like any internal
             | resource like printers and staplers.
               | [deleted]
             | devcpp wrote:
             | AWS basically finances the rest of Amazon. It's 70% of its
             | revenue (that is public info). Except for retail, the rest
             | is all losses. So the discounts don't matter much, other
             | branches just try to save money (frugality is one of
             | Amazon's core values) but basically get what they need.
           | danial wrote:
           | This repository is not associated with Amazon.
       | gwern wrote:
       | Is this limited strictly to the GoodReads API or does it pull in
       | more interesting data like the shelf-tags? When I did
       | https://www.gwern.net/GoodReads the other month, I had to
       | literally scrape shelves by hand because the API doesn't cover
       | them and they lie to bots.
       | habosa wrote:
       | Someone, somewhere has to be able to make a better alternative to
       | Goodreads right? The site is slow, ugly, and buggy. The
       | functionality is so simple: I tell you when I read a book and
       | what I think about it.
       | I'm just shocked Amazon has been able to own this niche with so
       | little effort.
         | drusepth wrote:
         | I've been working on a competitor for a while now, and the
         | hardest part of replicating functionality is the data.
         | OpenLibrary is probably the best source for book metadata
         | online, but even their library dumps are riddled with mistakes
         | that manifest in weird ways as you start building your own
         | library. The Goodreads site sucks, but they've got surprising
         | data quality that I don't think anyone else has; and they have
         | a super restrictive data policy so you can't repurpose book
         | data, reviews, shelf data, etc, even when users auth with a
         | Goodreads account.
         | It's a small moat, but definitely penetrable with more than a
         | little effort.
           | hombre_fatal wrote:
           | Demanding perfect data is a waste of time that will let you
           | procrastinate your product indefinitely.
             | maw wrote:
             | If you're aiming for "like X, but for people who are
             | actually interested in the market X supposedly serves" then
             | maybe this isn't true.
           | barkingcat wrote:
           | The good data quality is actually an artifact of humans being
           | involved in every step of the cataloging process. There's a
           | large group in goodreads called the GoodReads Librarians, and
           | that group has around a hundred thousand dedicated people who
           | go through and flag anomalies, correct titles and indexes etc
           | Book publishers or people who've worked in book publishing
           | will know that the book database is one area you don't want
           | to mess with unless you know what you are doing. ISBN's are
           | not the be all and end all of the story, and when you start
           | taking into account special editions, covers, ebook editions,
           | language translations, you'll start to realize that the Book
           | Catalog system going back in history, including Dewey decimal
           | system is a marvel of human achievement.
           | Of course establishing a good quality index is going to take
           | work. People often forget that quality take human work and
           | effort.
           | EDIT: I lied. I changed the number from my original estimate
           | of a "few hundred" to "hundred thousand". The Goodreads
           | Librarians group has 103718 members as of when I just peeked
           | now - so it's actually a large number of humans submitting
           | fixes to their catalog.
           | https://www.goodreads.com/group/show/220-goodreads-
           | librarian...
           | If you take a look at the kind of discussions taking place,
           | those are the kinds of things any competitor to Goodreads
           | needs to know about.
           | spullara wrote:
           | Facts are not copyrightable and scraping has been determined
           | to be legal. IANAL, but I'm not sure law would protect them
           | from the factual metadata about books being repurposed.
             | jjeaff wrote:
             | Not being illegal wouldn't protect you from a crushing
             | lawsuit though. Especially since the details likely vary
             | (linked in data was publicly accessible, not sure if
             | Goodreads requires a login).
           | trollied wrote:
           | You could nick the scraping code from Calibre...
             | dewey wrote:
             | The scraping part is probably not the complicated portion
             | of the endeavor.
         | aaron-santos wrote:
         | Just signed up for FediReads[1] last week. It's a decentralized
         | Goodreads with ActivityPub, and open source[2].
         | [1] http://fedireads-test.glitch.me/
         | [2] https://github.com/mouse-reeve/fedireads/
           | FalconSensei wrote:
           | > This is just a demo, any data here may be deleted without
           | warning. sign up for email updates
           | So basically, keep using GoodReads for now?
         | 101008 wrote:
         | I've been working on something like this. Super simple, like an
         | Spreadsheet of what you read but as a SaaS. I was thinking in
         | monetize it a la Pinboard: focused on privacy. Like, $3 per
         | month and you have it, without Amazon or Google knowing what
         | books you read and how you rate them.
         | grimgrin wrote:
         | Not to imply this functionality is complex, but really the most
         | important thing for me are the lists:
         | I _love_ that I can take a book I enjoyed, see it's on a list
         | of "Best Magic Systems", and note what was rated even better
         | for its magic system
         | A simple method of discovery for me
         | https://www.goodreads.com/list/show/871.Most_Interesting_Mag...
         | https://www.goodreads.com/list/show/8497.Aliens_First_Contac...
         | spillguard wrote:
         | As a regular Goodreads user, I've never cared about the site's
         | relatively slow load times. What's important to me is the trust
         | that the site will still be around in 20 years, largely thanks
         | to the Amazon ownership and Kindle integrations. I wouldn't
         | have that same faith in an anonymous competitor.
           | FalconSensei wrote:
           | The kindle integration, the amount of correct data, and the
           | fact that it's not going to vanish in the next year is what
           | keeps me using GoodReads.
           | kirubakaran wrote:
           | Goodreads was that anonymous site once upon a time. You're
           | just not an early adopter and that's okay. That's no reason
           | to not create a better alternative.
         | trollied wrote:
         | The thing that goodreads has that will be hard to replicate is
         | the Kindle integration.
           | mmanfrin wrote:
           | This is the Achilles heel of any potential competitor. The
           | lazy integration means there is a big subset of users who
           | simply won't engage with a competitor because it requires
           | more work. Couple that with the social graph Goodreads
           | already has and you're looking at a huge moat.
         | sidthekid wrote:
         | I find the site decently fast, definitely ugly but then again I
         | don't want it to get a reddit-style redesign either. The
         | information density is ok right now, and I'm actually impressed
         | by the wide range of functionalities they have, related to
         | reviewing books and updating your progress.
         | hombre_fatal wrote:
         | Have you tried looking? There's LibraryThing and a couple
         | others.
         | I don't think there's much value left on the table in the
         | niche, though. Kindles have first-class Goodreads sync and even
         | a Goodreads button in their global navbar. And Goodreads'
         | competitors, for the few people who don't want to use
         | Goodreads, already have a deep rut of incumbency.
         | Even you, who has supposed great issues with Goodreads,
         | apparently wasn't bothered enough to even see if competitors
         | existed all this time, much less before writing your comment.
         | Doesn't bode well for the Goodreads' competitor market, lol.
           | dlsso wrote:
           | None of the competitors I'm aware of have fuzzy search
           | though, which is pretty annoying.
           | "color prple"
           | LibraryThing: 0 results
           | Goodreads: 2,000+ results and they're well sorted
             | hombre_fatal wrote:
             | Reddit has terrible search too, but you can appreciate that
             | "Reddit but with good search" isn't all it takes to compete
             | with Reddit. That's 0.001% of the work.
             | And of course Goodreads has issues of its own, but none of
             | them are show-stoppers for most people, especially few of
             | the people who just use it as a glorified Excel
             | spreadsheet.
             | I only chuckle about this because, like many enterprising
             | HNers, I myself have considered building a Goodreads
             | competitor in the past and even managed to build the ol'
             | weekend prototype (i.e. 0.001% of the work). It's one of
             | those projects where you start and, after you get some of
             | the easy things done like fuzzy search, you go "wait, wtf
             | am I doing? Who would switch to this?"
             | Using and improving OpenLibrary is also alluring, but
             | pretty hard to do without an application with actual users
             | that have some sort of "edit book" functionality that you
             | can then moderate and submit upstream to the OpenLibrary
             | data source.
             | For example, look how ListenNotes.com lets users edit its
             | podcast database: https://www.listennotes.com/podcasts/the-
             | joe-rogan-experienc... -> the "Edit" tab.
           | FalconSensei wrote:
           | GoodReads already doesn't look awesome, but whoa,
           | LibraryThing looks like it hasn't been update in the past
           | decade
       | ldng wrote:
       | For my own education, what is "Data Lake" ? Data wharejouse is
       | "has-been" and that's the new hype way to call it ?
         | bdibs wrote:
         | A data lake is raw, unstructured data vs. a warehouse where
         | everything is already parsed, processed, and currently query-
         | able is my understanding.
         | [deleted]
       | prions wrote:
       | Really similar to the pipelines that I engineer/manage at my
       | current company. Although we have our Airflow on kubernetes.
       | One optimization though is separating your loading tasks from
       | compute tasks. This makes the pipeline more resilient and makes
       | backfilling/reprocessing less of a headache.
       | skandl wrote:
       | Amazon literally siphons off the data and has invested so little
       | in its users.
       | Any recommendations for an alternative?
         | FalconSensei wrote:
         | LibraryThing if you don't mind the ugly website
       | adam-_- wrote:
       | This is potentially quite interesting to me we are having
       | conversations on/off at work about data reporting, visualising
       | etc., which is leading me to pay attention to related topics.
       | However, it's lacking in any context explaining what you're
       | trying to achieve and why.
       | It's probably obvious to some people but for me, it's not, which
       | I think is a shame.
         | mrlatinos wrote:
         | Beyond just data replication/archival purposes, it seems you
         | can use the this to run analysis against Goodreads entire
         | public dataset. This is much more efficient than using their
         | API alone.
       | lcfcjs2 wrote:
       | Wow this is cool.
       | cal97g wrote:
       | This doesn't seem to be goodreads' actual etl pipeline, it's just
       | an example ETL pipeline from some guy.
       | krmmalik wrote:
       | Super interesting. Surely there are some business cases of how
       | someone could use this data for good (?)
       | For example someone could show the disparity between a New York
       | times bestseller and the book getting the most amount of activity
       | on GoodReads (added to most shelves for example)
       (page generated 2020-02-27 23:00 UTC)