hngopher.com

       [HN Gopher] Scrape like the big boys
       ___________________________________________________________________
        
       Scrape like the big boys
        
       Author : incolumitas
       Score  : 286 points
       Date   : 2021-11-05 09:22 UTC (13 hours ago)
        
 (HTM) web link (incolumitas.com)
 (TXT) w3m dump (incolumitas.com)
        
       | InvOfSmallC wrote:
       | Where I was working we stopped caring about ips browser etc
       | because it was just a race. What we did was analyzing behaviour
       | of clicks and acted on that. When we recognized it we went on
       | serving a fake page. It cuts down a little bit of costs because
       | it was static pages. In general it took a lot of time for them to
       | discover the pattern and it was way more manageable for us.
        
       | devops000 wrote:
       | Could you share your code for AWS lambda and puppetter? It's
       | definitely interesting for other websites
        
         | incolumitas wrote:
         | Sure.
         | 
         | https://github.com/NikolaiT/Crawling-Infrastructure
         | 
         | And here I am writing about it (but its quite old):
         | https://incolumitas.com/2019/08/31/web-scraping-puppeteer-aw...
        
       | joekrill wrote:
       | A little pet-peeve I have is when an obscure(ish) acronym is used
       | and never defined. Is SERP a well-known acronym? Perhaps this is
       | a niche blog and I'm not the intended audience.
        
         | tptacek wrote:
         | Yes; a SERP is a Google search result page. It's the most
         | important acronym in SEO.
        
           | nomdep wrote:
           | I don't remember never ever hearing it and I've been in the
           | industry for some time
        
             | Kiro wrote:
             | You can't be serious.
        
             | hollerith wrote:
             | Huh. I've never been in the industry, but noticed "SERP" at
             | least 15, maybe 20, years ago and have remembered it since.
             | 
             | (If I were writing something to be published, though, I
             | would write "search-engine results page" instead of
             | "SERP".)
        
             | weird-eye-issue wrote:
             | You've been in the SEO industry for some time and never
             | heard SERP?
        
         | daveguy wrote:
         | I had to look it up.
         | 
         | SERP: Search Engine Results Page
        
         | bgroat wrote:
         | Not the OP, but I thought it was well known.
         | 
         | That said, I do a lot of SEO work.
         | 
         | Still, it should be best practice to define any acronym or
         | initialism the first time you use it
        
         | fergie wrote:
         | Unintroduced acronyms should always be avoided.
        
           | nsotelo wrote:
           | As English speakers we often take for granted acronyms such
           | as DB or even USA. For foreigners these can also be
           | inscrutable.
        
           | praptak wrote:
           | Depends on the audience-acronym pair. I don't think HTTP
           | needs an introduction in a technical article, OTOH (on the
           | other hand ;) ) a general newspaper should probably expand
           | HTTP but not WWW.
        
         | joncp wrote:
         | An all-too-common occurrence in HN comments as well.
        
         | marginalia_nu wrote:
         | The word SERP feels like a bit of a shibboleth for SEO-people.
         | They seem to take it for granted, the rest of the world just
         | looks puzzled when they hear it.
        
       | hall0ween wrote:
       | Basic question, how does one profit from scraping data and what
       | kinda data?
       | 
       | Taking a stab at answering it: you scrape the data and build a
       | business around selling it. Stock prices? But that's boring, plus
       | how many others are doing it? I bet a lot.
        
         | 323 wrote:
         | These are scraping artificially limited releases clothes/shoes.
         | You buy a shoe at $100 and immediately sell it at $1000.
         | 
         | Artificial scarcity - every week you release a "limited edition
         | item", but if you do the math, it's not limited edition at all
         | if you integrate over a year.
        
         | throw1234651234 wrote:
         | 1. Be job site. 2. Have employees that cost money call
         | facilities and get job listings. 3. Establishing relationships
         | with facilities to list jobs. 4. Buy job listings from 3rd
         | parties. 5. List them for free hoping to make margin. 6.
         | Scraper steals all jobs, lags site, and gets value of hard work
         | for free.
        
           | hall0ween wrote:
           | ahh thanks
        
       | IceWreck wrote:
       | The author says proxys are expensive and then proceeds to spend a
       | shitton of money buying all that hardware.
        
         | incolumitas wrote:
         | 4G proxies are just soo much better than so called
         | "residential" or straight datacenter proxies. It makes sense to
         | create your own 4G proxy farm if you conduct business in that
         | area.
         | 
         | With only 10 dongles and 10 dataplans, you can have a lot of IP
         | addresses that are extremely hard to block. It's an one time
         | investment, paying proxy providers is a fixed cost.
        
           | bsder wrote:
           | Where do you get 4G dongles that don't suck nowadays?
           | 
           | We tried to get some, but all of the ones we could get were
           | various levels of broken or unsupported.
        
         | palijer wrote:
         | That was not the authors main argument against proxies, that
         | was just an additional point. You ignored the primary argument
         | in your judgment.
         | 
         | >>Because I could not fully trust the other customers with whom
         | I shared the proxy bandwidth. What if I share proxy servers
         | with criminals that do more malicious stuff than the somewhat
         | innocent SERP scraping?
        
           | RandomThrow321 wrote:
           | Can they not call out a secondary point?
        
       | ebbp wrote:
       | Having spent a week battling a particularly inconsiderate
       | scraping attempt, I'm quite unsurprised by the juvenile tone and
       | fairly glib approach to the ethics of bots/scraping presented by
       | the piece.
       | 
       | For the site I work for, about 20-30% of our monthly hosting
       | costs go towards servicing bot/scraping traffic. We've generally
       | priced this into the cost of doing business, as we've prioritised
       | making our site as freely accessible as possible.
       | 
       | But after this week, where some amateur did real damage to us
       | with a ham-fisted attempt to scrape too much too quickly, we're
       | forced to degrade the experience for ALL users by introducing
       | captchas and other techniques we'd really rather not.
        
         | paco3346 wrote:
         | I'm right there with you. I'm the lead engineer for an
         | automotive SaaS provider (with ~6000 customers and ~4 billion
         | requests per month) and we recently started moving all our
         | services to Cloudflare's WAF to take advantage of their bot
         | protection. We were getting scrapes from botnets in the 100000+
         | per minute range that was affecting performance.
         | 
         | We chose to switch to the JS challenge screen as it requires no
         | human interaction. We now block 75% (estimated to the best of
         | our knowledge) of bot traffic but some customers are livid over
         | the challenge screen.
        
           | [deleted]
        
           | EdwardDiego wrote:
           | What were they scraping, if I can ask? Was it targeted or
           | just wget -r style?
        
             | paco3346 wrote:
             | It was a hybrid of low-effort vulnerability scanning and
             | targeted inventory scraping. Many dealerships in the
             | automotive space will pay gray-hat third parties to scrape
             | and compile data on their competitors.
             | 
             | The irony for us as a provider is that it's one of our
             | customers (party A) paying a third party to scrape data
             | from another one of our customers (party B) which in turn
             | affects the performance of party A's site. We've started
             | blocking these third parties and directing them to paid
             | APIs that we offer.
        
               | RobSm wrote:
               | And how do you get your 'inventory data'? Aren't you
               | scraping (or using scraped data) yourself? Oh the irony
               | :)
        
               | paco3346 wrote:
               | No, we're a contracted provider for these customers. They
               | ingest their data into our network through APIs or CSVs.
        
           | Andoryuuta wrote:
           | I'm really surprised that the JS challenges helped so much,
           | given that there are open source libraries for bypassing them
           | (e.g. cloudscraper[0]).
           | 
           | [0]: https://github.com/venomous/cloudscraper
        
             | paco3346 wrote:
             | If someone wanted to get past it they probably could. We've
             | had a few sources of traffic that we've had to straight up
             | block (as opposed to challenge) because of this exact
             | issue. So far it's been a "good enough" solution that
             | blocks enough of the bot traffic to be effective.
        
           | RobSm wrote:
           | Why do you think those bots were scraping your data in the
           | first place?
        
         | devwastaken wrote:
         | If an amateur can do that to your service by scraping, imagine
         | what someone can do if they actually intend to do you harm.
         | With cloud pricing models someone could find a little
         | misconfiguration or oversight and put you in the hole in
         | operating costs. Anti-abuse is a necessary design when your
         | service is exposed to the internet.
         | 
         | Not saying that doesn't suck - it does, it's why many ideas
         | don't work in practice as an online service.
        
         | kulikalov wrote:
         | Why not create api endpoint and charge mild cost for that data?
         | You'll make money instead of spending it.
        
           | scarygliders wrote:
           | Do you honestly believe all site scraper people/companies are
           | ethical enough to go to whoever pays /them/ to scrape data
           | from a competitor's site and say "oh they offer an API to
           | access this data let's pay for that", instead of "why pay for
           | that data when we can scrape it right off their site"?
           | 
           | Also, not all types of company will provide API endpoints. It
           | all depends on the type of site - for example, an online shop
           | might not wish to provide easily accessible data on offered
           | products and prices, to their competitors who may wish to
           | undercut them. Why would an online shop do that?
        
             | jadell wrote:
             | I run a large scraper farm against several large sites.
             | They're not online shops, and we don't compete with them.
             | But they do have hundreds of thousands of data points that
             | we use to provide reports and analytics for our clients,
             | who also do not compete with the sites.
             | 
             | I absolutely would pay for an API that provides that data.
             | I'd be willing to pay 10x more than the cost of maintaining
             | and running the scrapers.
             | 
             | But the sites being scraped have no interest in that.
        
               | texasbigdata wrote:
               | Building and maintaining the scraper is the not cost they
               | would use to measure it internally. It's the cost to
               | build the API, and support it and perhaps any perverse
               | incentive it creates where even more data flows out to
               | competitors.
        
               | wolverine876 wrote:
               | And the cost of being scraped.
        
               | RobSm wrote:
               | Building API is 5 times easier than building routes for
               | your public webpages, which is basically an 'API' as
               | well.
        
               | CWuestefeld wrote:
               | Have you tried approaching those sites and asking them to
               | provide an API, pointing out that it would be easier for
               | both of you in the long run? Or are you just assuming
               | they wouldn't do it.
               | 
               | Because right now, I sure wish that the bots - which
               | comprise probably 2/3 of my traffic - are causing me huge
               | headaches and I wish that the people doing it would tell
               | me what the heck they want.
        
             | zivkovicp wrote:
             | Well, you don't need an api, just a CSV file with a
             | catalog.
             | 
             | The scraping company WILL use the API/CSV file... they will
             | probably also still charge their customer for scraping, so
             | it's a win-win :D
             | 
             | You can think of it this way, the prices and product data
             | are publicly visible already on the website, there are no
             | real secrets, none of it is password protected.
             | 
             | You can be principled and insist on blocking bots and spend
             | a lot of time and money on tools, people, and ultimately
             | hosting because the bots will _always_ win; or you can
             | offer the data for free /minimal fee and serve it with
             | almost zero cost and cache it so you can do that with a
             | micro sized server.
             | 
             | You can always lie about some of the prices if you want,
             | but you will just encourage bots again.
             | 
             | Ethics are nice, but let's be honest, very lacking.
             | Sometimes it's better to be pragmatic.
        
               | scarygliders wrote:
               | > You can think of it this way, the prices and product
               | data are publicly visible already on the website, there
               | are no real secrets, none of it is password protected.
               | 
               | There's the problem right there. The prices and product
               | data are publicy visible - because there is a target
               | audience of /humans/ for whom the site is designed and
               | intended to be used by. The site is not there to cater
               | for a competitor's scrapers.
               | 
               | I don't care how much people couch their unethical
               | behaviour in "the data is publically available", the
               | basic fact is most if not all websites exist for human
               | eyeballs to look at them. They do not exist for arseholes
               | to DOS them by inundating them with scrapers.
        
               | zivkovicp wrote:
               | I agree 100%, but it is a fact of life, and sometimes
               | it's better to just minimize the fuzz and focus on the
               | things that matter.
               | 
               | Your argument is perfectly valid and applies to offline
               | activities as well (what stops a competitor from walking
               | through the aisles of a Walmart or Costco?), but this is
               | a battle that can't be won, there are too many parasitic
               | actors. It is human nature.
        
               | mcdonje wrote:
               | > (what stops a competitor from walking through the
               | aisles of a Walmart or Costco?)
               | 
               | That's a significant portion of Nielsen's business model.
        
               | TeMPOraL wrote:
               | > _the basic fact is most if not all websites exist for
               | human eyeballs to look at them._
               | 
               | There's a whole ethical subthread here of websites trying
               | to making the experience for those humans miserable, and
               | taking away the agency necessary to protect oneself from
               | that. A browser is _a_ user agent. So is a screen reader.
               | So is a script one writes to not deal with bullshit
               | fluff, when all one wants is a simple table of products,
               | features and prices.
        
               | 0xdeadbeefbabe wrote:
               | Let's not encourage these unethical people to even think
               | of using human eyeballs and manual data entry for their
               | scraping instead of bots. That sounds pretty darn
               | unethical.
        
               | zo1 wrote:
               | From my perspective, the problem is that the data that is
               | offered isn't really "for humans". The data is for
               | _convincing_ the humans to buy /pay or worse, browse and
               | watch ads as a result.
               | 
               | But overall, information is one of those goods that has
               | intrinsic properties like no other. It can be copied,
               | infinitely. And we haven't yet figured out the dynamics
               | of how to reason about it, so it feels like we're
               | pretending they're physical goods.
               | 
               | Edit. Side note. I'd go further and say that some of the
               | data is even worse, it's "offered" with the real
               | intention being to confuse the users into performing non-
               | optimally in the market. Look at
               | Amazon/Ebay/AliExpress/Google listings for evidence of
               | that. Just Google - Google is a ML and scraping power
               | house, and the best they can muster is to be spammed with
               | fake websites and duplicate/confusing listings.
        
               | TeMPOraL wrote:
               | You hit the nail on the head. It's hard to have sympathy
               | for site operators complaining about scraping, where
               | almost every site does its best[0] to make using it a
               | time consuming, potentially risky and overall annoying
               | ordeal. Not to mention, information asymmetry is anathema
               | to a well-functioning market, and yet no. 1 reason for
               | fighting bots given in the whole thread here is a desire
               | to maintain that information asymmetry.
               | 
               | And that's also the dirty secret behind the "attention
               | economy": it's whole point is to make things _as
               | inefficient as possible_ , because if you're making money
               | on people's attention, you need to first steal it (by
               | distracting them from what they're trying to achieve),
               | and then either direct towards your goals (vs. those of
               | the users), or stretch it out to maximize their exposure
               | to advertising.
               | 
               | --
               | 
               | [0] - Sometimes unintentionally. Unfortunately, the
               | overall zeitgeist of UX design is heavily influenced by
               | bad players, so default advice in the industry is often
               | already intrinsically user-hostile.
        
             | matheusmoreira wrote:
             | > Why would an online shop do that?
             | 
             | Because otherwise the HTML will become the API.
        
         | marginalia_nu wrote:
         | Bots are one of those things that are easy to build and hard to
         | get right, and there's really no way of preparing for the
         | chaotic reality of real web pages other than fixing the
         | problems as they show up. Weird and unexpected interactions are
         | going to happen. Crawling the real web involves navigating a
         | fractal of unexpected, undocumented and non-standard corner
         | cases. Nobody gets that right on the first try. Because of that
         | I do think we need to be a bit patient with bots.
         | 
         | At the same time, even as someone who runs a web crawler, I
         | have zero qualms about blocking misbehaving bots.
        
           | chillfox wrote:
           | I kinda feel like rate limiting your request to individual
           | domains and IP addresses is an easy thing that goes a long
           | way towards getting it right.
        
             | marginalia_nu wrote:
             | There are still snags with that.
             | 
             | Stuff like redirect resolution is very easy to overlook.
             | You may think you're fetching 1 URL per second, but if you
             | are using the wrong tool and you're on a server that has
             | you bouncing around like in a pinball machine and takes you
             | through a dozen redirects for every request, the reality
             | may be closer to 10 requests per second.
             | 
             | On top of that, sometimes the same server has multiple
             | domains. Sometimes the same IP-address serves a large
             | number of servers (maybe it's a CDN).
        
               | RobSm wrote:
               | If you build your site in a way that multiplies each
               | request 10x, well then that's what you get. Don't do that
               | and you won't have issue with requests. Or handle those
               | requests properly. There are solutions to that. You know
               | how many requests your local google CDN gets? They know
               | how to manage load.
        
               | marginalia_nu wrote:
               | Most pages have at least a http->https redirect, many
               | contain a lot of old links to http content.
               | 
               | Usually it's error pages that really drive the large
               | redirect chains. They often have a vibe of like some
               | forgotten stopgap put in place to help with some
               | migration to a version of the site that is no longer in
               | existence.
               | 
               | Of course you don't know it's an error page until you
               | reach the end of the redirect chain.
        
         | [deleted]
        
         | krzyk wrote:
         | As a programmer that just sometimes wants to check if given
         | item is available in store I would like to be able to use API
         | for that. But if it is not available one has to scrape.
        
         | scarygliders wrote:
         | Right with you there.
         | 
         | I had a particularly bad time not so long ago, when a
         | customer's site - a shop - was brought to its knees because
         | someone, probably a competitor, hired some scraper-company of
         | some sort to scrape every product and price.
         | 
         | The scraper would systematically go through every single
         | product page.
         | 
         | And by scraper, I mean - 100's of them. All at the same time,
         | using the old trick of 1 scraper requesting 3 or 4 product
         | pages at a time then pausing for a while.
         | 
         | They used umpteen different IP address blocks from all over the
         | globe - but mainly using OVH vps IP address blocks from France.
         | 
         | Now, maybe if they'd just thrown, say, 5 or 10 of the scraper
         | "units" at the site, no one would have noticed in amongst
         | Googlebot (which they wanted to use anyway because they are
         | using Google Shopping to try to bring in more sales).
         | 
         | But no. This shower of arseholes threw 100's of scraper "tasks"
         | at the site. They got greedy.
         | 
         | Now, the site was robust enough to handle this load - barely -
         | which was massive, however, having to do that /and/ also handle
         | normal day-to-day traffic? Nah. The bastards got greedy and
         | like you I spent a few days unfucking the damage they were
         | causing.
         | 
         | Seriously, I hate scrapers. I hate the people who make
         | scrapers. I hate their lack of ethics. Fuck those guys.
        
           | thatwasunusual wrote:
           | It sucks when this happens, but it's easily avoidable by
           | using a caching frontend of some sort.
           | 
           | My favorite is Varnish,[0] which I have used with great
           | success for _many_ web sites throughout the years. Even a web
           | site that 10+ millions of requests per day ran from a single
           | web server for a long time a decade-ish ago.
           | 
           | [0] https://varnish-cache.org/
        
           | mdoms wrote:
           | If your site is so poorly written it can't handle a few
           | hundred computers trying to do something as simple as loading
           | your product pages then sorry, but that's on you. The
           | information is on the public web and scrapers are as entitled
           | to access it as any web browser.
        
             | [deleted]
        
           | matheusmoreira wrote:
           | > Seriously, I hate scrapers. I hate the people who make
           | scrapers. I hate their lack of ethics. Fuck those guys.
           | 
           | Not everybody in this space is out to destroy your site. Some
           | of us actively try to put as little load on your site as
           | possible. My scraper puts less load on sites than I do when I
           | browse them normally, I've measured it. Really sucks when we
           | get lumped together with the other abusers and blocked.
        
             | ligerzer0 wrote:
             | Exactly, some of us use scrapers because while we can't go
             | full Richard Stallman, we also don't want to visually sift
             | through ridiculous UI just to look at some basic data/text.
        
           | _jal wrote:
           | In a past life, we were consulting with a startup that
           | offered a subscription data service. They were very sensitive
           | about scrapers, especially on the time limited try-before-
           | you-buy accounts, which competitors were abusing.
           | 
           | At their request, we built a method to flag accounts for data
           | poisoning. Once flagged, those accounts would start getting
           | plausible-ish looking garbage data.
           | 
           | It was pretty effective. One competitor went offline for a
           | few days about a week after that started, and had a more
           | limited offering when they came back up.
        
             | scarygliders wrote:
             | That's a good way of going about dealing with this kind of
             | abuse indeed. Wish I'd thought of doing that at the time,
             | but due to the nature of this shop you didn't need a user
             | account to browse the products/prices.
             | 
             | I'm now making an entirely new shop for them - I shall bear
             | this in mind. Thanks for that!
        
             | brightball wrote:
             | Yea. Detect them and mess with them is the only approach
             | that seems to work for a lot of abusive activity. Banning
             | doesn't work because they will just start over from
             | scratch. The only thing you can really do is make them
             | think you haven't "caught" them yet and during that stretch
             | make sure their time is wasted.
        
           | funnyflamigo wrote:
           | > Seriously, I hate scrapers. I hate the people who make
           | scrapers. I hate their lack of ethics. Fuck those guys.
           | 
           | Wait till you find out what half of Google's business is
           | based on (spoiler - scraping).
           | 
           | I really don't think scraping itself is an issue 90% of the
           | time. It's the behavior of the out of control scrapers that
           | are the problem. A well behaved scraper should barely be
           | noticeable, if at all.
        
             | RobSm wrote:
             | Exactly. I am surprised that the 'devs' can't figure out a
             | way to block only annoying/excessive scrapers. Most likely
             | they are just lazy and then just put 3rd party 'solution'
             | and job done. Pay me.
        
             | jjeaff wrote:
             | At least google's scraping does result in your website
             | being discoverable by users. So you get something out of
             | it. That's not to say that sometimes Google is missing or
             | stealing data they scrape. But at least there is some
             | benefit. Many other scrapers are merely taking the data to
             | compete.
        
               | funnyflamigo wrote:
               | I strongly feel that if a human can get to it manually,
               | we have to accept that either it will be botted or humans
               | will be paid to do it by hand (They call these people
               | "analysts" or "market researchers").
               | 
               | I might argue that what google actually uses their
               | scraped data for is their search engine - which is
               | private. They simply allow us access to specially crafted
               | queries, which they can and do manipulate (for many
               | reasons, some good some bad).
               | 
               | The only thing I'd say meets that definition would be
               | like Common Crawl.
        
               | [deleted]
        
           | [deleted]
        
         | jtdev wrote:
         | Considering the demand for your content, why haven't you
         | created and provided an API? Maybe you could monetize?
        
           | chewmieser wrote:
           | Like everyone and their brother has a web spider. And some of
           | them are VERY badly designed. We block them when they use too
           | many resources, although we'd rather just let them be.
           | 
           | Can't speak for the op but we have APIs and move the ones
           | scraping and reselling our content to APIs. The majority are
           | just a worthless suck on resources though.
        
           | ebbp wrote:
           | We do offer an API - the scrapers are trying to circumvent
           | using that, presumably.
        
             | halfmatthalfcat wrote:
             | Maybe the API terms/cost are prohibitive? I'm sure there's
             | some equilibrium where they would rather pay you than go
             | through the trouble of scraping.
        
               | kulikalov wrote:
               | Maybe docs or infra are unbearable
        
             | purerandomness wrote:
             | Why do you think are they trying to circumvent it?
             | 
             | Does your API provide all the information that can be found
             | on the site, or are they scraping because the API is
             | incomplete?
             | 
             | We've once had to scrape Amazon product pages because they
             | have a lot of API endpoints, but those didn't contain the
             | data we needed.
        
               | scarygliders wrote:
               | Why would Amazon wish to provide you with easy to access
               | data on their products and prices when you could either
               | be a competitor wishing to undercut those prices, or be a
               | scraper company hired by such a competitor?
               | 
               | In what universe is providing such a straightforward way
               | of helping a competitor considered sane business
               | practice?
        
               | matheusmoreira wrote:
               | Because they will get the data regardless of what you do
               | and if you don't make an API it will cost you more due to
               | overhead.
        
               | jtdev wrote:
               | In the end, they still get the data, just in a much less
               | desirable way for both you and the customer.
        
               | manquer wrote:
               | Most sellers who are on Amazon platform give Amazon that
               | information and a lot more, knowing full well Amazon will
               | use their sales data to launch an Amazon Basics
               | competitior.
               | 
               | It is a sane business approach when you are a pragmatic
               | business who knows the limits that constrain your
               | business.
               | 
               | Either the content company is going to build a simple API
               | (could be just a static CSV file hosted on S3 or
               | whatever) with useful information or try to monetize/hide
               | this information and force scapers to use the website .
               | 
               | A bot is always going to win unless you want to make
               | users also a lot of friction. In the era of deepfakes and
               | fairly robust AI tooling the difference between bot
               | action and humann action is not all that much.
               | 
               | If you are going to be agressive with captcha , IP blocks
               | and other fingerprinting, users who get identified false
               | positive.or annpyed would leave.
               | 
               | When the cost of losing those users is more than allowing
               | access to scrapers,you would absolutely setup the API.
        
               | weird-eye-issue wrote:
               | Man your comment is hilarious because in fact Amazon DOES
               | provide an API for exactly that
        
               | scarygliders wrote:
               | And yet...
               | 
               | > We've once had to scrape Amazon product pages because
               | they have a lot of API endpoints, but those didn't
               | contain the data we needed.
               | 
               | ...only a couple of comments up.
        
               | matheusmoreira wrote:
               | This is the number one reason to scrape websites. It's
               | always nice when there's an API with documentation and
               | rate limiting rules you can follow. Sometimes the data I
               | need just isn't there, though. Then I open up their site
               | and find a huge amount of private API endpoints that do
               | exactly what I want. Then I open up a ticket about it and
               | it gets 200 replies but they ignore it for _years_. It 's
               | fucking stupid and it's really no wonder people scrape
               | their site.
        
             | 1cvmask wrote:
             | What is your site may I ask?
             | 
             | Just curious about the difference in value from using your
             | API and web scraping as there is a cost to web scraping as
             | well.
        
               | bryanrasmussen wrote:
               | If you make your scraper well, and it counterfeits being
               | a real user believably, you end up with a solution that
               | can be tweaked as needed to handle whatever traps people
               | put in to try to defeat your scrapers.
               | 
               | If you make your api client well, you don't have the
               | problems of a scraper - but if the api owner decides to
               | change rules for api and you can't do what your business
               | is based on being able to do (think of api owner as
               | Twitter) then you need to make a scraper.
        
             | gmanis wrote:
             | Is it not viable to put majority of your data behind a
             | login and so the bots only get a very limited snapshot
             | while legitimate users get it through a free login?
             | 
             | I'm asking this because I'm going through very similar
             | situation and would love to see other opinions around this.
        
               | weird-eye-issue wrote:
               | You are defining legitimate users as those that have a
               | valid session cookie? Good luck
        
             | aninteger wrote:
             | Wait, why wouldn't you have rate limiting on your API?
             | Providers like Cloudflare offer this although I guess you
             | could roll your own too since our industry loves to
             | reinvent the wheel.
        
           | throwaway2993 wrote:
           | I wrote a scraper a couple of years ago to get a single data
           | point from a website where my client was already a paying
           | customer. This website had an API, which they were also
           | paying for, but the API didn't cover that data point, so at
           | the time they had one of their admin people populating that
           | missing piece of data manually, which was taking them around
           | ten minutes a day.
           | 
           | I asked them if my customer could pay to access this data
           | point via their API and they quoted 3600 EUR/month! Enter the
           | scraper...
        
         | [deleted]
        
         | taytus wrote:
         | >where some amateur did real damage to us
         | 
         | If an amateur can do damage to you, then I have some bad news
         | for you...
        
           | Goronmon wrote:
           | _If an amateur can do damage to you, then I have some bad
           | news for you..._
           | 
           | I believe the point wasn't surprise that damage occurred at
           | all, but frustration that damage can occur just out
           | laziness/ignorance rather than malice.
        
             | scarygliders wrote:
             | Indeed, that was precisely their point, and "bad news for
             | you" is disingenuous as there are many techniques used by
             | incompetent, or just downright unethical and greedy scraper
             | companies which, no matter how robust the target is, can
             | still give it a major headache.
             | 
             | I've witnessed a site being basically DOS'ed due to
             | particularly greedy and aggressive mass scraping attempts.
        
           | convolutionart wrote:
           | This is nonsense. It's always easier to destroy than to
           | build/mantain. If you got any real advice, by all means...
        
       | biosed wrote:
       | I used to lead Sys Eng for a FTSE 100 company. Our data was
       | valuable but only for a short amount of time. We were constantly
       | scraped which cost us in hosting etc. We even seen competitors
       | use our figures (good ones used it to offset their prices, bad
       | ones just used it straight). As the article suggest, we couldn't
       | block mobile operator IPs, some had over 100k customers behind
       | them. Forcing the users to login did little as the scrapers just
       | created accounts. We had a few approaches that minimised the
       | scraping:
       | 
       | Rate Limiting by login,
       | 
       | Limiting data to know workflows ...
       | 
       | But our most fruitful effort was when we removed limits and
       | started giving "bad" data. By bad I mean alter the price up or
       | down by a small percentage. This hit them in the pocket but
       | again, wasn't a golden bullet. If the customer made a transaction
       | on the altered figure we we informed them and took it at the
       | correct price.
       | 
       | It's a cool problem to tackle but it is just an arms race.
        
         | wolverine876 wrote:
         | > But our most fruitful effort was when we removed limits and
         | started giving "bad" data. By bad I mean alter the price up or
         | down by a small percentage. ... If the customer made a
         | transaction on the altered figure we we informed them and took
         | it at the correct price.
         | 
         | Is that legal? It would be a big blow to trust if I was the
         | customer, but that's without knowing what you were selling and
         | in what market.
        
           | killingtime74 wrote:
           | It's legal if it's in the contract. Standard for contracts to
           | allow for mistakes and confirmations of prices
        
             | kwhitefoot wrote:
             | It's not mistake if you do it deliberately!
        
               | killingtime74 wrote:
               | Yes (not saying it's a mistake) but putting confirmation
               | can be in the contract, no law says you only get 1 chance
               | to display price.
        
         | rootusrootus wrote:
         | I know a guy at Nike that had to deal with a similar problem.
         | As I recall, they basically gave in -- instead of trying to
         | fight the scrapers, they built them an API so they'd quit
         | trashing the performance of the retail site with all the
         | scraping.
        
           | chadwittman wrote:
           | The real Jedi move
        
             | wrycoder wrote:
             | Especially if you charge for it, which would save them
             | money, because they wouldn't have to redo their code every
             | time you changed your website.
        
           | gonzo41 wrote:
           | I think there's an opportunity for a new JS framework to have
           | something like randomly generated dom that will always
           | display the page and elements the same to a human but
           | constantly break paths for computers.
           | 
           | Like displaying a table with semantic elements, then divs,
           | then using an iframe with css grid and floating values over
           | the top.
           | 
           | This almost seems like a problem for AI to solve.
        
           | matheusmoreira wrote:
           | Yes. That's exactly what everyone should do.
        
             | echelon wrote:
             | If data is your competitive advantage or product, then
             | what? Accept that your market no longer exists and that
             | there's no way to stop theft?
        
               | Grimm1 wrote:
               | You're going to need to explain how scraping publicly
               | available information on a website is theft.
               | 
               | If information is your competitive advantage maybe you
               | shouldn't have it on a publicly accessible website, and
               | should instead stick it behind an API with pay tiers and
               | a very clear license regarding what you may do with it as
               | an end user.
               | 
               | Note, a simple sign up being required to view a website
               | makes it not publicly available information any longer
               | and you can cover usage, again, in a license.
               | 
               | Then you have a whole bunch of legal avenues you can use
               | to protect your work. Assuming you can afford it that is.
        
               | achillesheels wrote:
               | It is copyright information, no? So technically it is
               | intellectual property theft if the scraping use is for
               | commercial purposes.
        
               | Grimm1 wrote:
               | No? If you place information publicly on a website it's
               | pretty much free game, no copyright violation, especially
               | regarding user generated information. That's my take, but
               | legally it's a gray area and it's still going back and
               | forth in the courts (at least in the US) but for a while
               | before a decision was vacated by the supreme court
               | scraping publicly available information on a site was
               | legally protected and seemingly inline with my thoughts
               | on it.
        
         | ransom1538 wrote:
         | I love the honey pot approach. Put tons of valued hrefs on the
         | page that are invisible (css) that the scrapper would find.
         | Then just rate limit that ip address and randomize the data
         | coming back. Profit.
        
         | endymi0n wrote:
         | > It's a cool problem to tackle but it is just an arms race.
         | 
         | Plus, it's one you're going to lose. I was once asked at an
         | All-Hands why we don't defend ourselves against bots even more
         | vigorously.
         | 
         | My answer was: "Because I don't know how to build a publically
         | available website that I could not scrape myself if I really
         | wanted to."
        
       | DeathArrow wrote:
       | You can put some wasm crypto mining code and at least profit from
       | bots. :D
        
       | abc03 wrote:
       | I scrap government sites a lot as they don't provide apis. For
       | mobile proxies, I use the proxidize dongles and mobinet.io (free,
       | with Android devices). As stated in the article, with cgNAT it's
       | basically impossible to block them as in my case, half the
       | country couldn't access the sites anymore (if you place them in
       | several locations and use one carrier each there).
        
       | kerokerokero wrote:
       | Thanks for the share. Great stuff.
       | 
       | I used to scrape websites to generate content for higher SERPs.
       | 
       | Ended up going into the adult industry lols.
       | (https://javfilms.net)
        
         | anon9001 wrote:
         | Neat! I've run across your site organically :P
         | 
         | I've always wondered, and since you're right here... how do
         | sites like this make money?
         | 
         | It looks like you're probably crawling all the JAV vendors,
         | finding free clips of today's releases, embedding them in your
         | own site to draw traffic, and making money with affiliate links
         | to buy the full content?
         | 
         | Am I missing anything? It seems hard to believe you'd get
         | enough affiliate signups to make it worthwhile.
         | 
         | I can imagine your site as being a few hours a year of script
         | maintenance and a money printer, or a 40hr/week SEO job with
         | 1000s of similar sites across the adult industry.
         | 
         | I'd love to know anything you're willing to share about how the
         | business works.
        
       | wilg wrote:
       | Not the same kind of scraping, but does anyone have
       | thoughts/resources/best practices for doing link previews (like
       | Twitter/iMessage/Facebook)?
        
       | mrg3_2013 wrote:
       | wow! That was an interesting read.
        
       | neals wrote:
       | In a particularly hard to scrape website, using some kind of bot
       | protection that I just couldn't reliably get working (if anybody
       | wants to know what that was exactly, I'll go and check it) I now
       | have a small Intel NUC running with firefox that listens to a
       | local server and uses Temper Monkey to perform commands. Works
       | like a charm and I can actualy see what it's doing and where it's
       | going wrong. (though it's not scalable, of course)
       | 
       | We use it for data-entry on a government website. A human would
       | average around 10 minutes of clicking and typing, where the bot
       | takes maybe 10 seconds. Last year we did 12000 entries. Good bot.
        
         | nkozyra wrote:
         | You can use chromium/chrome/cdp and turn headless off and see
         | the same thing.
        
         | funnyflamigo wrote:
         | I'm curious what bot protection it was? It couldn't have been
         | trying too hard unless you were employing multiple anti-
         | fingerprinting techniques, I'm assuming you used firefox's
         | built in anti-fingerprinting?
        
       ___________________________________________________________________
       (page generated 2021-11-05 23:00 UTC)