[HN Gopher] Web scraping is legal, US appeals court reaffirms
       ___________________________________________________________________
        
       Web scraping is legal, US appeals court reaffirms
        
       Author : spenvo
       Score  : 432 points
       Date   : 2022-04-18 19:37 UTC (3 hours ago)
        
 (HTM) web link (techcrunch.com)
 (TXT) w3m dump (techcrunch.com)
        
       | bobajeff wrote:
       | I had no idea this was even being discussed. I'm glad they are
       | reasonable on this. Wish they had been as reasonable on breaking
       | encryption/DRM schemes.
        
       | amelius wrote:
       | Where is the lobbying? Something seems wrong ...
        
       | 8bitsrule wrote:
       | Wondered how this works WRT copyright (since the article did not
       | contain the word). Here's Kent State's (short) IPR advice
       | [https://libguides.library.kent.edu/data-management/copyright] It
       | says "Data are considered 'facts' under U.S. law. They are not
       | copyrightable.... Creative arrangement, annotation, or selection
       | of data can be protected by copyright."
        
       | ricardo81 wrote:
       | For what it's worth, Linkedin was incredibly easy to scrape back
       | in the day, wrt profile/email correlation. Can't buy any
       | aggressive stance they may have against 'scrapers'.
       | 
       | 2 options.
       | 
       | Their Linkedin ID's are base 12, and would redirect you if you
       | simply wanted to enumerate them.
       | 
       | You could also upload your 'contacts', 200-300 at a time and it'd
       | leak profile IDs (Twitter and Facebook mitigated this ~5 years
       | ago). I still have a @pornhub or some such "contact" that I can't
       | delete from testing this.
        
       | infiniteL0Op wrote:
        
       | altdataseller wrote:
       | Does the ruling make it illegal to block scrapers?
        
       | ricardo81 wrote:
       | Interesting as I've seen a few search engine start ups that seem
       | to scrape other search engine results, depending on your
       | definition of scraping. My definition would be a user agent that
       | doesn't uniquely identify itself that isn't using an authorised
       | API.
        
       | cloudyporpoise wrote:
       | These anti-scraping corporate activists need to get with the
       | times and allow access to their data, legitimately, through an
       | API. Third parties will scrape and sell the data regardless, so
       | why not just cut them out and even charge for individuals to
       | legitimately use the API. API keys could be tied to an individual
       | and at least LinkedIn would know who was conducting what action.
       | 
       | Make it easier to get the data through an API than having to
       | scrape it.
        
       | CWuestefeld wrote:
       | While I have sympathy for what the scrapers are trying to do in
       | many cases, it bothers me that this doesn't seem to address what
       | happens when badly-behaved scrapers cause, in effect, a DOS on
       | the site.
       | 
       | For the family of sites I'm responsible for, bot traffic
       | comprises a majority of traffic - that is, to a first
       | approximation, the lion's share of our operational costs are from
       | needing to scale to handle the huge amount of bot traffic. Even
       | when it's not as big as a DOS, it doesn't seem right to me that I
       | can't tell people they're not welcome to cause this additional
       | system load.
       | 
       | Or even if there was some standardized way that we could provide
       | a dumb API, just giving them raw data so we don't need to incur
       | the additional expense of the processing for creature comforts on
       | the page designed to make our users happier but the bots won't
       | notice.
        
         | hardtke wrote:
         | The problem with many sites (and LinkedIn in particular) is
         | that they whitelist a bunch of specific websites, presumably
         | based on the business interests, but disallow everyone else in
         | their robots.txt. You should either allow all scrapers that
         | respect certain load requirements or allow none. Anything that
         | Google is allowed to see and include in their search results
         | should be fair game.
         | 
         | Here's the end of LinkedIn's robots.txt:
         | 
         | User-agent: * Disallow: /
         | 
         | # Notice: If you would like to crawl LinkedIn, # please email
         | whitelist-crawl@linkedin.com to apply # for white listing.
        
           | diamondage wrote:
           | And this is what the HiQ case hinged on. LinkedIn were
           | essentially selectively applying the computer fraud and abuse
           | act based on their business interests - that was never going
           | to sit well with judges.
        
         | car_analogy wrote:
         | > Even when it's not as big as a DOS, it doesn't seem right to
         | me that I can't tell people they're not welcome to cause this
         | additional system load.
         | 
         | You _can_ tell them. You just can 't prosecute them if they
         | don't obey.
        
         | davidhyde wrote:
         | I don't know what kind of data you serve up but perhaps you
         | could serve low quality or inaccurate content from addresses
         | that are guessed from your api. I.e. endpoints not normally
         | reachable in the normal functioning of your web app should
         | return reasonable junk. A mixture of accurate and inaccurate
         | data becomes worthless for bots and worthless data is not worth
         | scraping. Just an idea!
        
         | ryan_j_naughton wrote:
         | As other have said (A) there are plenty of countermeasures you
         | can take, but also (B) you are frustrated that you are
         | providing something free to the public and then annoyed at the
         | "wrong" customers are using your product and costing you money.
         | I'm sorry, but this is a failure of your business model.
         | 
         | If we were to analogize this to a non-internet example: (1) A
         | company throws a free concert/event and believes they will make
         | money by alcohol sales. (2) A bunch of sober/non-drinking folks
         | attend the concert but only drink water (3) Company blames the
         | concert attendees for "taking advantage" of them when they
         | really just had poor company policies and a bad business model.
         | 
         | Put things behind authentication and authorization. Add a
         | paywall. Implement DDOS and detection and banning approaches
         | for scrapers. Etc etc.
         | 
         | But don't make something public and then get mad at THE PUBLIC
         | for using it. Behind that machine is a person, who happens to
         | be a member of the public.
        
           | noisenotsignal wrote:
           | There are certain classes of websites where the proposed
           | solutions aren't a great fit. For example, a shopping site
           | hiding their catalog behind paywalls or authentication would
           | raise barriers to entry such that a lot of genuine customers
           | would be lost. I don't think the business model is in general
           | to be blamed here and it's ok to acknowledge the unfortunate
           | overhead and costs added by site usage patterns (e.g.
           | scraping) that are counter to the expectation.
        
         | Nextgrid wrote:
         | But don't you already have countermeasures to deter DoS attacks
         | or malicious _human_ users (what if someone pays or convinces
         | people to open your site and press F5 repeatedly)?
         | 
         | If not, you should, and the badly-behaved scrapers are actually
         | a good wake-up call.
        
         | colinmhayes wrote:
         | I'm sympathetic to this. I built a search engine for my senior
         | project and my half baked scraper ended up taking down duke
         | law's site during their registration period. Ended up getting a
         | not so kindly worded email from them, but honestly this wasn't
         | an especially hard problem to solve. All of my traffic was
         | coming from the cluster that was on my university's subnet, it
         | wouldn't have been that hard to for them to IP address timeouts
         | when my crawler started scraping thousands of pages a second on
         | their site. Not to victim blame, this was totally my fault, but
         | I was a bit surprised that they hadn't experienced this before
         | with how much automated scraping goes on.
        
           | brightball wrote:
           | I'm honestly more interested in bot detection than anything
           | else at this point.
           | 
           | It seems like it should be perfectly legal to detect and then
           | hold the connection open for a long period of time without
           | giving a useful response. Or even send highly compressed gzip
           | responses designed to fill their drives.
           | 
           | Legal or not, I can't see any good reason that we can't make
           | it painful.
        
             | fjabre wrote:
             | Make it painful if they abuse the site.
             | 
             | We all benefit from open data. Polite scrapers are just
             | fine and a natural part of the web ecosystem.
             | 
             | Google has been scraping the web all day every day for
             | decades now.
        
         | rmbyrro wrote:
         | I have sympathy for your operational issues and costs, but
         | isn't this kind of complaint the same as a shopping mall/center
         | complaining of people who go in, check some info and go out
         | without buying?
         | 
         | I understand that bots have leverage and automation, but so
         | does you to reach a larger audience. Should we continue to
         | benefit from one side of the leverage, while complaining about
         | the other side?
        
           | wlesieutre wrote:
           | It's more like a mall complaining that while they're trying
           | to serve 1000 customers, someone has gone and dumped 10000000
           | roombas throughout the stores which are going around scanning
           | all the price tags.
        
           | kortilla wrote:
           | No, because those are people going to the mall. Not robots
           | 100x the quantity of real people.
        
           | CWuestefeld wrote:
           | No. When I say that bots exceed the amount of real traffic,
           | I'm including people "window shopping" on the good side.
           | 
           | My complaint is more like, somebody wants to know the prices
           | of all our products, and that we have roughly X products
           | (where X is a very large number). They get X friends to all
           | go into the store almost simultaneously, each writing down
           | the price of the particular product they've been assigned to
           | research. When they do this, there's scant space left in the
           | store for even the browsing kind of customers to walk in. (of
           | course I exaggerate a bit, but that's the idea)
        
             | mrobins wrote:
             | I'm sympathetic to the complaints about "rude" scraping
             | behavior but there's an easy solution. Rather than make
             | people consume boatloads of resources they don't want
             | (individual page views, images, scripts, etc.) just build
             | good interoperability tools that give the people what they
             | want. In the physical example above that would be a product
             | catalog that's easily replicated with a CSV product listing
             | or an API.
        
               | jensensbutton wrote:
               | You don't know why any random scraper is scraping you and
               | thus you don't know what api to build that will do them
               | from scraping. Also, it's likely easier for them to
               | contribute scraping than write a bunch of code to
               | integrate with your API so it there's no incentive for
               | them to do so either.
        
               | mcronce wrote:
               | Writing a scraper for a webpage is typically far more
               | development effort than writing an API wrapper
        
               | withinboredom wrote:
               | Just advertise the API in the headers. Or better yet, set
               | the buttons/links only to be accessible via .usetheapi-
               | dammit selector. Lastly, provide an API and a
               | "developers.whatever.com" domain to report issues with
               | the API, get API keys, and pay for more requests. It
               | should be pretty easy to setup, especially if there's an
               | internal API available behind the frontend already. I'd
               | venture a dev team could devote 20% to a few sprints and
               | have an MVP thing up and running.
        
               | mrobins wrote:
               | I think lots of website owners know exactly where the
               | value in their content exists. Whether or not they want
               | to share that in a convenient way, especially to
               | competitors etc is another story.
               | 
               | That said if scraping is inevitable, it's immensely
               | wasteful effort to both the scraper and the content owner
               | that's often avoidable.
        
               | altdataseller wrote:
               | For the 2nd part, I have done scraping and would always
               | opt for an API if the price is reasonable over paying
               | nosebleed amounts for residential proxies
        
               | CWuestefeld wrote:
               | Yes, exactly. Nobody is standing up and saying "we're the
               | ones doing this, and here's what we wish you'd put in an
               | API".
               | 
               | Also, I'm a big Jenson Button fan.
        
         | [deleted]
        
         | bastardoperator wrote:
         | Have you considered using a cache service like cloudflare?
        
         | KennyBlanken wrote:
         | > While I have sympathy for what the scrapers are trying to do
         | in many cases, it bothers me that this doesn't seem to address
         | what happens when badly-behaved scrapers cause, in effect, a
         | DOS on the site.
         | 
         | Like when Aaron Swartz spent months hammering JSTOR causing it
         | to become so slow it was almost unusuable, and despite knowing
         | that he was causing widespread problems (including the eventual
         | banning of MIT's entire IP range) actually worked to add
         | additional laptops and improve his scraping speed...all the
         | while going out of his way to subvert MIT's netops group trying
         | to figure out where he was on the network.
         | 
         | JSTOR, by the way, is a non-profit that provides aggregate
         | access to their cataloged archive of journals, for schools and
         | libraries to access journals they would otherwise never be able
         | to afford. In many cases, free access.
        
         | linuxdude314 wrote:
         | If most of your traffic is bots, is the site even worth
         | running?
         | 
         | This really is akin to the question, "Should others be allowed
         | to take my photo or try to talk to me in public?"
         | 
         | Of course the answer should be yes, the internet is the digital
         | equivalent of a public space. If you make it accessible, anyone
         | should be able to consume.
         | 
         | If you don't want it scraped add auth!
        
         | [deleted]
        
         | voxic11 wrote:
         | The court just ruled that scraping on its own isn't a violation
         | of the CFAA. Meaning it doesn't count as the crime of
         | "accessing a protected computer without authorization or
         | exceeding authorized access and obtaining information".
         | 
         | However presumably all the other provisions of the CFAA still
         | apply, so if your scraping damages the functioning of a
         | internet service then you still would have committed the crime
         | of "Damaging a protected computer by intentional access".
         | Negligently damaging a protected computer is punishable by 1
         | year in prison on the first offence. Recklessly damaging a
         | protected computer is punishable by 1-5 years on the first
         | offense. And intentionally damaging a protected computer is
         | punishable by 1-10 years for the first offense. These penalties
         | can go up to 20 years for repeated offenses.
        
         | gdulli wrote:
         | When the original ruling in favor of HiQ came out, it still
         | allowed for LinkedIn to block certain kinds of malicious
         | scraping. LinkedIn had been specifically blocking HiQ, and was
         | ordered to stop doing that.
        
         | kstrauser wrote:
         | I've told this story before, but it was fun, so I'm sharing it
         | again:
         | 
         | I'll skip the details, but a previous employer dealt with a
         | large, then-new .mil website. Our customers would log into the
         | site to check on the status of their invoices, and each page
         | load would take approximately 1 minute. Seriously. It took
         | about 10 minutes to log in and get to the list of invoices
         | available to be checked, then another minute to look at one of
         | them, then another minute to get out of it and back into the
         | list, and so on.
         | 
         | My job was to write a scraper for that website. It ran all
         | night to fetch data into our DB, and then our website could
         | show the same information to our customers in a matter of
         | milliseconds (or all at once if they wanted one big aggregate
         | report). Our customers _loved_ this. The .mil website 's
         | developer _hated_ it, and blamed all sorts of their tech
         | problems on us, although:
         | 
         | - While optimizing, I figured out how to skip lots of
         | intermediate page loads and go directly to the invoices we
         | wanted to see.
         | 
         | - We ran our scraper at night so that it wouldn't interfere
         | with their site during the day.
         | 
         | - Because each of our customers had to check each one of their
         | invoices every day if they wanted to get paid, and we were
         | doing it more efficiently, our total load on their site was
         | lower than the total load of our customers would be.
         | 
         | Their site kept crashing, and we were there scapegoat. It was
         | great fun when they blamed us in a public meeting, and we
         | responded that we'd actually disabled our crawler for the past
         | week, so the problem was still on their end.
         | 
         | Eventually, they threatened to cut off all our access to the
         | site. We helpfully pointed out that their brand new site wasn't
         | ADA compliant, and we had vision-impaired customers who weren't
         | able to use it. We offered to allow our customers to run the
         | same reports from our website, for free, at no cost to the .mil
         | agency, so that they wouldn't have to rebuild their website
         | from the ground up. They saw it our way and begrudgingly
         | allowed us to keep scraping.
        
           | dwater wrote:
           | I have worked with .mil customers who paid us to scrape and
           | index their website because they didn't have a better way to
           | access their official, public documents.
        
             | oneoff786 wrote:
             | Me too but for a private company
             | 
             | In reality it was probably more like org sub group A wanted
             | to leverage org sub group B's data but they didn't
             | cooperate
        
             | jll29 wrote:
             | This is not .mil specific: I've been told of a case where
             | an airline first legally attacked a flight search engine
             | (Skyscanner) for scraping, and then told them to continue
             | when they realized that their own search engine couldn't
             | handle all the traffic, and even if it could, it was more
             | expensive per query than routing via Skyscanner.
        
         | brailsafe wrote:
         | You probably can. On the protocol level with JSON-LD or other
         | rich data packages that generate xml or standardized json
         | endpoints. I did this for an open data portal, and this is
         | something most G7 governments do with their federal open data
         | portals using off the shelf packages (that are worth
         | researching a bit obviously first), particularly in the python
         | and flask world. We were still getting hammered by China at our
         | Taiwanese language subdomain, but that was a different concern
        
         | bequanna wrote:
         | As someone that has been on the other end, I can tell you devs
         | don't want to use selenium or inspect requests to reverse
         | engineer your UI and _wish_ there were more clean APIs.
         | 
         | Have you tried making your UI more challenging to scrape and
         | adding a simple API that requires free registration?
        
         | YPCrumble wrote:
         | Reading your comment my impression is that this is either an
         | exaggeration or a very unique type of site if bots make up the
         | majority of traffic to the point that scrapers are anywhere
         | near the primary load factor.
         | 
         | Would someone let me know if I'm just plain wrong in this
         | assumption? I've run many types of sites and scrapers have
         | never been anywhere close to the main source of traffic or even
         | particularly noticeable compared to regular users.
         | 
         | Even considering a very commonly scraped site like LinkedIn or
         | Craigslist - for any site of any magnitude like this public
         | pages are going to be cached so additional scrapers are going
         | to have negligible impact. And a rate limit is probably one
         | line of config.
         | 
         | I'm not saying you are necessarily wrong, but I can't imagine a
         | scenario that you're describing and would love to hear of one.
        
           | CWuestefeld wrote:
           | It's a B2B ecommerce site. Our annual revenue from the site
           | would put us on the list of top 100 ecommerce sites [1]
           | (we're not listed because ecommerce isn't the only businesss
           | we do. With that much potential revenue to steal from us,
           | perhaps the stakes are higher.
           | 
           | As described elsewhere, rate limiting doesn't work. The bots
           | come from hundreds to thousands of separate IPs
           | simultaneously, cooperating in a distributed fashion. Any one
           | of them is within reasonable behavioral ranges.
           | 
           | Also, caching, even through a CDN doesn't help. As a B2B
           | site, all our pricing is custom as negotiated with each
           | customer. (What's ironic is that this means that the pricing
           | data that the bots are scraping isn't even representative -
           | it only shows what we offer walkup, non-contract customers.)
           | And because the pricing is dynamic, it also means that the
           | scraping to get these prices is one of the more
           | computationally expensive activities they could do.
           | 
           | To be fair, there is some low-hanging fruit in blocking many
           | of them. Like, it's easy to detect those that are flooding
           | from a single address, or sending SQL injection attacks, or
           | just plain coming from Russia. I assume those are just the
           | script kiddies and stuff. The problem is that it still leaves
           | a whole lot of bad actors once these are skimmed off the top.
           | 
           | [1] https://en.wikipedia.org/wiki/List_of_largest_Internet_co
           | mpa...
        
             | mindslight wrote:
             | > _As a B2B site, all our pricing is custom as negotiated
             | with each customer ... the pricing is dynamic_
             | 
             | So your company is deliberately trying to frustrate the
             | market, and doesn't like the result of third parties
             | attempting to help market efficiency? It seems like this is
             | the exact kind of scraping that we generally want more of!
             | I'm sorry about your personal technical predicament, but it
             | doesn't sound like your perspective is really coming from
             | the moral high ground here.
        
             | YPCrumble wrote:
             | Thanks for the explanation!
             | 
             | The thing I still don't understand is why (edit server not
             | cdn) caching doesn't work - you have to identify customers
             | somehow, and provide everyone else a cached response at the
             | server level. For that matter, rate limit non-customers
             | also.
        
               | CWuestefeld wrote:
               | The pages getting most of the bot action are search and
               | product details.
               | 
               | Search results obviously can't be cached, as it's
               | completely ad hoc.
               | 
               | Product details can't be cached either, or more
               | precisely, there are parts of each product page that
               | can't be cached because
               | 
               | * different customers have different products in the
               | catalog
               | 
               | * different products have different prices for a given
               | product
               | 
               | * different products have customer-specific aliases
               | 
               | * there's a huge number of products (low millions) and
               | many thousands of distinct catalogs (many customers have
               | effectively identical catalogs, and we've already got
               | logic that collapses those in the backend)
               | 
               | * prices are also based on costs from upstream suppliers,
               | which are themselves changing dynamically.
               | 
               | Putting all this together, the number of times a given
               | [product,customer] tuple will be requested in a
               | reasonable cache TTL isn't very much greater than 1. The
               | exception being for walk-up pricing for non-contract
               | users, and we've been talking about how we might optimize
               | that particular cases.
        
               | YPCrumble wrote:
               | Ahhhhh, search results makes a whole lot more sense!
               | Thank you. Search can't be cached and the people who want
               | to use your search functionality as a high availability
               | API endpoint use different IP addresses to get around
               | rate limiting.
               | 
               | The low millions of products also makes some sense I
               | suppose but it's hard to imagine why this doesn't simply
               | take a login for the customer to see the products if
               | they're unique to each customer.
               | 
               | On the other hand, I suspect the price this company is
               | paying to mitigate scrapers is akin to a drop of water in
               | the ocean, no? As a percent of the development budget it
               | might seem high and therefore seem big to the developer,
               | but I suspect the CEO of the company doesn't even know
               | that scrapers are scraping the site. Maybe I'm wrong.
               | 
               | Thanks again for the multiple explanations in any case,
               | it opened my eyes to a way scrapers could be problematic
               | that I hadn't thought about.
        
           | toast0 wrote:
           | If you've got a site with a _lot_ of pages, bot traffic can
           | get pretty big. Things like a shopping site with a large
           | number of products, a travel site with pages for hotels and
           | things to do, something to do with movies or tv shows and
           | actors, basically anything with a large catalog will drive a
           | lot of bot traffic.
           | 
           | It's been forever since I worked at Yahoo Travel, but bot
           | traffic was significant then, I'd guess roughly 5-10% of the
           | traffic was declared bots, but Yandex and Baidu weren't
           | agressive crawlers yet, so I wouldn't be terribly surprised
           | if a site with a large catalog that wasn't top 3 with humans
           | would have a majority of traffic as bots. For the most part,
           | we didn't have availability issues as a result of bot
           | traffic, but every once in a while, a bot would really ramp
           | up traffic and cause issues, and we would have to carefully
           | design our list interfaces to avoid bots crawling through a
           | lot of different views of the same list (while also trying to
           | make sure they saw everything in the list). Humans may very
           | well want to have all the narrowing options, but it's not
           | really helpful to expose hotels near Las Vegas starting with
           | the letter M that don't have pools to Google.
        
             | YPCrumble wrote:
             | I appreciate the response but I'm still perplexed. It's not
             | about the percent of traffic if that traffic is cached. And
             | rate limiting also prevents any problems. It just doesn't
             | seem plausible that scrapers are going to DDoS a site per
             | the original comment. I suppose you'd get bad traffic
             | reports and other problems like log noise, but claiming it
             | to be a general form of DDoS really does sound like
             | hyperbole.
        
           | breischl wrote:
           | As another example, I used to work on a site that was roughly
           | hotel stays. A regular person might search where to stay in
           | small set of areas, dates and usually the same number of
           | people.
           | 
           | Bots would routinely try to scrape pricing for every
           | combination of {property, arrival_date, departure_date,
           | num_guests} in the next several years. The load to serve this
           | would have been _vastly_ higher than real customers, but our
           | frontend was mostly pretty good at filtering them out.
           | 
           | We also served some legitimate partners that wanted basically
           | the same thing via an API... and the load was in fact
           | enormous. But at least then it was a real partner with some
           | kind of business case that would ultimately benefit us, and
           | we could make some attempt to be smart about what they asked
           | for.
        
           | VWWHFSfQ wrote:
           | > a very unique type of site if bots make up the majority of
           | traffic
           | 
           | Pretty much Twitter and the majority of such websites.
        
             | YPCrumble wrote:
             | Do you really believe bots make up a significant amount of
             | Twitter's operating cost? Like I said they're just
             | accessing cached tweets and are rate limited. How can the
             | bot usage possibly be more than a small part of twitter's
             | operating cost?
        
           | nojito wrote:
           | Bandwidth isn't free.
        
             | YPCrumble wrote:
             | I didn't say it is free, I said that the bandwidth for bots
             | is negligible compared to that of regular users.
        
               | nojito wrote:
               | Negligible isn't free either.
        
         | svnpenn wrote:
         | Implement TLS fingerprinting on your server. People can still
         | fake that if they are determined, but it should cut the abuse
         | way down.
        
           | userbinator wrote:
           | TLS fingerprinting is one of the ways minority browsers and
           | OS setups get unfairly excluded. I have an intense hatred of
           | Cloudflare for popularising that. Yes, there are ways around
           | it, but I still don't think I should have to fight to use the
           | user-agent I want.
        
             | oh_sigh wrote:
             | I don't want to say tough cookies, but if OPs
             | characterization isn't hyperbole("the lion's share of our
             | operational costs are from needing to scale to handle the
             | huge amount of bot traffic."), then it can be a situation
             | where you have to choose between 1) cut off a huge chunk of
             | bots, but upset a tiny percent of users, and improve the
             | service for everyone else, or 2) simply not provide the
             | service at all due to costs.
        
             | nyuszika7h wrote:
             | I don't think it's likely to cause issues if implemented
             | properly. Realistically you can't really build a list of
             | "good" TLS fingerprints because there are a lot of
             | different browser/device combinations, so in my experience
             | most sites usually just block "bad" ones known to belong to
             | popular request libraries and such.
        
           | CWuestefeld wrote:
           | No, nor can we just do it by IP. The bots are MUCH more
           | sophisticated than that. More often than not, it's a
           | cooperating distributed net of hundreds of bots, coming from
           | multiple AWS, Azure, and GCP addresses. So they can pop up
           | anywhere, and that IP could wind up being a real customer
           | next week. And they're only recognizable as a botnet with
           | sophisticated logic looking at the gestalt of web logs.
           | 
           | We do use a 3rd party service to help with this - but that on
           | its own is imposing a 5- to 6-digit annual expense on our
           | business.
        
             | z3c0 wrote:
             | Have you considered setting up an API to allow the bots to
             | get what they want without hammering your front-end
             | servers?
        
               | CWuestefeld wrote:
               | Yes. And if I could get the perpetrators to raise their
               | hands so I could work out an API for them, it would be
               | the path of least resistance. But they take great pains
               | to be anonymous, although I know from circumstantial
               | evidence that at least a good chunk of it is various
               | competitors (or services acting on behalf of competitors)
               | scraping price data.
               | 
               | IANAL, but I also wonder if, given that I'd be designing
               | something specifically for competitors to query our
               | prices in order to adjust their own prices, this would
               | constitute some form of illegal collusion.
        
               | marginalia_nu wrote:
               | What seems to actually work is to identify the bots and
               | instead of giving up your hand by blocking them, to
               | quietly poison the data. Critically, it needs to be
               | subtle enough that it's not immediately obvious the data
               | is manipulated. It should look like a plausible response,
               | only with some random changes.
        
               | kayodelycaon wrote:
               | What makes you think they would use it?
        
               | z3c0 wrote:
               | It's in their interest. I've scraped a lot, and it's not
               | easy to build a reliable process on. Why parse a human
               | interface when there's an application interface
               | available?
        
             | thaumaturgy wrote:
             | There's a lot of metadata available for IPs, and that
             | metadata can be used to aggregate clusters of IPs, and that
             | in turn can be datamined for trending activity, which can
             | be used to sift out abusive activity from normal browsing.
             | 
             | If you're dropping 6 figs annually on this and it's still
             | frustrating, I'd be interested in talking with you. I built
             | an abuse prediction system out of this approach for a small
             | company a few years back, it worked well and it'd be cool
             | to revisit the problem.
        
         | borski wrote:
         | You could ban their IPs?
        
           | KMnO4 wrote:
           | IP bans are equivalent to residential door locks. They're
           | only deterring the most trivial attacks.
           | 
           | In school I needed to scrape a few hundred thousand pages of
           | a proteomics database website. For some reason you had to
           | view each entry one at a time. There was IP throttling which
           | banned you if you made requests too quickly. But slowing the
           | script to 1 request per second would have taken days to
           | scrape the site. So I paid <$5 for a list of 500 proxy
           | servers and distributed it, completing the task in under half
           | an hour.
        
             | borski wrote:
             | I agree it's not perfect. It's also significantly better
             | than nothing.
        
             | l33t2328 wrote:
             | Using proxies to hide your identity to get around a denial
             | of access seems to get awfully close to violating the
             | Computer Fraud and Abuse Act(in USA, at least).
             | 
             | I'm surprised your school was okay with it.
        
         | throw10920 wrote:
         | Have you considered serving a proof-of-work challenge to
         | clients accessing your website? Minimal cost on legit users,
         | but large costs on large-scale web-scraping operations, and it
         | doesn't matter if they split up their efforts across a bunch of
         | IP addresses - they're still going to have to do those
         | computations.
         | 
         | https://en.wikipedia.org/wiki/Hashcash
        
           | nyuszika7h wrote:
           | No thanks, as a user I would stay far away from such
           | websites. This is akin to crypto miners. I don't need them to
           | drive up my electricity costs and also contribute to global
           | warming in the process. It's not worth the cost.
        
         | userbinator wrote:
         | That's what rate-limiting is for. Don't be so aggressive with
         | it that you start hitting the faster visitors, however, or they
         | may soon go somewhere else (has happened to me a few times).
        
           | loceng wrote:
           | Do you know if there's a way to rate limit logged-in users
           | differently than visitors of a site?
        
             | rolph wrote:
             | rate limiting can be a double edged sword, you can be
             | better off giving a scraper highest bandwidth so they are
             | gone sooner, otherwise somthing like making a zip or other
             | sort of compilation of the site available may be an option.
             | 
             | just what kind of scraper you have is a concern.
             | 
             | does scraper just want a bunch of stock images;
             | 
             | or does scraper have FOMO on web trinkets;
             | 
             | or does scraper want to mirror/impersonate your site.
             | 
             | the last option is the most concerning because then;
             | 
             | scraper is mirroring bcz your site is cool and local UI/UX
             | is wanted;
             | 
             | or is scraper phishing smishing or otherwise duping your
             | users.
        
               | loceng wrote:
               | Yeah, good points to consider. I think the sites that
               | would be scrapped the most would be where the data is
               | regularly and reliably up-to-date, and a large volume of
               | it at that - so not just one scraper but many different
               | parties may on a daily or weekly basis try to scrap every
               | page.
               | 
               | I feel that ruling should have the caveat that if a fair
               | cost paid API version for getting publicly listed data
               | then the scrapers must legally use that (say no more than
               | 5% more than cost of CPU/bandwidth/etc of the scraping
               | behaviour); ideally a rule too that at minimum there be a
               | delay if they are republishing that data without your
               | permission, so at least you as the platform/source/reason
               | for the data being up-to-date aren't harmed too - which
               | may then kill the source platform over time if regular
               | visitors somehow start going to the competitor publishing
               | the data.
        
             | patmorgan23 wrote:
             | Absolutely you just have to check the session cookie
        
             | minusf wrote:
             | nginx can be set up to do that using the session cookie.
        
           | CWuestefeld wrote:
           | Rate limiting isn't an effective defense for us.
           | 
           | First, as a B2B site, many of our users from a given customer
           | (and with huge customers, that can be many) are coming
           | through the same proxy server, effectively presenting to us
           | as the same IP,
           | 
           | Second, the bots years back became much more sophisticated
           | than a single, or even relatively finite, IP. Today they work
           | AWS, Azure, GCP, and other cloud services. So the IPs that
           | they're assigned today will be different tomorrow. Worse, the
           | IPs that they're assigned today may well be used by a real
           | customer tomorrow.
        
             | gregsadetsky wrote:
             | Have you tried including the recaptcha v3 library and
             | looking at the distribution of scores? --
             | https://developers.google.com/recaptcha/docs/v3 --
             | "reCAPTCHA v3 returns a score for each request without user
             | friction"
             | 
             | It obviously depends on how motivated the scrapers are
             | (i.e. whether their headless browsers are actually
             | headless, and/or doing everything they can to not appear
             | headless, whether Google has caught on to their latest
             | tricks etc. etc.) but it would at least be interesting to
             | look at the score distribution and then see whether you can
             | cut off or slow down the < 0.3 scoring requests (or
             | redirect them to your API docs)
        
               | 9dev wrote:
               | It sounds great, until you have Chinese customers. That's
               | when you'll figure out Recaptcha just doesn't really work
               | in China, and have to begrudgingly ditch it altogether...
        
             | kevincox wrote:
             | If your users are logged in you can rate limit by user
             | instead of by IP. This mostly solves this problem.
             | Generally what I do is for logged in users I rate limit by
             | user, then for not-logged-in users I rate limit
             | aggressively by IP. If they hit the limit the message lets
             | them know that they can get around it by logging in. Of
             | course this depends on user accounts having some sort of
             | cost to create. I've never actually implemented it but
             | considered having only users who have made at least one
             | purchase bypass the IP limit or otherwise get a bigger rate
             | limit.
        
         | forgotmypw17 wrote:
         | Yes, I think working to accommodate the non-humans along with
         | the humans is the right approach here.
         | 
         | Scrapers have a limited range of IPs, so rate-limiting them and
         | stalling (or dropping) request responses is one way to deal
         | with the DoS scenario.
         | 
         | For my sites, I have placed the majority behind HTTP Basic
         | Auth...
        
           | KptMarchewa wrote:
           | You realistically can't. There are services like [0][1] that
           | mean any IP could be a scraper node.
           | 
           | [0] https://brightdata.com/proxy-types/residential-proxies
           | [1] https://oxylabs.io/products/residential-proxy-pool
        
             | orlp wrote:
             | > How does Bright Data acquire its residential IPs?
             | 
             | > Bright Data has built a unique consumer IP model by which
             | all involved parties are fairly compensated for their
             | voluntary participation. App owners install a unique
             | Software Development Kit (SDK) to their applications and
             | receive monthly remuneration based on the number of users
             | who opt-in. App users can voluntarily opt-in and are
             | compensated through an ad-free user experience or enjoy an
             | upgraded version of the app they are using for free. These
             | consumers or 'peers' serve as the basis of our network and
             | can opt-out at any time. This model has brought into
             | existence an unrivaled, first of its kind, ethically sound,
             | and compliant network of real consumers.
             | 
             | I don't know how they can say with a straight face that
             | this is 'ethically sound'. They have, essentially, created
             | a botnet, but apparently because it's "AdTech" and the user
             | "opts-in" (read: they click on random buttons until they
             | hit one that makes the banner/ad go away) it's suddenly not
             | malware.
        
               | TedDoesntTalk wrote:
               | NordVPN (Tesonet) has another business doing the same
               | thing. They sell the IP addresses/bandwidth of their
               | NordVPN customers to anyone who needs bulk mobile or
               | residential IP addresses. That's right, installing their
               | VPN software adds your IP address to a pool that NordVPN
               | then resells. Xfinity/Comcast sort of pioneered this with
               | their wifi routers that automatically expose an isolated
               | wifi network called 'xfinity' (IIRC) whether you agree or
               | not.
        
               | rascul wrote:
               | > They sell the IP addresses/bandwidth of their NordVPN
               | customers to anyone who needs bulk mobile or residential
               | IP addresses
               | 
               | I would be interested in a reference for this if you have
               | one.
        
               | duskwuff wrote:
               | The Comcast access points do, at least, have the saving
               | grace that they're on a separate network segment from the
               | customer's hardware, and don't share an IP address or
               | bandwidth/traffic limit with the customer.
               | 
               | Tesonet and other similar services (e.g. Luminati) don't
               | have that. As far as anyone -- including web services,
               | the ISP, or law enforcement -- are concerned, their
               | traffic is the subscriber's traffic.
        
       | lazyjeff wrote:
       | Now I wonder whether "retrieving" your own OAuth token from an
       | app to make REST calls that extract your own data from cloud
       | services is legal. It seems to fall under the same guideline,
       | that exceeding authorization is not unauthorized access, so even
       | though it's usually against the terms of service it doesn't
       | violate CFAA?
        
       | MWil wrote:
       | Next up, we just need all public court records to be freely
       | available to BE scraped and not $3 per page
       | 
       | https://patentlyo.com/patent/2020/08/court-pacer-should.html
        
         | KennyBlanken wrote:
         | Really the problem is that PACER has been turned into a cash
         | cow for the federal court system, with fees and profits growing
         | despite costs being virtually nill.
         | 
         | But yeah, the irony of the federal court system legalizing
         | screen scraping, something PACER contractually prohibits.
        
       | jakelazaroff wrote:
       | If I have to stake out a binary position here, I'm pro scraping.
       | But I really wish we could find a way to be more nuanced here.
       | The scraper in question is looking at public LinkedIn profiles so
       | that it can snitch to employers about which employees might be
       | looking for new jobs. That's not at all the same as archival;
       | it's using my data to harm me.
        
         | xboxnolifes wrote:
         | It's a public page. Your employer could just as well check your
         | page theirself. It may be a tragedy of efficiency, but it's not
         | like the scraper is grabbing hidden data. The issue is in
         | something else. Maybe it's the fact that your current employer
         | would punish you for looking for a new job. Or maybe LinkedIn's
         | public "looking for job" status is not sustainable in it's
         | current form.
        
           | notch656a wrote:
           | Weev was charged, and eventually convicted based merely on
           | scraping from AT&T [0]. When the charge was vacated, it was
           | only on venue/jurisdiction, not on the basis of the scraping
           | being legal. Seems there's precedent merely scraping this
           | information is felonious behavior.
           | 
           | https://www.eff.org/deeplinks/2013/07/weevs-case-flawed-
           | begi...
        
           | jakelazaroff wrote:
           | Yes, and if I have a public Twitter account it's perfectly
           | possible for someone to flood me with spam messages. That
           | doesn't mean we should do nothing to prevent it. As I said
           | elsewhere, we should strive to make it possible for people to
           | exist in public digital spaces without worrying about bad
           | actors.
        
             | xboxnolifes wrote:
             | Someone can manually spam you, and I don't think that
             | should be allowed. That is a separate topic and discussion.
             | Unless you are arguing that your employer should not be
             | allowed to check your LinkedIn status.
        
               | jakelazaroff wrote:
               | I'm just using it as an example of a case in which a
               | public profile doesn't automatically mean anything goes.
               | I had hoped to generate discussion about how to throw out
               | some of the bathwater without throwing out the baby too,
               | but I guess no one is really interested.
        
         | brians wrote:
         | Yes, but it's specifically using data you published to harm
         | you. Compare Blind, which is engineered to not be attributable
         | in this way.
        
           | jakelazaroff wrote:
           | I understand that I published it. That doesn't mean I should
           | accept that hostile parties will use it against me.
           | 
           | This is kinda like telling someone who is being harassed on
           | social media that they're consenting to it by having a public
           | account. We should strive to make our digital personae safe
           | from bad actors, not throw our hands up and say "if you put
           | yourself out there, you have no recourse".
        
       | TedDoesntTalk wrote:
       | This is great news. A win for the Internet Archive and other
       | archivists.
        
         | ghaff wrote:
         | IANAL but it's not immediately obvious to me that this ruling
         | covers bulk scraping and _republishing_ untransformed. I 'm
         | genuinely curious about this personally. I presumably can't
         | just grab anything I feel like off the web, curate it, and sell
         | it.
        
       | 1vuio0pswjnm7 wrote:
       | "On LinkedIn, our members trust us with their information, which
       | is why we prohibit unauthorized scraping on our platform."
       | 
       | This is an unpersuasive argument because it ignores all the
       | computer users who are not "members". Whether or not "members"
       | trust LinkedIn should have no bearning on whether other computer
       | users who may or may not be "members" can retrieve others' public
       | information.
       | 
       | Even more, this statement does not decry so-called scraping only
       | "unauthorised" scraping. Who provides "authorisation". Surely not
       | the LinkedIn members.
       | 
       | It is presumptuous if not ridiculous for "tech" companies to
       | claim computer users "trust" them. Most of these companies
       | recieve no feedback from the majority of their "members". Tech
       | companies generally have no "customer service" for the members
       | they target with data collection.
       | 
       | Further, there is an absence of meaningful choice. It is like
       | saying people "trust" credit bureaus with their information.
       | History shows these data collection intermediaries could not be
       | trusted and that is why Americans have the Fair Credit Reporting
       | Act.
        
         | MWil wrote:
         | Great point about non-members/members
        
       | ketzu wrote:
       | I am not versed in law, especially not in US law, but this case
       | seems to be very specific that scraping is no violation of the
       | CFAA. I do support this interpretation.
       | 
       | However, the case of scraping I personally find more problematic
       | is the use of personal data I provide to one side, then used by
       | scrapers without my knowledge or permission. I truly wonder which
       | way we are better off on that issue as a society. Independent of
       | the current law, should anything that is accessible be
       | essentially free-for-all or should there be limitations on what
       | you are allowed to do. Cases highlighted in the article: Facial
       | recognition by third parties on social media profiles, facebook
       | scraping for personal data, search engines, journalists or
       | archives. (Not all need to have the same answer to the question
       | "do we want this") Besides that, the point I care slightly less
       | about is the idea that allowing scaping with very leisure limits
       | leads to even more closed up systems.
        
         | snarf21 wrote:
         | serious question (ianal): If I write down some information, at
         | what point does that information have copyright protection? Do
         | I have to claim it with c?
        
           | henryfjordan wrote:
           | Never. "Mere listings" of data (like the phone book) are not
           | copyrightable.
           | 
           | But also anything you write which is copyrightable is
           | copyright immediately. You can register the work w/ the
           | copyright office for some extra perks but it's not strictly
           | necessary.
        
           | butlerm wrote:
           | You might want to check out Feist v. Rural Telephone Company
           | (1991), and also look up the Berne Convention (on copyright),
           | which the U.S. joined in the 1970s.
           | 
           | If by "information" you mean mere facts without creativity in
           | selection or arrangement, those are generally not protectable
           | by copyright in the United States, although possibly in some
           | other countries. Copyright generally protects works of
           | authorship, and nothing else. No creativity no copyright.
        
         | henryfjordan wrote:
         | "I gave my data to Linkedin and now scrapers are reading it
         | from the public web". Be mad at Linkedin before you are mad at
         | the scraper.
        
           | ViViDboarder wrote:
           | I may edit and delete my information from LinkedIn, but I
           | have no idea who has persisted that data beyond there.
           | 
           | There is such a thing as scraping responsibility and
           | irresponsibly. Both kinds happen.
        
           | ketzu wrote:
           | This seems to have various angles to it.
           | 
           | First, one question is if the intent of the original owner of
           | the data important? When I put data on linkedin (or facebook,
           | my private website, hackenews or my employers website) I
           | might have an opinion on who gets to do what with my data
           | (see also GDPR discussions). Should I blame linkedin (or
           | meta/myself/my employer) to do what I expected them to do, or
           | should I blame those that do what I don't want them to do?
           | Should I just be blamed directly because I even want to make
           | a distinction between those? If I didn't want my data I could
           | just not provide it (or participate in/surf the web at all if
           | we extend the idea to more general data collection).
           | 
           | Secondly, it touches on the idea that linkedin should not
           | make the data publicly available (i.e., without
           | authentication), and we end up with a less open system. Is
           | that better? Is it what we want? Maybe there are also other
           | ways that I am not aware just now. (Competing purely on value
           | added is probably futile for data aggregators.)
        
             | henryfjordan wrote:
             | Your intent as the original owner of the data is important!
             | You have to explicitly give Linkedin the right to display
             | your data. It's in their Terms of Service. If Linkedin does
             | something with your data that is outside the ToS, then that
             | is on them, but if they do something within the ToS that
             | you don't like then maybe you should not have provided them
             | with your data.
             | 
             | As for whether the data should be public, that's a decision
             | we each have to make.
        
         | ct0 wrote:
         | Consider that scrapers may be far less interested in you as the
         | individual than they are regarding your input into the
         | aggregated data points of you and those like you.
        
           | altdataseller wrote:
           | The scraper example in this case HiQ had a product that
           | tracked employee profile changes to predict employee churn.
           | 
           | So they were specifically interested in you, personally not
           | the aggregate
        
           | xboxnolifes wrote:
           | That's the same argument for all major internet tracking
           | cookies. I don't think that's going to convince this site's
           | userbase.
        
       | nomoreusernames wrote:
       | so i dont have the right to not be scraped? thats like sending
       | radiowaves and making me pay a license for a radio i dont use.
       | same with spam. i should give you a token to send me emails i
       | maybe want to look at your stuff.
        
       | notch656a wrote:
       | An interesting departure, considering weev was convicted on
       | merely scraping [0] AT&T. Although his charge was vacated, it was
       | on the venue/jurisdiction, not that scraping was found to be
       | legal.
       | 
       | [0] https://www.eff.org/deeplinks/2013/07/weevs-case-flawed-
       | begi...
        
       | whatever1 wrote:
       | Is it legal though to login to a website and then scrape data
       | (that would not be accessible if I was just browsing as a guest)?
        
       ___________________________________________________________________
       (page generated 2022-04-18 23:00 UTC)