[HN Gopher] Web scraping is legal, US appeals court reaffirms ___________________________________________________________________ Web scraping is legal, US appeals court reaffirms Author : spenvo Score : 432 points Date : 2022-04-18 19:37 UTC (3 hours ago) (HTM) web link (techcrunch.com) (TXT) w3m dump (techcrunch.com) | bobajeff wrote: | I had no idea this was even being discussed. I'm glad they are | reasonable on this. Wish they had been as reasonable on breaking | encryption/DRM schemes. | amelius wrote: | Where is the lobbying? Something seems wrong ... | 8bitsrule wrote: | Wondered how this works WRT copyright (since the article did not | contain the word). Here's Kent State's (short) IPR advice | [https://libguides.library.kent.edu/data-management/copyright] It | says "Data are considered 'facts' under U.S. law. They are not | copyrightable.... Creative arrangement, annotation, or selection | of data can be protected by copyright." | ricardo81 wrote: | For what it's worth, Linkedin was incredibly easy to scrape back | in the day, wrt profile/email correlation. Can't buy any | aggressive stance they may have against 'scrapers'. | | 2 options. | | Their Linkedin ID's are base 12, and would redirect you if you | simply wanted to enumerate them. | | You could also upload your 'contacts', 200-300 at a time and it'd | leak profile IDs (Twitter and Facebook mitigated this ~5 years | ago). I still have a @pornhub or some such "contact" that I can't | delete from testing this. | infiniteL0Op wrote: | altdataseller wrote: | Does the ruling make it illegal to block scrapers? | ricardo81 wrote: | Interesting as I've seen a few search engine start ups that seem | to scrape other search engine results, depending on your | definition of scraping. My definition would be a user agent that | doesn't uniquely identify itself that isn't using an authorised | API. | cloudyporpoise wrote: | These anti-scraping corporate activists need to get with the | times and allow access to their data, legitimately, through an | API. Third parties will scrape and sell the data regardless, so | why not just cut them out and even charge for individuals to | legitimately use the API. API keys could be tied to an individual | and at least LinkedIn would know who was conducting what action. | | Make it easier to get the data through an API than having to | scrape it. | CWuestefeld wrote: | While I have sympathy for what the scrapers are trying to do in | many cases, it bothers me that this doesn't seem to address what | happens when badly-behaved scrapers cause, in effect, a DOS on | the site. | | For the family of sites I'm responsible for, bot traffic | comprises a majority of traffic - that is, to a first | approximation, the lion's share of our operational costs are from | needing to scale to handle the huge amount of bot traffic. Even | when it's not as big as a DOS, it doesn't seem right to me that I | can't tell people they're not welcome to cause this additional | system load. | | Or even if there was some standardized way that we could provide | a dumb API, just giving them raw data so we don't need to incur | the additional expense of the processing for creature comforts on | the page designed to make our users happier but the bots won't | notice. | hardtke wrote: | The problem with many sites (and LinkedIn in particular) is | that they whitelist a bunch of specific websites, presumably | based on the business interests, but disallow everyone else in | their robots.txt. You should either allow all scrapers that | respect certain load requirements or allow none. Anything that | Google is allowed to see and include in their search results | should be fair game. | | Here's the end of LinkedIn's robots.txt: | | User-agent: * Disallow: / | | # Notice: If you would like to crawl LinkedIn, # please email | whitelist-crawl@linkedin.com to apply # for white listing. | diamondage wrote: | And this is what the HiQ case hinged on. LinkedIn were | essentially selectively applying the computer fraud and abuse | act based on their business interests - that was never going | to sit well with judges. | car_analogy wrote: | > Even when it's not as big as a DOS, it doesn't seem right to | me that I can't tell people they're not welcome to cause this | additional system load. | | You _can_ tell them. You just can 't prosecute them if they | don't obey. | davidhyde wrote: | I don't know what kind of data you serve up but perhaps you | could serve low quality or inaccurate content from addresses | that are guessed from your api. I.e. endpoints not normally | reachable in the normal functioning of your web app should | return reasonable junk. A mixture of accurate and inaccurate | data becomes worthless for bots and worthless data is not worth | scraping. Just an idea! | ryan_j_naughton wrote: | As other have said (A) there are plenty of countermeasures you | can take, but also (B) you are frustrated that you are | providing something free to the public and then annoyed at the | "wrong" customers are using your product and costing you money. | I'm sorry, but this is a failure of your business model. | | If we were to analogize this to a non-internet example: (1) A | company throws a free concert/event and believes they will make | money by alcohol sales. (2) A bunch of sober/non-drinking folks | attend the concert but only drink water (3) Company blames the | concert attendees for "taking advantage" of them when they | really just had poor company policies and a bad business model. | | Put things behind authentication and authorization. Add a | paywall. Implement DDOS and detection and banning approaches | for scrapers. Etc etc. | | But don't make something public and then get mad at THE PUBLIC | for using it. Behind that machine is a person, who happens to | be a member of the public. | noisenotsignal wrote: | There are certain classes of websites where the proposed | solutions aren't a great fit. For example, a shopping site | hiding their catalog behind paywalls or authentication would | raise barriers to entry such that a lot of genuine customers | would be lost. I don't think the business model is in general | to be blamed here and it's ok to acknowledge the unfortunate | overhead and costs added by site usage patterns (e.g. | scraping) that are counter to the expectation. | Nextgrid wrote: | But don't you already have countermeasures to deter DoS attacks | or malicious _human_ users (what if someone pays or convinces | people to open your site and press F5 repeatedly)? | | If not, you should, and the badly-behaved scrapers are actually | a good wake-up call. | colinmhayes wrote: | I'm sympathetic to this. I built a search engine for my senior | project and my half baked scraper ended up taking down duke | law's site during their registration period. Ended up getting a | not so kindly worded email from them, but honestly this wasn't | an especially hard problem to solve. All of my traffic was | coming from the cluster that was on my university's subnet, it | wouldn't have been that hard to for them to IP address timeouts | when my crawler started scraping thousands of pages a second on | their site. Not to victim blame, this was totally my fault, but | I was a bit surprised that they hadn't experienced this before | with how much automated scraping goes on. | brightball wrote: | I'm honestly more interested in bot detection than anything | else at this point. | | It seems like it should be perfectly legal to detect and then | hold the connection open for a long period of time without | giving a useful response. Or even send highly compressed gzip | responses designed to fill their drives. | | Legal or not, I can't see any good reason that we can't make | it painful. | fjabre wrote: | Make it painful if they abuse the site. | | We all benefit from open data. Polite scrapers are just | fine and a natural part of the web ecosystem. | | Google has been scraping the web all day every day for | decades now. | rmbyrro wrote: | I have sympathy for your operational issues and costs, but | isn't this kind of complaint the same as a shopping mall/center | complaining of people who go in, check some info and go out | without buying? | | I understand that bots have leverage and automation, but so | does you to reach a larger audience. Should we continue to | benefit from one side of the leverage, while complaining about | the other side? | wlesieutre wrote: | It's more like a mall complaining that while they're trying | to serve 1000 customers, someone has gone and dumped 10000000 | roombas throughout the stores which are going around scanning | all the price tags. | kortilla wrote: | No, because those are people going to the mall. Not robots | 100x the quantity of real people. | CWuestefeld wrote: | No. When I say that bots exceed the amount of real traffic, | I'm including people "window shopping" on the good side. | | My complaint is more like, somebody wants to know the prices | of all our products, and that we have roughly X products | (where X is a very large number). They get X friends to all | go into the store almost simultaneously, each writing down | the price of the particular product they've been assigned to | research. When they do this, there's scant space left in the | store for even the browsing kind of customers to walk in. (of | course I exaggerate a bit, but that's the idea) | mrobins wrote: | I'm sympathetic to the complaints about "rude" scraping | behavior but there's an easy solution. Rather than make | people consume boatloads of resources they don't want | (individual page views, images, scripts, etc.) just build | good interoperability tools that give the people what they | want. In the physical example above that would be a product | catalog that's easily replicated with a CSV product listing | or an API. | jensensbutton wrote: | You don't know why any random scraper is scraping you and | thus you don't know what api to build that will do them | from scraping. Also, it's likely easier for them to | contribute scraping than write a bunch of code to | integrate with your API so it there's no incentive for | them to do so either. | mcronce wrote: | Writing a scraper for a webpage is typically far more | development effort than writing an API wrapper | withinboredom wrote: | Just advertise the API in the headers. Or better yet, set | the buttons/links only to be accessible via .usetheapi- | dammit selector. Lastly, provide an API and a | "developers.whatever.com" domain to report issues with | the API, get API keys, and pay for more requests. It | should be pretty easy to setup, especially if there's an | internal API available behind the frontend already. I'd | venture a dev team could devote 20% to a few sprints and | have an MVP thing up and running. | mrobins wrote: | I think lots of website owners know exactly where the | value in their content exists. Whether or not they want | to share that in a convenient way, especially to | competitors etc is another story. | | That said if scraping is inevitable, it's immensely | wasteful effort to both the scraper and the content owner | that's often avoidable. | altdataseller wrote: | For the 2nd part, I have done scraping and would always | opt for an API if the price is reasonable over paying | nosebleed amounts for residential proxies | CWuestefeld wrote: | Yes, exactly. Nobody is standing up and saying "we're the | ones doing this, and here's what we wish you'd put in an | API". | | Also, I'm a big Jenson Button fan. | [deleted] | bastardoperator wrote: | Have you considered using a cache service like cloudflare? | KennyBlanken wrote: | > While I have sympathy for what the scrapers are trying to do | in many cases, it bothers me that this doesn't seem to address | what happens when badly-behaved scrapers cause, in effect, a | DOS on the site. | | Like when Aaron Swartz spent months hammering JSTOR causing it | to become so slow it was almost unusuable, and despite knowing | that he was causing widespread problems (including the eventual | banning of MIT's entire IP range) actually worked to add | additional laptops and improve his scraping speed...all the | while going out of his way to subvert MIT's netops group trying | to figure out where he was on the network. | | JSTOR, by the way, is a non-profit that provides aggregate | access to their cataloged archive of journals, for schools and | libraries to access journals they would otherwise never be able | to afford. In many cases, free access. | linuxdude314 wrote: | If most of your traffic is bots, is the site even worth | running? | | This really is akin to the question, "Should others be allowed | to take my photo or try to talk to me in public?" | | Of course the answer should be yes, the internet is the digital | equivalent of a public space. If you make it accessible, anyone | should be able to consume. | | If you don't want it scraped add auth! | [deleted] | voxic11 wrote: | The court just ruled that scraping on its own isn't a violation | of the CFAA. Meaning it doesn't count as the crime of | "accessing a protected computer without authorization or | exceeding authorized access and obtaining information". | | However presumably all the other provisions of the CFAA still | apply, so if your scraping damages the functioning of a | internet service then you still would have committed the crime | of "Damaging a protected computer by intentional access". | Negligently damaging a protected computer is punishable by 1 | year in prison on the first offence. Recklessly damaging a | protected computer is punishable by 1-5 years on the first | offense. And intentionally damaging a protected computer is | punishable by 1-10 years for the first offense. These penalties | can go up to 20 years for repeated offenses. | gdulli wrote: | When the original ruling in favor of HiQ came out, it still | allowed for LinkedIn to block certain kinds of malicious | scraping. LinkedIn had been specifically blocking HiQ, and was | ordered to stop doing that. | kstrauser wrote: | I've told this story before, but it was fun, so I'm sharing it | again: | | I'll skip the details, but a previous employer dealt with a | large, then-new .mil website. Our customers would log into the | site to check on the status of their invoices, and each page | load would take approximately 1 minute. Seriously. It took | about 10 minutes to log in and get to the list of invoices | available to be checked, then another minute to look at one of | them, then another minute to get out of it and back into the | list, and so on. | | My job was to write a scraper for that website. It ran all | night to fetch data into our DB, and then our website could | show the same information to our customers in a matter of | milliseconds (or all at once if they wanted one big aggregate | report). Our customers _loved_ this. The .mil website 's | developer _hated_ it, and blamed all sorts of their tech | problems on us, although: | | - While optimizing, I figured out how to skip lots of | intermediate page loads and go directly to the invoices we | wanted to see. | | - We ran our scraper at night so that it wouldn't interfere | with their site during the day. | | - Because each of our customers had to check each one of their | invoices every day if they wanted to get paid, and we were | doing it more efficiently, our total load on their site was | lower than the total load of our customers would be. | | Their site kept crashing, and we were there scapegoat. It was | great fun when they blamed us in a public meeting, and we | responded that we'd actually disabled our crawler for the past | week, so the problem was still on their end. | | Eventually, they threatened to cut off all our access to the | site. We helpfully pointed out that their brand new site wasn't | ADA compliant, and we had vision-impaired customers who weren't | able to use it. We offered to allow our customers to run the | same reports from our website, for free, at no cost to the .mil | agency, so that they wouldn't have to rebuild their website | from the ground up. They saw it our way and begrudgingly | allowed us to keep scraping. | dwater wrote: | I have worked with .mil customers who paid us to scrape and | index their website because they didn't have a better way to | access their official, public documents. | oneoff786 wrote: | Me too but for a private company | | In reality it was probably more like org sub group A wanted | to leverage org sub group B's data but they didn't | cooperate | jll29 wrote: | This is not .mil specific: I've been told of a case where | an airline first legally attacked a flight search engine | (Skyscanner) for scraping, and then told them to continue | when they realized that their own search engine couldn't | handle all the traffic, and even if it could, it was more | expensive per query than routing via Skyscanner. | brailsafe wrote: | You probably can. On the protocol level with JSON-LD or other | rich data packages that generate xml or standardized json | endpoints. I did this for an open data portal, and this is | something most G7 governments do with their federal open data | portals using off the shelf packages (that are worth | researching a bit obviously first), particularly in the python | and flask world. We were still getting hammered by China at our | Taiwanese language subdomain, but that was a different concern | bequanna wrote: | As someone that has been on the other end, I can tell you devs | don't want to use selenium or inspect requests to reverse | engineer your UI and _wish_ there were more clean APIs. | | Have you tried making your UI more challenging to scrape and | adding a simple API that requires free registration? | YPCrumble wrote: | Reading your comment my impression is that this is either an | exaggeration or a very unique type of site if bots make up the | majority of traffic to the point that scrapers are anywhere | near the primary load factor. | | Would someone let me know if I'm just plain wrong in this | assumption? I've run many types of sites and scrapers have | never been anywhere close to the main source of traffic or even | particularly noticeable compared to regular users. | | Even considering a very commonly scraped site like LinkedIn or | Craigslist - for any site of any magnitude like this public | pages are going to be cached so additional scrapers are going | to have negligible impact. And a rate limit is probably one | line of config. | | I'm not saying you are necessarily wrong, but I can't imagine a | scenario that you're describing and would love to hear of one. | CWuestefeld wrote: | It's a B2B ecommerce site. Our annual revenue from the site | would put us on the list of top 100 ecommerce sites [1] | (we're not listed because ecommerce isn't the only businesss | we do. With that much potential revenue to steal from us, | perhaps the stakes are higher. | | As described elsewhere, rate limiting doesn't work. The bots | come from hundreds to thousands of separate IPs | simultaneously, cooperating in a distributed fashion. Any one | of them is within reasonable behavioral ranges. | | Also, caching, even through a CDN doesn't help. As a B2B | site, all our pricing is custom as negotiated with each | customer. (What's ironic is that this means that the pricing | data that the bots are scraping isn't even representative - | it only shows what we offer walkup, non-contract customers.) | And because the pricing is dynamic, it also means that the | scraping to get these prices is one of the more | computationally expensive activities they could do. | | To be fair, there is some low-hanging fruit in blocking many | of them. Like, it's easy to detect those that are flooding | from a single address, or sending SQL injection attacks, or | just plain coming from Russia. I assume those are just the | script kiddies and stuff. The problem is that it still leaves | a whole lot of bad actors once these are skimmed off the top. | | [1] https://en.wikipedia.org/wiki/List_of_largest_Internet_co | mpa... | mindslight wrote: | > _As a B2B site, all our pricing is custom as negotiated | with each customer ... the pricing is dynamic_ | | So your company is deliberately trying to frustrate the | market, and doesn't like the result of third parties | attempting to help market efficiency? It seems like this is | the exact kind of scraping that we generally want more of! | I'm sorry about your personal technical predicament, but it | doesn't sound like your perspective is really coming from | the moral high ground here. | YPCrumble wrote: | Thanks for the explanation! | | The thing I still don't understand is why (edit server not | cdn) caching doesn't work - you have to identify customers | somehow, and provide everyone else a cached response at the | server level. For that matter, rate limit non-customers | also. | CWuestefeld wrote: | The pages getting most of the bot action are search and | product details. | | Search results obviously can't be cached, as it's | completely ad hoc. | | Product details can't be cached either, or more | precisely, there are parts of each product page that | can't be cached because | | * different customers have different products in the | catalog | | * different products have different prices for a given | product | | * different products have customer-specific aliases | | * there's a huge number of products (low millions) and | many thousands of distinct catalogs (many customers have | effectively identical catalogs, and we've already got | logic that collapses those in the backend) | | * prices are also based on costs from upstream suppliers, | which are themselves changing dynamically. | | Putting all this together, the number of times a given | [product,customer] tuple will be requested in a | reasonable cache TTL isn't very much greater than 1. The | exception being for walk-up pricing for non-contract | users, and we've been talking about how we might optimize | that particular cases. | YPCrumble wrote: | Ahhhhh, search results makes a whole lot more sense! | Thank you. Search can't be cached and the people who want | to use your search functionality as a high availability | API endpoint use different IP addresses to get around | rate limiting. | | The low millions of products also makes some sense I | suppose but it's hard to imagine why this doesn't simply | take a login for the customer to see the products if | they're unique to each customer. | | On the other hand, I suspect the price this company is | paying to mitigate scrapers is akin to a drop of water in | the ocean, no? As a percent of the development budget it | might seem high and therefore seem big to the developer, | but I suspect the CEO of the company doesn't even know | that scrapers are scraping the site. Maybe I'm wrong. | | Thanks again for the multiple explanations in any case, | it opened my eyes to a way scrapers could be problematic | that I hadn't thought about. | toast0 wrote: | If you've got a site with a _lot_ of pages, bot traffic can | get pretty big. Things like a shopping site with a large | number of products, a travel site with pages for hotels and | things to do, something to do with movies or tv shows and | actors, basically anything with a large catalog will drive a | lot of bot traffic. | | It's been forever since I worked at Yahoo Travel, but bot | traffic was significant then, I'd guess roughly 5-10% of the | traffic was declared bots, but Yandex and Baidu weren't | agressive crawlers yet, so I wouldn't be terribly surprised | if a site with a large catalog that wasn't top 3 with humans | would have a majority of traffic as bots. For the most part, | we didn't have availability issues as a result of bot | traffic, but every once in a while, a bot would really ramp | up traffic and cause issues, and we would have to carefully | design our list interfaces to avoid bots crawling through a | lot of different views of the same list (while also trying to | make sure they saw everything in the list). Humans may very | well want to have all the narrowing options, but it's not | really helpful to expose hotels near Las Vegas starting with | the letter M that don't have pools to Google. | YPCrumble wrote: | I appreciate the response but I'm still perplexed. It's not | about the percent of traffic if that traffic is cached. And | rate limiting also prevents any problems. It just doesn't | seem plausible that scrapers are going to DDoS a site per | the original comment. I suppose you'd get bad traffic | reports and other problems like log noise, but claiming it | to be a general form of DDoS really does sound like | hyperbole. | breischl wrote: | As another example, I used to work on a site that was roughly | hotel stays. A regular person might search where to stay in | small set of areas, dates and usually the same number of | people. | | Bots would routinely try to scrape pricing for every | combination of {property, arrival_date, departure_date, | num_guests} in the next several years. The load to serve this | would have been _vastly_ higher than real customers, but our | frontend was mostly pretty good at filtering them out. | | We also served some legitimate partners that wanted basically | the same thing via an API... and the load was in fact | enormous. But at least then it was a real partner with some | kind of business case that would ultimately benefit us, and | we could make some attempt to be smart about what they asked | for. | VWWHFSfQ wrote: | > a very unique type of site if bots make up the majority of | traffic | | Pretty much Twitter and the majority of such websites. | YPCrumble wrote: | Do you really believe bots make up a significant amount of | Twitter's operating cost? Like I said they're just | accessing cached tweets and are rate limited. How can the | bot usage possibly be more than a small part of twitter's | operating cost? | nojito wrote: | Bandwidth isn't free. | YPCrumble wrote: | I didn't say it is free, I said that the bandwidth for bots | is negligible compared to that of regular users. | nojito wrote: | Negligible isn't free either. | svnpenn wrote: | Implement TLS fingerprinting on your server. People can still | fake that if they are determined, but it should cut the abuse | way down. | userbinator wrote: | TLS fingerprinting is one of the ways minority browsers and | OS setups get unfairly excluded. I have an intense hatred of | Cloudflare for popularising that. Yes, there are ways around | it, but I still don't think I should have to fight to use the | user-agent I want. | oh_sigh wrote: | I don't want to say tough cookies, but if OPs | characterization isn't hyperbole("the lion's share of our | operational costs are from needing to scale to handle the | huge amount of bot traffic."), then it can be a situation | where you have to choose between 1) cut off a huge chunk of | bots, but upset a tiny percent of users, and improve the | service for everyone else, or 2) simply not provide the | service at all due to costs. | nyuszika7h wrote: | I don't think it's likely to cause issues if implemented | properly. Realistically you can't really build a list of | "good" TLS fingerprints because there are a lot of | different browser/device combinations, so in my experience | most sites usually just block "bad" ones known to belong to | popular request libraries and such. | CWuestefeld wrote: | No, nor can we just do it by IP. The bots are MUCH more | sophisticated than that. More often than not, it's a | cooperating distributed net of hundreds of bots, coming from | multiple AWS, Azure, and GCP addresses. So they can pop up | anywhere, and that IP could wind up being a real customer | next week. And they're only recognizable as a botnet with | sophisticated logic looking at the gestalt of web logs. | | We do use a 3rd party service to help with this - but that on | its own is imposing a 5- to 6-digit annual expense on our | business. | z3c0 wrote: | Have you considered setting up an API to allow the bots to | get what they want without hammering your front-end | servers? | CWuestefeld wrote: | Yes. And if I could get the perpetrators to raise their | hands so I could work out an API for them, it would be | the path of least resistance. But they take great pains | to be anonymous, although I know from circumstantial | evidence that at least a good chunk of it is various | competitors (or services acting on behalf of competitors) | scraping price data. | | IANAL, but I also wonder if, given that I'd be designing | something specifically for competitors to query our | prices in order to adjust their own prices, this would | constitute some form of illegal collusion. | marginalia_nu wrote: | What seems to actually work is to identify the bots and | instead of giving up your hand by blocking them, to | quietly poison the data. Critically, it needs to be | subtle enough that it's not immediately obvious the data | is manipulated. It should look like a plausible response, | only with some random changes. | kayodelycaon wrote: | What makes you think they would use it? | z3c0 wrote: | It's in their interest. I've scraped a lot, and it's not | easy to build a reliable process on. Why parse a human | interface when there's an application interface | available? | thaumaturgy wrote: | There's a lot of metadata available for IPs, and that | metadata can be used to aggregate clusters of IPs, and that | in turn can be datamined for trending activity, which can | be used to sift out abusive activity from normal browsing. | | If you're dropping 6 figs annually on this and it's still | frustrating, I'd be interested in talking with you. I built | an abuse prediction system out of this approach for a small | company a few years back, it worked well and it'd be cool | to revisit the problem. | borski wrote: | You could ban their IPs? | KMnO4 wrote: | IP bans are equivalent to residential door locks. They're | only deterring the most trivial attacks. | | In school I needed to scrape a few hundred thousand pages of | a proteomics database website. For some reason you had to | view each entry one at a time. There was IP throttling which | banned you if you made requests too quickly. But slowing the | script to 1 request per second would have taken days to | scrape the site. So I paid <$5 for a list of 500 proxy | servers and distributed it, completing the task in under half | an hour. | borski wrote: | I agree it's not perfect. It's also significantly better | than nothing. | l33t2328 wrote: | Using proxies to hide your identity to get around a denial | of access seems to get awfully close to violating the | Computer Fraud and Abuse Act(in USA, at least). | | I'm surprised your school was okay with it. | throw10920 wrote: | Have you considered serving a proof-of-work challenge to | clients accessing your website? Minimal cost on legit users, | but large costs on large-scale web-scraping operations, and it | doesn't matter if they split up their efforts across a bunch of | IP addresses - they're still going to have to do those | computations. | | https://en.wikipedia.org/wiki/Hashcash | nyuszika7h wrote: | No thanks, as a user I would stay far away from such | websites. This is akin to crypto miners. I don't need them to | drive up my electricity costs and also contribute to global | warming in the process. It's not worth the cost. | userbinator wrote: | That's what rate-limiting is for. Don't be so aggressive with | it that you start hitting the faster visitors, however, or they | may soon go somewhere else (has happened to me a few times). | loceng wrote: | Do you know if there's a way to rate limit logged-in users | differently than visitors of a site? | rolph wrote: | rate limiting can be a double edged sword, you can be | better off giving a scraper highest bandwidth so they are | gone sooner, otherwise somthing like making a zip or other | sort of compilation of the site available may be an option. | | just what kind of scraper you have is a concern. | | does scraper just want a bunch of stock images; | | or does scraper have FOMO on web trinkets; | | or does scraper want to mirror/impersonate your site. | | the last option is the most concerning because then; | | scraper is mirroring bcz your site is cool and local UI/UX | is wanted; | | or is scraper phishing smishing or otherwise duping your | users. | loceng wrote: | Yeah, good points to consider. I think the sites that | would be scrapped the most would be where the data is | regularly and reliably up-to-date, and a large volume of | it at that - so not just one scraper but many different | parties may on a daily or weekly basis try to scrap every | page. | | I feel that ruling should have the caveat that if a fair | cost paid API version for getting publicly listed data | then the scrapers must legally use that (say no more than | 5% more than cost of CPU/bandwidth/etc of the scraping | behaviour); ideally a rule too that at minimum there be a | delay if they are republishing that data without your | permission, so at least you as the platform/source/reason | for the data being up-to-date aren't harmed too - which | may then kill the source platform over time if regular | visitors somehow start going to the competitor publishing | the data. | patmorgan23 wrote: | Absolutely you just have to check the session cookie | minusf wrote: | nginx can be set up to do that using the session cookie. | CWuestefeld wrote: | Rate limiting isn't an effective defense for us. | | First, as a B2B site, many of our users from a given customer | (and with huge customers, that can be many) are coming | through the same proxy server, effectively presenting to us | as the same IP, | | Second, the bots years back became much more sophisticated | than a single, or even relatively finite, IP. Today they work | AWS, Azure, GCP, and other cloud services. So the IPs that | they're assigned today will be different tomorrow. Worse, the | IPs that they're assigned today may well be used by a real | customer tomorrow. | gregsadetsky wrote: | Have you tried including the recaptcha v3 library and | looking at the distribution of scores? -- | https://developers.google.com/recaptcha/docs/v3 -- | "reCAPTCHA v3 returns a score for each request without user | friction" | | It obviously depends on how motivated the scrapers are | (i.e. whether their headless browsers are actually | headless, and/or doing everything they can to not appear | headless, whether Google has caught on to their latest | tricks etc. etc.) but it would at least be interesting to | look at the score distribution and then see whether you can | cut off or slow down the < 0.3 scoring requests (or | redirect them to your API docs) | 9dev wrote: | It sounds great, until you have Chinese customers. That's | when you'll figure out Recaptcha just doesn't really work | in China, and have to begrudgingly ditch it altogether... | kevincox wrote: | If your users are logged in you can rate limit by user | instead of by IP. This mostly solves this problem. | Generally what I do is for logged in users I rate limit by | user, then for not-logged-in users I rate limit | aggressively by IP. If they hit the limit the message lets | them know that they can get around it by logging in. Of | course this depends on user accounts having some sort of | cost to create. I've never actually implemented it but | considered having only users who have made at least one | purchase bypass the IP limit or otherwise get a bigger rate | limit. | forgotmypw17 wrote: | Yes, I think working to accommodate the non-humans along with | the humans is the right approach here. | | Scrapers have a limited range of IPs, so rate-limiting them and | stalling (or dropping) request responses is one way to deal | with the DoS scenario. | | For my sites, I have placed the majority behind HTTP Basic | Auth... | KptMarchewa wrote: | You realistically can't. There are services like [0][1] that | mean any IP could be a scraper node. | | [0] https://brightdata.com/proxy-types/residential-proxies | [1] https://oxylabs.io/products/residential-proxy-pool | orlp wrote: | > How does Bright Data acquire its residential IPs? | | > Bright Data has built a unique consumer IP model by which | all involved parties are fairly compensated for their | voluntary participation. App owners install a unique | Software Development Kit (SDK) to their applications and | receive monthly remuneration based on the number of users | who opt-in. App users can voluntarily opt-in and are | compensated through an ad-free user experience or enjoy an | upgraded version of the app they are using for free. These | consumers or 'peers' serve as the basis of our network and | can opt-out at any time. This model has brought into | existence an unrivaled, first of its kind, ethically sound, | and compliant network of real consumers. | | I don't know how they can say with a straight face that | this is 'ethically sound'. They have, essentially, created | a botnet, but apparently because it's "AdTech" and the user | "opts-in" (read: they click on random buttons until they | hit one that makes the banner/ad go away) it's suddenly not | malware. | TedDoesntTalk wrote: | NordVPN (Tesonet) has another business doing the same | thing. They sell the IP addresses/bandwidth of their | NordVPN customers to anyone who needs bulk mobile or | residential IP addresses. That's right, installing their | VPN software adds your IP address to a pool that NordVPN | then resells. Xfinity/Comcast sort of pioneered this with | their wifi routers that automatically expose an isolated | wifi network called 'xfinity' (IIRC) whether you agree or | not. | rascul wrote: | > They sell the IP addresses/bandwidth of their NordVPN | customers to anyone who needs bulk mobile or residential | IP addresses | | I would be interested in a reference for this if you have | one. | duskwuff wrote: | The Comcast access points do, at least, have the saving | grace that they're on a separate network segment from the | customer's hardware, and don't share an IP address or | bandwidth/traffic limit with the customer. | | Tesonet and other similar services (e.g. Luminati) don't | have that. As far as anyone -- including web services, | the ISP, or law enforcement -- are concerned, their | traffic is the subscriber's traffic. | lazyjeff wrote: | Now I wonder whether "retrieving" your own OAuth token from an | app to make REST calls that extract your own data from cloud | services is legal. It seems to fall under the same guideline, | that exceeding authorization is not unauthorized access, so even | though it's usually against the terms of service it doesn't | violate CFAA? | MWil wrote: | Next up, we just need all public court records to be freely | available to BE scraped and not $3 per page | | https://patentlyo.com/patent/2020/08/court-pacer-should.html | KennyBlanken wrote: | Really the problem is that PACER has been turned into a cash | cow for the federal court system, with fees and profits growing | despite costs being virtually nill. | | But yeah, the irony of the federal court system legalizing | screen scraping, something PACER contractually prohibits. | jakelazaroff wrote: | If I have to stake out a binary position here, I'm pro scraping. | But I really wish we could find a way to be more nuanced here. | The scraper in question is looking at public LinkedIn profiles so | that it can snitch to employers about which employees might be | looking for new jobs. That's not at all the same as archival; | it's using my data to harm me. | xboxnolifes wrote: | It's a public page. Your employer could just as well check your | page theirself. It may be a tragedy of efficiency, but it's not | like the scraper is grabbing hidden data. The issue is in | something else. Maybe it's the fact that your current employer | would punish you for looking for a new job. Or maybe LinkedIn's | public "looking for job" status is not sustainable in it's | current form. | notch656a wrote: | Weev was charged, and eventually convicted based merely on | scraping from AT&T [0]. When the charge was vacated, it was | only on venue/jurisdiction, not on the basis of the scraping | being legal. Seems there's precedent merely scraping this | information is felonious behavior. | | https://www.eff.org/deeplinks/2013/07/weevs-case-flawed- | begi... | jakelazaroff wrote: | Yes, and if I have a public Twitter account it's perfectly | possible for someone to flood me with spam messages. That | doesn't mean we should do nothing to prevent it. As I said | elsewhere, we should strive to make it possible for people to | exist in public digital spaces without worrying about bad | actors. | xboxnolifes wrote: | Someone can manually spam you, and I don't think that | should be allowed. That is a separate topic and discussion. | Unless you are arguing that your employer should not be | allowed to check your LinkedIn status. | jakelazaroff wrote: | I'm just using it as an example of a case in which a | public profile doesn't automatically mean anything goes. | I had hoped to generate discussion about how to throw out | some of the bathwater without throwing out the baby too, | but I guess no one is really interested. | brians wrote: | Yes, but it's specifically using data you published to harm | you. Compare Blind, which is engineered to not be attributable | in this way. | jakelazaroff wrote: | I understand that I published it. That doesn't mean I should | accept that hostile parties will use it against me. | | This is kinda like telling someone who is being harassed on | social media that they're consenting to it by having a public | account. We should strive to make our digital personae safe | from bad actors, not throw our hands up and say "if you put | yourself out there, you have no recourse". | TedDoesntTalk wrote: | This is great news. A win for the Internet Archive and other | archivists. | ghaff wrote: | IANAL but it's not immediately obvious to me that this ruling | covers bulk scraping and _republishing_ untransformed. I 'm | genuinely curious about this personally. I presumably can't | just grab anything I feel like off the web, curate it, and sell | it. | 1vuio0pswjnm7 wrote: | "On LinkedIn, our members trust us with their information, which | is why we prohibit unauthorized scraping on our platform." | | This is an unpersuasive argument because it ignores all the | computer users who are not "members". Whether or not "members" | trust LinkedIn should have no bearning on whether other computer | users who may or may not be "members" can retrieve others' public | information. | | Even more, this statement does not decry so-called scraping only | "unauthorised" scraping. Who provides "authorisation". Surely not | the LinkedIn members. | | It is presumptuous if not ridiculous for "tech" companies to | claim computer users "trust" them. Most of these companies | recieve no feedback from the majority of their "members". Tech | companies generally have no "customer service" for the members | they target with data collection. | | Further, there is an absence of meaningful choice. It is like | saying people "trust" credit bureaus with their information. | History shows these data collection intermediaries could not be | trusted and that is why Americans have the Fair Credit Reporting | Act. | MWil wrote: | Great point about non-members/members | ketzu wrote: | I am not versed in law, especially not in US law, but this case | seems to be very specific that scraping is no violation of the | CFAA. I do support this interpretation. | | However, the case of scraping I personally find more problematic | is the use of personal data I provide to one side, then used by | scrapers without my knowledge or permission. I truly wonder which | way we are better off on that issue as a society. Independent of | the current law, should anything that is accessible be | essentially free-for-all or should there be limitations on what | you are allowed to do. Cases highlighted in the article: Facial | recognition by third parties on social media profiles, facebook | scraping for personal data, search engines, journalists or | archives. (Not all need to have the same answer to the question | "do we want this") Besides that, the point I care slightly less | about is the idea that allowing scaping with very leisure limits | leads to even more closed up systems. | snarf21 wrote: | serious question (ianal): If I write down some information, at | what point does that information have copyright protection? Do | I have to claim it with c? | henryfjordan wrote: | Never. "Mere listings" of data (like the phone book) are not | copyrightable. | | But also anything you write which is copyrightable is | copyright immediately. You can register the work w/ the | copyright office for some extra perks but it's not strictly | necessary. | butlerm wrote: | You might want to check out Feist v. Rural Telephone Company | (1991), and also look up the Berne Convention (on copyright), | which the U.S. joined in the 1970s. | | If by "information" you mean mere facts without creativity in | selection or arrangement, those are generally not protectable | by copyright in the United States, although possibly in some | other countries. Copyright generally protects works of | authorship, and nothing else. No creativity no copyright. | henryfjordan wrote: | "I gave my data to Linkedin and now scrapers are reading it | from the public web". Be mad at Linkedin before you are mad at | the scraper. | ViViDboarder wrote: | I may edit and delete my information from LinkedIn, but I | have no idea who has persisted that data beyond there. | | There is such a thing as scraping responsibility and | irresponsibly. Both kinds happen. | ketzu wrote: | This seems to have various angles to it. | | First, one question is if the intent of the original owner of | the data important? When I put data on linkedin (or facebook, | my private website, hackenews or my employers website) I | might have an opinion on who gets to do what with my data | (see also GDPR discussions). Should I blame linkedin (or | meta/myself/my employer) to do what I expected them to do, or | should I blame those that do what I don't want them to do? | Should I just be blamed directly because I even want to make | a distinction between those? If I didn't want my data I could | just not provide it (or participate in/surf the web at all if | we extend the idea to more general data collection). | | Secondly, it touches on the idea that linkedin should not | make the data publicly available (i.e., without | authentication), and we end up with a less open system. Is | that better? Is it what we want? Maybe there are also other | ways that I am not aware just now. (Competing purely on value | added is probably futile for data aggregators.) | henryfjordan wrote: | Your intent as the original owner of the data is important! | You have to explicitly give Linkedin the right to display | your data. It's in their Terms of Service. If Linkedin does | something with your data that is outside the ToS, then that | is on them, but if they do something within the ToS that | you don't like then maybe you should not have provided them | with your data. | | As for whether the data should be public, that's a decision | we each have to make. | ct0 wrote: | Consider that scrapers may be far less interested in you as the | individual than they are regarding your input into the | aggregated data points of you and those like you. | altdataseller wrote: | The scraper example in this case HiQ had a product that | tracked employee profile changes to predict employee churn. | | So they were specifically interested in you, personally not | the aggregate | xboxnolifes wrote: | That's the same argument for all major internet tracking | cookies. I don't think that's going to convince this site's | userbase. | nomoreusernames wrote: | so i dont have the right to not be scraped? thats like sending | radiowaves and making me pay a license for a radio i dont use. | same with spam. i should give you a token to send me emails i | maybe want to look at your stuff. | notch656a wrote: | An interesting departure, considering weev was convicted on | merely scraping [0] AT&T. Although his charge was vacated, it was | on the venue/jurisdiction, not that scraping was found to be | legal. | | [0] https://www.eff.org/deeplinks/2013/07/weevs-case-flawed- | begi... | whatever1 wrote: | Is it legal though to login to a website and then scrape data | (that would not be accessible if I was just browsing as a guest)? ___________________________________________________________________ (page generated 2022-04-18 23:00 UTC)