[HN Gopher] Ask HN: Best practices for ethical web scraping? ___________________________________________________________________ Ask HN: Best practices for ethical web scraping? Hello HN! As part of my learning in data science, I need/want to gather data. One relatively easy way to do that is web scraping. However I'd like to do that in a respectful way. Here are three things I can think of: 1. Identify my bot with a user agent/info URL, and provide a way to contact me 2. Don't DoS websites with tons of request. 3. Respect the robots.txt What else would be considered good practice when it comes to web scraping? Author : aspyct Score : 192 points Date : 2020-04-04 13:27 UTC (9 hours ago) | mfontani wrote: | If all scrapers did what you did, I'd curse a lot less at $work. | Kudos for that. | | Re 2 and 3: do you parse/respect the "Crawl-delay" robots.txt | directive, and do you ensure that works properly across your | fleet of crawlers? | the8472 wrote: | In addition to crawl-delay there's also HTTP 429 and the retry- | after header. | | https://tools.ietf.org/html/rfc6585#page-3 | greglindahl wrote: | Sites also use 403 and 503 to send rate-limit signals, | despite what the RFCs say. | aspyct wrote: | Hehe, my "fleet of crawlers" is a single machine in a closet so | far :) I'll think about that kind of synchronization later. | | However I do parse and respect the "crawl-delay" now, thanks | for pointing it out! | greglindahl wrote: | A large fraction of websites with Crawl-Delay set it a decade | ago and promptly forgot about it. No modern crawler uses it for | anything other than a hint. The primary factors for crawl rate | are usually site page count and response time. | abannin wrote: | Don't fake identity. If the site requires a login, don't fake | that login. This has legal implications. | RuedigerVoigt wrote: | Common CMS are fairly good at caching and can handle a high load, | but quite often someone deems a badly programmed extension | "mission critical". In that case one of your requests might | trigger dozens of database calls. If multiple sites share a | database backend, an accidental DOS might bring down a whole | organization. | | If the bot has a distinct IP (or distinct user agent), then a | good setup can handle this situation automatically. If the | crawler switches IPs to circumvent a rate limit or for other | reasons, then it often causes trouble in the form of tickets and | phone calls to the webmasters. Few care about some gigabytes of | traffic, but they do care about overtime. | | Some react by blocking whole IP ranges. I have seen sites that | blocked every request from the network of Deutsche Telekom (Tier | 1 / former state monopoly in Germany) for weeks. So you might | affect many on your network. | | So: | | * Most of the time it does not matter if you scrape all | information you need in minutes or overnight. For crawl jobs I | try to avoid the time of day I assume high traffic to the site. | So I would not crawl restaurant sites at lunch time, but 2 a.m. | local time should be fine. If the response time goes up suddenly | at this time, this can be due to a backup job. Simply wait a bit. | | * The software you choose has an impact: If you use Selenium or | headless Chrome, you load images and scripts. If you do not need | those, analyzing the source (with for example beautiful soup) | draws less of the server's resources and might be much faster. | | * Keep track of your requests. A specific file might be linked | from a dozen pages of the site you crawl. Download it just once. | This can be tricky if a site uses A/B testing for headlines and | changes the URL. | | * If you provide contact information read your emails. This | sounds silly, but at my previous work we had problems with a | friendly crawler with known owners. It tried to crawl our sites | once a quarter and was blocked each time, because they did not | react to our friendly requests to change their crawling rate. | | Side note: I happen to work on a python library for a polite | crawler. It is about a week away from stable (one important bug | fix and a database schema change for a new feature). In case it | is helpful: https://github.com/RuedigerVoigt/exoskeleton | Someone wrote: | IMO, the best practice is "don't". If you think the data you're | trying to scrape is freely available, contact the site owner, and | ask them whether dumps are available. | | Certainly, if your goal is "learning in data science", and thus | not tied to a specific subject, there are enough open datasets to | work with, for example from https://data.europa.eu/euodp/en/home | or https://www.data.gov/ | aspyct wrote: | I'm a lot more motivated to do data science on topics I | actually care about :) Unfortunately those topics (or websites, | in this case) don't expose ready-made databases or csv files. | pxtail wrote: | Where this _' best practice is "don't"'_ idea comes from? I saw | it couple of times when scraping topic surfaces. I think that | it is kind of hypocrisy and actually acting against own good | and even good of the internet as whole because it artificially | limits who can do what. | | Why are there entities which are allowed to scrape web however | they want (who got into their position because of scraping the | web) and when it comes to regular Joe then he is discouraged | from doing so? | xzel wrote: | This might be overboard for most projects but here is what I | recently did. There is a website I use heavily that provides | sales data for a specific type of products. I actually e-mailed | to make sure this was allowed because they took down their public | API a few years ago. They said yes everything that is on the | website is fair game and you can even do it on your main account. | It was actually a surprisingly nice response. | sys_64738 wrote: | Ethical web scraping? Is that even a thing? | RhodesianHunter wrote: | How do you think Google provides search results? | sys_64738 wrote: | You're claiming google is ethical? Bit of a stretch. | haddr wrote: | Some time ago I wrote an answer on stackoverflow: | https://stackoverflow.com/questions/38947884/changing-proxy-... | | Maybe that can help. | johnnylambada wrote: | You should probably just paste your answer here if it's that | good. | ok_coo wrote: | I work with a scientific institution and it's still amazing to me | that people don't check or ask if there are downloadable full | datasets that anyone can have for free. They just jump right in | to scraping websites. | | I don't know what kind of data you're looking for, but please | verify that there isn't a quicker/easier way of getting the data | than scraping first. | jakelazaroff wrote: | I think your main obligation is not to the entity from which | you're scraping the data, but the people whom the data is about. | | For example, the recent case between LinkedIn and hiQ centered on | the latter not respecting the former's terms of service. But even | if they had followed that to the T, what hiQ is doing -- scraping | people's profiles and snitching to their employer when it looked | like they were job hunting -- is incredibly unethical. | | Invert power structures. Think about how the information you | scrape could be misused. Allow people to opt out. | aspyct wrote: | That's a fair point indeed. I don't think I will ever expose | non-anonymized data, because that's just too sensitive. But if | I ever do, I'll make sure people are made aware they are | listed, and that they can opt out easily. | monkpit wrote: | I tried to find a source to back up what you're saying about | hiQ "snitching" to employers about employees searching for | jobs, but all I can find is vague documentation about the legal | suit hiQ v. LinkedIn. | | Do you have a link to an article or something? | jakelazaroff wrote: | Sure, it's mentioned in the EFF article about the lawsuit: | https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq- | v-l... | | _> HiQ Labs' business model involves scraping publicly | available LinkedIn data to create corporate analytics tools | that could determine when employees might leave for another | company, or what trainings companies should invest in for | their employees._ | lkjdsklf wrote: | It's their actual product. Keeper. | | > Keeper is the first HCM tool to offer predictive attrition | insights about an organization's employees based on publicly | available data. | sudoaza wrote: | Those 3 are the main, sharing the data in the end could be also a | way to avoid future scrapings. | mrkramer wrote: | That's an interesting proposition. For example there is Google | Dataset Search where you can "locate online data that is freely | available for use". | aspyct wrote: | Didn't know about that search engine. Thanks a lot! Actually | found a few fun datasets, made my day :) | rectang wrote: | In addition to the steps you're already taking, and the ethical | suggestions from other commenters, I suggest that you aquaint | yourself thoroughly with intellectual property (IP) law. If you | eventually decide to publish anything based on what you learn, | copyright and possibly trademark law will come into play. | | Knowing what rights you have to use material you're scraping | early on could guide you towards seeking out alternative sources | in some cases, sparing you trouble down the line. | aspyct wrote: | That's a good point! So far I'm not planning on publicly | disclosing any of my results, but that may come, I guess. | yjftsjthsd-h wrote: | I'm curious how this would be an issue; factual information | isn't copyrightable, and most of the obvious things that I can | think to do with a scraper amount to pulling factual | information in bulk. Even if it's information like, "this is | the average price for this item across 13 different stores". | (Although I'm not a lawyer and only pay attention to American | law, so take all of this with the appropriate amount of salt) | rectang wrote: | How much can you quote from a crawled document? Can you | republish the entire crawl? What can you do under "fair use" | of copyrighted material and what can't you do? Can you | articulate a solid defense of your publication that it truly | contains only pure factual information? Will BigCo dislike | having its name associated with the study but can you protect | yourself by limiting yourself to "nominative use" of its | trademarks? What is the practical risk of someone raising a | stink if the legality of your usage is ambiguous? Who | actually holds copyright on the crawled documents? | | You have a lot of rights and you can do a lot. Understanding | those rights and where they end lets you do _more_ , and with | confidence. | elorant wrote: | My policy on scraping is to never use asynchronous methods. I've | seen a lot of small e-commerce sites that can't really handle the | load, even if it's a few hundred requests per second, and the | server crashes. So even if it takes me longer to scrape a site I | prefer to not cause any real harm on them as long as I can avoid | it. | moooo99 wrote: | The rules you named are some I personally followed. One other | extremely important thing is privacy when you want to crawl | personal data like social networks. I personally avoid crawling | data that inexperienced users might accidentally expose, like | email adresses, phone numbers or their friends list. A good rule | of thumb for social networks for me always was, that I only | scrape the data that is visible when my bot is not logged in | (also helps to not break the providers ToS). | | The most elegant way would be to ask the site provider if they | allow scraping their website and which rules you should obey. I | was surprised how open some providers were, but some don't even | bother replying. If they don't reply, apply the rules you set and | follow the obvious ones like not overloading their service etc. | aspyct wrote: | I tried the elegant way before, after creating a mobile | application to find fuel pumps around the country for a | specific brand. My request was greeted with a "don't publish; | we're busy making one; we'll sue you anyway". I guess where I'm | from, people don't share their data yet... | | Totally agree with the point on accidental personal data, | thanks for pointing that out! | | PS: they never released their app... | [deleted] | montroser wrote: | Nice you to ask this question and think about how to be as | considerate as you can. | | Some other thoughts: | | - Find the most minimal, least expensive (for you and them both) | way to get the data you're looking for. Sometimes you can iterate | through search results pages and get all you need from there in | bulk, rather than iterating through detail pages one at at a | time. | | - Even if they don't have an official/documented API, they may | very likely have internal JSON routes, or RSS feeds that you can | consume directly, which may be easier for them to accommodate. | | - Pay attention to response times. If you get your results back | in 50ms, it probably was trivially easy for them and you can | request a bunch without troubling them too much. On the other | hand, if responses are taking 5s to come back, then be gentle. If | you are using internal undocumented APIs you may find that you | get faster/cheaper cached results if you stick to the same sets | of parameters as the site is using on its own (e.g., when the | site's front end makes AJAX calls) | [deleted] | aspyct wrote: | That's great advice! Especially the one about response times. I | didn't think of that, and will integrate it in my sleep timer | :) | snidane wrote: | When scraping just behave as to not piss off the site owner - | whatever that means. Eg. not cause excessive load or making sure | you don't leak out sensitive data. | | Next put yourself in their shoes and realize they don't usually | monitor their traffic that much or simply don't care as long as | you don't slow down their site. It's usually only certain big | sites with heavy bot traffic such as linkedin or sneaker shoe | sites which implement bot protections. Most others don't care. | | Some websites are created almost as if they want to be scraped. | The json api used by frontend is ridiculously clean and | accessible. Perhaps they benefit when people see their results | and invest in their stock. You never fully know if the site wants | to be scraped or not. | | The reality of scraping industry related to your question is this | | 1. scraping companies generally don't use real user agent such as | 'my friendly data science bot' but they hide behind a set of fake | ones and/or route the traffic through a proxy network. You don't | want to get banned so stupidly easily by revealing user agent | when you know your competitors don't reveal theirs. | | 2. This one is obvious. The general rule is to scrape over long | time period continuously and add large delays between requests of | at least 1 second. If you go below 1 second be careful. | | 3. robots.txt is controversial and doesn't serve its original | purpose. It should be renamed to google_instructions.txt because | site owners use it to guide googlebot to navigate their site. It | is generally ignored by the industry again because you know your | competitors ignore it. | | Just remember the rule of 'not to piss off the site owner' and | then just go ahead and scrape. Also keep in mind that you are in | a free country and we don't discriminate here whether it is of | racial or gender reasons or whether you are a biological or | mechanical website visitor. | | I simply described the reality of data science industry around | scraping after several years of being in it. Note that this will | probably not be liked by HN audience as they are mostly website | devs and site owners. | hutzlibu wrote: | "or making sure you don't leak out sensitive data" | | If sensitive data can be scraped, it is not really stored | sensitive. So I would not care too much about it and just | notify the owner if I notice it. | HenryBemis wrote: | Keep in mind that if you end up with data that are protected | under GDPR, merely having them puts you in a damning | position. The intended owner will be fried for not protecting | it adequately, but you violate GDPR since "I never agreed to | you collecting, processing, etc" the data. And imagine the | world of pain if you are caught with children's data. | aspyct wrote: | Well, having a few websites of my own, I really do think that | point 1 is the worst. I can't filter bots that disguise as | users from my access logs, and they actually hurt my work (i.e. | figuring out what people read). | | Totally agree with the rest though. Maybe adapt the "large | delay" of 1 second to the kind of website I'm scraping though. | | Thanks for your feedback! | the8472 wrote: | > I can't filter bots that disguise as users from my access | logs, and they actually hurt my work (i.e. figuring out what | people read). | | If the bots aren't querying from residential IPs you could | match their IPs to ASNs and then filter based on that to | separate domestic and data center origins. | aspyct wrote: | Ha, that's a good idea! Is there a list somewhere of the | cidr blocks that are assigned to residential vs server | farms? I mean, how can I tell an IP is residential? | the8472 wrote: | The other way around may be easier, i.e. excluding known | datacenter ranges. There are some commercial databases | for that, i'm not sure if there are any free ones. But | you can also do this manually by running a whois on an IP | and then extracting the ranges from the whois response | and caching then. Then you can look at the orgname or | something like that. You can also download the whois | databases from the RIRs, but they don't contain the | information what kind of entities they are. | $ dig +short reddit.com 151.101.1.140 | $ whois 151.101.1.140 NetRange: | 151.101.0.0 - 151.101.255.255 CIDR: | 151.101.0.0/16 OrgName: Fastly | [...] | | So if you see a known hoster here then you can exclude it | from your statistics. | capableweb wrote: | What I've done in the past is to pull down all the IPs of | request I see, filter by unique, do whois for each one of | them (you're gonna need to have a backoff/rate limit here | as whois services are usually rate limited) and save the | organization name, ASN and CIDR blocks, again filter by | uniqueness, then create a new list with the organizations | of interest and match with the CIDR blocks. Now you have | an allow/blocklist you can use. | codingdave wrote: | You are correct that I don't like this advice... not because I | find it to be wrong, but because you are approaching it solely | from a competitive perspective -- "Your competitors don't have | ethics, so you shouldn't either." That doesn't help someone who | is engaging in research and trying to hold themselves to a | higher standard. | lordgrenville wrote: | I'm neither a web dev nor a site owner, but OP literally asked | for tips on _ethical_ web scraping, not "what's the most I can | get away with". | wizzwizz4 wrote: | 1. is the only one I don't like. I think you should use your | real user agent first on any given site, as a courtesy; whether | you give up or change to a more "normal" user agent if you get | banned is up to you. | | Oh, and for 3.: if you can, apply some heuristics to your | reading of the robots.txt. If it's just "deny everything", then | ignore it, but you really don't want to be responsible for | crawling all of the GET /delete/:id pages of a badly-designed | site... (those should definitely be POST, and authenticated, by | the way). | mpclark wrote: | Also, if a target site is behind Cloudflare then you probably | won't be able to masquerade as any of the popular bots - they | block fake google/yandex bots. | gilad wrote: | As for delete, use authenticated DELETE, not POST, it's why | its there in the first place | chatmasta wrote: | I disagree. The risks are similar to those of disclosing a | security vulnerability to a company without a bug bounty. You | cannot know how litigious or technically illiterate the | company will be. What if they decide you're "hacking" them | and call the FBI with the helpful information you included in | your user agent? Crazier things have happened. | | Anonymity is part of the right to privacy; IMO, such a right | should extend to bots as well. There should be no shame in | anonymously accessing a website, whether via automated means | or otherwise. | a1369209993 wrote: | > such a right should extend to bots as well | | No, it very much shouldn't, but (as you probably meant) it | _should_ extend to the _person_ (not, eg, company) _using_ | a bot, which amounts to the same thing in this case. | erdos4d wrote: | Perhaps I am behind the curve here, but why would sneaker shoe | sites get scraped hard? | abannin wrote: | There is a very active secondary market for sneakers. If you | can buy before supply is exhausted, you can make some decent | money. | [deleted] | mettamage wrote: | Indirectly related, if you have some time to spare follow | Harvard's course in ethics! [1] | | Here is why: while it didn't teach me anything new (in a sense), | it did give me a vocabulary to better articulate myself. Having | new words to describe certain ideas means you have more | analytical tools at your disposal. So you'll be able to examine | your own ethical stance better. | | It takes some time, but instead of watching Netflix (if that's a | thing you do), watch this instead! Although, The Good Place is a | pretty good Netflix show sprinkling some basic ethics in there. | | [1] https://www.youtube.com/watch?v=kBdfcR-8hEY | lapnitnelav wrote: | Thanks for sharing that Harvard's course. | | The cost benefit analysis part reminds me a lot of some of the | comments you see here (and elsewhere) with regards to Covid-19 | and the economic shutdown of societies. Quite timely. | aspyct wrote: | Great recommendations, thanks! | aspyct wrote: | I must insist. This course is great! Thanks :) | JackC wrote: | In some cases, especially during development, local caching of | responses can help reduce load. You can write a little wrapper | that tries to return url contents from a local cache and then | falls back to a live request. | sairamkunala wrote: | Simple, | | respect robots.txt | | find your data from sitemaps, ensure you query at a slow rate. | robots.txt has a cool off period. See | https://en.wikipedia.org/wiki/Robots_exclusion_standard#Craw... | | example: https://www.google.com/robots.txt | aspyct wrote: | Yeah that's a must do, but I think most websites don't even | bother making a robots.txt beyond "please index us, google". | However that wouldn't necessarily mean they're happy about | someone vacuuming their whole website in a few days. | brainzap wrote: | Ask for permissions and have nice timeout/retries. | jll29 wrote: | The only sound advice one can give is: there are two elements to | consider: 1) ethics is different from law 1.1) the ethical way: | respect robots.txt protocol 2) consult a lawyer 2.1) prior | written consent, they will say, prevents you from being sued, and | not much else. | tdy721 wrote: | Schema.org is a nice resource. If you can find that meta-data on | a site, you can be just a little more sure they don't mind | getting that data scraped. It's the instruction book for teaching | google and other crawlers extra information and context. Your | scraper would be wise to parse this extra meta information. | mapgrep wrote: | I always add an "Accept-Encoding" header to my request to | indicate I will accept a gzip response (or deflate if available). | Your http library (in whatever language your bot is in) probably | supports this with a near trivial amount of additional code, if | any. Meanwhile you are saving the target site some bandwidth. | | Look into If-Modified-Since and If-None-Match/Etag headers as | well if you are querying resources that support those headers | (RSS feeds, for example, commonly support these, and static | resources). They prevent the target site from having to send | anything other than a 304, saving bandwidth and possibly compute. | adrianhel wrote: | I like this approach. Personally I wait an hour if I get an | invalid response and use timeouts of a few seconds between other | requests. | tedivm wrote: | I've gone through this process twice- one about six months ago, | and once just this week. | | In the first event the content wasn't clearly licensed and the | site as somewhat small, so I didn't want to break them. I emailed | them and they gave us permission but only if we only crawled one | page per ten seconds. Took us a weekend, but we got all the data | and did so in a way that respected their site. | | The second one was this last week and was part of a personal | project. All of the content was over an open license (creative | commons), and the site was hosted on a platform that can take a | ton of traffic. For this one I made sure we weren't hitting it | too hard (scrapy has some great autothrottle options), but | otherwise didn't worry about it too much. | | Since the second project is personal I open sourced the crawler | if you're curious- https://github.com/tedivm/scp_crawler | coderholic wrote: | Another option is to not scrape at all, and use an existing data | set. Common crawl is one good example, and http archive is | another. | | If you just want meta data from the homepage of all domains we | scrape that every month at https://host.io and make the data | available over our API: https://host.io/docs | tyingq wrote: | Be careful about making the data you've scraped visible to | Google's search engine scrapers. | | That's often how site owners get riled up. They search for some | unique phrase on Google, and your site shows up in the search | results. | lazyjones wrote: | It's incredibly ironic that one has to avoid doing what Google | does in order to be kept in their index. | MarcellusDrum wrote: | This isn't really an "ethical" practice, more like how to hide | that you are scraping data practice. If you have to hide the | fact that you are scraping their data, maybe you shouldn't be | doing it in the first place. | tyingq wrote: | Depends. Maybe, for example, you're doing some competitive | price analysis and never plan on exposing scraped things like | product descriptions...you only plan to use those internally | to confirm you're comparing like products. But you expose it | accidentally. Avoid that. | throwaway777555 wrote: | The suggestions in the comments are excellent. One thing I would | add is this: contact the site owner in advance and ask for their | permission. If they are okay with it or if you don't hear back, | credit the site in your work. Then send the owner a message with | where they can see the information being used. | | Some sites will have rules or guidelines for attribution already | in place. For example, the DMOZ had a Required Attribution page | to explain how to credit them: https://dmoz- | odp.org/docs/en/license.html. Discogs mentions that use of their | data also falls under CC0: https://data.discogs.com/. Other sites | may have these details in their Terms of Service, About page, or | similar. | avip wrote: | Contact site owner, tell them who you are and what you're doing, | ask about data dump or api. | pfarrell wrote: | It won't help you learn to write a scraper, but using the common | crawl dataset will get you access to a crazy amount of data | without paying to acquire it yourself. | | https://commoncrawl.org/the-data/ | aspyct wrote: | Cool, didn't know about this. Thanks! | Reelin wrote: | > As part of my learning in data science, I need/want to | gather data. | | Also not web scraping, but a few other public data set | sources to check. | | https://registry.opendata.aws | | https://github.com/awesomedata/awesome-public-datasets | aspyct wrote: | Thanks! ___________________________________________________________________ (page generated 2020-04-04 23:00 UTC)