hngopher.com

       [HN Gopher] Ask HN: Best practices for ethical web scraping?
       ___________________________________________________________________
        
       Ask HN: Best practices for ethical web scraping?
        
       Hello HN!  As part of my learning in data science, I need/want to
       gather data. One relatively easy way to do that is web scraping.
       However I'd like to do that in a respectful way. Here are three
       things I can think of:  1. Identify my bot with a user agent/info
       URL, and provide a way to contact me 2. Don't DoS websites with
       tons of request. 3. Respect the robots.txt  What else would be
       considered good practice when it comes to web scraping?
        
       Author : aspyct
       Score  : 192 points
       Date   : 2020-04-04 13:27 UTC (9 hours ago)
        
       | mfontani wrote:
       | If all scrapers did what you did, I'd curse a lot less at $work.
       | Kudos for that.
       | 
       | Re 2 and 3: do you parse/respect the "Crawl-delay" robots.txt
       | directive, and do you ensure that works properly across your
       | fleet of crawlers?
        
         | the8472 wrote:
         | In addition to crawl-delay there's also HTTP 429 and the retry-
         | after header.
         | 
         | https://tools.ietf.org/html/rfc6585#page-3
        
           | greglindahl wrote:
           | Sites also use 403 and 503 to send rate-limit signals,
           | despite what the RFCs say.
        
         | aspyct wrote:
         | Hehe, my "fleet of crawlers" is a single machine in a closet so
         | far :) I'll think about that kind of synchronization later.
         | 
         | However I do parse and respect the "crawl-delay" now, thanks
         | for pointing it out!
        
         | greglindahl wrote:
         | A large fraction of websites with Crawl-Delay set it a decade
         | ago and promptly forgot about it. No modern crawler uses it for
         | anything other than a hint. The primary factors for crawl rate
         | are usually site page count and response time.
        
       | abannin wrote:
       | Don't fake identity. If the site requires a login, don't fake
       | that login. This has legal implications.
        
       | RuedigerVoigt wrote:
       | Common CMS are fairly good at caching and can handle a high load,
       | but quite often someone deems a badly programmed extension
       | "mission critical". In that case one of your requests might
       | trigger dozens of database calls. If multiple sites share a
       | database backend, an accidental DOS might bring down a whole
       | organization.
       | 
       | If the bot has a distinct IP (or distinct user agent), then a
       | good setup can handle this situation automatically. If the
       | crawler switches IPs to circumvent a rate limit or for other
       | reasons, then it often causes trouble in the form of tickets and
       | phone calls to the webmasters. Few care about some gigabytes of
       | traffic, but they do care about overtime.
       | 
       | Some react by blocking whole IP ranges. I have seen sites that
       | blocked every request from the network of Deutsche Telekom (Tier
       | 1 / former state monopoly in Germany) for weeks. So you might
       | affect many on your network.
       | 
       | So:
       | 
       | * Most of the time it does not matter if you scrape all
       | information you need in minutes or overnight. For crawl jobs I
       | try to avoid the time of day I assume high traffic to the site.
       | So I would not crawl restaurant sites at lunch time, but 2 a.m.
       | local time should be fine. If the response time goes up suddenly
       | at this time, this can be due to a backup job. Simply wait a bit.
       | 
       | * The software you choose has an impact: If you use Selenium or
       | headless Chrome, you load images and scripts. If you do not need
       | those, analyzing the source (with for example beautiful soup)
       | draws less of the server's resources and might be much faster.
       | 
       | * Keep track of your requests. A specific file might be linked
       | from a dozen pages of the site you crawl. Download it just once.
       | This can be tricky if a site uses A/B testing for headlines and
       | changes the URL.
       | 
       | * If you provide contact information read your emails. This
       | sounds silly, but at my previous work we had problems with a
       | friendly crawler with known owners. It tried to crawl our sites
       | once a quarter and was blocked each time, because they did not
       | react to our friendly requests to change their crawling rate.
       | 
       | Side note: I happen to work on a python library for a polite
       | crawler. It is about a week away from stable (one important bug
       | fix and a database schema change for a new feature). In case it
       | is helpful: https://github.com/RuedigerVoigt/exoskeleton
        
       | Someone wrote:
       | IMO, the best practice is "don't". If you think the data you're
       | trying to scrape is freely available, contact the site owner, and
       | ask them whether dumps are available.
       | 
       | Certainly, if your goal is "learning in data science", and thus
       | not tied to a specific subject, there are enough open datasets to
       | work with, for example from https://data.europa.eu/euodp/en/home
       | or https://www.data.gov/
        
         | aspyct wrote:
         | I'm a lot more motivated to do data science on topics I
         | actually care about :) Unfortunately those topics (or websites,
         | in this case) don't expose ready-made databases or csv files.
        
         | pxtail wrote:
         | Where this _' best practice is "don't"'_ idea comes from? I saw
         | it couple of times when scraping topic surfaces. I think that
         | it is kind of hypocrisy and actually acting against own good
         | and even good of the internet as whole because it artificially
         | limits who can do what.
         | 
         | Why are there entities which are allowed to scrape web however
         | they want (who got into their position because of scraping the
         | web) and when it comes to regular Joe then he is discouraged
         | from doing so?
        
       | xzel wrote:
       | This might be overboard for most projects but here is what I
       | recently did. There is a website I use heavily that provides
       | sales data for a specific type of products. I actually e-mailed
       | to make sure this was allowed because they took down their public
       | API a few years ago. They said yes everything that is on the
       | website is fair game and you can even do it on your main account.
       | It was actually a surprisingly nice response.
        
       | sys_64738 wrote:
       | Ethical web scraping? Is that even a thing?
        
         | RhodesianHunter wrote:
         | How do you think Google provides search results?
        
           | sys_64738 wrote:
           | You're claiming google is ethical? Bit of a stretch.
        
       | haddr wrote:
       | Some time ago I wrote an answer on stackoverflow:
       | https://stackoverflow.com/questions/38947884/changing-proxy-...
       | 
       | Maybe that can help.
        
         | johnnylambada wrote:
         | You should probably just paste your answer here if it's that
         | good.
        
       | ok_coo wrote:
       | I work with a scientific institution and it's still amazing to me
       | that people don't check or ask if there are downloadable full
       | datasets that anyone can have for free. They just jump right in
       | to scraping websites.
       | 
       | I don't know what kind of data you're looking for, but please
       | verify that there isn't a quicker/easier way of getting the data
       | than scraping first.
        
       | jakelazaroff wrote:
       | I think your main obligation is not to the entity from which
       | you're scraping the data, but the people whom the data is about.
       | 
       | For example, the recent case between LinkedIn and hiQ centered on
       | the latter not respecting the former's terms of service. But even
       | if they had followed that to the T, what hiQ is doing -- scraping
       | people's profiles and snitching to their employer when it looked
       | like they were job hunting -- is incredibly unethical.
       | 
       | Invert power structures. Think about how the information you
       | scrape could be misused. Allow people to opt out.
        
         | aspyct wrote:
         | That's a fair point indeed. I don't think I will ever expose
         | non-anonymized data, because that's just too sensitive. But if
         | I ever do, I'll make sure people are made aware they are
         | listed, and that they can opt out easily.
        
         | monkpit wrote:
         | I tried to find a source to back up what you're saying about
         | hiQ "snitching" to employers about employees searching for
         | jobs, but all I can find is vague documentation about the legal
         | suit hiQ v. LinkedIn.
         | 
         | Do you have a link to an article or something?
        
           | jakelazaroff wrote:
           | Sure, it's mentioned in the EFF article about the lawsuit:
           | https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-
           | v-l...
           | 
           |  _> HiQ Labs' business model involves scraping publicly
           | available LinkedIn data to create corporate analytics tools
           | that could determine when employees might leave for another
           | company, or what trainings companies should invest in for
           | their employees._
        
           | lkjdsklf wrote:
           | It's their actual product. Keeper.
           | 
           | > Keeper is the first HCM tool to offer predictive attrition
           | insights about an organization's employees based on publicly
           | available data.
        
       | sudoaza wrote:
       | Those 3 are the main, sharing the data in the end could be also a
       | way to avoid future scrapings.
        
         | mrkramer wrote:
         | That's an interesting proposition. For example there is Google
         | Dataset Search where you can "locate online data that is freely
         | available for use".
        
           | aspyct wrote:
           | Didn't know about that search engine. Thanks a lot! Actually
           | found a few fun datasets, made my day :)
        
       | rectang wrote:
       | In addition to the steps you're already taking, and the ethical
       | suggestions from other commenters, I suggest that you aquaint
       | yourself thoroughly with intellectual property (IP) law. If you
       | eventually decide to publish anything based on what you learn,
       | copyright and possibly trademark law will come into play.
       | 
       | Knowing what rights you have to use material you're scraping
       | early on could guide you towards seeking out alternative sources
       | in some cases, sparing you trouble down the line.
        
         | aspyct wrote:
         | That's a good point! So far I'm not planning on publicly
         | disclosing any of my results, but that may come, I guess.
        
         | yjftsjthsd-h wrote:
         | I'm curious how this would be an issue; factual information
         | isn't copyrightable, and most of the obvious things that I can
         | think to do with a scraper amount to pulling factual
         | information in bulk. Even if it's information like, "this is
         | the average price for this item across 13 different stores".
         | (Although I'm not a lawyer and only pay attention to American
         | law, so take all of this with the appropriate amount of salt)
        
           | rectang wrote:
           | How much can you quote from a crawled document? Can you
           | republish the entire crawl? What can you do under "fair use"
           | of copyrighted material and what can't you do? Can you
           | articulate a solid defense of your publication that it truly
           | contains only pure factual information? Will BigCo dislike
           | having its name associated with the study but can you protect
           | yourself by limiting yourself to "nominative use" of its
           | trademarks? What is the practical risk of someone raising a
           | stink if the legality of your usage is ambiguous? Who
           | actually holds copyright on the crawled documents?
           | 
           | You have a lot of rights and you can do a lot. Understanding
           | those rights and where they end lets you do _more_ , and with
           | confidence.
        
       | elorant wrote:
       | My policy on scraping is to never use asynchronous methods. I've
       | seen a lot of small e-commerce sites that can't really handle the
       | load, even if it's a few hundred requests per second, and the
       | server crashes. So even if it takes me longer to scrape a site I
       | prefer to not cause any real harm on them as long as I can avoid
       | it.
        
       | moooo99 wrote:
       | The rules you named are some I personally followed. One other
       | extremely important thing is privacy when you want to crawl
       | personal data like social networks. I personally avoid crawling
       | data that inexperienced users might accidentally expose, like
       | email adresses, phone numbers or their friends list. A good rule
       | of thumb for social networks for me always was, that I only
       | scrape the data that is visible when my bot is not logged in
       | (also helps to not break the providers ToS).
       | 
       | The most elegant way would be to ask the site provider if they
       | allow scraping their website and which rules you should obey. I
       | was surprised how open some providers were, but some don't even
       | bother replying. If they don't reply, apply the rules you set and
       | follow the obvious ones like not overloading their service etc.
        
         | aspyct wrote:
         | I tried the elegant way before, after creating a mobile
         | application to find fuel pumps around the country for a
         | specific brand. My request was greeted with a "don't publish;
         | we're busy making one; we'll sue you anyway". I guess where I'm
         | from, people don't share their data yet...
         | 
         | Totally agree with the point on accidental personal data,
         | thanks for pointing that out!
         | 
         | PS: they never released their app...
        
       | [deleted]
        
       | montroser wrote:
       | Nice you to ask this question and think about how to be as
       | considerate as you can.
       | 
       | Some other thoughts:
       | 
       | - Find the most minimal, least expensive (for you and them both)
       | way to get the data you're looking for. Sometimes you can iterate
       | through search results pages and get all you need from there in
       | bulk, rather than iterating through detail pages one at at a
       | time.
       | 
       | - Even if they don't have an official/documented API, they may
       | very likely have internal JSON routes, or RSS feeds that you can
       | consume directly, which may be easier for them to accommodate.
       | 
       | - Pay attention to response times. If you get your results back
       | in 50ms, it probably was trivially easy for them and you can
       | request a bunch without troubling them too much. On the other
       | hand, if responses are taking 5s to come back, then be gentle. If
       | you are using internal undocumented APIs you may find that you
       | get faster/cheaper cached results if you stick to the same sets
       | of parameters as the site is using on its own (e.g., when the
       | site's front end makes AJAX calls)
        
         | [deleted]
        
         | aspyct wrote:
         | That's great advice! Especially the one about response times. I
         | didn't think of that, and will integrate it in my sleep timer
         | :)
        
       | snidane wrote:
       | When scraping just behave as to not piss off the site owner -
       | whatever that means. Eg. not cause excessive load or making sure
       | you don't leak out sensitive data.
       | 
       | Next put yourself in their shoes and realize they don't usually
       | monitor their traffic that much or simply don't care as long as
       | you don't slow down their site. It's usually only certain big
       | sites with heavy bot traffic such as linkedin or sneaker shoe
       | sites which implement bot protections. Most others don't care.
       | 
       | Some websites are created almost as if they want to be scraped.
       | The json api used by frontend is ridiculously clean and
       | accessible. Perhaps they benefit when people see their results
       | and invest in their stock. You never fully know if the site wants
       | to be scraped or not.
       | 
       | The reality of scraping industry related to your question is this
       | 
       | 1. scraping companies generally don't use real user agent such as
       | 'my friendly data science bot' but they hide behind a set of fake
       | ones and/or route the traffic through a proxy network. You don't
       | want to get banned so stupidly easily by revealing user agent
       | when you know your competitors don't reveal theirs.
       | 
       | 2. This one is obvious. The general rule is to scrape over long
       | time period continuously and add large delays between requests of
       | at least 1 second. If you go below 1 second be careful.
       | 
       | 3. robots.txt is controversial and doesn't serve its original
       | purpose. It should be renamed to google_instructions.txt because
       | site owners use it to guide googlebot to navigate their site. It
       | is generally ignored by the industry again because you know your
       | competitors ignore it.
       | 
       | Just remember the rule of 'not to piss off the site owner' and
       | then just go ahead and scrape. Also keep in mind that you are in
       | a free country and we don't discriminate here whether it is of
       | racial or gender reasons or whether you are a biological or
       | mechanical website visitor.
       | 
       | I simply described the reality of data science industry around
       | scraping after several years of being in it. Note that this will
       | probably not be liked by HN audience as they are mostly website
       | devs and site owners.
        
         | hutzlibu wrote:
         | "or making sure you don't leak out sensitive data"
         | 
         | If sensitive data can be scraped, it is not really stored
         | sensitive. So I would not care too much about it and just
         | notify the owner if I notice it.
        
           | HenryBemis wrote:
           | Keep in mind that if you end up with data that are protected
           | under GDPR, merely having them puts you in a damning
           | position. The intended owner will be fried for not protecting
           | it adequately, but you violate GDPR since "I never agreed to
           | you collecting, processing, etc" the data. And imagine the
           | world of pain if you are caught with children's data.
        
         | aspyct wrote:
         | Well, having a few websites of my own, I really do think that
         | point 1 is the worst. I can't filter bots that disguise as
         | users from my access logs, and they actually hurt my work (i.e.
         | figuring out what people read).
         | 
         | Totally agree with the rest though. Maybe adapt the "large
         | delay" of 1 second to the kind of website I'm scraping though.
         | 
         | Thanks for your feedback!
        
           | the8472 wrote:
           | > I can't filter bots that disguise as users from my access
           | logs, and they actually hurt my work (i.e. figuring out what
           | people read).
           | 
           | If the bots aren't querying from residential IPs you could
           | match their IPs to ASNs and then filter based on that to
           | separate domestic and data center origins.
        
             | aspyct wrote:
             | Ha, that's a good idea! Is there a list somewhere of the
             | cidr blocks that are assigned to residential vs server
             | farms? I mean, how can I tell an IP is residential?
        
               | the8472 wrote:
               | The other way around may be easier, i.e. excluding known
               | datacenter ranges. There are some commercial databases
               | for that, i'm not sure if there are any free ones. But
               | you can also do this manually by running a whois on an IP
               | and then extracting the ranges from the whois response
               | and caching then. Then you can look at the orgname or
               | something like that. You can also download the whois
               | databases from the RIRs, but they don't contain the
               | information what kind of entities they are.
               | $ dig +short reddit.com         151.101.1.140
               | $ whois 151.101.1.140              NetRange:
               | 151.101.0.0 - 151.101.255.255         CIDR:
               | 151.101.0.0/16         OrgName:        Fastly
               | [...]
               | 
               | So if you see a known hoster here then you can exclude it
               | from your statistics.
        
               | capableweb wrote:
               | What I've done in the past is to pull down all the IPs of
               | request I see, filter by unique, do whois for each one of
               | them (you're gonna need to have a backoff/rate limit here
               | as whois services are usually rate limited) and save the
               | organization name, ASN and CIDR blocks, again filter by
               | uniqueness, then create a new list with the organizations
               | of interest and match with the CIDR blocks. Now you have
               | an allow/blocklist you can use.
        
         | codingdave wrote:
         | You are correct that I don't like this advice... not because I
         | find it to be wrong, but because you are approaching it solely
         | from a competitive perspective -- "Your competitors don't have
         | ethics, so you shouldn't either." That doesn't help someone who
         | is engaging in research and trying to hold themselves to a
         | higher standard.
        
         | lordgrenville wrote:
         | I'm neither a web dev nor a site owner, but OP literally asked
         | for tips on _ethical_ web scraping, not  "what's the most I can
         | get away with".
        
         | wizzwizz4 wrote:
         | 1. is the only one I don't like. I think you should use your
         | real user agent first on any given site, as a courtesy; whether
         | you give up or change to a more "normal" user agent if you get
         | banned is up to you.
         | 
         | Oh, and for 3.: if you can, apply some heuristics to your
         | reading of the robots.txt. If it's just "deny everything", then
         | ignore it, but you really don't want to be responsible for
         | crawling all of the GET /delete/:id pages of a badly-designed
         | site... (those should definitely be POST, and authenticated, by
         | the way).
        
           | mpclark wrote:
           | Also, if a target site is behind Cloudflare then you probably
           | won't be able to masquerade as any of the popular bots - they
           | block fake google/yandex bots.
        
           | gilad wrote:
           | As for delete, use authenticated DELETE, not POST, it's why
           | its there in the first place
        
           | chatmasta wrote:
           | I disagree. The risks are similar to those of disclosing a
           | security vulnerability to a company without a bug bounty. You
           | cannot know how litigious or technically illiterate the
           | company will be. What if they decide you're "hacking" them
           | and call the FBI with the helpful information you included in
           | your user agent? Crazier things have happened.
           | 
           | Anonymity is part of the right to privacy; IMO, such a right
           | should extend to bots as well. There should be no shame in
           | anonymously accessing a website, whether via automated means
           | or otherwise.
        
             | a1369209993 wrote:
             | > such a right should extend to bots as well
             | 
             | No, it very much shouldn't, but (as you probably meant) it
             | _should_ extend to the _person_ (not, eg, company) _using_
             | a bot, which amounts to the same thing in this case.
        
         | erdos4d wrote:
         | Perhaps I am behind the curve here, but why would sneaker shoe
         | sites get scraped hard?
        
           | abannin wrote:
           | There is a very active secondary market for sneakers. If you
           | can buy before supply is exhausted, you can make some decent
           | money.
        
           | [deleted]
        
       | mettamage wrote:
       | Indirectly related, if you have some time to spare follow
       | Harvard's course in ethics! [1]
       | 
       | Here is why: while it didn't teach me anything new (in a sense),
       | it did give me a vocabulary to better articulate myself. Having
       | new words to describe certain ideas means you have more
       | analytical tools at your disposal. So you'll be able to examine
       | your own ethical stance better.
       | 
       | It takes some time, but instead of watching Netflix (if that's a
       | thing you do), watch this instead! Although, The Good Place is a
       | pretty good Netflix show sprinkling some basic ethics in there.
       | 
       | [1] https://www.youtube.com/watch?v=kBdfcR-8hEY
        
         | lapnitnelav wrote:
         | Thanks for sharing that Harvard's course.
         | 
         | The cost benefit analysis part reminds me a lot of some of the
         | comments you see here (and elsewhere) with regards to Covid-19
         | and the economic shutdown of societies. Quite timely.
        
         | aspyct wrote:
         | Great recommendations, thanks!
        
           | aspyct wrote:
           | I must insist. This course is great! Thanks :)
        
       | JackC wrote:
       | In some cases, especially during development, local caching of
       | responses can help reduce load. You can write a little wrapper
       | that tries to return url contents from a local cache and then
       | falls back to a live request.
        
       | sairamkunala wrote:
       | Simple,
       | 
       | respect robots.txt
       | 
       | find your data from sitemaps, ensure you query at a slow rate.
       | robots.txt has a cool off period. See
       | https://en.wikipedia.org/wiki/Robots_exclusion_standard#Craw...
       | 
       | example: https://www.google.com/robots.txt
        
         | aspyct wrote:
         | Yeah that's a must do, but I think most websites don't even
         | bother making a robots.txt beyond "please index us, google".
         | However that wouldn't necessarily mean they're happy about
         | someone vacuuming their whole website in a few days.
        
       | brainzap wrote:
       | Ask for permissions and have nice timeout/retries.
        
       | jll29 wrote:
       | The only sound advice one can give is: there are two elements to
       | consider: 1) ethics is different from law 1.1) the ethical way:
       | respect robots.txt protocol 2) consult a lawyer 2.1) prior
       | written consent, they will say, prevents you from being sued, and
       | not much else.
        
       | tdy721 wrote:
       | Schema.org is a nice resource. If you can find that meta-data on
       | a site, you can be just a little more sure they don't mind
       | getting that data scraped. It's the instruction book for teaching
       | google and other crawlers extra information and context. Your
       | scraper would be wise to parse this extra meta information.
        
       | mapgrep wrote:
       | I always add an "Accept-Encoding" header to my request to
       | indicate I will accept a gzip response (or deflate if available).
       | Your http library (in whatever language your bot is in) probably
       | supports this with a near trivial amount of additional code, if
       | any. Meanwhile you are saving the target site some bandwidth.
       | 
       | Look into If-Modified-Since and If-None-Match/Etag headers as
       | well if you are querying resources that support those headers
       | (RSS feeds, for example, commonly support these, and static
       | resources). They prevent the target site from having to send
       | anything other than a 304, saving bandwidth and possibly compute.
        
       | adrianhel wrote:
       | I like this approach. Personally I wait an hour if I get an
       | invalid response and use timeouts of a few seconds between other
       | requests.
        
       | tedivm wrote:
       | I've gone through this process twice- one about six months ago,
       | and once just this week.
       | 
       | In the first event the content wasn't clearly licensed and the
       | site as somewhat small, so I didn't want to break them. I emailed
       | them and they gave us permission but only if we only crawled one
       | page per ten seconds. Took us a weekend, but we got all the data
       | and did so in a way that respected their site.
       | 
       | The second one was this last week and was part of a personal
       | project. All of the content was over an open license (creative
       | commons), and the site was hosted on a platform that can take a
       | ton of traffic. For this one I made sure we weren't hitting it
       | too hard (scrapy has some great autothrottle options), but
       | otherwise didn't worry about it too much.
       | 
       | Since the second project is personal I open sourced the crawler
       | if you're curious- https://github.com/tedivm/scp_crawler
        
       | coderholic wrote:
       | Another option is to not scrape at all, and use an existing data
       | set. Common crawl is one good example, and http archive is
       | another.
       | 
       | If you just want meta data from the homepage of all domains we
       | scrape that every month at https://host.io and make the data
       | available over our API: https://host.io/docs
        
       | tyingq wrote:
       | Be careful about making the data you've scraped visible to
       | Google's search engine scrapers.
       | 
       | That's often how site owners get riled up. They search for some
       | unique phrase on Google, and your site shows up in the search
       | results.
        
         | lazyjones wrote:
         | It's incredibly ironic that one has to avoid doing what Google
         | does in order to be kept in their index.
        
         | MarcellusDrum wrote:
         | This isn't really an "ethical" practice, more like how to hide
         | that you are scraping data practice. If you have to hide the
         | fact that you are scraping their data, maybe you shouldn't be
         | doing it in the first place.
        
           | tyingq wrote:
           | Depends. Maybe, for example, you're doing some competitive
           | price analysis and never plan on exposing scraped things like
           | product descriptions...you only plan to use those internally
           | to confirm you're comparing like products. But you expose it
           | accidentally. Avoid that.
        
       | throwaway777555 wrote:
       | The suggestions in the comments are excellent. One thing I would
       | add is this: contact the site owner in advance and ask for their
       | permission. If they are okay with it or if you don't hear back,
       | credit the site in your work. Then send the owner a message with
       | where they can see the information being used.
       | 
       | Some sites will have rules or guidelines for attribution already
       | in place. For example, the DMOZ had a Required Attribution page
       | to explain how to credit them: https://dmoz-
       | odp.org/docs/en/license.html. Discogs mentions that use of their
       | data also falls under CC0: https://data.discogs.com/. Other sites
       | may have these details in their Terms of Service, About page, or
       | similar.
        
       | avip wrote:
       | Contact site owner, tell them who you are and what you're doing,
       | ask about data dump or api.
        
       | pfarrell wrote:
       | It won't help you learn to write a scraper, but using the common
       | crawl dataset will get you access to a crazy amount of data
       | without paying to acquire it yourself.
       | 
       | https://commoncrawl.org/the-data/
        
         | aspyct wrote:
         | Cool, didn't know about this. Thanks!
        
           | Reelin wrote:
           | > As part of my learning in data science, I need/want to
           | gather data.
           | 
           | Also not web scraping, but a few other public data set
           | sources to check.
           | 
           | https://registry.opendata.aws
           | 
           | https://github.com/awesomedata/awesome-public-datasets
        
             | aspyct wrote:
             | Thanks!
        
       ___________________________________________________________________
       (page generated 2020-04-04 23:00 UTC)