[HN Gopher] Web scraping with Python open knowledge
       ___________________________________________________________________
        
       Web scraping with Python open knowledge
        
       Author : PigiVinci83
       Score  : 114 points
       Date   : 2022-05-27 16:45 UTC (6 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | account-5 wrote:
       | I was reading another thread about webscraping, someone mentioned
       | CSS selectors being way quicker than xpath. I'm easy either way
       | but apart from a more powerful syntax what other benefits are
       | there?
        
         | PigiVinci83 wrote:
         | Having a large codebase like ours, we find out that XPATH are
         | more readable, but i understand it's a personal feeling. We
         | don't have high frequency scraping, so the performances of CSS
         | vs XPATH were not considered. It's an interesting point i'd
         | like to write more about, thanks for sharing.
        
         | showerst wrote:
         | CSS is nice because it's more readable than XPATH for longer
         | queries, and is friendlier to newer programmers who didn't come
         | up when XML was big.
         | 
         | XPATH is generally more powerful for really gnarly things and
         | for backtracking. "Show me the 3rd paragraph that's a sibling
         | of the fourth div id="subhed" and contains the text "starting".
        
           | dotancohen wrote:
           | > XPATH is generally more powerful...
           | 
           | That is a convincing argument is you can back it up with an
           | XPATH expression.
        
             | mdaniel wrote:
             | Well, the rest of their sentence summed it up pretty well;
             | try and implement that example using CSS selectors
             | 
             | Hell, even "find id=subhead and _go up one element_" isn't
             | possible in CSS because that's not a problem it was
             | designed to solve
        
         | mdaniel wrote:
         | In my experience, it's not that CSS selectors are "more
         | powerful," but rather "more legible." XPath is for sure more
         | powerful, but also usually lower signal to noise ratio
         | response.css("#the-id")         # vs
         | response.xpath("//*[@id='the-id']")
         | 
         | Thankfully, Scrapy (well, pedantically "parsel") allows mixing
         | and matching, using the one which makes the most sense
         | response.css(".someClass").xpath(".//*[starts-with(text(),
         | 'Price')]")
        
       | PigiVinci83 wrote:
       | A work in progress guide about web scraping in python, anti bot
       | softwares and techniques and so on. Please feel free to share and
       | contribute with your own experience too.
        
         | alexchamberlain wrote:
         | The tab formatting seems like an odd (and rather unPythonic)
         | addition. What's the intention there?
        
           | datalopers wrote:
           | Why is whitespace even discussed in a tutorial about web
           | scraping? I think speaks to how amateur this documentation
           | is, but hey it's a tutorial on web scraping in python, which
           | is a HN crowd favorite and guaranteed to hit the front page
           | [1][2].
           | 
           | [1] 3 days ago: https://news.ycombinator.com/item?id=31500007
           | 
           | [2] 12 days ago:
           | https://news.ycombinator.com/item?id=31387248
        
             | PigiVinci83 wrote:
             | Because it's not a tutorial on web scraping but a mix of
             | what we suggest internally to do and what we've learnt from
             | our experience in this field in these years. For our
             | codebase we prefer tabs instead of spaces, but i understand
             | it's a subject for debates that last decades :) But thanks
             | for the point, I'll rephrase the topic in the guide
        
               | civilized wrote:
               | It's not a best practice, it's just a random thing your
               | team does. It does make the team sound amateurish if it
               | can't distinguish between meaningful best practices and
               | just conventions the team happens to have.
        
               | datalopers wrote:
               | It's odd to me that your apparent revenue stream is from
               | scraping difficult-to-scrape sites and you're
               | broadcasting the exact tactics you use to bypass anti-
               | scraping systems. You're making your own life difficult
               | by giving Cloudflare/PerimeterX/etc the information
               | necessary to improve their tooling.
               | 
               | You also seem to advertise many of the sites/datasets
               | you're scraping, which opens you up to litigation.
               | Especially if they're employing anti-scraping tooling and
               | you're brazenly bypassing those. It doesn't matter that
               | it's legal in most jurisdictions of the world, you'll
               | still have to handle cease and desists or potential
               | lawsuits, which is a major cost and distraction.
        
               | punnerud wrote:
               | << You also seem to advertise many of the sites/datasets
               | you're scraping, which opens you up to litigation.>>
               | 
               | Is that a done deal now after the "LinkedIn vs HiQ" case
               | public information only hold copyright, but you can use
               | the by product as it's fit you for new business?
        
               | datalopers wrote:
               | The only clear outcome from the LinkedIn case, afaik, is
               | that scraping publicly of accessible data is not a
               | federal crime under CFAA [1]. There are still plenty of
               | other civil ways that someone can sue you to stop
               | scraping their site: breach of contract, trespass to
               | chattels, trademark infringement, etc. And they can do so
               | over and over again til you're broke. OP is based in
               | Italy anyway so I have absolutely no clue what does and
               | doesn't apply.
               | 
               | I'd like to point out that, while HiQ Labs "won" the
               | case, that company is basically dead. The CEO and CTO are
               | both working for other companies now. So I think the
               | bigger takeaway is: don't get yourself sued while you're
               | a tiny little startup.
               | 
               | [1] https://www.natlawreview.com/article/hiq-labs-v-
               | linkedin
        
           | hrbf wrote:
           | Same here. It feels out of place, unnecessary and its
           | rationalization unconvincing. Considering Python, an outright
           | weird suggestion.
        
             | wheelerof4te wrote:
             | Python allows indenting using tabs, so I don't understand
             | why it's a weird decision.
             | 
             | In fact, they even stated their reasoning in the document.
             | I don't see why anyone has to blindly follow PEP8 nor do I
             | get why 4 spaces indent has to be considered a standard.
        
         | etskinner wrote:
         | On the page about canvas fingerprinting[0], it only mentions
         | Cloudflare. From what I can tell, reCaptcha v3 also uses canvas
         | fingerprinting [1]
         | 
         | [0] https://github.com/reanalytics-databoutique/webscraping-
         | open...
         | 
         | [1] https://brianwjoe.com/2019/02/06/how-does-
         | recaptcha-v3-work/
        
           | PigiVinci83 wrote:
           | Thanks for sharing, i'll update soon the page.
        
         | jamestimmins wrote:
         | I appreciate the inclusion of anti-bot software. As someone who
         | builds plugins for enterprise apps (currently Airtable), I
         | really want to build automated tests for my apps with Selenium,
         | but keep getting foiled by anti-bot measures.
         | 
         | Can anyone recommend other resources for understanding anti-bot
         | tech and their workarounds?
        
       | captn3m0 wrote:
       | Good list, confused about the "tabs weighing less" bit. Isn't
       | that a preference left for the end-devs?
       | 
       | Another tip I've found is to check if the data is accessible on a
       | mobile app and proxy it to see if there is a JSON API available.
        
         | PigiVinci83 wrote:
         | Thanks for your reply, mobile data it's a thing i need to add
         | soon. Usually we check using Fiddler if there's an API inside,
         | but only for really problematic website.
        
       | [deleted]
        
       | Xeoncross wrote:
       | Plug for https://commoncrawl.org/ if you need billions of pages
       | but don't want to deal with scraping the web yourself.
        
         | [deleted]
        
         | squiggy22 wrote:
         | Is there subsets of common crawl anywhere for individual sites.
         | E.g. YouTube for example?
        
           | magundu wrote:
           | You can query subset of specific sites from common crawl
           | itself.
        
         | mdaniel wrote:
         | It would thrill me if common crawl were updated with such
         | frequency that it would allow new search engines to enter the
         | market
         | 
         | I haven't dug into it enough to know if there's some technical
         | reason it's not currently the case, or just lack of
         | (interest|willpower)
        
           | input_sh wrote:
           | I'd argue that one broad crawl every 2-3 months in addition
           | to their updated-daily news crawl[0] should be good enough to
           | make a rudimentary search engine.
           | 
           | [0] https://data.commoncrawl.org/crawl-data/CC-
           | NEWS/index.html
        
             | mdaniel wrote:
             | You're right about the "rudimentary" part, because I don't
             | know how they do it but the major players have some not-
             | kidding-around freshness:
             | 
             | https://www.google.com/search?hl=en&q=%22thrill%20me%22%20%
             | 2...
             | 
             | https://www.bing.com/search?q=%22thrill+me%22+%22common+cra
             | w... _(and DDG similarly, because bing)_
             | 
             | ed: I was curious if maybe HN publishes a sitemap, and it
             | seems no. Then again, hnreplies knows about the HN API so
             | maybe it's special-cased by the big crawlers
             | https://github.com/ggerganov/hnreplies#hnreplies
        
       | afandian wrote:
       | If you're the kind of person who wants "open data" (read as
       | broadly as you like) and could get it in snapshots direct from
       | the source without having to scrape, what would your ideal format
       | be?
       | 
       | I know it's a very open ended question.
        
         | PigiVinci83 wrote:
         | Thanks for the question, i can speak for what we've encountered
         | in these years of web scraping and nothing beats API and JSON,
         | but i'm sure there are formats even more friendly to read.
        
         | jshen wrote:
         | Probably RDF serialized as hextuples
         | https://github.com/ontola/hextuples
        
           | afandian wrote:
           | Looks interesting. From that page I couldn't see what 'graph'
           | field relates to. Is it the identifier for a distinct named
           | graph? It was blank in the examples.
           | 
           | Do you use it? What for?
        
       | sgtquack wrote:
       | As someone who recently dealt with scraping sites behind
       | cloudflare...I never want to scrape again
        
       | jonatron wrote:
       | As the second sentence says, it's a cat and mouse game, so
       | there's no incentive on either side of bot vs anti-bot to share
       | information.
        
         | PigiVinci83 wrote:
         | I'm sure no one will add here its secret sauce :)
        
       ___________________________________________________________________
       (page generated 2022-05-27 23:00 UTC)