[HN Gopher] Web scraping with Python open knowledge ___________________________________________________________________ Web scraping with Python open knowledge Author : PigiVinci83 Score : 114 points Date : 2022-05-27 16:45 UTC (6 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | account-5 wrote: | I was reading another thread about webscraping, someone mentioned | CSS selectors being way quicker than xpath. I'm easy either way | but apart from a more powerful syntax what other benefits are | there? | PigiVinci83 wrote: | Having a large codebase like ours, we find out that XPATH are | more readable, but i understand it's a personal feeling. We | don't have high frequency scraping, so the performances of CSS | vs XPATH were not considered. It's an interesting point i'd | like to write more about, thanks for sharing. | showerst wrote: | CSS is nice because it's more readable than XPATH for longer | queries, and is friendlier to newer programmers who didn't come | up when XML was big. | | XPATH is generally more powerful for really gnarly things and | for backtracking. "Show me the 3rd paragraph that's a sibling | of the fourth div id="subhed" and contains the text "starting". | dotancohen wrote: | > XPATH is generally more powerful... | | That is a convincing argument is you can back it up with an | XPATH expression. | mdaniel wrote: | Well, the rest of their sentence summed it up pretty well; | try and implement that example using CSS selectors | | Hell, even "find id=subhead and _go up one element_" isn't | possible in CSS because that's not a problem it was | designed to solve | mdaniel wrote: | In my experience, it's not that CSS selectors are "more | powerful," but rather "more legible." XPath is for sure more | powerful, but also usually lower signal to noise ratio | response.css("#the-id") # vs | response.xpath("//*[@id='the-id']") | | Thankfully, Scrapy (well, pedantically "parsel") allows mixing | and matching, using the one which makes the most sense | response.css(".someClass").xpath(".//*[starts-with(text(), | 'Price')]") | PigiVinci83 wrote: | A work in progress guide about web scraping in python, anti bot | softwares and techniques and so on. Please feel free to share and | contribute with your own experience too. | alexchamberlain wrote: | The tab formatting seems like an odd (and rather unPythonic) | addition. What's the intention there? | datalopers wrote: | Why is whitespace even discussed in a tutorial about web | scraping? I think speaks to how amateur this documentation | is, but hey it's a tutorial on web scraping in python, which | is a HN crowd favorite and guaranteed to hit the front page | [1][2]. | | [1] 3 days ago: https://news.ycombinator.com/item?id=31500007 | | [2] 12 days ago: | https://news.ycombinator.com/item?id=31387248 | PigiVinci83 wrote: | Because it's not a tutorial on web scraping but a mix of | what we suggest internally to do and what we've learnt from | our experience in this field in these years. For our | codebase we prefer tabs instead of spaces, but i understand | it's a subject for debates that last decades :) But thanks | for the point, I'll rephrase the topic in the guide | civilized wrote: | It's not a best practice, it's just a random thing your | team does. It does make the team sound amateurish if it | can't distinguish between meaningful best practices and | just conventions the team happens to have. | datalopers wrote: | It's odd to me that your apparent revenue stream is from | scraping difficult-to-scrape sites and you're | broadcasting the exact tactics you use to bypass anti- | scraping systems. You're making your own life difficult | by giving Cloudflare/PerimeterX/etc the information | necessary to improve their tooling. | | You also seem to advertise many of the sites/datasets | you're scraping, which opens you up to litigation. | Especially if they're employing anti-scraping tooling and | you're brazenly bypassing those. It doesn't matter that | it's legal in most jurisdictions of the world, you'll | still have to handle cease and desists or potential | lawsuits, which is a major cost and distraction. | punnerud wrote: | << You also seem to advertise many of the sites/datasets | you're scraping, which opens you up to litigation.>> | | Is that a done deal now after the "LinkedIn vs HiQ" case | public information only hold copyright, but you can use | the by product as it's fit you for new business? | datalopers wrote: | The only clear outcome from the LinkedIn case, afaik, is | that scraping publicly of accessible data is not a | federal crime under CFAA [1]. There are still plenty of | other civil ways that someone can sue you to stop | scraping their site: breach of contract, trespass to | chattels, trademark infringement, etc. And they can do so | over and over again til you're broke. OP is based in | Italy anyway so I have absolutely no clue what does and | doesn't apply. | | I'd like to point out that, while HiQ Labs "won" the | case, that company is basically dead. The CEO and CTO are | both working for other companies now. So I think the | bigger takeaway is: don't get yourself sued while you're | a tiny little startup. | | [1] https://www.natlawreview.com/article/hiq-labs-v- | linkedin | hrbf wrote: | Same here. It feels out of place, unnecessary and its | rationalization unconvincing. Considering Python, an outright | weird suggestion. | wheelerof4te wrote: | Python allows indenting using tabs, so I don't understand | why it's a weird decision. | | In fact, they even stated their reasoning in the document. | I don't see why anyone has to blindly follow PEP8 nor do I | get why 4 spaces indent has to be considered a standard. | etskinner wrote: | On the page about canvas fingerprinting[0], it only mentions | Cloudflare. From what I can tell, reCaptcha v3 also uses canvas | fingerprinting [1] | | [0] https://github.com/reanalytics-databoutique/webscraping- | open... | | [1] https://brianwjoe.com/2019/02/06/how-does- | recaptcha-v3-work/ | PigiVinci83 wrote: | Thanks for sharing, i'll update soon the page. | jamestimmins wrote: | I appreciate the inclusion of anti-bot software. As someone who | builds plugins for enterprise apps (currently Airtable), I | really want to build automated tests for my apps with Selenium, | but keep getting foiled by anti-bot measures. | | Can anyone recommend other resources for understanding anti-bot | tech and their workarounds? | captn3m0 wrote: | Good list, confused about the "tabs weighing less" bit. Isn't | that a preference left for the end-devs? | | Another tip I've found is to check if the data is accessible on a | mobile app and proxy it to see if there is a JSON API available. | PigiVinci83 wrote: | Thanks for your reply, mobile data it's a thing i need to add | soon. Usually we check using Fiddler if there's an API inside, | but only for really problematic website. | [deleted] | Xeoncross wrote: | Plug for https://commoncrawl.org/ if you need billions of pages | but don't want to deal with scraping the web yourself. | [deleted] | squiggy22 wrote: | Is there subsets of common crawl anywhere for individual sites. | E.g. YouTube for example? | magundu wrote: | You can query subset of specific sites from common crawl | itself. | mdaniel wrote: | It would thrill me if common crawl were updated with such | frequency that it would allow new search engines to enter the | market | | I haven't dug into it enough to know if there's some technical | reason it's not currently the case, or just lack of | (interest|willpower) | input_sh wrote: | I'd argue that one broad crawl every 2-3 months in addition | to their updated-daily news crawl[0] should be good enough to | make a rudimentary search engine. | | [0] https://data.commoncrawl.org/crawl-data/CC- | NEWS/index.html | mdaniel wrote: | You're right about the "rudimentary" part, because I don't | know how they do it but the major players have some not- | kidding-around freshness: | | https://www.google.com/search?hl=en&q=%22thrill%20me%22%20% | 2... | | https://www.bing.com/search?q=%22thrill+me%22+%22common+cra | w... _(and DDG similarly, because bing)_ | | ed: I was curious if maybe HN publishes a sitemap, and it | seems no. Then again, hnreplies knows about the HN API so | maybe it's special-cased by the big crawlers | https://github.com/ggerganov/hnreplies#hnreplies | afandian wrote: | If you're the kind of person who wants "open data" (read as | broadly as you like) and could get it in snapshots direct from | the source without having to scrape, what would your ideal format | be? | | I know it's a very open ended question. | PigiVinci83 wrote: | Thanks for the question, i can speak for what we've encountered | in these years of web scraping and nothing beats API and JSON, | but i'm sure there are formats even more friendly to read. | jshen wrote: | Probably RDF serialized as hextuples | https://github.com/ontola/hextuples | afandian wrote: | Looks interesting. From that page I couldn't see what 'graph' | field relates to. Is it the identifier for a distinct named | graph? It was blank in the examples. | | Do you use it? What for? | sgtquack wrote: | As someone who recently dealt with scraping sites behind | cloudflare...I never want to scrape again | jonatron wrote: | As the second sentence says, it's a cat and mouse game, so | there's no incentive on either side of bot vs anti-bot to share | information. | PigiVinci83 wrote: | I'm sure no one will add here its secret sauce :) ___________________________________________________________________ (page generated 2022-05-27 23:00 UTC)