[HN Gopher] Show HN: Flyscrape - A standalone and scriptable web... ___________________________________________________________________ Show HN: Flyscrape - A standalone and scriptable web scraper in Go Author : philippta Score : 138 points Date : 2023-11-11 14:18 UTC (8 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | lucgagan wrote: | This looks great. I wish I had this a few months ago! Giving it a | try. | philippta wrote: | Glad to hear! You're welcome to leave any feedback on Github | (as an Issue) or right in here. | bryanrasmussen wrote: | Looks like it doesn't have the possibility of running it as a | particular browser etc. Which I guess makes it fine for a lot of | pages, but also a lot of scraping tasks would be affected. Am I | right or did I miss something? | philippta wrote: | Yes, this is correct. As of right now there is no built-in | support for running as a browser. | | What is possible though, is to use a service like ScrapingBee | (not affiliated) and set it as the proxy. This would render the | page on their end, in a browser. | acheong08 wrote: | Try tls-client. It gets around TLS fingerprinting by | Cloudflare | snake117 wrote: | Looks interesting, and thank you for sharing this! One common | issue with scraping web pages is dealing with data that is | dynamically loaded. Is there a solution for this? For example, | when using Scrapy, you can have Splash running in Docker via | scrapy-splash (https://github.com/scrapy-plugins/scrapy-splash). | figmert wrote: | Can't you load the URL that is being dynamically loaded | directly within your scraper? | mdaniel wrote: | Not only can you, in my experience it is substantially less | drama and arguably less load on the target system since the | full page may make many many other requests that a | presentation layer would care about that I don't | | The trade-offs usually fall into: | | - authing to the endpoint can sometimes be weird | | - it for sure makes the traffic stand out since it isn't | otherwise surrounded by those extraneous requests | | - it, as with all good things scraping, carries its own | maintenance and monitoring burden | | However, similar to those tradeoffs, it's also been my | experience that a full page load offers a ton more tracking | opportunities that are not present in a direct endpoint | fetch. I mean, look how many "stealth" plugins out there | designed to mask the fact that a headless browser is headless | | But, having said all of that: without question the biggest | risk to modern day scraping is Cloudflare and Akamai | gatekeeping. I do appreciate the arguments of "but ddos!11" | and yet I would rather only actors that are actually | exhibiting bad behavior[1] be blocked instead of everyone | trying with a copy of python who have set reasonable rate | limits | | 1 = this setting aside that "bad behavior" can be defined as | "downloading data that the site makes freely available to | Chrome but not freely available to python" | philippta wrote: | Thanks! As mentioned in another comment, currently there is no | build in support for this yet. | | As a workaround one could use a service like ScrapingBee (not | affiliated) as a proxy, that renders the page in a browser for | you. | | Surely, relying on a service for this is not always ideal. I am | also working on a small wrapper that turns Chrome into an HTTPS | proxy, which you could plug right into flyscrape. Unfortunately | it is very experimental still and not public yet. I have not | yet decided if I release it as part of flyscrape or as a | separate project. | xyzzy_plugh wrote: | I like web scraping in Go. The support for parsing HTML in | x/text/html is pretty good, and libraries like | github.com/PuerkitoBio/goquery go a long way to matching | ergonomics in other tools. This project uses both, but then also | goes on to use github.com/dop251/goja, which is a JavaScript VM | _and_ it 's accompanying nodejs compatability layer _and_ even | esbuild, in order to _interpret scraping instruction scripts_. | | I mean, at this point I am not sure Go is the right tool for the | job (I am _actually_ pretty confident that it is _not_ ). | | A pretty neat stack of engineering, sure! This is cool, niely | done. But I can't help but feel disturbed. | cxr wrote: | Your comment was posted 4 minutes ago. That means you still | have enough time to edit your comment to change it so it | contains real URLs that link to the project repos for the | packages mentioned: | | <https://github.com/PuerkitoBio/goquery> | | <https://github.com/dop251/goja> | | (Please do not reply to this comment of mine--if you do, I | won't be able to delete it once the previous post is fixed, | because the existence of the replies will prevent that.) | cheapgeek wrote: | Ok | xyzzy_plugh wrote: | Even if I saw this post in time, I wouldn't have edited it. | They are all proper Go package names. | sunshadow wrote: | These days, I'm not even using Go for scraping that much, as the | webpage changes makes me crazy and JS code evaluation is a | lifesaver, so I moved to Typescript+Playwright. (Crawlee | framework is cool, while not strictly necessary). | | Its been 8+ years since i started scraping. I even wrote a | popular Go web scraping framework previously: | (https://github.com/geziyor/geziyor). | | My favorite stack as of 2023: | TypeScript+Playwright+Crawlee(Optional) If you're serious in | scraping, you should learn javascript, thus, playwright should be | good. | | Note: There are niche cases where lower-level language would be | required (C++, Go etc), but probably only <%5 | mikercampbell wrote: | Have you seen Crul?? | | I love the JS flow, but I thought crul was an interesting newer | tool!! | | But I agree, you gotta get in there and it's easier with JS | reyostallenberg wrote: | Can you add a link to it? | mdaniel wrote: | I'm sorry to hear that your searches for that very specific | name didn't provide the information you were looking for | | its show hn: https://news.ycombinator.com/item?id=34970917 | | tfl: https://www.crul.com/ | sunshadow wrote: | Crul looks nice, though, you cannot imagine how many startups | that I've seen failed doing a very similar thing as Crul. | Wouldn't rely on it. The problem is complex: Humans | generating messy pages | hipadev23 wrote: | How does that help you mitigate when a site changes? If you're | fetching some value in a given <div> under a long XPATH and | they decide to change that path? | sunshadow wrote: | You don't use XPath&CSS selectors at all (Except if you dont | have choice). You rely on more generic stuff, e.g, "the | button that has 'Sign in' on it": await | page.getByRole('button', { name: 'Sign in' }).click(); | | See playwright locators: https://playwright.dev/docs/locators | 8n4vidtmkvmk wrote: | I started putting data-testid attributes in my web app for | automated testing using playwright. Prevents me from | breaking my own script but it sure would make me more | scrapable if anyone cared. Well.. I guess I only do it on | inputs, not the rendered page which is what scrapers care | most about. | sunshadow wrote: | Unless you start a war against scrapers, you don't need | to worry about that as I'll always find a way to scrape | your site as long as its valuable to 'me'. Even if it | requires Real browser + OCR :) | erhaetherth wrote: | Oh I know I couldn't prevent it. But if you wanted to | scrape me, you'd have to pay the monthly subscription | because everything is behind a pay wall/login. And then | you'd only have access to data you entered because it's | just that kind of app :-) | latchkey wrote: | This is where you just train an LLM so you can write: | | 'get button named "sign in" and click' | | Then on the back end, it generates your example code. | nurettin wrote: | Don't know about the poster, but I try to find divs and | buttons in a fuzzy way. Usually via element text. Sometimes | it mitigates changes, sometimes it doesn't. It's a guessing | game. Especially when they start using shadow elements or | iframes in the page. If I'm looking for something specific | like a price or dimensions, I can sometimes get away with it | by collecting dollar amounts or X x Y x Z from the raw text. | slig wrote: | Thanks for sharing! Just a small nit: the links at the bottom of | this page are broken [1]. | | [1]: | https://github.com/philippta/flyscrape/blob/master/docs/read... | fyzix wrote: | What happens if 'find()' returns a list and you call '.text()'. | Intuition tells me it should fail but maybe it implicitly gets | the text from the first item if it exists. | | Either way, I think you create a separate method 'find_all()' | that returns a list to make the API easier to reason about. ___________________________________________________________________ (page generated 2023-11-11 23:00 UTC)