hngopher.com

       [HN Gopher] Show HN: Flyscrape - A standalone and scriptable web...
       ___________________________________________________________________
        
       Show HN: Flyscrape - A standalone and scriptable web scraper in Go
        
       Author : philippta
       Score  : 138 points
       Date   : 2023-11-11 14:18 UTC (8 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | lucgagan wrote:
       | This looks great. I wish I had this a few months ago! Giving it a
       | try.
        
         | philippta wrote:
         | Glad to hear! You're welcome to leave any feedback on Github
         | (as an Issue) or right in here.
        
       | bryanrasmussen wrote:
       | Looks like it doesn't have the possibility of running it as a
       | particular browser etc. Which I guess makes it fine for a lot of
       | pages, but also a lot of scraping tasks would be affected. Am I
       | right or did I miss something?
        
         | philippta wrote:
         | Yes, this is correct. As of right now there is no built-in
         | support for running as a browser.
         | 
         | What is possible though, is to use a service like ScrapingBee
         | (not affiliated) and set it as the proxy. This would render the
         | page on their end, in a browser.
        
           | acheong08 wrote:
           | Try tls-client. It gets around TLS fingerprinting by
           | Cloudflare
        
       | snake117 wrote:
       | Looks interesting, and thank you for sharing this! One common
       | issue with scraping web pages is dealing with data that is
       | dynamically loaded. Is there a solution for this? For example,
       | when using Scrapy, you can have Splash running in Docker via
       | scrapy-splash (https://github.com/scrapy-plugins/scrapy-splash).
        
         | figmert wrote:
         | Can't you load the URL that is being dynamically loaded
         | directly within your scraper?
        
           | mdaniel wrote:
           | Not only can you, in my experience it is substantially less
           | drama and arguably less load on the target system since the
           | full page may make many many other requests that a
           | presentation layer would care about that I don't
           | 
           | The trade-offs usually fall into:
           | 
           | - authing to the endpoint can sometimes be weird
           | 
           | - it for sure makes the traffic stand out since it isn't
           | otherwise surrounded by those extraneous requests
           | 
           | - it, as with all good things scraping, carries its own
           | maintenance and monitoring burden
           | 
           | However, similar to those tradeoffs, it's also been my
           | experience that a full page load offers a ton more tracking
           | opportunities that are not present in a direct endpoint
           | fetch. I mean, look how many "stealth" plugins out there
           | designed to mask the fact that a headless browser is headless
           | 
           | But, having said all of that: without question the biggest
           | risk to modern day scraping is Cloudflare and Akamai
           | gatekeeping. I do appreciate the arguments of "but ddos!11"
           | and yet I would rather only actors that are actually
           | exhibiting bad behavior[1] be blocked instead of everyone
           | trying with a copy of python who have set reasonable rate
           | limits
           | 
           | 1 = this setting aside that "bad behavior" can be defined as
           | "downloading data that the site makes freely available to
           | Chrome but not freely available to python"
        
         | philippta wrote:
         | Thanks! As mentioned in another comment, currently there is no
         | build in support for this yet.
         | 
         | As a workaround one could use a service like ScrapingBee (not
         | affiliated) as a proxy, that renders the page in a browser for
         | you.
         | 
         | Surely, relying on a service for this is not always ideal. I am
         | also working on a small wrapper that turns Chrome into an HTTPS
         | proxy, which you could plug right into flyscrape. Unfortunately
         | it is very experimental still and not public yet. I have not
         | yet decided if I release it as part of flyscrape or as a
         | separate project.
        
       | xyzzy_plugh wrote:
       | I like web scraping in Go. The support for parsing HTML in
       | x/text/html is pretty good, and libraries like
       | github.com/PuerkitoBio/goquery go a long way to matching
       | ergonomics in other tools. This project uses both, but then also
       | goes on to use github.com/dop251/goja, which is a JavaScript VM
       | _and_ it 's accompanying nodejs compatability layer _and_ even
       | esbuild, in order to _interpret scraping instruction scripts_.
       | 
       | I mean, at this point I am not sure Go is the right tool for the
       | job (I am _actually_ pretty confident that it is _not_ ).
       | 
       | A pretty neat stack of engineering, sure! This is cool, niely
       | done. But I can't help but feel disturbed.
        
         | cxr wrote:
         | Your comment was posted 4 minutes ago. That means you still
         | have enough time to edit your comment to change it so it
         | contains real URLs that link to the project repos for the
         | packages mentioned:
         | 
         | <https://github.com/PuerkitoBio/goquery>
         | 
         | <https://github.com/dop251/goja>
         | 
         | (Please do not reply to this comment of mine--if you do, I
         | won't be able to delete it once the previous post is fixed,
         | because the existence of the replies will prevent that.)
        
           | cheapgeek wrote:
           | Ok
        
           | xyzzy_plugh wrote:
           | Even if I saw this post in time, I wouldn't have edited it.
           | They are all proper Go package names.
        
       | sunshadow wrote:
       | These days, I'm not even using Go for scraping that much, as the
       | webpage changes makes me crazy and JS code evaluation is a
       | lifesaver, so I moved to Typescript+Playwright. (Crawlee
       | framework is cool, while not strictly necessary).
       | 
       | Its been 8+ years since i started scraping. I even wrote a
       | popular Go web scraping framework previously:
       | (https://github.com/geziyor/geziyor).
       | 
       | My favorite stack as of 2023:
       | TypeScript+Playwright+Crawlee(Optional) If you're serious in
       | scraping, you should learn javascript, thus, playwright should be
       | good.
       | 
       | Note: There are niche cases where lower-level language would be
       | required (C++, Go etc), but probably only <%5
        
         | mikercampbell wrote:
         | Have you seen Crul??
         | 
         | I love the JS flow, but I thought crul was an interesting newer
         | tool!!
         | 
         | But I agree, you gotta get in there and it's easier with JS
        
           | reyostallenberg wrote:
           | Can you add a link to it?
        
             | mdaniel wrote:
             | I'm sorry to hear that your searches for that very specific
             | name didn't provide the information you were looking for
             | 
             | its show hn: https://news.ycombinator.com/item?id=34970917
             | 
             | tfl: https://www.crul.com/
        
           | sunshadow wrote:
           | Crul looks nice, though, you cannot imagine how many startups
           | that I've seen failed doing a very similar thing as Crul.
           | Wouldn't rely on it. The problem is complex: Humans
           | generating messy pages
        
         | hipadev23 wrote:
         | How does that help you mitigate when a site changes? If you're
         | fetching some value in a given <div> under a long XPATH and
         | they decide to change that path?
        
           | sunshadow wrote:
           | You don't use XPath&CSS selectors at all (Except if you dont
           | have choice). You rely on more generic stuff, e.g, "the
           | button that has 'Sign in' on it":                   await
           | page.getByRole('button', { name: 'Sign in' }).click();
           | 
           | See playwright locators: https://playwright.dev/docs/locators
        
             | 8n4vidtmkvmk wrote:
             | I started putting data-testid attributes in my web app for
             | automated testing using playwright. Prevents me from
             | breaking my own script but it sure would make me more
             | scrapable if anyone cared. Well.. I guess I only do it on
             | inputs, not the rendered page which is what scrapers care
             | most about.
        
               | sunshadow wrote:
               | Unless you start a war against scrapers, you don't need
               | to worry about that as I'll always find a way to scrape
               | your site as long as its valuable to 'me'. Even if it
               | requires Real browser + OCR :)
        
               | erhaetherth wrote:
               | Oh I know I couldn't prevent it. But if you wanted to
               | scrape me, you'd have to pay the monthly subscription
               | because everything is behind a pay wall/login. And then
               | you'd only have access to data you entered because it's
               | just that kind of app :-)
        
             | latchkey wrote:
             | This is where you just train an LLM so you can write:
             | 
             | 'get button named "sign in" and click'
             | 
             | Then on the back end, it generates your example code.
        
           | nurettin wrote:
           | Don't know about the poster, but I try to find divs and
           | buttons in a fuzzy way. Usually via element text. Sometimes
           | it mitigates changes, sometimes it doesn't. It's a guessing
           | game. Especially when they start using shadow elements or
           | iframes in the page. If I'm looking for something specific
           | like a price or dimensions, I can sometimes get away with it
           | by collecting dollar amounts or X x Y x Z from the raw text.
        
       | slig wrote:
       | Thanks for sharing! Just a small nit: the links at the bottom of
       | this page are broken [1].
       | 
       | [1]:
       | https://github.com/philippta/flyscrape/blob/master/docs/read...
        
       | fyzix wrote:
       | What happens if 'find()' returns a list and you call '.text()'.
       | Intuition tells me it should fail but maybe it implicitly gets
       | the text from the first item if it exists.
       | 
       | Either way, I think you create a separate method 'find_all()'
       | that returns a list to make the API easier to reason about.
        
       ___________________________________________________________________
       (page generated 2023-11-11 23:00 UTC)