[HN Gopher] Pup: Parsing HTML at the command line ___________________________________________________________________ Pup: Parsing HTML at the command line Author : tosh Score : 67 points Date : 2022-11-30 18:55 UTC (4 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | bijoo wrote: | It looks like the project became inactive for a bit and there are | alternatives such as htmlq, etc. | https://github.com/ericchiang/pup/issues/150 | mananaysiempre wrote: | From the looks of it, htmlq doesn't have anything comparable to | pup's JSON output. That JSON is cumbersome to work with, but | combined with jq it allows one to extend the shell hackery just | a little bit beyond what CSS can do. | dang wrote: | Related: | | _Pup - Like Jq, but for HTML_ - | https://news.ycombinator.com/item?id=24797697 - Oct 2020 (2 | comments) | | _Show HN: Pup - A command-line HTML parser_ - | https://news.ycombinator.com/item?id=8312249 - Sept 2014 (27 | comments) | | Random bit of history: that Show HN was a very early choice for | what is now called the second-chance pool: | | _Ask HN: Why did three HN stories jump 100 ranking points in 5 | mins?_ - https://news.ycombinator.com/item?id=8313505 - Sept 2014 | (6 comments) | John23832 wrote: | While this is nice, it's three years old. | | Direct installation of brew scripts isn't supported anymore. `go | get` installs aren't either. | | It needs an update. | JodieBenitez wrote: | It may need an update, but not for installation: | | go install github.com/ericchiang/pup@latest | thangalin wrote: | https://www.w3.org/Tools/HTML-XML-utils/ | bsnnkv wrote: | I use this extensively in bash scripts where I need to scrape | HTML reliably and consistently. Cannot recommend it highly | enough. | | In fact, all of the source data for a project of mine, Baytyab[1] | (couplet-finder) was scraped using bash + pup. | | [1]: https://baytyab.com | pipeline_peak wrote: | Would you recommend it over BeautifulSoup? | turtlebits wrote: | Will check it out, but would have preferred XPath selectors | instead of CSS. | undume wrote: | xmllint can do that: curl example.org | xmllint | --html --xpath '//some/xpath/selector' - | natrys wrote: | Also: https://github.com/benibela/xidel | curben wrote: | I'm using xmlstarlet in Alpine as a bare minimum way to scrap a | webpage in CI pipeline. | BossHogg wrote: | There's also https://github.com/charmparticle/xpe | wmichelin wrote: | Why? CSS selectors are the normal web developer way to select | content from a document. Even JavaScript adopted the approach. | tuukkah wrote: | XPath supports more complex queries. In JavaScript, XPath is | available as document.evaluate | ducktective wrote: | I'm looking for something like this but for scraping SPAs and JS- | rich web content. A _single static binary_ like this not chrome | driver or selenium etc ___________________________________________________________________ (page generated 2022-11-30 23:00 UTC)