[HN Gopher] Htmlq: like jq, but for html ___________________________________________________________________ Htmlq: like jq, but for html Author : jabo Score : 849 points Date : 2021-09-07 07:12 UTC (15 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | dfederschmidt wrote: | This looks very useful, big fan of all the ^[a-z]+q$ utilities | out there. But as a user, I would probably want to use XPath[0] | notation here. Maybe that is just me. A quick search revealed | xidel[1] which seems to be similar, but supports XPath. | | [0]https://en.wikipedia.org/wiki/XPath | [1]https://github.com/benibela/xidel | chriswarbo wrote: | My web scraping tends to start with xidel. If I need a little | bit more power I'll use xmlstarlet. If neither of those is | enough, I'll use Python's beautifulsoup package :) | waynenilsen wrote: | part of the problem with this is that HTML is mostly not valid | XML | akie wrote: | I'd like to state my support for the author's choice of CSS | selectors in this particular use case. I think it's a natural | fit for this domain and already very well known, perhaps even | known better than XPath. | mirekrusin wrote: | Playwright ppl had to solve this for themselves, you can mix | them as they are distinct, have few small custom | modifications to help with selectors. Playwright compatible | selectors would be nice. | berkes wrote: | I'd like to add my support here too, but with a note. | | When scraping and parsing (or writing integration test DSL), | I always start out with CSS selectors. But always hit cases | where they lack or require hoop-jumping and then fall back on | Xpath. I then have a codebase with both CSS-Sel and Xpath, | which is arguably worse then having only one method. | | I suspect here, one uses this tool untill CSS selector | limitations are getting in the way, after which one switches | to another tool(chain) | alpha_squared wrote: | Do you mind giving an example? I'm having trouble following | where CSS is limited for selection. | berkes wrote: | Like other commentor says: parent/child. But also | selecting by content (e.g. "click the button with the | delete-icon" or "find the link with '@harrypotter') or | selecting by attributes (e.g. click the pager-item that | goes to next page) or selecting items outside of body | (e.g. og-tags, title etc). All are doable in CSS3 | selectors, but everything shouts that they are not meant | for this; whereas xpath does this far more natural. | unspecified wrote: | Searching text content is my main remaining use of XPath. | benibela wrote: | XPath does general data processing not just selection | | E.g. when you have a list of numbers on the website, | XPath can calculate the sum or the maximum of the numbers | | Or you have a list of names "Last name, First name", then | you can remove the last name and sort the first names | alphabetically. Or count how often each name occurs and | return the most popular name. | | Then it goes back to selection, e.g. select all numbers | that are smaller than the average. Or calculate the most | popular name, then select all elements containing that | name | vlunkr wrote: | Well, the big one is selecting a parent from the child. | androceium wrote: | You could do this with the :has() CSS psuedo-class[0], | though inverted (select a parent that _has_ the child | matching a selector). | | Looks like that psuedo-class has not been implemented in | the kuchiki library that htmlq uses though. | | [0]: https://developer.mozilla.org/en- | US/docs/Web/CSS/:has | Jenk wrote: | I've not had much friction using either, they are "close | enough" that the time to (re)write a query from one to the | other is not very significant. | lilyball wrote: | This looks really neat! It supports a bunch of different query | types, and can even do things like follow links to get info | about the linked-to pages! | | It's also in nixpkgs, though for some reason the nixpkgs | derivation is marked as linux-only (i.e. not Darwin). (Edit: | probably because the fpc dependency is also Linux-only, with a | linux-specific patch and a comment suggesting that supporting | other platforms would require adding per-platform patches) | exyi wrote: | Thanks, this looks more powerfull. Support CSS, XPath and | XQuery. Maybe I could learn a bit of XQuery when I have a use | case for it :) | dmit wrote: | Well, here's your first lesson then: if you prepend (: to | your comment it will become a valid XQuery document! | | (: XQuery comments are marked by mirrored smilie faces, like | this. :) | bdcravens wrote: | Nice - I've been writing XQuery for years and I had no clue | firefoxd wrote: | Super useful. You've created a fantastic tool here. Thank you. | d--b wrote: | Just being that guy: is there a reason you didn't call it hq? | zamadatix wrote: | Not author but neither is the poster: Jq got away with it | because it's one of the few 2 letter combinations that wasn't | absolutely overloaded and "jquery" was already taken. OTOH | nobody shortens HTML to H and HQ is an extremely common | acronym, if not one of the most popular 2 letter acronyms you | could pick. | OJFord wrote: | jq didn't get away with it! Have you never tried searching | for anything to do with it? How I _wish_ it were called | `jsonq`! | mgdm wrote: | I just wanted to be slightly more descriptive and less likely | to collide with other tools. | notRobot wrote: | Hahah, I love how this is your second comment in 10 years on | HN. | mgdm wrote: | Hah. Yeah. I had another account for a little while but | then HN started to let me reset the password for this one | quite recently, so here I am. | Snd_ wrote: | This is great! Thanks | who-shot-jr wrote: | Good work! | harperlee wrote: | Nice! | | This is the kind of obvious tool that once it exists, you can't | really grok the fact it did not earlier, and that it took until | now to exist. | dmos62 wrote: | > grok | | A good opportunity to introduce `gron` to those unfamiliar! | > gron "https://api.github.com/repos/tomnomnom/gron/commits?per | _page=1" | fgrep "commit.author" json[0].commit.author | = {}; json[0].commit.author.date = | "2016-07-02T10:51:21Z"; json[0].commit.author.email = | "mail@tomnomnom.com"; json[0].commit.author.name = "Tom | Hudson"; | | https://github.com/tomnomnom/gron | rsync wrote: | "A good opportunity to introduce `gron` to those unfamiliar!" | | Thank you - appreciated. | | I haven't done much work with json but have had reasons | recently to do so - and I immediately saw how difficult it | was to pipeline to grep ... | | But what I still don't understand is that some json outputs I | see have multiple values _with the exact same name_ (!) and | that still seems "un-grep-able" to me ... | | What am I missing ? | dmos62 wrote: | You might be missing a change in index: `obj[0].prop` vs | `obj[1].prop`. Or, your JSON might have the same property | defined multiple times: `{a:1, a:2}` (though I'm not sure | how gron handles that situation). | lvncelot wrote: | > (though I'm not sure how gron handles that situation). | | It seems both gron and jq only use the value that has | been defined last: ~ echo | '{"a":1,"a":2}' | gron | json = {}; json.a = 2; ~ echo | '{"a":1,"a":2}' | jq | { "a": 2 } | croon wrote: | The json output likely contains multiple objects. Can you | request more specifically the object(s) you need and grep | on that? | dotancohen wrote: | > But what I still don't understand is that some json | > outputs I see have multiple values with the exact same | name | | This is neither explicitly allowed nor explicitly forbidden | by the JSON spec. It is implementation dependent upon how | to handle - does one value override the other? Should they | be treated as an array? | | In practice, this situation is usually carefully avoided by | services that produce JSON. If you are interfacing with a | service that does produce duplicate values, I'd be | interested in seeing it for curiosity's sake. If you are | writing a service and this is the output, then I implore | you to reconsider! | ptwt wrote: | It did write it a few years ago. | | https://github.com/plainas/tq | matsemann wrote: | There are already tools for xpath, but using css selectors is | much more aligned with what I write every day, so that's nice. | harperlee wrote: | Yes, and awk and others. I meant something semantically | closer to the need, with css selectors. | natrys wrote: | It's not novel obviously. I have been using pup[1] for years. | And xidel[2] is probably older. | | [1] https://github.com/ericchiang/pup | | [2] https://github.com/benibela/xidel | ducktective wrote: | Looks nice! Any comparisons with pup? | | https://github.com/ericchiang/pup | notorandit wrote: | Next is xmlq: https://github.com/dscape/xmlq | Ronak123 wrote: | https://techflashes.com/top-upcoming-futuristic-technologies... | unityByFreedom wrote: | Why not just jquery? | purplecats wrote: | brilliant. does this spin up a heavy DOM implementation in the | background or do something lighter such as regexp? | mdzn wrote: | You can't parse HTML with regexps. It's not a regular language. | underdeserver wrote: | https://stackoverflow.com/questions/1732348/regex-match- | open... | carnitine wrote: | What language implements regexps that actually correspond to | regular languages though? | Deukhoofd wrote: | Looks like it uses servos html5ever (through kuchiki), so no | DOM representation. | chrismorgan wrote: | Kuchiki materialises what they call a "DOM-like tree". I'd | consider it a DOM tree, myself, despite the differences in | precise API. | | But it's not using a full browser to back it, which I suspect | is what's really being asked. | mrweasel wrote: | It looks to be using html5ever to parse the HTML, similar to | something like BeautifulSoup in Python. | delusional wrote: | The source is right there. You can read it. It uses html5ever | (part of the servo project). | [deleted] | gostsamo wrote: | You can't parse html with regular expressions :) | | https://stackoverflow.com/questions/1732348/regex-match-open... | bmn__ wrote: | "Oh Yes You Can Use Regexes to Parse HTML!" | | https://stackoverflow.com/a/4234491 | anon4242 wrote: | Yeah, if you allow yourself some Perl to help you with | those parts that regexes can't handle... | akie wrote: | Technically correct, but did you see the regex he uses? It | spans 82 lines... | andybak wrote: | And the obligitory caveat from the comments: | | > While arbitrary HTML with only a regex is impossible, it's | sometimes appropriate to use them for parsing a limited, | known set of HTML. | hnbad wrote: | The emphasis here is on "known". The tool is general | purpose (i.e. handling _unknown_ HTML) so using regexes | would be ill-advised. | dredmorbius wrote: | See also the html-xml-utils from w3c. | | hxextract and hxselect perform similar extract functions. | | hxclean and hxnormalize (combined) will pretty-print HTML. | | https://www.w3.org/Tools/HTML-XML-utils/ | mozey wrote: | Funny, couple of years ago I thought someone should create | something for JSON similar to what | [XSLT](https://en.wikipedia.org/wiki/XSLT) is for XML. See | example here https://www.w3schools.com/xml/xsl_intro.asp | | Then I found out about jq because awscli was using it in | example docs. | | I guess `htmlq` makes sense if it has the exact same syntax as | `jq`, and the user is already familiar with the latter? | desktopninja wrote: | Very nice tool. I've long spoiled myself with Powershell's: | Invoke-WebRequest eg. # what is the latest release of | apache-tomcat? $LINKS=$(Invoke-WebRequest -Uri | 'https://tomcat.apache.org/download-80.cgi' | Select-Object | -ExpandProperty Links) $LATEST=$($Links | Where-Object | -Property href -Match '#8.5.[0-9]+').href.substring(1) | $FETCH=$($Links | Where-Object -Property href -match "apache- | tomcat-${LATEST}.zip$").href | Tepix wrote: | Should it be $LINKS instead of $Links (2x)? | desktopninja wrote: | "$links" works too because PWSH is not case sensitive. But I | should have used $LINKS like you said for cleaner write-up. | systemvoltage wrote: | This is nifty! Python + bs4 takes some googling to remember how | to parse a webpage. This is just straight forward, thanks so | much. | jillesvangurp wrote: | If you make the html well formed, xpath also works great. Great | stuff if you ever need to pick html apart. Used this quite a bit | when microformats were still a thing together with jtidy. | | Jq is very loosely inspired by that, I guess. Might come full | circle here and use some XSL transformations ... | qw wrote: | You can usually find a html parser for your language, that you | can use xpath/xsl on. It will just make the same assumptions | that the browser does, by adding missing closing tags etc. | | I made a tool that extracted parts of web pages 10-15 years | ago, and it worked well. There are of course cases where the | html is so unstructured that the results were unpredictable, | but it worked well in general. | ludovicianul wrote: | And a Java version with pre-compiled binaries: | https://github.com/ludovicianul/hq | srg0 wrote: | "htmlq: like jq, but for HTML" | | "jq is like sed for JSON data" | | sed: "While in some ways similar to an editor which permits | scripted edits (_such as ed_), sed works by making only one pass | over the input(s)" | | ed: "ed is a line-oriented text editor". | | Software definition through a reference to another software is | somewhat confusing. Potential users come from different | backgrounds (I had no idea what is jq), and it is not clear what | are the defining features of each project. Is jq line oriented? | Is htmlq operating in a single pass? | digitalsushi wrote: | "htmlq is like jq but for html" is a very specific 'dog | whistle' for people who use jq. I agree that people who don't | know what jq is will get no value and pay no attention. But for | people who use jq, the claim is, like a dog whistle, clear, | concise, and means exactly what it says. In two seconds, | everyone using both jq and html will instantly know what is | available and log it away. | | So for general purposes, it's a terrible marketing pitch. And | yet I think it's a very, very valuable demonstration of knowing | some of their 'customers'. | acomar wrote: | this isn't what a dogwhistle is. it's just explanation by | analogy to a model presumed to be shared by the intended | audience. a dogwhistle offers a surface meaning to the | uninitiated that's anodyne but communicates a hidden, coded | message to those who possess some undisclosed, shared | knowledge with the author. this kind of analogy entirely | lacks the surface meaning and the message shared via jargon | also communicates something about how you might learn enough | to understand the analogy. | philipswood wrote: | I can't speak for people who don't know jq, but knowing jq, | this is a great tagline: it gives me an immediate | understanding of what it does, how I could expect to use it | and what value and ease of use I can expect. | | I'll be trying it out next time I'm on a PC. | rendall wrote: | > _I can 't speak for people who don't know jq,_ | | I can, and it's not illuminating at all. | throwaway2016a wrote: | I agree, however if you do know how to use jq than "like jq, | but for html" is extremely effective. I use jq all the time and | that title hooked me, I immediately wanted to try it. | | But if you haven't used Jq that I can see how that title is | less than helpful. | ducktective wrote: | The first three are not proper definitions per-se but kind of | an advertisement, trying to familiarize by self-comparison with | a _tried & true_ tool that has proven its worth. | | _You know Jimmy the famous mechanic? I 'm Timmy, _his brother_ | but an electrician._ | | IMO, at least `jq` has proven itself as _the_ indispensable | tool for json-data manipulation. | corporealshift wrote: | I mean...if you read the github readme it literally describes | what it does in the next line: "Uses CSS selectors to extract | bits content from HTML files". | kbenson wrote: | > Software definition through a reference to another software | is somewhat confusing. | | Possibly, depending on background as you note, but not all | promotion is intended at the same audience. When submitting to | HN, "like jq, but for X" is short and conveys what it is to | most the people that would care, I think. jq has been submitted | and talked about here _many_ times with lively discussion over | the years.[1] At this point I think most those that are | interested in what that is and what this is will understand | fairly quickly from the title. Those that don 't might be | missed, or they might look it up like you, or they might see it | through some other submission some other time with a different | title which isn't based on a chain of references. | | 1: | https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu... | zamadatix wrote: | 1st sentence - Explaining the tool for those the tool was made | for without beating around the bush. | | 2nd sentence - Explaining the tool to folks in the general web | domain what it can do for them. | | 3rd sentence - Explaining where to learn how to use the tool if | you've stumbled across it but web is not your area of | expertise. | | All that info fits in nearly 25 words then it lists the options | for the tool and jumps straight into multiple examples (with | outputs!). If the only explanation had been "htmlq: like jq, | but for HTML" I'd agree but having the comparison to explain | what it does isn't a bad thing it's _only_ having the | comparison that would be bad. | | Personally I think this is a model example of a opening for a | Github readme. | zerocount wrote: | I disagree. The 2nd sentence contains, "extract bits | content." What is that? | | If you're going to write a minimal introduction, at least | make sure it's not confusing. | | I get the feeling the author felt compelled to write an | introduction and did so with as little effort as possible. | cyberge99 wrote: | I believe he tailored it to his target audience. If you | find it confusing, you are likely not it. | ritchiea wrote: | As web developer for over a decade "bits content" doesn't | mean anything to me. But I understand what the tool does | from the rest of the description. Try running a google | search for "bits content," [0] it's not a commonly used | phrase in web development or anything. It's a poor choice | of words. | | 0. https://www.google.com/search?hl=en&q=%22bits%20conten | t%22 | chownie wrote: | It's supposed to be "bits of content", it's not jargon. | The author's just accidentally a word, we all do it. | ritchiea wrote: | It's more than fair to say in technical documentation you | intend others to use having a grammatical error or | missing word is confusing and a problem. It's the writing | equivalent of having a bug in your code. And it's | definitely not "writing to a target audience" as the | parent comment suggested. We all make mistakes but don't | try to call a mistake effective documentation. | RobertKerans wrote: | Of course it is, but neither parent nor anyone else is | saying anything close to the mistake being effective | documentation. There's a single missing word which needs | to be added in, but the overall text is clearly writing | to a target audience. You are aware of this, and of how | small the mistake is, and you understand what the | sentence should read as, so I'm not sure what your point | is? | theIV wrote: | My hunch is that this is a typo and it should read "extract | bits OF content." | mgdm wrote: | Exactly this! I'll fix it after work. | rendall wrote: | Maybe have the line about "jq" be 2nd. Have the first | line be a brief description of what it actually does. | ritchiea wrote: | I agree and having a missing word in your text often | leads to confusion :) | | Honestly you could drop the "bits" which is a bit | redundant and use the phrase "Uses CSS selectors to | extract content from HTML files." | oauea wrote: | What's this thing called a "computer" that people keep going on | about, anyway? | da_chicken wrote: | It's a person who does mathematic calculations all day. For | example, creating range tables for artillery, calculating | averages or totals of a large range of values, or solving | complex integrals or differential equations, and so on. | They're commonly used in industry or government, especially | in astronomy, aerospace and civil engineering for both | simulation and analysis. Perhaps the most well-known | computers were the Harvard Computers, which operated in the | late 19th and early 20th centuries. | | As a job, computers were largely automated out of existence | by solid-state transistor based automated computers and | integrated circuit transistor automated computers in the 60s, | 70s and 80s, which replaced the enormously expensive and | often largely experimental electro-mechanical automated | computers while radically reducing cost and improving | performance both by several orders of magnitude. | samstave wrote: | Here - this explains it really succinctly: | | https://www.youtube.com/watch?v=lE1bS-Mn2Mk | theandrewbailey wrote: | It's like a programmable loom, but for logical and | mathematical operations. | samhw wrote: | You may be interested in the symbol grounding problem (http | s://en.wikipedia.org/wiki/Symbol_grounding_problem#Groun... | ). It's like the binding problem, but for symbols. | [deleted] | sundarurfriend wrote: | Sort of related: [Expecting Short Inferential | Distances](https://www.readthesequences.com/Expecting-Short- | Inferential...) | nextaccountic wrote: | jq isn't line-oriented, it's json-oriented. it's operaring on a | stream of jsons from stdin, so its query is applied to each | json in sequence. | | I would expect that htmlq run the query a single time for a | single html; just like jquery $('#something') or | document.querySelector('#something') | zatkin wrote: | Why not incorporate this into jq itself, like perhaps adding some | command line arguments to switch to HTML mode? | Deukhoofd wrote: | What would the benefits of fitting a HTML parser into a JSON | parser tool be? | lmm wrote: | JQ is not just a parser but a tool for doing operations, many | of which are (or should be) generic across any tree-like data | format. Reusing that part across different input formats | makes a lot of sense. | mjburgess wrote: | Well once there's an HTML parser, then a pdf viewer, and then | everything needed for PDFs (ie., programming, emailer, video | support, etc.) we'll finally have that ideal operating system | we've been waiting for. | mro_name wrote: | sounds a lot more like blockchain. | e12e wrote: | Would probably be more useful to implement html2json, and pipe | in html? | | Ed: eg: https://github.com/Jxck/html2json | downWidOutaFite wrote: | Why? I find xpath's syntax much simpler and regular than jq's. | [deleted] | pabs3 wrote: | I tend to reach for XPath selectors before CSS ones when querying | HTML. | necovek wrote: | Nice, I expected something based on XPath (like xpd), but web | developers dealing with HTML are infinitely more familiar with | CSS selectors, so a great choice! | busterarm wrote: | I want the option to use both, like Nokogiri gives you. | necovek wrote: | Sure, that sounds nice, but having two simple tools each | doing the job well in its own space is perfectly fine for me | -- do you imagine needing to combine Xpath and CSS queries in | a single run? | busterarm wrote: | I've had to do it when dealing with some poorly-designed | XML apis in the past. Nokogiri was a godsend. | rendall wrote: | What is jq? | mro_name wrote: | it's statically linkable rust, isn't it? Awesome. I'm looking for | a successor to | | $ xmllint --html --xpath ... | | that doesn't choke on inline svg. | gigatexal wrote: | This is very cool. This will make scraping the web even easier! | elif wrote: | When I saw the title I thought this was some machine learning- | specific rmq/0mq message passing tech called HT. Very excited to | zero. | m4r35n357 wrote: | Should be HQ . . . | pkrumins wrote: | Call it "hq". | jhatemyjob wrote: | Crazy how a 300-line codebase manages to amass 2000 stars on | Github and 700 upvotes on HN. Amazing ROI. | gizdan wrote: | Once upon a time I was using pup[0] for such thing as well as | later I changed to cascadia[1] which seemed much more advanced. | | Comparing the two repos, it seems pup is dead, but cascadia may | not be. | | These tools, including htmlq, seem to sell themselves as "jq for | html", which is far from the truth. Jq is closer to the awk where | you can do just about everything with json. Cascadia, htmlq, and | pup seem closer to grep for html. They can essentially only | select data from a html source. | | [0] https://github.com/EricChiang/pup [1] | https://github.com/suntong/cascadia | heavyset_go wrote: | I've used pup for a few projects, but was unaware of cascadia. | Thanks for pointing it out. | croon wrote: | Well, jq is grep _as well_ as sed and awk, but yeah, htmlq | seems to be just grep, for sake of comparison. | | But I don't think html has any need for a sed/awk tool, or at | least not as much. Json output could very well be piped forward | to the next CLI tool after you've changed it slightly with jq. | I don't see this scenario as likely with html. | gizdan wrote: | > Well, jq is grep as well as sed and awk, but yeah, htmlq | seems to be just grep, for sake of comparison. | | Exactly, and that is what I mean. If you want to compare, | compare it with grep, not jq. | | Someone else posted xidel[0] in this thread, which I've not | used, but it seems to be the "jq but for html". | | [0] https://github.com/benibela/xidel | bamdadd wrote: | is there a brew install command ? | mcovalt wrote: | I'd like to see a tool using lol-html [0] and their CSS selector | API as a streaming HTML editor. | | [0] https://github.com/cloudflare/lol-html | hyperpallium2 wrote: | From examples, this is only like jq in the sense that the q | stands for the same thing. Even the way it does that is | different. | | An xmlq that was really like jq would be fun, about 20 years ago. | cerved wrote: | I would still like xmlq, there are (regrettably) still a lot of | applications that store data and configuration in xml | dotancohen wrote: | There is `xq` today, which parses XML like `jq`. I think that | it is relatively unknown because it is part of the `yq` package | for parsing YMAL. So just install `yq` via PIP and you'll get | `xq` as well. | | There is also `xmlstarlet` for parsing XML in a similar | fashion. | hyperpallium2 wrote: | xmlstarlet is really nothing like jq, as a language. But yes, | I use it because it is the best commandline xml processor I'd | found. That's the only similarity to jq. | | Is this the yq? https://kislyuk.github.io/yq/ It does contain | an 'xq', as a literal wrapper for jq, piping output into it | after transcoding XML to JSON using xmltodict | https://github.com/martinblech/xmltodict (which explodes xml | into separate JSON data structures). | | This is a bash one-liner! But TBF it really is a 'jq for | xml'. I think it would be horrible for some things, but you | could also do a lot of useful things painlessly. | dotancohen wrote: | Thank you for the comments. I've only recently discovered | both tools, and literally used them once each. Of the two | `xq` was easier for my particular work case (parsing a | Magento config) but I keep both tools in my virtual | toolbox. | | If you have any other suggestions for parsing XML for | exploratory purposes I'm very happy to hear them. | hyperpallium2 wrote: | Thanks! Not actually a reccommendation, but I have used | xsltproc (command line xslt), but it is horrible to use | because xslt syntax is horrible (though xslt's concepts | are pretty cool). One thing is it enables you to use | XPath in all its glory. | | Just installed xq. It's nice just seeing the pretty- | printed json output, so thanks for the pointer. Probably | better than xmlstarlet for my usage, which just queries | and outputs text, not xml. hmmm, that's probably true for | most commandline uses... | jle17 wrote: | Just looked into this and I think it's worth mentioning that | there are two different projects called `yq`. The first one | that came up (written in go instead of python) is not the | right one and doesn't have the `xq` tool. | abledon wrote: | is anyone else using the https://github.com/json-path/JsonPath | over the jq route? | | I hope we standardize on some jq query language, like we have | with a base set of SQL syntax | andybak wrote: | > like jg | | "jq is a lightweight and flexible command-line JSON processor" | chefandy wrote: | If anyone is looking for a good library to do this in Python, | PyQuery works well: | | https://pythonhosted.org/pyquery/ | teitoklien wrote: | Maybe call it hq ? | Simplicitas wrote: | My thoughts EXACTLY... but anyway, great new utility indeed! | teitoklien wrote: | Haha, Indeed its a very good utility :D | oauea wrote: | https://jsoup.org/ has been around for a long time and seems a | bit more mature & maintained than this two-code-files 2-year-old | repo. Highly recommend. | avereveard wrote: | what's wrong with using html tidy + xmllint ? | mro_name wrote: | nothing wrong. Searching unmodified html though is sometimes | preferable. | soheil wrote: | I'd use something like this script that you can put together | yourself: #!/usr/bin/env ruby require | 'nokogiri'; p Nokogiri::HTML(STDIN.read).css(ARGV[0]).text | | Just save it to a file in your _/ usr/local/bin/hq_ and do _chmod | +x !$_ | | Then you can do: curl -s | "https://news.ycombinator.com/news"|hq "tr:first-child | .storylink" | | It uses Nokogiri[0], which is much more battle tested and works | with CSS and XPath selectors. | | [0] | https://nokogiri.org/tutorials/parsing_an_html_xml_document.... | triska wrote: | This is very nice! | | For reasoning about tree-based data such as HTML, I also highly | recommend the declarative programming language Prolog. HTML | documents map naturally to Prolog terms and can be readily | reasoned about with built-in language mechanisms. For instance, | here is the sample query from the htmlq README, fetching all | elements with id _get-help_ from https://www.rust-lang.org, using | Scryer Prolog and its SGML and HTTP libraries in combination with | the XPath-inspired query language from library(xpath): | ?- http_open("https://www.rust-lang.org", Stream, []), | load_html(stream(Stream), DOM, []), xpath(DOM, | //(*(@id="get-help")), E). | | Yielding: E = element(div,[class="flex flex- | colum ...",id="get-help"],["\n ",element(h4,[],["Get | help!"]),"\n ",element(ul,[],["\n | ...",element(li,[],[element(a,[... = ...],[...])]),"\n | ...",element(li,[],[...]),...|...]),"\n | ...",element(div,[class="la ..."],["\n | ...",element(label,[...],[...]),...|...]),"\n ..."]) ; | false. | | The selector //(*(@id="get-help")) is used to obtain all HTML | elements whose _id_ attribute is get-help. On backtracking, all | solutions are reported. | | The other example from the README, extracting all _links_ from | the page, can be obtained with Scryer Prolog like this: | ?- http_open("https://www.rust-lang.org", Stream, []), | load_html(stream(Stream), DOM, []), xpath(DOM, | //a(@href), Link), portray_clause(Link), | false. | | This query uses forced backtracking to write all links on | standard output, yielding: "/". | "/tools/install". "/learn". "https://play.rust- | lang.org/". "/tools". "/governance". | "/community". "https://blog.rust-lang.org/". | "/learn/get-started". etc. | chriswarbo wrote: | Thanks, that's a rare example of something which is (a) simple | enough to understand for a Prolog-newbie like me, and (b) more | practical than ubiquitous family-tree example. | | I'm always looking for opportunities to dip my toes into | Prolog; in hindsight it's clearly a good fit for tree- | structured data structures. | samhw wrote: | Interestingly, the only other context in which I've come | across Prolog is from friends who studied at Cambridge, here | in the UK. For some reason, the CS 'tripos' (course) there is | really heavily focussed on Prolog, and everyone I know from | there ended up a huge fan of the language. I'm not sure why | that's the case, though, given that almost all other | universities seem to use more common languages (Java, C++, | etc). | zimpenfish wrote: | cs.man.ac.uk, at least back in 1992, had a compulsory | Prolog module in the first year. Don't know anyone from | then who didn't hate that module with a burning passion. | | (There was no Java, C++, etc. either. It was SML, Pascal, | 68000, and Oracle Pascal-Embedded-SQL.) | ramses0 wrote: | "Prolog as a library" => Given "functional" constraints => | $CONSTRAINTS.prolog( "query..." ) => results | | ...many languages (similar to regex / state-machine) can | benefit greatly from offloading a portion to something | prolog-ish, but it's unfortunate that prolog knowledge | isn't as widely distributed. | WickyNilliams wrote: | I studied CS at a different university in UK and we used | Prolog for one module on AI or perhaps machine vision. I | really enjoyed working with it. This was 15 years ago. | Looking through their current curriculum I can't see prolog | being mentioned anymore. Shame! | pandatigox wrote: | I tried to run this on my computer now, but as a complete | Prolog noob, I'm having errors running the script? How do you | load the http_open module/library in the first place? I tried | following some Prolog tutorials in the past but I always get | stuck trying to run something in the REPL. I'm using scryer- | prolog. Thanks in advance! | triska wrote: | The libraries I mentioned can be loaded by invoking the | use_module/1 predicate on the toplevel, here is the complete | transcript that loads the SGML, HTTP and XPath libraries in | Scryer Prolog: ?- | use_module(library(sgml)). true. ?- | use_module(library(http/http_open)). true. | ?- use_module(library(xpath)). true. | | The second query also uses portray_clause/1 from | library(format), which you can load with: | ?- use_module(library(format)). true. | | After all these libraries are loaded, you can post the sample | queries from above, and it should work. | | There are also other ways to load these libraries: A very | common way to load a library is to use the use_module/1 | _directive_ in Prolog source files. In that case, you would | put for example the following 4 directives in a Prolog source | file, say sample.pl: :- | use_module(library(sgml)). :- | use_module(library(http/http_open)). :- | use_module(library(xpath)). :- | use_module(library(format)). | | And then run sample.pl with: $ scryer- | prolog sample.pl | | You can then again post the goals from above on the toplevel, | and it will work too. | | Another way is to put these directives in your ~/.scryerrc | configuration file, which is automatically consulted when | Scryer Prolog starts. I recommend to do this for libraries | you frequently need. Common candidates for this are for | example library(dcgs), library(lists) and library(reif). | | Personally, I start Scryer Prolog from within Emacs, and I | have set up Emacs so that I can consult a buffer with Prolog | code, and also post queries and interact with the Prolog | toplevel from within Emacs. | pandatigox wrote: | Wow that works fantastically! Thank you for that. It almost | seems like magic. | okasaki wrote: | It's pretty easy in Python too, eg.: >>> soup | = BeautifulSoup(requests.get("https://www.rust-lang.org").text) | >>> [x["href"] for x in soup.find_all("a")] ['/', | '/tools/install', '/learn', 'https://play.rust-lang.org/', | '/tools', '/governance', '/community', 'https://blog.rust- | lang.org/',... | triska wrote: | In a certain sense (for example, when measuring brevity), it | is indeed easy to write this example in Python. However, the | Python version also illustrates that many different language | constructs are needed to express the intended functionality. | In comparison to Prolog, Python is a quite complex language | with many different language constructs, including loops, | objects, methods, assignment, dictionaries etc. all of which | are used in this example. | | As I see it, a key attraction of Prolog is its simplicity: | With a single language construct (Horn clauses), you are able | to express all known computations, and the example queries I | posted show that only a single language element, namely again | Horn clauses to express a query, is needed to run the code. | The Prolog query, and also every Prolog clause, is itself a | Prolog term and can be inspected with built-in mechanisms. | | As a consequence, an immediate benefit of using Prolog for | such use cases is that you can easily reason about user- | specified queries in your applications, and for example | easily allow only a safe subset of code to be run by users, | or execute a user-specified query with different execution | strategies etc. In comparison, Python code is much harder to | analyze and restrict to a particular subset due to the | language's comparatively high syntactic complexity. | the_jeremy wrote: | The benefit of Python is that developers already know about | these language constructs, and that more developers know | Python than Prolog. | lostcolony wrote: | I don't think the op's point was "how easy it would be to | hire developers", or even "taking all the considerations | a business is under, I feel Prolog makes sense". He was | just touting how easy Prolog's built in pattern matching | and declarative style makes implementing and using | selectors at a language level. | | Honestly, if we didn't talk about the benefits of a | language irrespective of how easy it is to hire for it, | we'd never have introduced anything beyond FORTRAN, if we | even made it that far. Bringing "X is easier to hire for" | into a conversation about the language is, at best, a | non-sequitur. | notriddle wrote: | We might have been better off that way. FORTRAN does have | its downsides, but language churn itself has downsides | that almost always outweigh the assumed upsides of a | better language. | | If we had just stuck with FORTRAN forever, how many | problems would have been completely avoided!? There'd be | better, and more, IDEs, since even if the language is | hard to parse, it's still just one parser that needs all | the effort. So many unfortunate problems in education | caused by language and ecosystem churn would have been | avoided (the infamous "by the time you graduate, it's | always outdated" problem). | | The only problem is that FORTRAN is too new. Should've | stuck with the Hollerith tabulator. | jfmc wrote: | AFAIK, this was first proposed and implemented in Ciao Prolog | back in late 90s (modern versions here: https://ciao- | lang.org/ciao/build/doc/ciao.html/html.html). It was way before | Python was popular and JavaScript ever existed. | parhamn wrote: | Ive been looking for a library that can find the best set of | selectors to most consistently find the element youre looking for | in a page. | | Any pointers to something that exists? Interestingly I've also | found very little for dom extraction in the OS ML space. ___________________________________________________________________ (page generated 2021-09-07 23:01 UTC)