hngopher.com

       [HN Gopher] Htmlq: like jq, but for html
       ___________________________________________________________________
        
       Htmlq: like jq, but for html
        
       Author : jabo
       Score  : 849 points
       Date   : 2021-09-07 07:12 UTC (15 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | dfederschmidt wrote:
       | This looks very useful, big fan of all the ^[a-z]+q$ utilities
       | out there. But as a user, I would probably want to use XPath[0]
       | notation here. Maybe that is just me. A quick search revealed
       | xidel[1] which seems to be similar, but supports XPath.
       | 
       | [0]https://en.wikipedia.org/wiki/XPath
       | [1]https://github.com/benibela/xidel
        
         | chriswarbo wrote:
         | My web scraping tends to start with xidel. If I need a little
         | bit more power I'll use xmlstarlet. If neither of those is
         | enough, I'll use Python's beautifulsoup package :)
        
         | waynenilsen wrote:
         | part of the problem with this is that HTML is mostly not valid
         | XML
        
         | akie wrote:
         | I'd like to state my support for the author's choice of CSS
         | selectors in this particular use case. I think it's a natural
         | fit for this domain and already very well known, perhaps even
         | known better than XPath.
        
           | mirekrusin wrote:
           | Playwright ppl had to solve this for themselves, you can mix
           | them as they are distinct, have few small custom
           | modifications to help with selectors. Playwright compatible
           | selectors would be nice.
        
           | berkes wrote:
           | I'd like to add my support here too, but with a note.
           | 
           | When scraping and parsing (or writing integration test DSL),
           | I always start out with CSS selectors. But always hit cases
           | where they lack or require hoop-jumping and then fall back on
           | Xpath. I then have a codebase with both CSS-Sel and Xpath,
           | which is arguably worse then having only one method.
           | 
           | I suspect here, one uses this tool untill CSS selector
           | limitations are getting in the way, after which one switches
           | to another tool(chain)
        
             | alpha_squared wrote:
             | Do you mind giving an example? I'm having trouble following
             | where CSS is limited for selection.
        
               | berkes wrote:
               | Like other commentor says: parent/child. But also
               | selecting by content (e.g. "click the button with the
               | delete-icon" or "find the link with '@harrypotter') or
               | selecting by attributes (e.g. click the pager-item that
               | goes to next page) or selecting items outside of body
               | (e.g. og-tags, title etc). All are doable in CSS3
               | selectors, but everything shouts that they are not meant
               | for this; whereas xpath does this far more natural.
        
               | unspecified wrote:
               | Searching text content is my main remaining use of XPath.
        
               | benibela wrote:
               | XPath does general data processing not just selection
               | 
               | E.g. when you have a list of numbers on the website,
               | XPath can calculate the sum or the maximum of the numbers
               | 
               | Or you have a list of names "Last name, First name", then
               | you can remove the last name and sort the first names
               | alphabetically. Or count how often each name occurs and
               | return the most popular name.
               | 
               | Then it goes back to selection, e.g. select all numbers
               | that are smaller than the average. Or calculate the most
               | popular name, then select all elements containing that
               | name
        
               | vlunkr wrote:
               | Well, the big one is selecting a parent from the child.
        
               | androceium wrote:
               | You could do this with the :has() CSS psuedo-class[0],
               | though inverted (select a parent that _has_ the child
               | matching a selector).
               | 
               | Looks like that psuedo-class has not been implemented in
               | the kuchiki library that htmlq uses though.
               | 
               | [0]: https://developer.mozilla.org/en-
               | US/docs/Web/CSS/:has
        
             | Jenk wrote:
             | I've not had much friction using either, they are "close
             | enough" that the time to (re)write a query from one to the
             | other is not very significant.
        
         | lilyball wrote:
         | This looks really neat! It supports a bunch of different query
         | types, and can even do things like follow links to get info
         | about the linked-to pages!
         | 
         | It's also in nixpkgs, though for some reason the nixpkgs
         | derivation is marked as linux-only (i.e. not Darwin). (Edit:
         | probably because the fpc dependency is also Linux-only, with a
         | linux-specific patch and a comment suggesting that supporting
         | other platforms would require adding per-platform patches)
        
         | exyi wrote:
         | Thanks, this looks more powerfull. Support CSS, XPath and
         | XQuery. Maybe I could learn a bit of XQuery when I have a use
         | case for it :)
        
           | dmit wrote:
           | Well, here's your first lesson then: if you prepend (: to
           | your comment it will become a valid XQuery document!
           | 
           | (: XQuery comments are marked by mirrored smilie faces, like
           | this. :)
        
             | bdcravens wrote:
             | Nice - I've been writing XQuery for years and I had no clue
        
       | firefoxd wrote:
       | Super useful. You've created a fantastic tool here. Thank you.
        
       | d--b wrote:
       | Just being that guy: is there a reason you didn't call it hq?
        
         | zamadatix wrote:
         | Not author but neither is the poster: Jq got away with it
         | because it's one of the few 2 letter combinations that wasn't
         | absolutely overloaded and "jquery" was already taken. OTOH
         | nobody shortens HTML to H and HQ is an extremely common
         | acronym, if not one of the most popular 2 letter acronyms you
         | could pick.
        
           | OJFord wrote:
           | jq didn't get away with it! Have you never tried searching
           | for anything to do with it? How I _wish_ it were called
           | `jsonq`!
        
         | mgdm wrote:
         | I just wanted to be slightly more descriptive and less likely
         | to collide with other tools.
        
           | notRobot wrote:
           | Hahah, I love how this is your second comment in 10 years on
           | HN.
        
             | mgdm wrote:
             | Hah. Yeah. I had another account for a little while but
             | then HN started to let me reset the password for this one
             | quite recently, so here I am.
        
       | Snd_ wrote:
       | This is great! Thanks
        
       | who-shot-jr wrote:
       | Good work!
        
       | harperlee wrote:
       | Nice!
       | 
       | This is the kind of obvious tool that once it exists, you can't
       | really grok the fact it did not earlier, and that it took until
       | now to exist.
        
         | dmos62 wrote:
         | > grok
         | 
         | A good opportunity to introduce `gron` to those unfamiliar!
         | > gron "https://api.github.com/repos/tomnomnom/gron/commits?per
         | _page=1" | fgrep "commit.author"         json[0].commit.author
         | = {};         json[0].commit.author.date =
         | "2016-07-02T10:51:21Z";         json[0].commit.author.email =
         | "mail@tomnomnom.com";         json[0].commit.author.name = "Tom
         | Hudson";
         | 
         | https://github.com/tomnomnom/gron
        
           | rsync wrote:
           | "A good opportunity to introduce `gron` to those unfamiliar!"
           | 
           | Thank you - appreciated.
           | 
           | I haven't done much work with json but have had reasons
           | recently to do so - and I immediately saw how difficult it
           | was to pipeline to grep ...
           | 
           | But what I still don't understand is that some json outputs I
           | see have multiple values _with the exact same name_ (!) and
           | that still seems  "un-grep-able" to me ...
           | 
           | What am I missing ?
        
             | dmos62 wrote:
             | You might be missing a change in index: `obj[0].prop` vs
             | `obj[1].prop`. Or, your JSON might have the same property
             | defined multiple times: `{a:1, a:2}` (though I'm not sure
             | how gron handles that situation).
        
               | lvncelot wrote:
               | > (though I'm not sure how gron handles that situation).
               | 
               | It seems both gron and jq only use the value that has
               | been defined last:                 ~  echo
               | '{"a":1,"a":2}' | gron
               | json = {};       json.a = 2;       ~  echo
               | '{"a":1,"a":2}' | jq
               | {         "a": 2       }
        
             | croon wrote:
             | The json output likely contains multiple objects. Can you
             | request more specifically the object(s) you need and grep
             | on that?
        
             | dotancohen wrote:
             | > But what I still don't understand is that some json
             | > outputs I see have multiple values with the exact same
             | name
             | 
             | This is neither explicitly allowed nor explicitly forbidden
             | by the JSON spec. It is implementation dependent upon how
             | to handle - does one value override the other? Should they
             | be treated as an array?
             | 
             | In practice, this situation is usually carefully avoided by
             | services that produce JSON. If you are interfacing with a
             | service that does produce duplicate values, I'd be
             | interested in seeing it for curiosity's sake. If you are
             | writing a service and this is the output, then I implore
             | you to reconsider!
        
         | ptwt wrote:
         | It did write it a few years ago.
         | 
         | https://github.com/plainas/tq
        
         | matsemann wrote:
         | There are already tools for xpath, but using css selectors is
         | much more aligned with what I write every day, so that's nice.
        
           | harperlee wrote:
           | Yes, and awk and others. I meant something semantically
           | closer to the need, with css selectors.
        
         | natrys wrote:
         | It's not novel obviously. I have been using pup[1] for years.
         | And xidel[2] is probably older.
         | 
         | [1] https://github.com/ericchiang/pup
         | 
         | [2] https://github.com/benibela/xidel
        
       | ducktective wrote:
       | Looks nice! Any comparisons with pup?
       | 
       | https://github.com/ericchiang/pup
        
       | notorandit wrote:
       | Next is xmlq: https://github.com/dscape/xmlq
        
       | Ronak123 wrote:
       | https://techflashes.com/top-upcoming-futuristic-technologies...
        
       | unityByFreedom wrote:
       | Why not just jquery?
        
       | purplecats wrote:
       | brilliant. does this spin up a heavy DOM implementation in the
       | background or do something lighter such as regexp?
        
         | mdzn wrote:
         | You can't parse HTML with regexps. It's not a regular language.
        
           | underdeserver wrote:
           | https://stackoverflow.com/questions/1732348/regex-match-
           | open...
        
           | carnitine wrote:
           | What language implements regexps that actually correspond to
           | regular languages though?
        
         | Deukhoofd wrote:
         | Looks like it uses servos html5ever (through kuchiki), so no
         | DOM representation.
        
           | chrismorgan wrote:
           | Kuchiki materialises what they call a "DOM-like tree". I'd
           | consider it a DOM tree, myself, despite the differences in
           | precise API.
           | 
           | But it's not using a full browser to back it, which I suspect
           | is what's really being asked.
        
         | mrweasel wrote:
         | It looks to be using html5ever to parse the HTML, similar to
         | something like BeautifulSoup in Python.
        
         | delusional wrote:
         | The source is right there. You can read it. It uses html5ever
         | (part of the servo project).
        
         | [deleted]
        
         | gostsamo wrote:
         | You can't parse html with regular expressions :)
         | 
         | https://stackoverflow.com/questions/1732348/regex-match-open...
        
           | bmn__ wrote:
           | "Oh Yes You Can Use Regexes to Parse HTML!"
           | 
           | https://stackoverflow.com/a/4234491
        
             | anon4242 wrote:
             | Yeah, if you allow yourself some Perl to help you with
             | those parts that regexes can't handle...
        
             | akie wrote:
             | Technically correct, but did you see the regex he uses? It
             | spans 82 lines...
        
           | andybak wrote:
           | And the obligitory caveat from the comments:
           | 
           | > While arbitrary HTML with only a regex is impossible, it's
           | sometimes appropriate to use them for parsing a limited,
           | known set of HTML.
        
             | hnbad wrote:
             | The emphasis here is on "known". The tool is general
             | purpose (i.e. handling _unknown_ HTML) so using regexes
             | would be ill-advised.
        
       | dredmorbius wrote:
       | See also the html-xml-utils from w3c.
       | 
       | hxextract and hxselect perform similar extract functions.
       | 
       | hxclean and hxnormalize (combined) will pretty-print HTML.
       | 
       | https://www.w3.org/Tools/HTML-XML-utils/
        
         | mozey wrote:
         | Funny, couple of years ago I thought someone should create
         | something for JSON similar to what
         | [XSLT](https://en.wikipedia.org/wiki/XSLT) is for XML. See
         | example here https://www.w3schools.com/xml/xsl_intro.asp
         | 
         | Then I found out about jq because awscli was using it in
         | example docs.
         | 
         | I guess `htmlq` makes sense if it has the exact same syntax as
         | `jq`, and the user is already familiar with the latter?
        
       | desktopninja wrote:
       | Very nice tool. I've long spoiled myself with Powershell's:
       | Invoke-WebRequest            eg. # what is the latest release of
       | apache-tomcat?       $LINKS=$(Invoke-WebRequest -Uri
       | 'https://tomcat.apache.org/download-80.cgi' | Select-Object
       | -ExpandProperty Links)       $LATEST=$($Links | Where-Object
       | -Property href -Match '#8.5.[0-9]+').href.substring(1)
       | $FETCH=$($Links | Where-Object -Property href -match "apache-
       | tomcat-${LATEST}.zip$").href
        
         | Tepix wrote:
         | Should it be $LINKS instead of $Links (2x)?
        
           | desktopninja wrote:
           | "$links" works too because PWSH is not case sensitive. But I
           | should have used $LINKS like you said for cleaner write-up.
        
       | systemvoltage wrote:
       | This is nifty! Python + bs4 takes some googling to remember how
       | to parse a webpage. This is just straight forward, thanks so
       | much.
        
       | jillesvangurp wrote:
       | If you make the html well formed, xpath also works great. Great
       | stuff if you ever need to pick html apart. Used this quite a bit
       | when microformats were still a thing together with jtidy.
       | 
       | Jq is very loosely inspired by that, I guess. Might come full
       | circle here and use some XSL transformations ...
        
         | qw wrote:
         | You can usually find a html parser for your language, that you
         | can use xpath/xsl on. It will just make the same assumptions
         | that the browser does, by adding missing closing tags etc.
         | 
         | I made a tool that extracted parts of web pages 10-15 years
         | ago, and it worked well. There are of course cases where the
         | html is so unstructured that the results were unpredictable,
         | but it worked well in general.
        
       | ludovicianul wrote:
       | And a Java version with pre-compiled binaries:
       | https://github.com/ludovicianul/hq
        
       | srg0 wrote:
       | "htmlq: like jq, but for HTML"
       | 
       | "jq is like sed for JSON data"
       | 
       | sed: "While in some ways similar to an editor which permits
       | scripted edits (_such as ed_), sed works by making only one pass
       | over the input(s)"
       | 
       | ed: "ed is a line-oriented text editor".
       | 
       | Software definition through a reference to another software is
       | somewhat confusing. Potential users come from different
       | backgrounds (I had no idea what is jq), and it is not clear what
       | are the defining features of each project. Is jq line oriented?
       | Is htmlq operating in a single pass?
        
         | digitalsushi wrote:
         | "htmlq is like jq but for html" is a very specific 'dog
         | whistle' for people who use jq. I agree that people who don't
         | know what jq is will get no value and pay no attention. But for
         | people who use jq, the claim is, like a dog whistle, clear,
         | concise, and means exactly what it says. In two seconds,
         | everyone using both jq and html will instantly know what is
         | available and log it away.
         | 
         | So for general purposes, it's a terrible marketing pitch. And
         | yet I think it's a very, very valuable demonstration of knowing
         | some of their 'customers'.
        
           | acomar wrote:
           | this isn't what a dogwhistle is. it's just explanation by
           | analogy to a model presumed to be shared by the intended
           | audience. a dogwhistle offers a surface meaning to the
           | uninitiated that's anodyne but communicates a hidden, coded
           | message to those who possess some undisclosed, shared
           | knowledge with the author. this kind of analogy entirely
           | lacks the surface meaning and the message shared via jargon
           | also communicates something about how you might learn enough
           | to understand the analogy.
        
           | philipswood wrote:
           | I can't speak for people who don't know jq, but knowing jq,
           | this is a great tagline: it gives me an immediate
           | understanding of what it does, how I could expect to use it
           | and what value and ease of use I can expect.
           | 
           | I'll be trying it out next time I'm on a PC.
        
             | rendall wrote:
             | > _I can 't speak for people who don't know jq,_
             | 
             | I can, and it's not illuminating at all.
        
         | throwaway2016a wrote:
         | I agree, however if you do know how to use jq than "like jq,
         | but for html" is extremely effective. I use jq all the time and
         | that title hooked me, I immediately wanted to try it.
         | 
         | But if you haven't used Jq that I can see how that title is
         | less than helpful.
        
         | ducktective wrote:
         | The first three are not proper definitions per-se but kind of
         | an advertisement, trying to familiarize by self-comparison with
         | a _tried & true_ tool that has proven its worth.
         | 
         |  _You know Jimmy the famous mechanic? I 'm Timmy, _his brother_
         | but an electrician._
         | 
         | IMO, at least `jq` has proven itself as _the_ indispensable
         | tool for json-data manipulation.
        
         | corporealshift wrote:
         | I mean...if you read the github readme it literally describes
         | what it does in the next line: "Uses CSS selectors to extract
         | bits content from HTML files".
        
         | kbenson wrote:
         | > Software definition through a reference to another software
         | is somewhat confusing.
         | 
         | Possibly, depending on background as you note, but not all
         | promotion is intended at the same audience. When submitting to
         | HN, "like jq, but for X" is short and conveys what it is to
         | most the people that would care, I think. jq has been submitted
         | and talked about here _many_ times with lively discussion over
         | the years.[1] At this point I think most those that are
         | interested in what that is and what this is will understand
         | fairly quickly from the title. Those that don 't might be
         | missed, or they might look it up like you, or they might see it
         | through some other submission some other time with a different
         | title which isn't based on a chain of references.
         | 
         | 1:
         | https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
        
         | zamadatix wrote:
         | 1st sentence - Explaining the tool for those the tool was made
         | for without beating around the bush.
         | 
         | 2nd sentence - Explaining the tool to folks in the general web
         | domain what it can do for them.
         | 
         | 3rd sentence - Explaining where to learn how to use the tool if
         | you've stumbled across it but web is not your area of
         | expertise.
         | 
         | All that info fits in nearly 25 words then it lists the options
         | for the tool and jumps straight into multiple examples (with
         | outputs!). If the only explanation had been "htmlq: like jq,
         | but for HTML" I'd agree but having the comparison to explain
         | what it does isn't a bad thing it's _only_ having the
         | comparison that would be bad.
         | 
         | Personally I think this is a model example of a opening for a
         | Github readme.
        
           | zerocount wrote:
           | I disagree. The 2nd sentence contains, "extract bits
           | content." What is that?
           | 
           | If you're going to write a minimal introduction, at least
           | make sure it's not confusing.
           | 
           | I get the feeling the author felt compelled to write an
           | introduction and did so with as little effort as possible.
        
             | cyberge99 wrote:
             | I believe he tailored it to his target audience. If you
             | find it confusing, you are likely not it.
        
               | ritchiea wrote:
               | As web developer for over a decade "bits content" doesn't
               | mean anything to me. But I understand what the tool does
               | from the rest of the description. Try running a google
               | search for "bits content," [0] it's not a commonly used
               | phrase in web development or anything. It's a poor choice
               | of words.
               | 
               | 0. https://www.google.com/search?hl=en&q=%22bits%20conten
               | t%22
        
               | chownie wrote:
               | It's supposed to be "bits of content", it's not jargon.
               | The author's just accidentally a word, we all do it.
        
               | ritchiea wrote:
               | It's more than fair to say in technical documentation you
               | intend others to use having a grammatical error or
               | missing word is confusing and a problem. It's the writing
               | equivalent of having a bug in your code. And it's
               | definitely not "writing to a target audience" as the
               | parent comment suggested. We all make mistakes but don't
               | try to call a mistake effective documentation.
        
               | RobertKerans wrote:
               | Of course it is, but neither parent nor anyone else is
               | saying anything close to the mistake being effective
               | documentation. There's a single missing word which needs
               | to be added in, but the overall text is clearly writing
               | to a target audience. You are aware of this, and of how
               | small the mistake is, and you understand what the
               | sentence should read as, so I'm not sure what your point
               | is?
        
             | theIV wrote:
             | My hunch is that this is a typo and it should read "extract
             | bits OF content."
        
               | mgdm wrote:
               | Exactly this! I'll fix it after work.
        
               | rendall wrote:
               | Maybe have the line about "jq" be 2nd. Have the first
               | line be a brief description of what it actually does.
        
               | ritchiea wrote:
               | I agree and having a missing word in your text often
               | leads to confusion :)
               | 
               | Honestly you could drop the "bits" which is a bit
               | redundant and use the phrase "Uses CSS selectors to
               | extract content from HTML files."
        
         | oauea wrote:
         | What's this thing called a "computer" that people keep going on
         | about, anyway?
        
           | da_chicken wrote:
           | It's a person who does mathematic calculations all day. For
           | example, creating range tables for artillery, calculating
           | averages or totals of a large range of values, or solving
           | complex integrals or differential equations, and so on.
           | They're commonly used in industry or government, especially
           | in astronomy, aerospace and civil engineering for both
           | simulation and analysis. Perhaps the most well-known
           | computers were the Harvard Computers, which operated in the
           | late 19th and early 20th centuries.
           | 
           | As a job, computers were largely automated out of existence
           | by solid-state transistor based automated computers and
           | integrated circuit transistor automated computers in the 60s,
           | 70s and 80s, which replaced the enormously expensive and
           | often largely experimental electro-mechanical automated
           | computers while radically reducing cost and improving
           | performance both by several orders of magnitude.
        
           | samstave wrote:
           | Here - this explains it really succinctly:
           | 
           | https://www.youtube.com/watch?v=lE1bS-Mn2Mk
        
           | theandrewbailey wrote:
           | It's like a programmable loom, but for logical and
           | mathematical operations.
        
             | samhw wrote:
             | You may be interested in the symbol grounding problem (http
             | s://en.wikipedia.org/wiki/Symbol_grounding_problem#Groun...
             | ). It's like the binding problem, but for symbols.
        
               | [deleted]
        
           | sundarurfriend wrote:
           | Sort of related: [Expecting Short Inferential
           | Distances](https://www.readthesequences.com/Expecting-Short-
           | Inferential...)
        
         | nextaccountic wrote:
         | jq isn't line-oriented, it's json-oriented. it's operaring on a
         | stream of jsons from stdin, so its query is applied to each
         | json in sequence.
         | 
         | I would expect that htmlq run the query a single time for a
         | single html; just like jquery $('#something') or
         | document.querySelector('#something')
        
       | zatkin wrote:
       | Why not incorporate this into jq itself, like perhaps adding some
       | command line arguments to switch to HTML mode?
        
         | Deukhoofd wrote:
         | What would the benefits of fitting a HTML parser into a JSON
         | parser tool be?
        
           | lmm wrote:
           | JQ is not just a parser but a tool for doing operations, many
           | of which are (or should be) generic across any tree-like data
           | format. Reusing that part across different input formats
           | makes a lot of sense.
        
           | mjburgess wrote:
           | Well once there's an HTML parser, then a pdf viewer, and then
           | everything needed for PDFs (ie., programming, emailer, video
           | support, etc.) we'll finally have that ideal operating system
           | we've been waiting for.
        
           | mro_name wrote:
           | sounds a lot more like blockchain.
        
         | e12e wrote:
         | Would probably be more useful to implement html2json, and pipe
         | in html?
         | 
         | Ed: eg: https://github.com/Jxck/html2json
        
       | downWidOutaFite wrote:
       | Why? I find xpath's syntax much simpler and regular than jq's.
        
       | [deleted]
        
       | pabs3 wrote:
       | I tend to reach for XPath selectors before CSS ones when querying
       | HTML.
        
       | necovek wrote:
       | Nice, I expected something based on XPath (like xpd), but web
       | developers dealing with HTML are infinitely more familiar with
       | CSS selectors, so a great choice!
        
         | busterarm wrote:
         | I want the option to use both, like Nokogiri gives you.
        
           | necovek wrote:
           | Sure, that sounds nice, but having two simple tools each
           | doing the job well in its own space is perfectly fine for me
           | -- do you imagine needing to combine Xpath and CSS queries in
           | a single run?
        
             | busterarm wrote:
             | I've had to do it when dealing with some poorly-designed
             | XML apis in the past. Nokogiri was a godsend.
        
       | rendall wrote:
       | What is jq?
        
       | mro_name wrote:
       | it's statically linkable rust, isn't it? Awesome. I'm looking for
       | a successor to
       | 
       | $ xmllint --html --xpath ...
       | 
       | that doesn't choke on inline svg.
        
       | gigatexal wrote:
       | This is very cool. This will make scraping the web even easier!
        
       | elif wrote:
       | When I saw the title I thought this was some machine learning-
       | specific rmq/0mq message passing tech called HT. Very excited to
       | zero.
        
       | m4r35n357 wrote:
       | Should be HQ . . .
        
       | pkrumins wrote:
       | Call it "hq".
        
       | jhatemyjob wrote:
       | Crazy how a 300-line codebase manages to amass 2000 stars on
       | Github and 700 upvotes on HN. Amazing ROI.
        
       | gizdan wrote:
       | Once upon a time I was using pup[0] for such thing as well as
       | later I changed to cascadia[1] which seemed much more advanced.
       | 
       | Comparing the two repos, it seems pup is dead, but cascadia may
       | not be.
       | 
       | These tools, including htmlq, seem to sell themselves as "jq for
       | html", which is far from the truth. Jq is closer to the awk where
       | you can do just about everything with json. Cascadia, htmlq, and
       | pup seem closer to grep for html. They can essentially only
       | select data from a html source.
       | 
       | [0] https://github.com/EricChiang/pup [1]
       | https://github.com/suntong/cascadia
        
         | heavyset_go wrote:
         | I've used pup for a few projects, but was unaware of cascadia.
         | Thanks for pointing it out.
        
         | croon wrote:
         | Well, jq is grep _as well_ as sed and awk, but yeah, htmlq
         | seems to be just grep, for sake of comparison.
         | 
         | But I don't think html has any need for a sed/awk tool, or at
         | least not as much. Json output could very well be piped forward
         | to the next CLI tool after you've changed it slightly with jq.
         | I don't see this scenario as likely with html.
        
           | gizdan wrote:
           | > Well, jq is grep as well as sed and awk, but yeah, htmlq
           | seems to be just grep, for sake of comparison.
           | 
           | Exactly, and that is what I mean. If you want to compare,
           | compare it with grep, not jq.
           | 
           | Someone else posted xidel[0] in this thread, which I've not
           | used, but it seems to be the "jq but for html".
           | 
           | [0] https://github.com/benibela/xidel
        
       | bamdadd wrote:
       | is there a brew install command ?
        
       | mcovalt wrote:
       | I'd like to see a tool using lol-html [0] and their CSS selector
       | API as a streaming HTML editor.
       | 
       | [0] https://github.com/cloudflare/lol-html
        
       | hyperpallium2 wrote:
       | From examples, this is only like jq in the sense that the q
       | stands for the same thing. Even the way it does that is
       | different.
       | 
       | An xmlq that was really like jq would be fun, about 20 years ago.
        
         | cerved wrote:
         | I would still like xmlq, there are (regrettably) still a lot of
         | applications that store data and configuration in xml
        
         | dotancohen wrote:
         | There is `xq` today, which parses XML like `jq`. I think that
         | it is relatively unknown because it is part of the `yq` package
         | for parsing YMAL. So just install `yq` via PIP and you'll get
         | `xq` as well.
         | 
         | There is also `xmlstarlet` for parsing XML in a similar
         | fashion.
        
           | hyperpallium2 wrote:
           | xmlstarlet is really nothing like jq, as a language. But yes,
           | I use it because it is the best commandline xml processor I'd
           | found. That's the only similarity to jq.
           | 
           | Is this the yq? https://kislyuk.github.io/yq/ It does contain
           | an 'xq', as a literal wrapper for jq, piping output into it
           | after transcoding XML to JSON using xmltodict
           | https://github.com/martinblech/xmltodict (which explodes xml
           | into separate JSON data structures).
           | 
           | This is a bash one-liner! But TBF it really is a 'jq for
           | xml'. I think it would be horrible for some things, but you
           | could also do a lot of useful things painlessly.
        
             | dotancohen wrote:
             | Thank you for the comments. I've only recently discovered
             | both tools, and literally used them once each. Of the two
             | `xq` was easier for my particular work case (parsing a
             | Magento config) but I keep both tools in my virtual
             | toolbox.
             | 
             | If you have any other suggestions for parsing XML for
             | exploratory purposes I'm very happy to hear them.
        
               | hyperpallium2 wrote:
               | Thanks! Not actually a reccommendation, but I have used
               | xsltproc (command line xslt), but it is horrible to use
               | because xslt syntax is horrible (though xslt's concepts
               | are pretty cool). One thing is it enables you to use
               | XPath in all its glory.
               | 
               | Just installed xq. It's nice just seeing the pretty-
               | printed json output, so thanks for the pointer. Probably
               | better than xmlstarlet for my usage, which just queries
               | and outputs text, not xml. hmmm, that's probably true for
               | most commandline uses...
        
           | jle17 wrote:
           | Just looked into this and I think it's worth mentioning that
           | there are two different projects called `yq`. The first one
           | that came up (written in go instead of python) is not the
           | right one and doesn't have the `xq` tool.
        
       | abledon wrote:
       | is anyone else using the https://github.com/json-path/JsonPath
       | over the jq route?
       | 
       | I hope we standardize on some jq query language, like we have
       | with a base set of SQL syntax
        
       | andybak wrote:
       | > like jg
       | 
       | "jq is a lightweight and flexible command-line JSON processor"
        
       | chefandy wrote:
       | If anyone is looking for a good library to do this in Python,
       | PyQuery works well:
       | 
       | https://pythonhosted.org/pyquery/
        
       | teitoklien wrote:
       | Maybe call it hq ?
        
         | Simplicitas wrote:
         | My thoughts EXACTLY... but anyway, great new utility indeed!
        
           | teitoklien wrote:
           | Haha, Indeed its a very good utility :D
        
       | oauea wrote:
       | https://jsoup.org/ has been around for a long time and seems a
       | bit more mature & maintained than this two-code-files 2-year-old
       | repo. Highly recommend.
        
       | avereveard wrote:
       | what's wrong with using html tidy + xmllint ?
        
         | mro_name wrote:
         | nothing wrong. Searching unmodified html though is sometimes
         | preferable.
        
       | soheil wrote:
       | I'd use something like this script that you can put together
       | yourself:                 #!/usr/bin/env ruby       require
       | 'nokogiri'; p Nokogiri::HTML(STDIN.read).css(ARGV[0]).text
       | 
       | Just save it to a file in your _/ usr/local/bin/hq_ and do _chmod
       | +x !$_
       | 
       | Then you can do:                 curl -s
       | "https://news.ycombinator.com/news"|hq "tr:first-child
       | .storylink"
       | 
       | It uses Nokogiri[0], which is much more battle tested and works
       | with CSS and XPath selectors.
       | 
       | [0]
       | https://nokogiri.org/tutorials/parsing_an_html_xml_document....
        
       | triska wrote:
       | This is very nice!
       | 
       | For reasoning about tree-based data such as HTML, I also highly
       | recommend the declarative programming language Prolog. HTML
       | documents map naturally to Prolog terms and can be readily
       | reasoned about with built-in language mechanisms. For instance,
       | here is the sample query from the htmlq README, fetching all
       | elements with id _get-help_ from https://www.rust-lang.org, using
       | Scryer Prolog and its SGML and HTTP libraries in combination with
       | the XPath-inspired query language from library(xpath):
       | ?- http_open("https://www.rust-lang.org", Stream, []),
       | load_html(stream(Stream), DOM, []),            xpath(DOM,
       | //(*(@id="get-help")), E).
       | 
       | Yielding:                      E = element(div,[class="flex flex-
       | colum ...",id="get-help"],["\n        ",element(h4,[],["Get
       | help!"]),"\n        ",element(ul,[],["\n
       | ...",element(li,[],[element(a,[... = ...],[...])]),"\n
       | ...",element(li,[],[...]),...|...]),"\n
       | ...",element(div,[class="la ..."],["\n
       | ...",element(label,[...],[...]),...|...]),"\n    ..."])         ;
       | false.
       | 
       | The selector //(*(@id="get-help")) is used to obtain all HTML
       | elements whose _id_ attribute is get-help. On backtracking, all
       | solutions are reported.
       | 
       | The other example from the README, extracting all _links_ from
       | the page, can be obtained with Scryer Prolog like this:
       | ?- http_open("https://www.rust-lang.org", Stream, []),
       | load_html(stream(Stream), DOM, []),            xpath(DOM,
       | //a(@href), Link),            portray_clause(Link),
       | false.
       | 
       | This query uses forced backtracking to write all links on
       | standard output, yielding:                   "/".
       | "/tools/install".         "/learn".         "https://play.rust-
       | lang.org/".         "/tools".         "/governance".
       | "/community".         "https://blog.rust-lang.org/".
       | "/learn/get-started".         etc.
        
         | chriswarbo wrote:
         | Thanks, that's a rare example of something which is (a) simple
         | enough to understand for a Prolog-newbie like me, and (b) more
         | practical than ubiquitous family-tree example.
         | 
         | I'm always looking for opportunities to dip my toes into
         | Prolog; in hindsight it's clearly a good fit for tree-
         | structured data structures.
        
           | samhw wrote:
           | Interestingly, the only other context in which I've come
           | across Prolog is from friends who studied at Cambridge, here
           | in the UK. For some reason, the CS 'tripos' (course) there is
           | really heavily focussed on Prolog, and everyone I know from
           | there ended up a huge fan of the language. I'm not sure why
           | that's the case, though, given that almost all other
           | universities seem to use more common languages (Java, C++,
           | etc).
        
             | zimpenfish wrote:
             | cs.man.ac.uk, at least back in 1992, had a compulsory
             | Prolog module in the first year. Don't know anyone from
             | then who didn't hate that module with a burning passion.
             | 
             | (There was no Java, C++, etc. either. It was SML, Pascal,
             | 68000, and Oracle Pascal-Embedded-SQL.)
        
             | ramses0 wrote:
             | "Prolog as a library" => Given "functional" constraints =>
             | $CONSTRAINTS.prolog( "query..." ) => results
             | 
             | ...many languages (similar to regex / state-machine) can
             | benefit greatly from offloading a portion to something
             | prolog-ish, but it's unfortunate that prolog knowledge
             | isn't as widely distributed.
        
             | WickyNilliams wrote:
             | I studied CS at a different university in UK and we used
             | Prolog for one module on AI or perhaps machine vision. I
             | really enjoyed working with it. This was 15 years ago.
             | Looking through their current curriculum I can't see prolog
             | being mentioned anymore. Shame!
        
         | pandatigox wrote:
         | I tried to run this on my computer now, but as a complete
         | Prolog noob, I'm having errors running the script? How do you
         | load the http_open module/library in the first place? I tried
         | following some Prolog tutorials in the past but I always get
         | stuck trying to run something in the REPL. I'm using scryer-
         | prolog. Thanks in advance!
        
           | triska wrote:
           | The libraries I mentioned can be loaded by invoking the
           | use_module/1 predicate on the toplevel, here is the complete
           | transcript that loads the SGML, HTTP and XPath libraries in
           | Scryer Prolog:                   ?-
           | use_module(library(sgml)).            true.         ?-
           | use_module(library(http/http_open)).            true.
           | ?- use_module(library(xpath)).            true.
           | 
           | The second query also uses portray_clause/1 from
           | library(format), which you can load with:
           | ?- use_module(library(format)).            true.
           | 
           | After all these libraries are loaded, you can post the sample
           | queries from above, and it should work.
           | 
           | There are also other ways to load these libraries: A very
           | common way to load a library is to use the use_module/1
           | _directive_ in Prolog source files. In that case, you would
           | put for example the following 4 directives in a Prolog source
           | file, say sample.pl:                   :-
           | use_module(library(sgml)).         :-
           | use_module(library(http/http_open)).         :-
           | use_module(library(xpath)).         :-
           | use_module(library(format)).
           | 
           | And then run sample.pl with:                   $ scryer-
           | prolog sample.pl
           | 
           | You can then again post the goals from above on the toplevel,
           | and it will work too.
           | 
           | Another way is to put these directives in your ~/.scryerrc
           | configuration file, which is automatically consulted when
           | Scryer Prolog starts. I recommend to do this for libraries
           | you frequently need. Common candidates for this are for
           | example library(dcgs), library(lists) and library(reif).
           | 
           | Personally, I start Scryer Prolog from within Emacs, and I
           | have set up Emacs so that I can consult a buffer with Prolog
           | code, and also post queries and interact with the Prolog
           | toplevel from within Emacs.
        
             | pandatigox wrote:
             | Wow that works fantastically! Thank you for that. It almost
             | seems like magic.
        
         | okasaki wrote:
         | It's pretty easy in Python too, eg.:                   >>> soup
         | = BeautifulSoup(requests.get("https://www.rust-lang.org").text)
         | >>> [x["href"] for x in soup.find_all("a")]              ['/',
         | '/tools/install', '/learn', 'https://play.rust-lang.org/',
         | '/tools', '/governance', '/community', 'https://blog.rust-
         | lang.org/',...
        
           | triska wrote:
           | In a certain sense (for example, when measuring brevity), it
           | is indeed easy to write this example in Python. However, the
           | Python version also illustrates that many different language
           | constructs are needed to express the intended functionality.
           | In comparison to Prolog, Python is a quite complex language
           | with many different language constructs, including loops,
           | objects, methods, assignment, dictionaries etc. all of which
           | are used in this example.
           | 
           | As I see it, a key attraction of Prolog is its simplicity:
           | With a single language construct (Horn clauses), you are able
           | to express all known computations, and the example queries I
           | posted show that only a single language element, namely again
           | Horn clauses to express a query, is needed to run the code.
           | The Prolog query, and also every Prolog clause, is itself a
           | Prolog term and can be inspected with built-in mechanisms.
           | 
           | As a consequence, an immediate benefit of using Prolog for
           | such use cases is that you can easily reason about user-
           | specified queries in your applications, and for example
           | easily allow only a safe subset of code to be run by users,
           | or execute a user-specified query with different execution
           | strategies etc. In comparison, Python code is much harder to
           | analyze and restrict to a particular subset due to the
           | language's comparatively high syntactic complexity.
        
             | the_jeremy wrote:
             | The benefit of Python is that developers already know about
             | these language constructs, and that more developers know
             | Python than Prolog.
        
               | lostcolony wrote:
               | I don't think the op's point was "how easy it would be to
               | hire developers", or even "taking all the considerations
               | a business is under, I feel Prolog makes sense". He was
               | just touting how easy Prolog's built in pattern matching
               | and declarative style makes implementing and using
               | selectors at a language level.
               | 
               | Honestly, if we didn't talk about the benefits of a
               | language irrespective of how easy it is to hire for it,
               | we'd never have introduced anything beyond FORTRAN, if we
               | even made it that far. Bringing "X is easier to hire for"
               | into a conversation about the language is, at best, a
               | non-sequitur.
        
               | notriddle wrote:
               | We might have been better off that way. FORTRAN does have
               | its downsides, but language churn itself has downsides
               | that almost always outweigh the assumed upsides of a
               | better language.
               | 
               | If we had just stuck with FORTRAN forever, how many
               | problems would have been completely avoided!? There'd be
               | better, and more, IDEs, since even if the language is
               | hard to parse, it's still just one parser that needs all
               | the effort. So many unfortunate problems in education
               | caused by language and ecosystem churn would have been
               | avoided (the infamous "by the time you graduate, it's
               | always outdated" problem).
               | 
               | The only problem is that FORTRAN is too new. Should've
               | stuck with the Hollerith tabulator.
        
         | jfmc wrote:
         | AFAIK, this was first proposed and implemented in Ciao Prolog
         | back in late 90s (modern versions here: https://ciao-
         | lang.org/ciao/build/doc/ciao.html/html.html). It was way before
         | Python was popular and JavaScript ever existed.
        
       | parhamn wrote:
       | Ive been looking for a library that can find the best set of
       | selectors to most consistently find the element youre looking for
       | in a page.
       | 
       | Any pointers to something that exists? Interestingly I've also
       | found very little for dom extraction in the OS ML space.
        
       ___________________________________________________________________
       (page generated 2021-09-07 23:01 UTC)