[HN Gopher] Understanding HTML with Large Language Models
       ___________________________________________________________________
        
       Understanding HTML with Large Language Models
        
       Author : PaulHoule
       Score  : 35 points
       Date   : 2022-10-11 19:26 UTC (3 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | bootcat wrote:
       | Looking forward to the code and the model,
        
       | drothlis wrote:
       | Related, "natbot" uses the stock GPT-3 model (no fine-tuning
       | apart from the examples in the prompt) to drive a browser:
       | 
       | https://github.com/nat/natbot
        
       | ankrgyl wrote:
       | There is a visual demo here:
       | https://sites.google.com/view/llm4html/home.
       | 
       | This work is very exciting to me for a few reasons:
       | 
       | - HTML is an incredibly rich source of visually structured
       | information, with a semi-structured representation. This is as
       | opposed to PDFs, which are usually fed into models with a "flat"
       | representation (words + bounding boxes). Intuitively, this offers
       | the model a more direct way to learn about nested structure, over
       | an almost unlimited source of unsupervised pre-training data.
       | 
       | - Many projects (e.g. Pix2Struct
       | https://arxiv.org/pdf/2210.03347.pdf, also from Google) operate
       | on pixels, which are expensive (both to render and process in the
       | transformer). Operating on HTML directly means smaller, faster,
       | more efficient models.
       | 
       | - (If open sourced) it will be the first (AFAIK) open pre-trained
       | ready-to-go model for the RPA/automation space (there are several
       | closed projects). They claim they plan to open source the dataset
       | at least, which is very exciting.
       | 
       | I'm particularly excited to extend this and similar
       | (https://arxiv.org/abs/2110.08518) for HTML question answering
       | and web scraping.
       | 
       | Disclaimer: I'm the CEO of Impira, which creates OSS
       | (https://github.com/impira/docquery) and proprietary
       | (http://impira.com/) tools for analyzing business documents. I am
       | not affiliated with this project.
        
         | hwers wrote:
         | This is google, they for sure aren't releasing the weights
        
         | ShamelessC wrote:
         | Exciting/scary stuff! A sophisticated enough version could
         | carry out any range of tasks that a typical computer
         | user/browser could from just a few sentences with somewhat high
         | chance of success.
         | 
         | we will overuse this tech, forgetting important processes that
         | are perhaps wise to keep a "human backup" for redundancy. Then
         | again, RPA is already a case where a "proper" rewrite of some
         | multi-program pipeline is impossible.
        
           | ankrgyl wrote:
           | This is a "classic" tension. Having worked in the (broader)
           | RPA space for a while, I would say that the true north star
           | of most processes is (a) rewriting the internal procedures to
           | be transformations on data (not UIs) and (b) standardizing
           | communication across companies.
           | 
           | There is a lot of momentum to solve (a) with no code, but
           | it's slow because processes are impossibly complex. I think
           | AI will accelerate this and could result in the "human
           | backup" dystopia. On the other hand, AI can also be used to
           | generate code, and I'm optimistic that technology like this
           | can accelerate humans' ability to encode complex processes
           | robustly (as transformations of data) and would 10 or 100x
           | less work than no/low code.
        
             | ShamelessC wrote:
             | > On the other hand, AI can also be used to generate code,
             | and I'm optimistic that technology like this can accelerate
             | humans' ability to encode complex processes robustly (as
             | transformations of data) and would 10 or 100x less work
             | than no/low code.
             | 
             | Ah right, lots of angles to consider! A hybrid system would
             | certainly be interesting. Let the AI runtime generate and
             | evaluate code to perform tasks (e.g. selenium/puppeteer in
             | python/java). Upon failure, "escalate permissions" to
             | enable DOM control, or full mouse/keyboard to complete the
             | task (probably best not to let the thing open up a code-
             | editor with M/KB controls though heh)
        
       ___________________________________________________________________
       (page generated 2022-10-11 23:00 UTC)