[HN Gopher] Understanding HTML with Large Language Models ___________________________________________________________________ Understanding HTML with Large Language Models Author : PaulHoule Score : 35 points Date : 2022-10-11 19:26 UTC (3 hours ago) (HTM) web link (arxiv.org) (TXT) w3m dump (arxiv.org) | bootcat wrote: | Looking forward to the code and the model, | drothlis wrote: | Related, "natbot" uses the stock GPT-3 model (no fine-tuning | apart from the examples in the prompt) to drive a browser: | | https://github.com/nat/natbot | ankrgyl wrote: | There is a visual demo here: | https://sites.google.com/view/llm4html/home. | | This work is very exciting to me for a few reasons: | | - HTML is an incredibly rich source of visually structured | information, with a semi-structured representation. This is as | opposed to PDFs, which are usually fed into models with a "flat" | representation (words + bounding boxes). Intuitively, this offers | the model a more direct way to learn about nested structure, over | an almost unlimited source of unsupervised pre-training data. | | - Many projects (e.g. Pix2Struct | https://arxiv.org/pdf/2210.03347.pdf, also from Google) operate | on pixels, which are expensive (both to render and process in the | transformer). Operating on HTML directly means smaller, faster, | more efficient models. | | - (If open sourced) it will be the first (AFAIK) open pre-trained | ready-to-go model for the RPA/automation space (there are several | closed projects). They claim they plan to open source the dataset | at least, which is very exciting. | | I'm particularly excited to extend this and similar | (https://arxiv.org/abs/2110.08518) for HTML question answering | and web scraping. | | Disclaimer: I'm the CEO of Impira, which creates OSS | (https://github.com/impira/docquery) and proprietary | (http://impira.com/) tools for analyzing business documents. I am | not affiliated with this project. | hwers wrote: | This is google, they for sure aren't releasing the weights | ShamelessC wrote: | Exciting/scary stuff! A sophisticated enough version could | carry out any range of tasks that a typical computer | user/browser could from just a few sentences with somewhat high | chance of success. | | we will overuse this tech, forgetting important processes that | are perhaps wise to keep a "human backup" for redundancy. Then | again, RPA is already a case where a "proper" rewrite of some | multi-program pipeline is impossible. | ankrgyl wrote: | This is a "classic" tension. Having worked in the (broader) | RPA space for a while, I would say that the true north star | of most processes is (a) rewriting the internal procedures to | be transformations on data (not UIs) and (b) standardizing | communication across companies. | | There is a lot of momentum to solve (a) with no code, but | it's slow because processes are impossibly complex. I think | AI will accelerate this and could result in the "human | backup" dystopia. On the other hand, AI can also be used to | generate code, and I'm optimistic that technology like this | can accelerate humans' ability to encode complex processes | robustly (as transformations of data) and would 10 or 100x | less work than no/low code. | ShamelessC wrote: | > On the other hand, AI can also be used to generate code, | and I'm optimistic that technology like this can accelerate | humans' ability to encode complex processes robustly (as | transformations of data) and would 10 or 100x less work | than no/low code. | | Ah right, lots of angles to consider! A hybrid system would | certainly be interesting. Let the AI runtime generate and | evaluate code to perform tasks (e.g. selenium/puppeteer in | python/java). Upon failure, "escalate permissions" to | enable DOM control, or full mouse/keyboard to complete the | task (probably best not to let the thing open up a code- | editor with M/KB controls though heh) ___________________________________________________________________ (page generated 2022-10-11 23:00 UTC)