[HN Gopher] GPT-3, Esq? Evaluating AI Legal Summaries [pdf]
       ___________________________________________________________________
        
       GPT-3, Esq? Evaluating AI Legal Summaries [pdf]
        
       Author : gavelin
       Score  : 44 points
       Date   : 2021-02-18 17:53 UTC (5 hours ago)
        
 (HTM) web link (www.davidvictorrodriguez.com)
 (TXT) w3m dump (www.davidvictorrodriguez.com)
        
       | DoomHotel wrote:
       | That article just reinforces what I gathered from this one:
       | 
       |  _GPT-3, Bloviator: OpenAI's language generator has no idea what
       | it's talking about_
       | 
       | https://www.technologyreview.com/2020/08/22/1007539/gpt3-ope...
        
         | joe_the_user wrote:
         | It's remarkable that the article seems to be saying "it's not
         | working now but maybe with a few tweaks we could swing this"
         | where the real situation seems to be that this won't be doing
         | any "meaning-critical" tasks for a long time, if ever.
         | 
         | GPT-3 seems like an extended Eliza-effect device [1]. It seems
         | that a general database of word-fragments sufficient to give
         | the impression it's following-along on any topic but that
         | doesn't involve any coherence but rather shows how much of
         | ordinary language is "just associations" (which isn't entirely
         | unimportant but still).
         | 
         | Altogether, it even seems less sensible than narrow, hand-tuned
         | chatbots like Alicebot[2] but it's harder to see it's
         | limitation because of it's huge dataset.
         | 
         | [1] https://en.wikipedia.org/wiki/ELIZA_effect [2]
         | https://en.wikipedia.org/wiki/Artificial_Linguistic_Internet...
        
       | blamestross wrote:
       | I'm getting concerned that even researchers who know better are
       | anthropomorphising GPT-3 in their descriptions of it's output.
        
         | make3 wrote:
         | As a researcher, it's pretty obvious from the language and from
         | the analysis of the author that he's a novice at machine
         | learning who is a lawyer, not a researcher in machine learning
        
           | gavelin wrote:
           | Correct. I hope that meant that this was more accessible than
           | the average write-up (and not less accurate!) :)
        
           | YeGoblynQueenne wrote:
           | The author seems to have a very clear understanding of how
           | GPT-3 works and their down-to-Earth, plain language analysis
           | is miles away from the wild flights of fancy we're used to
           | reading about GPT-3.
           | 
           | As a for instance, they didn't even use the word "understand"
           | once, to refer to what GPT-3 is doing.
        
       | make3 wrote:
       | The author seems to forget that fine-tuning exists all-together,
       | which is how real-world NLP applications are actually made. In
       | reply to the following paragraph, in a real world setting, this
       | would be done through model fine-tuning on high quality data.
       | "First, there must be greater transparency. The sources of
       | GPT-3's references must at least be referenceable and perhaps
       | tweaked to follow a proper hierarchy of authorities. Although it
       | is challenging to audit a 175 billion parameter algorithm, it
       | would be beneficial to understand the most influential semantic
       | parameters used to generate the output text. This would ideally
       | enable users to choose word choice (perhaps as a level of
       | sophistication), appropriate voice, and tone"
        
         | gavelin wrote:
         | You are quite correct that fine-tuning is necessary to improve
         | accuracy. I did not intend to dismiss that. The point I
         | intended to make there is that it would be nice to be able to
         | see whether, when interpreting a statute, GPT-3 relied heavily
         | on a blog interpreting a statute v. legislative history v.
         | Supreme Court precedent. The right approach would be to control
         | the proper hierarchy of authorities. It would be helpful to
         | understand, even as a textual matter, what GPT-3 was most
         | heavily relying on for a given prediction. That would speed up
         | categorically fixing bad prediction patterns.
         | 
         | It is also true that high quality data is necessary. There is a
         | reason why lawyers rely on Westlaw and LexisNexis to search for
         | relevant laws and even scholarly articles. They are trusted
         | sources. A better approach would rely on something like those
         | with a more narrow universe of quality sources. There is a ton
         | of labeling work that needs to be done, even beyond the
         | "KeyCite" type of labels Westlaw applies to documents. Note
         | that YC company www.rossintelligence.com ran into some trouble
         | recently with Westlaw and LexisNexis.
         | 
         | The quality v. quantity of data debate is particularly relevant
         | here. The power of GPT-3 is in part supposed to come from the
         | sheer scale of its training dataset size. It would be nice to
         | leverage some of the semantic training from a large non-legal
         | dataset to be able to stylistically output in layman's terms
         | while sourcing the authorities from more closely vetted
         | sources.
        
       | MasterScrat wrote:
       | Thank you for your analysis, it's great to have the insights of
       | someone with both legal and ML backgrounds.
       | 
       | I want to point out that a big part of building anything on GPT-3
       | (or other large LMs for that matter) is "prompt engineering",
       | which means you try out thousands of prompts and sampling
       | parameters until you find something that works reasonably well
       | for your use case.
       | 
       | Taking two default templates and a few different temperatures is
       | like taking some tutorials for a new framework, building a proof
       | of concept from them, then making a judgement from that. Sure, it
       | can provide a good first assessment, but that's it. You would
       | need much deeper experience to come to a meaningful conclusion.
        
         | NovemberWhiskey wrote:
         | > _I want to point out that a big part of building anything on
         | GPT-3 (or other large LMs for that matter) is "prompt
         | engineering", which means you try out thousands of prompts and
         | sampling parameters until you find something that works
         | reasonably well for your use case._
         | 
         | As someone who is instinctively skeptical about these language
         | models, this kind of statement makes my antennae twitch. You
         | have this black box model that generates all sort of plausible
         | outputs, you jiggle the handle until those outputs meet your
         | expectations for some range of tested inputs, and then ... you
         | assume it's just going to work?
         | 
         | For parlor tricks, or even low-stakes real world activities
         | that might be enough - but how can you trust it?
        
           | make3 wrote:
           | (As one of them) every professional researcher in NLP at
           | every large company (incl me) knows you can't rely on
           | generation right now, and huge teams everywhere are working
           | on reliability in text generation
        
             | joe_the_user wrote:
             | So, you take this "general purpose model" (with a huge
             | corpus of standard text) and you attempt to use it for a
             | narrow purpose. The model requires a lot of prompt tweaking
             | and other things for this narrow model but eventually you
             | "make it work".
             | 
             | How do you know you're not just "programming" a chatbot (by
             | indirectly filtering for the text-pieces you want) but in
             | the most indirect and unguaranteed fashion possible? I
             | suppose the advantage is you can say "look, it's
             | intelligent".
        
       | gwern wrote:
       | Also of interest is Arbel's paper on GPT-3 (AID currently but he
       | got GPT-3 access recently) legal summaries:
       | https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3740356
       | 
       | One thing I would note is the potential for a feedback loop
       | between summaries and the model: if a summary of a specific ToS
       | or piece of law is wrong, you can hardwire an expert-vetted one
       | (there's only so many ToSes or pieces of law and it'll be a long
       | tail), and you can feed back in the correct one as training data
       | to finetune the model. The bigger the GPT model, the smarter it
       | is, and the less feedback it takes to correct its summaries:
       | https://openai.com/blog/learning-to-summarize-with-human-fee...
        
         | gavelin wrote:
         | Thanks for the paper link! I think your reasoning of hardwiring
         | boilerplate to an expert-vetted (or written) summary is a much
         | more accurate approach. If only we could scrape the clause
         | explanation footnotes from quality sources I would not be
         | forced to write them!
        
       | minimaxir wrote:
       | > Second, text ought to be tokenized (a term used in natural
       | language processing wherein text is assigned a numerical value)
       | at many different levels (character, word, sentence, paragraph,
       | section, etc.) in order to make predictions that are relevant to
       | the excerpt of a text, while also remaining consistent to the
       | broader document. This is challenging because doing so involves a
       | tremendous amount of computational resources. However, it may be
       | necessary to accurately capture meaning at different levels of
       | abstraction.
       | 
       | I don't think this would be an improvement per se. The way Byte
       | Pair Encodings are constructed that GPT-3 (and GPT-2) use is that
       | they already take higher-level text representations and compress
       | it down (into tokens), which are then reflected in the training
       | of the model.
        
         | Tarq0n wrote:
         | And the attention mechanism should take care of "sentence,
         | paragraph, section" level context. Attributing this to
         | tokenization is a weird mistake to make.
        
           | gavelin wrote:
           | Tarq0n: Also a great point. The question then is whether the
           | attention mechanism is being triggered on the proper word or
           | character sequences. Lawyers employ a form of attention
           | mechanism when "issue spotting." For example, in a non-
           | compete I might scan for the duration (how many years until
           | the client can join a similar venture?) and breadth (what is
           | the definition of a similar venture?). For an attention
           | mechanism to work well for legal summaries, it seems to me
           | that it must trigger on many relevant context triggers at
           | different levels.
           | 
           | As an extreme example, if I put 50 contracts involving
           | multiple different parties into one big document and the
           | attention mechanism was triggered on the first document
           | title, would the subsequent document titles sufficiently
           | demarcate a new contract to reset the context? Or would the
           | attention stay "on" the first high accuracy context match and
           | mix up the terms and parties? In context of the paper, GPT-3
           | missed a ton of issues, indicating to me that the attention
           | mechanism is not being properly triggered. Again I may be
           | wrong that tokenizing and predicting based off of clauses,
           | paragraphs and sections would help improve the output, but
           | that was one way I thought would capture the different levels
           | of context.
        
         | gavelin wrote:
         | minimaxir: Great insight and probably merits an edit for
         | precision. My understanding is that Byte-Pair encoding is done
         | at the character and word level (and maybe even the sub-
         | character level), but not at the higher-level representations
         | such as paragraph, section--and beyond. Am I mistaken? Taking a
         | few steps back, is that an effective way to pinpoint context?
         | 
         | The goal should be to properly ascertain context at multiple
         | levels. When I am reviewing a document, I scan the document
         | title and section headings to grasp the structure of the
         | document prior to diving into the relevant clauses and their
         | elements. If there are external references, I will integrate
         | them before reading the clause in order to capture the complete
         | rule. A crucial mistake in contractual interpretation is
         | falsely attributing an element from one rule or section to
         | another, or excluding an element. What applies in A context
         | might not apply in B context, or it may be conditional on
         | another factor C.
         | 
         | The criticism I intended to make was that GPT-3 likely is not
         | (accurately) identifying the right context "bucket" before
         | making the prediction and that perhaps could be improved by
         | tokenizing different levels of context. I may have falsely
         | reasoned that this could be best accomplished through
         | tokenization at different levels. In context of the paper,
         | GPT-3 referenced Tinder and MommyMeet when I inputted
         | Linkedin's Privacy Policy. Also, if GPT-3 contains Section 230
         | in its training data, it did not look to the definition list at
         | the bottom of the statute to the define a key term (I excluded
         | the definitions from the input). My hunch is that a better
         | approach would localize based on the document, section and
         | clause type to precisely narrow the context before utilizing
         | character and word level predictions.
        
           | joe_the_user wrote:
           | I would claim it is easy to _think_ you 're seeing GPT-3 fail
           | because it's taking the wrong association path (not noticing
           | the hierarchical decomposition of the context). But the
           | general problem is that there is no fixed decomposition of
           | the meaning of a text.
           | 
           | It's tempting to think a "simple" procedure like summary
           | doesn't need "deep" meaning but it doesn't seem like that's
           | the case. It should especially be noted a lot of the
           | plausible "summaries" could be summaries of any privacy
           | policy or just what people say about privacy policies on the
           | net. It would have been interesting to give the system a
           | novel bit of text to interpret instead.
        
       | Aulig wrote:
       | Great read, thanks for sharing. It's always interesting to find
       | out how AI is impacting occupations outside the technology
       | sector.
        
         | gavelin wrote:
         | Thank you for reading!
        
       | dweekly wrote:
       | tl;dr - Don't use GPT-3 to summarize your legal documents yet.
        
         | gavelin wrote:
         | "tl;dr" may very well be the problem! There is a tendency of
         | many tl;dr summaries on the web to oversimplify and skew
         | concepts. If those are included in GPT-3's dataset, GPT-3's
         | output will try to match the dataset style (according to the
         | parameters) and likely not meet the legal standard. There is a
         | section following the conclusion in the paper where I touch on
         | other ways GPT-3 might be improved for the legal summarization
         | use case. The only way we get there is by delving into the
         | nuance of WHY GPT-3 is not yet good enough to replace lawyers,
         | and HOW we can improve on it as an architecture.
        
       | skybrian wrote:
       | GPT-3 picks words literally at random (according to a probability
       | distribution) so it would be good to run each experiment multiple
       | times to get a sense of the probability distribution. I doubt it
       | would change the conclusions, though.
        
         | gavelin wrote:
         | Good tip! I repeated a few inputs to see if the variation was
         | significant enough to warrant including that, and as your
         | intuition suggested, it was not with the parameters I selected.
         | A more robust experiment should definitely include repeat
         | attempts.
        
         | minimaxir wrote:
         | There's a _big_ difference between sampling from a learned
         | probability distribution and  "picks words literally at
         | random". The temperature = 0 examples have zero sampling by
         | construction, while higher temperatures having a slight
         | sampling.
         | 
         | That said, it doesn't hurt to have multiple attempts (aside
         | from the cost of using GPT-3, ahem).
        
         | make3 wrote:
         | In a professional setting, this is definitely done, and the
         | generations are then ranked with separate models that predict
         | different quality metrics, such as "interestingness", "safety"
         | in an inclusiveness type of way, whether the answer seems to
         | fit a style that you want, whether the facts in the answer seem
         | to make sense, etc., and it makes a big difference actually
        
         | blackbear_ wrote:
         | I thought generating text from language models was a
         | deterministic operation, searching for the maximum likelihood
         | sequence using beam search?
        
           | make3 wrote:
           | It is now common knowledge in the NLP community that beam
           | search only works for situations where the output space is
           | very constrained, specifically, in neural machine
           | translation.
           | 
           | In more open ended generation such as summarization, question
           | answering and story generation, beam search leads to poor and
           | repetitive outputs. Different (stochastic) sampling methods
           | lead to more interesting, diverse and .. functional outputs.
        
           | ansk wrote:
           | It can be as deterministic as you want it to be -- there are
           | parameters that control how much randomness is used during
           | the sampling process. Finding the most probable sequence from
           | the learned distribution is intractable for all but the
           | shortest of sequences. As you've said, beam search is used,
           | but this is just a local search heuristic and provides no
           | guarantees of producing the most probable output.
        
       | jll29 wrote:
       | To those that are interested in the state of the art in
       | commercial legal summarization (US law) as of Q1/2021:
       | https://arxiv.org/pdf/2102.05757.pdf
        
         | gavelin wrote:
         | Great link. Thank you for sharing this.
        
       | Der_Einzige wrote:
       | This is at the intersection of my research interests. I'd love to
       | see what happens if you run it on a large debate case with slight
       | variations of the input prompt queries.
       | 
       | I love to see all of these articles about legal summarization -
       | but it's always covering abstractive summarization! I want an
       | effective highlighter model. To be fair, I did build a system for
       | using transformers to do extractive summarization in an
       | unsupervised manner, but the results aren't that great. I'm sure
       | that Lawyers would find something that could highlight the most
       | legally salient sections to be extremely useful if it's
       | reasonably accurate.
       | 
       | I really did figure that we'd get effective word-level extractive
       | summaries before we'd get effective abstractive summaries but I
       | guess that intuition was wrong...
        
       ___________________________________________________________________
       (page generated 2021-02-18 23:01 UTC)