[HN Gopher] GPT-3, Esq? Evaluating AI Legal Summaries [pdf] ___________________________________________________________________ GPT-3, Esq? Evaluating AI Legal Summaries [pdf] Author : gavelin Score : 44 points Date : 2021-02-18 17:53 UTC (5 hours ago) (HTM) web link (www.davidvictorrodriguez.com) (TXT) w3m dump (www.davidvictorrodriguez.com) | DoomHotel wrote: | That article just reinforces what I gathered from this one: | | _GPT-3, Bloviator: OpenAI's language generator has no idea what | it's talking about_ | | https://www.technologyreview.com/2020/08/22/1007539/gpt3-ope... | joe_the_user wrote: | It's remarkable that the article seems to be saying "it's not | working now but maybe with a few tweaks we could swing this" | where the real situation seems to be that this won't be doing | any "meaning-critical" tasks for a long time, if ever. | | GPT-3 seems like an extended Eliza-effect device [1]. It seems | that a general database of word-fragments sufficient to give | the impression it's following-along on any topic but that | doesn't involve any coherence but rather shows how much of | ordinary language is "just associations" (which isn't entirely | unimportant but still). | | Altogether, it even seems less sensible than narrow, hand-tuned | chatbots like Alicebot[2] but it's harder to see it's | limitation because of it's huge dataset. | | [1] https://en.wikipedia.org/wiki/ELIZA_effect [2] | https://en.wikipedia.org/wiki/Artificial_Linguistic_Internet... | blamestross wrote: | I'm getting concerned that even researchers who know better are | anthropomorphising GPT-3 in their descriptions of it's output. | make3 wrote: | As a researcher, it's pretty obvious from the language and from | the analysis of the author that he's a novice at machine | learning who is a lawyer, not a researcher in machine learning | gavelin wrote: | Correct. I hope that meant that this was more accessible than | the average write-up (and not less accurate!) :) | YeGoblynQueenne wrote: | The author seems to have a very clear understanding of how | GPT-3 works and their down-to-Earth, plain language analysis | is miles away from the wild flights of fancy we're used to | reading about GPT-3. | | As a for instance, they didn't even use the word "understand" | once, to refer to what GPT-3 is doing. | make3 wrote: | The author seems to forget that fine-tuning exists all-together, | which is how real-world NLP applications are actually made. In | reply to the following paragraph, in a real world setting, this | would be done through model fine-tuning on high quality data. | "First, there must be greater transparency. The sources of | GPT-3's references must at least be referenceable and perhaps | tweaked to follow a proper hierarchy of authorities. Although it | is challenging to audit a 175 billion parameter algorithm, it | would be beneficial to understand the most influential semantic | parameters used to generate the output text. This would ideally | enable users to choose word choice (perhaps as a level of | sophistication), appropriate voice, and tone" | gavelin wrote: | You are quite correct that fine-tuning is necessary to improve | accuracy. I did not intend to dismiss that. The point I | intended to make there is that it would be nice to be able to | see whether, when interpreting a statute, GPT-3 relied heavily | on a blog interpreting a statute v. legislative history v. | Supreme Court precedent. The right approach would be to control | the proper hierarchy of authorities. It would be helpful to | understand, even as a textual matter, what GPT-3 was most | heavily relying on for a given prediction. That would speed up | categorically fixing bad prediction patterns. | | It is also true that high quality data is necessary. There is a | reason why lawyers rely on Westlaw and LexisNexis to search for | relevant laws and even scholarly articles. They are trusted | sources. A better approach would rely on something like those | with a more narrow universe of quality sources. There is a ton | of labeling work that needs to be done, even beyond the | "KeyCite" type of labels Westlaw applies to documents. Note | that YC company www.rossintelligence.com ran into some trouble | recently with Westlaw and LexisNexis. | | The quality v. quantity of data debate is particularly relevant | here. The power of GPT-3 is in part supposed to come from the | sheer scale of its training dataset size. It would be nice to | leverage some of the semantic training from a large non-legal | dataset to be able to stylistically output in layman's terms | while sourcing the authorities from more closely vetted | sources. | MasterScrat wrote: | Thank you for your analysis, it's great to have the insights of | someone with both legal and ML backgrounds. | | I want to point out that a big part of building anything on GPT-3 | (or other large LMs for that matter) is "prompt engineering", | which means you try out thousands of prompts and sampling | parameters until you find something that works reasonably well | for your use case. | | Taking two default templates and a few different temperatures is | like taking some tutorials for a new framework, building a proof | of concept from them, then making a judgement from that. Sure, it | can provide a good first assessment, but that's it. You would | need much deeper experience to come to a meaningful conclusion. | NovemberWhiskey wrote: | > _I want to point out that a big part of building anything on | GPT-3 (or other large LMs for that matter) is "prompt | engineering", which means you try out thousands of prompts and | sampling parameters until you find something that works | reasonably well for your use case._ | | As someone who is instinctively skeptical about these language | models, this kind of statement makes my antennae twitch. You | have this black box model that generates all sort of plausible | outputs, you jiggle the handle until those outputs meet your | expectations for some range of tested inputs, and then ... you | assume it's just going to work? | | For parlor tricks, or even low-stakes real world activities | that might be enough - but how can you trust it? | make3 wrote: | (As one of them) every professional researcher in NLP at | every large company (incl me) knows you can't rely on | generation right now, and huge teams everywhere are working | on reliability in text generation | joe_the_user wrote: | So, you take this "general purpose model" (with a huge | corpus of standard text) and you attempt to use it for a | narrow purpose. The model requires a lot of prompt tweaking | and other things for this narrow model but eventually you | "make it work". | | How do you know you're not just "programming" a chatbot (by | indirectly filtering for the text-pieces you want) but in | the most indirect and unguaranteed fashion possible? I | suppose the advantage is you can say "look, it's | intelligent". | gwern wrote: | Also of interest is Arbel's paper on GPT-3 (AID currently but he | got GPT-3 access recently) legal summaries: | https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3740356 | | One thing I would note is the potential for a feedback loop | between summaries and the model: if a summary of a specific ToS | or piece of law is wrong, you can hardwire an expert-vetted one | (there's only so many ToSes or pieces of law and it'll be a long | tail), and you can feed back in the correct one as training data | to finetune the model. The bigger the GPT model, the smarter it | is, and the less feedback it takes to correct its summaries: | https://openai.com/blog/learning-to-summarize-with-human-fee... | gavelin wrote: | Thanks for the paper link! I think your reasoning of hardwiring | boilerplate to an expert-vetted (or written) summary is a much | more accurate approach. If only we could scrape the clause | explanation footnotes from quality sources I would not be | forced to write them! | minimaxir wrote: | > Second, text ought to be tokenized (a term used in natural | language processing wherein text is assigned a numerical value) | at many different levels (character, word, sentence, paragraph, | section, etc.) in order to make predictions that are relevant to | the excerpt of a text, while also remaining consistent to the | broader document. This is challenging because doing so involves a | tremendous amount of computational resources. However, it may be | necessary to accurately capture meaning at different levels of | abstraction. | | I don't think this would be an improvement per se. The way Byte | Pair Encodings are constructed that GPT-3 (and GPT-2) use is that | they already take higher-level text representations and compress | it down (into tokens), which are then reflected in the training | of the model. | Tarq0n wrote: | And the attention mechanism should take care of "sentence, | paragraph, section" level context. Attributing this to | tokenization is a weird mistake to make. | gavelin wrote: | Tarq0n: Also a great point. The question then is whether the | attention mechanism is being triggered on the proper word or | character sequences. Lawyers employ a form of attention | mechanism when "issue spotting." For example, in a non- | compete I might scan for the duration (how many years until | the client can join a similar venture?) and breadth (what is | the definition of a similar venture?). For an attention | mechanism to work well for legal summaries, it seems to me | that it must trigger on many relevant context triggers at | different levels. | | As an extreme example, if I put 50 contracts involving | multiple different parties into one big document and the | attention mechanism was triggered on the first document | title, would the subsequent document titles sufficiently | demarcate a new contract to reset the context? Or would the | attention stay "on" the first high accuracy context match and | mix up the terms and parties? In context of the paper, GPT-3 | missed a ton of issues, indicating to me that the attention | mechanism is not being properly triggered. Again I may be | wrong that tokenizing and predicting based off of clauses, | paragraphs and sections would help improve the output, but | that was one way I thought would capture the different levels | of context. | gavelin wrote: | minimaxir: Great insight and probably merits an edit for | precision. My understanding is that Byte-Pair encoding is done | at the character and word level (and maybe even the sub- | character level), but not at the higher-level representations | such as paragraph, section--and beyond. Am I mistaken? Taking a | few steps back, is that an effective way to pinpoint context? | | The goal should be to properly ascertain context at multiple | levels. When I am reviewing a document, I scan the document | title and section headings to grasp the structure of the | document prior to diving into the relevant clauses and their | elements. If there are external references, I will integrate | them before reading the clause in order to capture the complete | rule. A crucial mistake in contractual interpretation is | falsely attributing an element from one rule or section to | another, or excluding an element. What applies in A context | might not apply in B context, or it may be conditional on | another factor C. | | The criticism I intended to make was that GPT-3 likely is not | (accurately) identifying the right context "bucket" before | making the prediction and that perhaps could be improved by | tokenizing different levels of context. I may have falsely | reasoned that this could be best accomplished through | tokenization at different levels. In context of the paper, | GPT-3 referenced Tinder and MommyMeet when I inputted | Linkedin's Privacy Policy. Also, if GPT-3 contains Section 230 | in its training data, it did not look to the definition list at | the bottom of the statute to the define a key term (I excluded | the definitions from the input). My hunch is that a better | approach would localize based on the document, section and | clause type to precisely narrow the context before utilizing | character and word level predictions. | joe_the_user wrote: | I would claim it is easy to _think_ you 're seeing GPT-3 fail | because it's taking the wrong association path (not noticing | the hierarchical decomposition of the context). But the | general problem is that there is no fixed decomposition of | the meaning of a text. | | It's tempting to think a "simple" procedure like summary | doesn't need "deep" meaning but it doesn't seem like that's | the case. It should especially be noted a lot of the | plausible "summaries" could be summaries of any privacy | policy or just what people say about privacy policies on the | net. It would have been interesting to give the system a | novel bit of text to interpret instead. | Aulig wrote: | Great read, thanks for sharing. It's always interesting to find | out how AI is impacting occupations outside the technology | sector. | gavelin wrote: | Thank you for reading! | dweekly wrote: | tl;dr - Don't use GPT-3 to summarize your legal documents yet. | gavelin wrote: | "tl;dr" may very well be the problem! There is a tendency of | many tl;dr summaries on the web to oversimplify and skew | concepts. If those are included in GPT-3's dataset, GPT-3's | output will try to match the dataset style (according to the | parameters) and likely not meet the legal standard. There is a | section following the conclusion in the paper where I touch on | other ways GPT-3 might be improved for the legal summarization | use case. The only way we get there is by delving into the | nuance of WHY GPT-3 is not yet good enough to replace lawyers, | and HOW we can improve on it as an architecture. | skybrian wrote: | GPT-3 picks words literally at random (according to a probability | distribution) so it would be good to run each experiment multiple | times to get a sense of the probability distribution. I doubt it | would change the conclusions, though. | gavelin wrote: | Good tip! I repeated a few inputs to see if the variation was | significant enough to warrant including that, and as your | intuition suggested, it was not with the parameters I selected. | A more robust experiment should definitely include repeat | attempts. | minimaxir wrote: | There's a _big_ difference between sampling from a learned | probability distribution and "picks words literally at | random". The temperature = 0 examples have zero sampling by | construction, while higher temperatures having a slight | sampling. | | That said, it doesn't hurt to have multiple attempts (aside | from the cost of using GPT-3, ahem). | make3 wrote: | In a professional setting, this is definitely done, and the | generations are then ranked with separate models that predict | different quality metrics, such as "interestingness", "safety" | in an inclusiveness type of way, whether the answer seems to | fit a style that you want, whether the facts in the answer seem | to make sense, etc., and it makes a big difference actually | blackbear_ wrote: | I thought generating text from language models was a | deterministic operation, searching for the maximum likelihood | sequence using beam search? | make3 wrote: | It is now common knowledge in the NLP community that beam | search only works for situations where the output space is | very constrained, specifically, in neural machine | translation. | | In more open ended generation such as summarization, question | answering and story generation, beam search leads to poor and | repetitive outputs. Different (stochastic) sampling methods | lead to more interesting, diverse and .. functional outputs. | ansk wrote: | It can be as deterministic as you want it to be -- there are | parameters that control how much randomness is used during | the sampling process. Finding the most probable sequence from | the learned distribution is intractable for all but the | shortest of sequences. As you've said, beam search is used, | but this is just a local search heuristic and provides no | guarantees of producing the most probable output. | jll29 wrote: | To those that are interested in the state of the art in | commercial legal summarization (US law) as of Q1/2021: | https://arxiv.org/pdf/2102.05757.pdf | gavelin wrote: | Great link. Thank you for sharing this. | Der_Einzige wrote: | This is at the intersection of my research interests. I'd love to | see what happens if you run it on a large debate case with slight | variations of the input prompt queries. | | I love to see all of these articles about legal summarization - | but it's always covering abstractive summarization! I want an | effective highlighter model. To be fair, I did build a system for | using transformers to do extractive summarization in an | unsupervised manner, but the results aren't that great. I'm sure | that Lawyers would find something that could highlight the most | legally salient sections to be extremely useful if it's | reasonably accurate. | | I really did figure that we'd get effective word-level extractive | summaries before we'd get effective abstractive summaries but I | guess that intuition was wrong... ___________________________________________________________________ (page generated 2021-02-18 23:01 UTC)