[HN Gopher] Wikidata, with 12B facts, can ground LLMs to improve... ___________________________________________________________________ Wikidata, with 12B facts, can ground LLMs to improve their factuality Author : raybb Score : 187 points Date : 2023-11-17 14:44 UTC (8 hours ago) (HTM) web link (twitter.com) (TXT) w3m dump (twitter.com) | bxhdndnjdbj wrote: | Ahh.. feed the LLM a special sauce. Then it will speak the Truth | dang wrote: | We've banned this account for posting unsubstantive comments. | Can you please not create accounts to break HN's rules with? It | will eventually get your main account banned as well. | | If you'd please review | https://news.ycombinator.com/newsguidelines.html and stick to | the rules when posting here, we'd appreciate it. | audiala wrote: | Wikidata is such a treasure. There is quite a learning curve to | master the SPARQL query language but it is really powerful. We | are testing it to provide context to LLMs when generating audio- | guides and the results are very impressive so far. | karencarits wrote: | I wish there was a way to add results from scientific papers to | wikidata - imagine doing meta-analyses by SPARQL queries | gaogao wrote: | You totally can! - https://www.wikidata.org/wiki/Q30249683 | | It's just pretty sparse, so you would need a focused effort | to fill out predicates of interest. | uneekname wrote: | Am I missing something? I do not see any results indicated | in the statements of that entity. | gaogao wrote: | Right, such a result would need to be marked with a new | predicate (verb) like: ``` Subject - Transformer's Paper | Predicate - Score Object - BLEU (28.4) ``` One of the | trickiest things use a semantic triple store like this is | that there's a lot of ways of phrasing the data, lots of | ambiguity. LLMs help in this case by being able to more | gracefully handle cases like having both 'Score' and | 'Benchmark' predicates, mergining the two together. | karencarits wrote: | Indeed, and hopefully -if there was a structured way of | doing it - people might want to do that effort in relation | to doing reviews or meta-analysis to make the underlying | data available for others, and make it easier to reproduce | or update the results over time | bugglebeetle wrote: | One of my favorite things about ChatGPT is that I pretty much | never have to write SPARQL myself anymore. I've had zero | problems with the resulting queries either, except in cases | where I've prompted it incorrectly. | gaogao wrote: | Yeah, it works so well, I wonder if it's just a natural fit | due to the attention mechanism and graph databases sharing | some common semantic triple foundations | gaogao wrote: | And in the reverse, Wikidata has a lot of gaps in its | annotations, where human labelling could be augmented by LLMs. I | wrote some stuff on both ground response and adding more stuff to | WikiData a while ago | https://friend.computer/jekyll/update/2023/04/30/wikidata-ll... | jsemrau wrote: | Would be a good idea to create an annotation model like DALL-E | 3 had done. | unshavedyak wrote: | Yup. This is my gut to where LLMs will really explode. Let them | augment data just a bit, train on improved data, augment more, | train again - etcetc. If we take things slow i suspect in the | long run it'll be really beneficial to multiple paradigms. | | I know people say training bots on bot data is bad, but A: it's | happening anyway, and B: it can't be worse than the actual | garbage they get trained on in a lot of cases anyway.. can it? | Filligree wrote: | Pixart-alpha provides an example of C: The bot labels can be | dramatically better than the human labels. | | Even though they used LLaVA, and LLaVA isn't all that good | compared to gpt-4. | gaogao wrote: | Training on bot data can be bad when it's ungrounded and | basically hallucinations on hallucinations. | | Having LLMs help curate something grounded is generally | reasonable. Functionally, it's somewhat similar to how some | training is using generated subtitles of videos for training | video/text pairs; it's very feasible to also go and clean | those up, even though it is bot data. | ori_b wrote: | > _Yup. This is my gut to where LLMs will really explode_ | | Yes, indeed. This is one place where LLMs can make it look | like a bomb went off. | prosqlinjector wrote: | We can do polynomial regression of data sets that looks | equally plausible, but it's not real data. | foobarchu wrote: | > A: it's happening anyway | | This is never a valid defense for doing more of something. | akjshdfkjhs wrote: | please no! | | The cost of now having unknown false data in there would | completely ruin the value of the whole effort. | | The entire value of the data (which is already everywhere | anyway) is the "cost" contributors paid via heavy moderation. | If you do not understand why that is diametrically opposite of | adding/enriching/augmenting/whichever euphemism with LLMs, I | don't know what else to say. | cj wrote: | 100%. I live in a hamlet of a larger town in the US, and was | curious what the population of my hamlet is. | | There's a Wikipedia page for the hamlet, but it's empty. No | population data, etc. | | I'd _much_ rather see no data than a LLM's best guess. I'm | guessing a LLM using the data would also perform better | without approximated or "probably right" information. | gaogao wrote: | Yeah, adding directly with an LLM is a bad idea. Instead, | this would be basically making suggestions linked back to the | Wikipedia snippet that a person could approve, edit, or | reject. This is a flow for scaling up annotation of data that | works pretty well, as it also sucks having a ton of the gaps | in the structured data, if it's sitting fine there in the | linked Wikipedia page. | StableAlkyne wrote: | Did they ever create the bridge from Wikipedia to Wikidata? I | remember hearing talk about it as a way of helping the lack | of data. The problem I had with Wikidata a couple years ago | was that it was usually an incomplete subset of Wikipedia's | infoboxes. | | Checking again for m-xylene, | https://m.wikidata.org/wiki/Q3234708 | | You get physical property data and citations. | | Now compare that to the chem infobox in wikipedia: | https://en.m.wikipedia.org/wiki/M-Xylene | | You get a lot more useful data, like the dipole moment and | solubility (kinda important for a solvent like Xylene), and | tons of other properties that Wikidata just doesn't have. All | in the infobox. | | It's weird that they don't just copy the Wikipedia infobox | for the chemicals in Wikidata. It's already there and | organized. And frequently cited. | | Maybe it's more useful for other fields, but I can't think of | a good use I'd get from the chemical section of Wikidata over | the databases it cites or Wikipedia itself... | YoshiRulz wrote: | I'm not that familiar with the subject, but I did read[1] | that Wikidata's adoption has been slowed by the fact that | triples can only be used on one page (per localisation). | There is some support for using it with infoboxes | though[2]. | | [1]: https://meta.wikimedia.org/wiki/Help:Array#Wikidata | | [2]: | https://en.wikipedia.org/wiki/Help:Wikidata#In_infoboxes | raybb wrote: | It would be really cool if there was a tool that could help | extra data from say a news article and then populate wikidata | with it after human review. I find the task of adding simple | fields like date founded to be too many clicks with the default | gui. | huytersd wrote: | Only if it is human validated and even then not really. | gibsonf1 wrote: | Hmm, there is a lot of opinion in Wikidata - so would not call | all of them facts, although some items are. Even if it was all | factual, the statistical nature of LLM's would still invent | things from the input as per the nature of the technology. | sharemywin wrote: | You just need to tell it to use the facts: | | "Us the information from the following list of facts to answer | the questions I ask without qualifications. answer | authoritatively. If the question can't be answered with the | following facts just say I don't know. | | Absolute Facts: | | The sky is purple. | | The sun is red and green | | When it rains animals fall from the sky." | sharemywin wrote: | If you tried to make a customer facing chatbot I wouldn't let | it generate responses directly. I would have it pick from a | list of canned responses. and/or have it contact a rep to | intervene on complicated questions. But there's no reason | this tech couldn't be used for some commercial situations | now. | prosqlinjector wrote: | The sky is not one color and changes color depending on | weather, sun, and global location. | behnamoh wrote: | But their example uses GPT-3, a completely outdated model which | was prone to hallucinations. But GPT-4 has got much better in | that regard, so I wonder what the marginal benefit of Wikidata is | for really huge LLMs such as the 4. | not2b wrote: | GPT-4 is not immune to making things up, and a smaller model | that doesn't have as much garbage and nonsense in its training | data might achieve results that are nearly as good for much | less cost. | behnamoh wrote: | Clearly you read my comment wrong, I said "GPT-4 has got much | better in that regard". | jakobson14 wrote: | On facts in the wikidata dataset? sure. | | But if you think this will stem the tide of LLM hallucinations, | you're high too. LLMs' primary function is to bullshit. | | In chess many games play out with the same opening but within a | few moves become a game no one has played before. Being outside | the dataset is the default for any sufficiently long | conversation. | mrtesthah wrote: | I've been waiting for the OpenCYC knowledge ontology to be used | for this purpose as well. | gibsonf1 wrote: | That ontology is quite a mess actually. | euroderf wrote: | Me too. But if OpenCYC has been completely absent from the | public discourse about A.I., does that mean there's a super | secret collaboration going on ? Or... hmm, maybe the NSA gets | to throw a few hundred million bucks at the problem ? | riku_iki wrote: | you would need also facts base, and in my understanding OpenCYC | is small compared to Wikidata | stri8ed wrote: | Do existing LLM's not already train on this data? | kfrzcode wrote: | Nope. Training data for the big LLMs is a corpus of text, not | structured data. There would be much more dimensionality with | regard to parameterization as far as I understand when it comes | to structured data | brlewis wrote: | The linked tweet has a diagram where you can pretty quickly see | that this isn't just about using wikidata as a training set. | The paper linked from the tweet also gives a good summary on | its first page. | crazygringo wrote: | Can it though? | | LLM's are currently trained on actual language patterns, and pick | up facts that are repeated consistently, not one-off things -- | and within all sorts of different contexts. | | Adding a bunch of unnatural "From Wikidata, <noun> <verb> <noun>" | sentences to the training data, severed from any kind of context, | seems like it would run the risk of: | | - Not increasing factual accuracy because there isn't enough | repetition of them | | - Not increasing factual accuracy because these facts aren't | being repeated consistently across other contexts, so they result | in a walled-off part of the model that doesn't affect normal | writing | | - And if they are massively repeated, all sorts of problems with | overtraining and learning exact sentences rather than the | conceptual content | | - Either way, introducing linguistic confusion to the LLM, | thinking that making long lists of "From Wikidata, ..." is a | normal way of talking | | If this is a technique that actually works, I'll believe it when | I see it. | | (Not to mention the fact that I don't think most of the stuff | people are asking LLM's for isn't stuff represented in Wikidata. | Wikidata-type facts are already pretty decently handled by | regular Google.) | ivalm wrote: | Fine tuning. You can autogenerate all kind of factual questions | with one word answers based on these triplets. | Closi wrote: | Well that's not actually how it works - they are just getting a | model (WikiSP & EntityLinker) to write a query that responds | with the fact from Wikidata. Did you read the post or just the | headline? | | Besides, let's not forget that humans are _also_ trained on | language data, and although humans can also be wrong, if a | human memorised all of Wikidata (by reading sentences /facts in | 'training data') it would be pretty good in a pub-quiz. | | Also, we obviously can't see anything inside how OpenAI train | GPT, but I wouldn't be surprised if sources with a higher | authority (e.g. wikidata) can be given a higher weight in the | training data, and also if sources such as wikidata could be | used with reinforcement learning to ensure that answers within | the dataset are 'correctly' answered without hallucination. | toomuchtodo wrote: | In this context, these are more expert systems vs LLMs, and | as you enumerate, they can be built well if built well. For | example, Google surfaces search engine results directly. This | is similar, but more powerful, because Wikimedia Foundation | can actually improve results, gaps, overall performance while | Google DGAF. | | I would expect as the tide rises with regards to this tech, | self hosting of training and providing services to prompts | becomes easier. For Wikimedia, it'll just be another cluster | and data pipeline system(s) at their datacenter. | crazygringo wrote: | Ah, I did misunderstand how it worked, thanks -- I was | looking at the flow chart and just focusing on the part that | said "From Wikidata, the filming location of 'A Bronx Tale' | includes New Jersey and New York" that had an arrow feeding | it into GTP-3... | | I'm not really sure how useful something this simple is, | then. If it's not actually improving the factual accuracy in | the training of the model itself, it's really just a hack | that makes the whole system even harder to reason about. | westurner wrote: | The objectively true data part? | | Also there's Retrieval Augmented Generation (RAG) | https://www.promptingguide.ai/techniques/rag : | | > _For more complex and knowledge-intensive tasks, it 's | possible to build a language model-based system that | accesses external knowledge sources to complete tasks. This | enables more factual consistency, improves reliability of | the generated responses, and helps to mitigate the problem | of "hallucination"._ | | > _Meta AI researchers introduced a method called Retrieval | Augmented Generation (RAG) to address such knowledge- | intensive tasks. RAG combines an information retrieval | component with a text generator model. RAG can be fine- | tuned and its internal knowledge can be modified in an | efficient manner and without needing retraining of the | entire model._ | | > _RAG takes an input and retrieves a set of relevant | /supporting documents given a source (e.g., Wikipedia)._ | The documents are concatenated as context with the original | input prompt and fed to the text generator which produces | the final output. This makes RAG adaptive for situations | where facts could evolve over time. _This is very useful as | LLMs 's parametric knowledge is static._ | | > _RAG allows language models to bypass retraining, | enabling access to the latest information for generating | reliable outputs via retrieval-based generation._ | | > _Lewis et al., (2021) proposed a general-purpose fine- | tuning recipe for RAG. A pre-trained seq2seq model is used | as the parametric memory and a dense vector index of | Wikipedia is used as non-parametric memory (accessed using | a neural pre-trained retriever)._ [...] | | > _RAG performs strong on several benchmarks such as | Natural Questions, WebQuestions, and CuratedTrec. RAG | generates responses that are more factual, specific, and | diverse when tested on MS-MARCO and Jeopardy questions. RAG | also improves results on FEVER fact verification._ | | > _This shows the potential of RAG as a viable option for | enhancing outputs of language models in knowledge-intensive | tasks._ | | So, with various methods, I think having ground facts in | the process somehow should improve accuracy. | rcfox wrote: | Isn't repetition essentially a way of adding weight? If you | could increase the inherent weight of Wikidata, wouldn't that | provide the same effect? | NovemberWhiskey wrote: | If you want to increase the likelihood that answers will read | like Wikipedia entries, sure. | brandonasuncion wrote: | Is it possible to finetune an LLM on the factual content | without altering its linguistic characteristics? | | With Stable Diffusion, you're able to use LoRAs to | introduce specific characters, objects, concepts, etc. | while maintaining the same visual qualities of the base | model. | | Why can't something similar be done with an LLM? | visarga wrote: | If I had the funds I'd run all the training set (GPT4 used 13 | trillion tokens) through a LLM to mine factual statements, then | do reconciliation or even better, I'd save a summary description | of the diverse results. In the end we'd end up with an universal | KB. Even for controversial topics, it would at least model the | distribution of opinions, and be able to confirm if a statement | doesn't exist in the database. | | Besides mining KB triplets I'd also use the LLM with contextual | material to generate Wikipedia-style articles based off external | references. It should write 1000x more articles covering all | known names and concepts, creating trillions of synthetic tokens | of high quality. This would be added to the pre-training stage. | boznz wrote: | Having an indexed database of facts not half-facts and half or | untruths is the only way AI is ever going to be useful; and until | it can fact check for itself these databases will need to be the | training wheels. | prosqlinjector wrote: | Curating and presenting facts is a form of narrative and is not | at all objective. | mike_hock wrote: | But then it needs extra filters so it doesn't accidentally say | something based. | Vysak wrote: | I think this typo works | TaylorAlexander wrote: | I don't think it was a typo. | night-rider wrote: | 'Facts' based on citations that no longer exist, or if they do | exist, they remain on Archive.org's Wayback Machine. And then | when you visit the resource in question, the author is not | credible enough to be believed and their 'facts' are on shaky | ground. It's turtles all the way down. | benopal64 wrote: | I question the sentiment. I think people CAN argue the basis of | a fact, however, being pragmatic and holistic can help provide | some understanding. Truth is always relative and always has | been. However, the human perspective holds real, tangible, | recordable, and testable evidence. We rely on the multitude of | perspectives to fully flesh out reality and determine the | details and TRUTH of reality at multiple scales. The value of | diverse human perspectives is similar to the value of | perceiving an idea, concept, or object at different scale. | riku_iki wrote: | sounds like algorithmicly solvable problem.. | Racing0461 wrote: | What if wiki articles are written using LLMs from now on? That | would be "ai incest" if its used as training/ground truth data. | | I forsee data created before AI/LLMs to be very valuable going | forward in much the same way steel mined before the detonation of | the first atomic bomb being used for nuclear devices/MRIs/etc. | ElectricalUnion wrote: | There is even a XKCD for that: https://xkcd.com/978/ | | s/a user's brain/llm/g | klysm wrote: | Strong doubt. The problem is LLMs don't have a robust | epistemology (they can't), and are structurally unable to provide | a basis over which they've "reasoned". | blackbear_ wrote: | Don't get too hung up on the present technical definition of | LLM. Perhaps it is possible to find new architectures that are | more suited to ground their claims. | SrslyJosh wrote: | > Don't get too hung up on the present technical definition | of LLM. | | The paper is literally about LLMs. Speculation about future | model architectures is irrelevant. | lukev wrote: | Using retrieval to look up a fact and then citing that fact in | response to a query (with attribution) is absolutely within the | capabilities of current LLMs. | | LLMs "hallucinate" all the time when pulling data from their | weights (because that's how that works, it's all just token | generation). But if the correct data is placed within their | context they are very capable of presenting the data in natural | language. | kelseyfrog wrote: | Humans, when probed, don't have a robust epistemology either. | | Our knowledge(and reality) is grounded in heuristics we reify | into natural order and it's easy for us to forget that our | conclusions exist as a veneer on our perceptions. Nearly every | bit of knowledge we hold has an opposite twin that we hold as | well. We favor completeness over consistency. | | When pressed, humans tend to justify their heuristics rather | than reexamine them because our minds have a clarity bias - ie: | we would rather feel like things are clear even if they are | wrong. Often times we can't go back and test if they are wrong | which biases epistemological justifications even more. | | So no, our rationality, the vast proportion of times is used to | rationalize rather than conclude. | Animats wrote: | Isn't this usually done by having something that takes in a | query, finds possibly relevant info in a database, and adds it to | the LLM prompt? That allows the use of very large databases | without trying to store them inside the LLM weights. | Linell wrote: | Yes, it's called retrieval augmented generation. | dmezzetti wrote: | I think a better approach is using retrieval augmented generation | with Wikipedia. | | This data source is designed for that: | https://huggingface.co/NeuML/txtai-wikipedia. | | With this source, you can also select articles that are viewed | the most, which is another important factor in validating facts. | An article which has no views might not be the best source of | information. | audiala wrote: | Part of the issue is to select the right Wikipedia article. | Wikidata offers a way to know for sure that you query the LLM | with the right data. Also the wikipedia txtai dataset is for | english only. | itissid wrote: | I feel, if you use a pre-trained model to do these things without | knowing the set intersection of the test and that dataset, makes | it very tough to know weather inference is in the transitive | closure of generated text the models were trained on or weather | they really improved. | | There was another approach to grounding LLMs the other day from | Normal Computing the other day: | https://blog.normalcomputing.ai/posts/2023-09-12-supersizing... | in which they use Mosaic but they also did not mentioned that | this was actually done. | | Sentient or not, I feel there should be a standard on | aggressively filtering out overlap on training and test datasets | for approaches like this. | chrisweekly wrote: | It's "whether". (weather is eg sunny or raining) | thewanderer1983 wrote: | Please don't. Wikipedia long abandoned neutrality. They aren't | the bearers of truth. | antipaul wrote: | Is it AI, or just a look up table? | Scene_Cast2 wrote: | For some more reading on using facts for ML, check out this | discussion: https://news.ycombinator.com/item?id=37354000 | xacky wrote: | I know for a fact that there are a lot of unreverted vandal edits | in Wikidata, because Wikidata's bots enter data so fast that it | is too fast for Special:Recentchanges to monitor. Even Wikipedia | still regularly gets 15+ year old hoaxes added to their hoax | list. | born-jre wrote: | when 100+B model hallucinates, that's a problem but a mistral 7b | (qk_4 around 4gb file ) does this have prams to encode | information to be hallucination proof, since llm cannot know what | they do not know. | | So maybe we should be building smaller model with ability that we | use their generation abilities not their facts but instead teach | them t query over another knowledge base system (reverse RAG) for | facts. ___________________________________________________________________ (page generated 2023-11-17 23:00 UTC)