[HN Gopher] Wikidata, with 12B facts, can ground LLMs to improve...
       ___________________________________________________________________
        
       Wikidata, with 12B facts, can ground LLMs to improve their
       factuality
        
       Author : raybb
       Score  : 187 points
       Date   : 2023-11-17 14:44 UTC (8 hours ago)
        
 (HTM) web link (twitter.com)
 (TXT) w3m dump (twitter.com)
        
       | bxhdndnjdbj wrote:
       | Ahh.. feed the LLM a special sauce. Then it will speak the Truth
        
         | dang wrote:
         | We've banned this account for posting unsubstantive comments.
         | Can you please not create accounts to break HN's rules with? It
         | will eventually get your main account banned as well.
         | 
         | If you'd please review
         | https://news.ycombinator.com/newsguidelines.html and stick to
         | the rules when posting here, we'd appreciate it.
        
       | audiala wrote:
       | Wikidata is such a treasure. There is quite a learning curve to
       | master the SPARQL query language but it is really powerful. We
       | are testing it to provide context to LLMs when generating audio-
       | guides and the results are very impressive so far.
        
         | karencarits wrote:
         | I wish there was a way to add results from scientific papers to
         | wikidata - imagine doing meta-analyses by SPARQL queries
        
           | gaogao wrote:
           | You totally can! - https://www.wikidata.org/wiki/Q30249683
           | 
           | It's just pretty sparse, so you would need a focused effort
           | to fill out predicates of interest.
        
             | uneekname wrote:
             | Am I missing something? I do not see any results indicated
             | in the statements of that entity.
        
               | gaogao wrote:
               | Right, such a result would need to be marked with a new
               | predicate (verb) like: ``` Subject - Transformer's Paper
               | Predicate - Score Object - BLEU (28.4) ``` One of the
               | trickiest things use a semantic triple store like this is
               | that there's a lot of ways of phrasing the data, lots of
               | ambiguity. LLMs help in this case by being able to more
               | gracefully handle cases like having both 'Score' and
               | 'Benchmark' predicates, mergining the two together.
        
             | karencarits wrote:
             | Indeed, and hopefully -if there was a structured way of
             | doing it - people might want to do that effort in relation
             | to doing reviews or meta-analysis to make the underlying
             | data available for others, and make it easier to reproduce
             | or update the results over time
        
         | bugglebeetle wrote:
         | One of my favorite things about ChatGPT is that I pretty much
         | never have to write SPARQL myself anymore. I've had zero
         | problems with the resulting queries either, except in cases
         | where I've prompted it incorrectly.
        
           | gaogao wrote:
           | Yeah, it works so well, I wonder if it's just a natural fit
           | due to the attention mechanism and graph databases sharing
           | some common semantic triple foundations
        
       | gaogao wrote:
       | And in the reverse, Wikidata has a lot of gaps in its
       | annotations, where human labelling could be augmented by LLMs. I
       | wrote some stuff on both ground response and adding more stuff to
       | WikiData a while ago
       | https://friend.computer/jekyll/update/2023/04/30/wikidata-ll...
        
         | jsemrau wrote:
         | Would be a good idea to create an annotation model like DALL-E
         | 3 had done.
        
         | unshavedyak wrote:
         | Yup. This is my gut to where LLMs will really explode. Let them
         | augment data just a bit, train on improved data, augment more,
         | train again - etcetc. If we take things slow i suspect in the
         | long run it'll be really beneficial to multiple paradigms.
         | 
         | I know people say training bots on bot data is bad, but A: it's
         | happening anyway, and B: it can't be worse than the actual
         | garbage they get trained on in a lot of cases anyway.. can it?
        
           | Filligree wrote:
           | Pixart-alpha provides an example of C: The bot labels can be
           | dramatically better than the human labels.
           | 
           | Even though they used LLaVA, and LLaVA isn't all that good
           | compared to gpt-4.
        
           | gaogao wrote:
           | Training on bot data can be bad when it's ungrounded and
           | basically hallucinations on hallucinations.
           | 
           | Having LLMs help curate something grounded is generally
           | reasonable. Functionally, it's somewhat similar to how some
           | training is using generated subtitles of videos for training
           | video/text pairs; it's very feasible to also go and clean
           | those up, even though it is bot data.
        
           | ori_b wrote:
           | > _Yup. This is my gut to where LLMs will really explode_
           | 
           | Yes, indeed. This is one place where LLMs can make it look
           | like a bomb went off.
        
           | prosqlinjector wrote:
           | We can do polynomial regression of data sets that looks
           | equally plausible, but it's not real data.
        
           | foobarchu wrote:
           | > A: it's happening anyway
           | 
           | This is never a valid defense for doing more of something.
        
         | akjshdfkjhs wrote:
         | please no!
         | 
         | The cost of now having unknown false data in there would
         | completely ruin the value of the whole effort.
         | 
         | The entire value of the data (which is already everywhere
         | anyway) is the "cost" contributors paid via heavy moderation.
         | If you do not understand why that is diametrically opposite of
         | adding/enriching/augmenting/whichever euphemism with LLMs, I
         | don't know what else to say.
        
           | cj wrote:
           | 100%. I live in a hamlet of a larger town in the US, and was
           | curious what the population of my hamlet is.
           | 
           | There's a Wikipedia page for the hamlet, but it's empty. No
           | population data, etc.
           | 
           | I'd _much_ rather see no data than a LLM's best guess. I'm
           | guessing a LLM using the data would also perform better
           | without approximated or "probably right" information.
        
           | gaogao wrote:
           | Yeah, adding directly with an LLM is a bad idea. Instead,
           | this would be basically making suggestions linked back to the
           | Wikipedia snippet that a person could approve, edit, or
           | reject. This is a flow for scaling up annotation of data that
           | works pretty well, as it also sucks having a ton of the gaps
           | in the structured data, if it's sitting fine there in the
           | linked Wikipedia page.
        
           | StableAlkyne wrote:
           | Did they ever create the bridge from Wikipedia to Wikidata? I
           | remember hearing talk about it as a way of helping the lack
           | of data. The problem I had with Wikidata a couple years ago
           | was that it was usually an incomplete subset of Wikipedia's
           | infoboxes.
           | 
           | Checking again for m-xylene,
           | https://m.wikidata.org/wiki/Q3234708
           | 
           | You get physical property data and citations.
           | 
           | Now compare that to the chem infobox in wikipedia:
           | https://en.m.wikipedia.org/wiki/M-Xylene
           | 
           | You get a lot more useful data, like the dipole moment and
           | solubility (kinda important for a solvent like Xylene), and
           | tons of other properties that Wikidata just doesn't have. All
           | in the infobox.
           | 
           | It's weird that they don't just copy the Wikipedia infobox
           | for the chemicals in Wikidata. It's already there and
           | organized. And frequently cited.
           | 
           | Maybe it's more useful for other fields, but I can't think of
           | a good use I'd get from the chemical section of Wikidata over
           | the databases it cites or Wikipedia itself...
        
             | YoshiRulz wrote:
             | I'm not that familiar with the subject, but I did read[1]
             | that Wikidata's adoption has been slowed by the fact that
             | triples can only be used on one page (per localisation).
             | There is some support for using it with infoboxes
             | though[2].
             | 
             | [1]: https://meta.wikimedia.org/wiki/Help:Array#Wikidata
             | 
             | [2]:
             | https://en.wikipedia.org/wiki/Help:Wikidata#In_infoboxes
        
         | raybb wrote:
         | It would be really cool if there was a tool that could help
         | extra data from say a news article and then populate wikidata
         | with it after human review. I find the task of adding simple
         | fields like date founded to be too many clicks with the default
         | gui.
        
         | huytersd wrote:
         | Only if it is human validated and even then not really.
        
       | gibsonf1 wrote:
       | Hmm, there is a lot of opinion in Wikidata - so would not call
       | all of them facts, although some items are. Even if it was all
       | factual, the statistical nature of LLM's would still invent
       | things from the input as per the nature of the technology.
        
         | sharemywin wrote:
         | You just need to tell it to use the facts:
         | 
         | "Us the information from the following list of facts to answer
         | the questions I ask without qualifications. answer
         | authoritatively. If the question can't be answered with the
         | following facts just say I don't know.
         | 
         | Absolute Facts:
         | 
         | The sky is purple.
         | 
         | The sun is red and green
         | 
         | When it rains animals fall from the sky."
        
           | sharemywin wrote:
           | If you tried to make a customer facing chatbot I wouldn't let
           | it generate responses directly. I would have it pick from a
           | list of canned responses. and/or have it contact a rep to
           | intervene on complicated questions. But there's no reason
           | this tech couldn't be used for some commercial situations
           | now.
        
           | prosqlinjector wrote:
           | The sky is not one color and changes color depending on
           | weather, sun, and global location.
        
       | behnamoh wrote:
       | But their example uses GPT-3, a completely outdated model which
       | was prone to hallucinations. But GPT-4 has got much better in
       | that regard, so I wonder what the marginal benefit of Wikidata is
       | for really huge LLMs such as the 4.
        
         | not2b wrote:
         | GPT-4 is not immune to making things up, and a smaller model
         | that doesn't have as much garbage and nonsense in its training
         | data might achieve results that are nearly as good for much
         | less cost.
        
           | behnamoh wrote:
           | Clearly you read my comment wrong, I said "GPT-4 has got much
           | better in that regard".
        
       | jakobson14 wrote:
       | On facts in the wikidata dataset? sure.
       | 
       | But if you think this will stem the tide of LLM hallucinations,
       | you're high too. LLMs' primary function is to bullshit.
       | 
       | In chess many games play out with the same opening but within a
       | few moves become a game no one has played before. Being outside
       | the dataset is the default for any sufficiently long
       | conversation.
        
       | mrtesthah wrote:
       | I've been waiting for the OpenCYC knowledge ontology to be used
       | for this purpose as well.
        
         | gibsonf1 wrote:
         | That ontology is quite a mess actually.
        
         | euroderf wrote:
         | Me too. But if OpenCYC has been completely absent from the
         | public discourse about A.I., does that mean there's a super
         | secret collaboration going on ? Or... hmm, maybe the NSA gets
         | to throw a few hundred million bucks at the problem ?
        
         | riku_iki wrote:
         | you would need also facts base, and in my understanding OpenCYC
         | is small compared to Wikidata
        
       | stri8ed wrote:
       | Do existing LLM's not already train on this data?
        
         | kfrzcode wrote:
         | Nope. Training data for the big LLMs is a corpus of text, not
         | structured data. There would be much more dimensionality with
         | regard to parameterization as far as I understand when it comes
         | to structured data
        
         | brlewis wrote:
         | The linked tweet has a diagram where you can pretty quickly see
         | that this isn't just about using wikidata as a training set.
         | The paper linked from the tweet also gives a good summary on
         | its first page.
        
       | crazygringo wrote:
       | Can it though?
       | 
       | LLM's are currently trained on actual language patterns, and pick
       | up facts that are repeated consistently, not one-off things --
       | and within all sorts of different contexts.
       | 
       | Adding a bunch of unnatural "From Wikidata, <noun> <verb> <noun>"
       | sentences to the training data, severed from any kind of context,
       | seems like it would run the risk of:
       | 
       | - Not increasing factual accuracy because there isn't enough
       | repetition of them
       | 
       | - Not increasing factual accuracy because these facts aren't
       | being repeated consistently across other contexts, so they result
       | in a walled-off part of the model that doesn't affect normal
       | writing
       | 
       | - And if they are massively repeated, all sorts of problems with
       | overtraining and learning exact sentences rather than the
       | conceptual content
       | 
       | - Either way, introducing linguistic confusion to the LLM,
       | thinking that making long lists of "From Wikidata, ..." is a
       | normal way of talking
       | 
       | If this is a technique that actually works, I'll believe it when
       | I see it.
       | 
       | (Not to mention the fact that I don't think most of the stuff
       | people are asking LLM's for isn't stuff represented in Wikidata.
       | Wikidata-type facts are already pretty decently handled by
       | regular Google.)
        
         | ivalm wrote:
         | Fine tuning. You can autogenerate all kind of factual questions
         | with one word answers based on these triplets.
        
         | Closi wrote:
         | Well that's not actually how it works - they are just getting a
         | model (WikiSP & EntityLinker) to write a query that responds
         | with the fact from Wikidata. Did you read the post or just the
         | headline?
         | 
         | Besides, let's not forget that humans are _also_ trained on
         | language data, and although humans can also be wrong, if a
         | human memorised all of Wikidata (by reading sentences /facts in
         | 'training data') it would be pretty good in a pub-quiz.
         | 
         | Also, we obviously can't see anything inside how OpenAI train
         | GPT, but I wouldn't be surprised if sources with a higher
         | authority (e.g. wikidata) can be given a higher weight in the
         | training data, and also if sources such as wikidata could be
         | used with reinforcement learning to ensure that answers within
         | the dataset are 'correctly' answered without hallucination.
        
           | toomuchtodo wrote:
           | In this context, these are more expert systems vs LLMs, and
           | as you enumerate, they can be built well if built well. For
           | example, Google surfaces search engine results directly. This
           | is similar, but more powerful, because Wikimedia Foundation
           | can actually improve results, gaps, overall performance while
           | Google DGAF.
           | 
           | I would expect as the tide rises with regards to this tech,
           | self hosting of training and providing services to prompts
           | becomes easier. For Wikimedia, it'll just be another cluster
           | and data pipeline system(s) at their datacenter.
        
           | crazygringo wrote:
           | Ah, I did misunderstand how it worked, thanks -- I was
           | looking at the flow chart and just focusing on the part that
           | said "From Wikidata, the filming location of 'A Bronx Tale'
           | includes New Jersey and New York" that had an arrow feeding
           | it into GTP-3...
           | 
           | I'm not really sure how useful something this simple is,
           | then. If it's not actually improving the factual accuracy in
           | the training of the model itself, it's really just a hack
           | that makes the whole system even harder to reason about.
        
             | westurner wrote:
             | The objectively true data part?
             | 
             | Also there's Retrieval Augmented Generation (RAG)
             | https://www.promptingguide.ai/techniques/rag :
             | 
             | > _For more complex and knowledge-intensive tasks, it 's
             | possible to build a language model-based system that
             | accesses external knowledge sources to complete tasks. This
             | enables more factual consistency, improves reliability of
             | the generated responses, and helps to mitigate the problem
             | of "hallucination"._
             | 
             | > _Meta AI researchers introduced a method called Retrieval
             | Augmented Generation (RAG) to address such knowledge-
             | intensive tasks. RAG combines an information retrieval
             | component with a text generator model. RAG can be fine-
             | tuned and its internal knowledge can be modified in an
             | efficient manner and without needing retraining of the
             | entire model._
             | 
             | > _RAG takes an input and retrieves a set of relevant
             | /supporting documents given a source (e.g., Wikipedia)._
             | The documents are concatenated as context with the original
             | input prompt and fed to the text generator which produces
             | the final output. This makes RAG adaptive for situations
             | where facts could evolve over time. _This is very useful as
             | LLMs 's parametric knowledge is static._
             | 
             | > _RAG allows language models to bypass retraining,
             | enabling access to the latest information for generating
             | reliable outputs via retrieval-based generation._
             | 
             | > _Lewis et al., (2021) proposed a general-purpose fine-
             | tuning recipe for RAG. A pre-trained seq2seq model is used
             | as the parametric memory and a dense vector index of
             | Wikipedia is used as non-parametric memory (accessed using
             | a neural pre-trained retriever)._ [...]
             | 
             | > _RAG performs strong on several benchmarks such as
             | Natural Questions, WebQuestions, and CuratedTrec. RAG
             | generates responses that are more factual, specific, and
             | diverse when tested on MS-MARCO and Jeopardy questions. RAG
             | also improves results on FEVER fact verification._
             | 
             | > _This shows the potential of RAG as a viable option for
             | enhancing outputs of language models in knowledge-intensive
             | tasks._
             | 
             | So, with various methods, I think having ground facts in
             | the process somehow should improve accuracy.
        
         | rcfox wrote:
         | Isn't repetition essentially a way of adding weight? If you
         | could increase the inherent weight of Wikidata, wouldn't that
         | provide the same effect?
        
           | NovemberWhiskey wrote:
           | If you want to increase the likelihood that answers will read
           | like Wikipedia entries, sure.
        
             | brandonasuncion wrote:
             | Is it possible to finetune an LLM on the factual content
             | without altering its linguistic characteristics?
             | 
             | With Stable Diffusion, you're able to use LoRAs to
             | introduce specific characters, objects, concepts, etc.
             | while maintaining the same visual qualities of the base
             | model.
             | 
             | Why can't something similar be done with an LLM?
        
       | visarga wrote:
       | If I had the funds I'd run all the training set (GPT4 used 13
       | trillion tokens) through a LLM to mine factual statements, then
       | do reconciliation or even better, I'd save a summary description
       | of the diverse results. In the end we'd end up with an universal
       | KB. Even for controversial topics, it would at least model the
       | distribution of opinions, and be able to confirm if a statement
       | doesn't exist in the database.
       | 
       | Besides mining KB triplets I'd also use the LLM with contextual
       | material to generate Wikipedia-style articles based off external
       | references. It should write 1000x more articles covering all
       | known names and concepts, creating trillions of synthetic tokens
       | of high quality. This would be added to the pre-training stage.
        
       | boznz wrote:
       | Having an indexed database of facts not half-facts and half or
       | untruths is the only way AI is ever going to be useful; and until
       | it can fact check for itself these databases will need to be the
       | training wheels.
        
         | prosqlinjector wrote:
         | Curating and presenting facts is a form of narrative and is not
         | at all objective.
        
       | mike_hock wrote:
       | But then it needs extra filters so it doesn't accidentally say
       | something based.
        
         | Vysak wrote:
         | I think this typo works
        
           | TaylorAlexander wrote:
           | I don't think it was a typo.
        
       | night-rider wrote:
       | 'Facts' based on citations that no longer exist, or if they do
       | exist, they remain on Archive.org's Wayback Machine. And then
       | when you visit the resource in question, the author is not
       | credible enough to be believed and their 'facts' are on shaky
       | ground. It's turtles all the way down.
        
         | benopal64 wrote:
         | I question the sentiment. I think people CAN argue the basis of
         | a fact, however, being pragmatic and holistic can help provide
         | some understanding. Truth is always relative and always has
         | been. However, the human perspective holds real, tangible,
         | recordable, and testable evidence. We rely on the multitude of
         | perspectives to fully flesh out reality and determine the
         | details and TRUTH of reality at multiple scales. The value of
         | diverse human perspectives is similar to the value of
         | perceiving an idea, concept, or object at different scale.
        
         | riku_iki wrote:
         | sounds like algorithmicly solvable problem..
        
       | Racing0461 wrote:
       | What if wiki articles are written using LLMs from now on? That
       | would be "ai incest" if its used as training/ground truth data.
       | 
       | I forsee data created before AI/LLMs to be very valuable going
       | forward in much the same way steel mined before the detonation of
       | the first atomic bomb being used for nuclear devices/MRIs/etc.
        
         | ElectricalUnion wrote:
         | There is even a XKCD for that: https://xkcd.com/978/
         | 
         | s/a user's brain/llm/g
        
       | klysm wrote:
       | Strong doubt. The problem is LLMs don't have a robust
       | epistemology (they can't), and are structurally unable to provide
       | a basis over which they've "reasoned".
        
         | blackbear_ wrote:
         | Don't get too hung up on the present technical definition of
         | LLM. Perhaps it is possible to find new architectures that are
         | more suited to ground their claims.
        
           | SrslyJosh wrote:
           | > Don't get too hung up on the present technical definition
           | of LLM.
           | 
           | The paper is literally about LLMs. Speculation about future
           | model architectures is irrelevant.
        
         | lukev wrote:
         | Using retrieval to look up a fact and then citing that fact in
         | response to a query (with attribution) is absolutely within the
         | capabilities of current LLMs.
         | 
         | LLMs "hallucinate" all the time when pulling data from their
         | weights (because that's how that works, it's all just token
         | generation). But if the correct data is placed within their
         | context they are very capable of presenting the data in natural
         | language.
        
         | kelseyfrog wrote:
         | Humans, when probed, don't have a robust epistemology either.
         | 
         | Our knowledge(and reality) is grounded in heuristics we reify
         | into natural order and it's easy for us to forget that our
         | conclusions exist as a veneer on our perceptions. Nearly every
         | bit of knowledge we hold has an opposite twin that we hold as
         | well. We favor completeness over consistency.
         | 
         | When pressed, humans tend to justify their heuristics rather
         | than reexamine them because our minds have a clarity bias - ie:
         | we would rather feel like things are clear even if they are
         | wrong. Often times we can't go back and test if they are wrong
         | which biases epistemological justifications even more.
         | 
         | So no, our rationality, the vast proportion of times is used to
         | rationalize rather than conclude.
        
       | Animats wrote:
       | Isn't this usually done by having something that takes in a
       | query, finds possibly relevant info in a database, and adds it to
       | the LLM prompt? That allows the use of very large databases
       | without trying to store them inside the LLM weights.
        
         | Linell wrote:
         | Yes, it's called retrieval augmented generation.
        
       | dmezzetti wrote:
       | I think a better approach is using retrieval augmented generation
       | with Wikipedia.
       | 
       | This data source is designed for that:
       | https://huggingface.co/NeuML/txtai-wikipedia.
       | 
       | With this source, you can also select articles that are viewed
       | the most, which is another important factor in validating facts.
       | An article which has no views might not be the best source of
       | information.
        
         | audiala wrote:
         | Part of the issue is to select the right Wikipedia article.
         | Wikidata offers a way to know for sure that you query the LLM
         | with the right data. Also the wikipedia txtai dataset is for
         | english only.
        
       | itissid wrote:
       | I feel, if you use a pre-trained model to do these things without
       | knowing the set intersection of the test and that dataset, makes
       | it very tough to know weather inference is in the transitive
       | closure of generated text the models were trained on or weather
       | they really improved.
       | 
       | There was another approach to grounding LLMs the other day from
       | Normal Computing the other day:
       | https://blog.normalcomputing.ai/posts/2023-09-12-supersizing...
       | in which they use Mosaic but they also did not mentioned that
       | this was actually done.
       | 
       | Sentient or not, I feel there should be a standard on
       | aggressively filtering out overlap on training and test datasets
       | for approaches like this.
        
         | chrisweekly wrote:
         | It's "whether". (weather is eg sunny or raining)
        
       | thewanderer1983 wrote:
       | Please don't. Wikipedia long abandoned neutrality. They aren't
       | the bearers of truth.
        
       | antipaul wrote:
       | Is it AI, or just a look up table?
        
       | Scene_Cast2 wrote:
       | For some more reading on using facts for ML, check out this
       | discussion: https://news.ycombinator.com/item?id=37354000
        
       | xacky wrote:
       | I know for a fact that there are a lot of unreverted vandal edits
       | in Wikidata, because Wikidata's bots enter data so fast that it
       | is too fast for Special:Recentchanges to monitor. Even Wikipedia
       | still regularly gets 15+ year old hoaxes added to their hoax
       | list.
        
       | born-jre wrote:
       | when 100+B model hallucinates, that's a problem but a mistral 7b
       | (qk_4 around 4gb file ) does this have prams to encode
       | information to be hallucination proof, since llm cannot know what
       | they do not know.
       | 
       | So maybe we should be building smaller model with ability that we
       | use their generation abilities not their facts but instead teach
       | them t query over another knowledge base system (reverse RAG) for
       | facts.
        
       ___________________________________________________________________
       (page generated 2023-11-17 23:00 UTC)