[HN Gopher] RAG is more than just embedding search
       ___________________________________________________________________
        
       RAG is more than just embedding search
        
       Author : jxnlco
       Score  : 104 points
       Date   : 2023-09-21 16:18 UTC (6 hours ago)
        
 (HTM) web link (jxnl.github.io)
 (TXT) w3m dump (jxnl.github.io)
        
       | dsr_ wrote:
       | So instead of asking google or wikipedia, you ask a natural
       | language tokenizer to break your query into several possible
       | queries, then feed that to an LLM in order to get an essay that
       | might answer your question.
       | 
       | Do I have that basically correct?
       | 
       | edit, 43 minutes later: the first three responders say yes. So,
       | it's a way of increasing the verbosity and reducing the
       | reliability of responses to search queries. Yay! Who would not
       | want such a thing?
       | 
       | (me. And probably you.)
        
         | fnordpiglet wrote:
         | At its most basic perhaps. But the LLM has an enormous semantic
         | corpus embedded in its model that augments the retrieved
         | document. The retrieved document in a way cements the context
         | better to help prevent wandering into hallucinations. So the
         | LLM would indeed be able to summarize the retrieved document,
         | but also synthesize it with other "knowledge" embedded in its
         | model.
         | 
         | But the more important thing is you can interrogate the LLM to
         | ask it the specific questions you have based on what it has
         | said and your goals. Contrast this to an information retrieval
         | based methods where you read the article hoping your questions
         | are answered, and when they aren't you are stuck digging
         | through less and less relevant results or refining a search
         | string hoping to find the right incantation that tweaks the
         | index in the right way, sifting through documents that may
         | contain the kernel of information somewhere if it wasn't SEO'ed
         | out of existence. This is a really unnatural way of discovering
         | information - the natural way, say with a teacher, is to be
         | told background, ask questions, and iterate to understanding.
         | This is how chat based LLMs work.
         | 
         | However with RAG you can ground them more concretely, as their
         | model is a massive mishmash of everything that may or may not
         | embed the information sought, but it's also mixed in with
         | everything else trained. You can bring in factual information
         | into context that may not have even been trained. However the
         | facts are a small aspect of knowledge - the overall semantics
         | in the total corpus supports the facts in adjacent areas.
        
           | throwaway4aday wrote:
           | You could also introduce a classifier step that takes the
           | result of the query and asks the LLM if the results truly are
           | relevant or not before passing them on to the summarization
           | step. You can even add more steps (with possibly diminishing
           | returns) such as taking the more relevant results and
           | crafting a new query that is a very condensed summary,
           | embedding it and then finding more results that are
           | semantically similar to it.
        
             | fnordpiglet wrote:
             | Yep. But the idea that a RAG backed LLM is merely an
             | efficient summarizer is missing the real power, which is it
             | can summarize then be interrogated iteratively to refine in
             | a semantic sense the actual questions you have, or explore
             | adjacent spaces. It's not just a search engine that can
             | summarize, it's a search engine that you can interrogate in
             | natural language and it responds directly to your
             | questions, as opposed to throwing a bunch of documents at
             | you that have a probability of being related to your query.
        
         | simcop2387 wrote:
         | Not quite, the advantage is that you can give it any documents
         | you want to search even ones that aren't available to google or
         | wikipedia. But I think otherwise that is essentially what is
         | proposed here. The nice part is that since you know which
         | documents got looked up when forumlating the answer you can
         | also provide those as part of the output to the user so they
         | can then go check the source data to confirm what was stated by
         | the LLM.
        
         | lukev wrote:
         | Yes, a bit, though an important feature here is it's still
         | searching the underlying data sources (e.g. Google, Wikipedia,
         | or others) and then using a LLM to summarize the results.
         | 
         | The "natural language tokenizer" itself is often an LLM (they
         | do a pretty good job of this).
         | 
         | A further extension this article doesn't talk about is to have
         | a LLM with a different prompt analyze the answer before
         | returning to the user, and do more queries if it doesn't
         | believe the question has been well answered (imagine clicking
         | "next page" of google search results under the hood).
         | 
         | The potential complexity of this scales all the way up to a
         | full "research assistant" LLM "agent" that calls itself
         | recursively.
        
           | pplonski86 wrote:
           | Before learning about RAG I thought that it is recurrent LLM
           | agent that traverse over documents. After some study I must
           | say that VectorDBs are boring.
        
             | throwaway4aday wrote:
             | It can be as simple or as complicated as you want. The
             | article starts off by saying the naive approach of just
             | embedding the query and looking for similar documents is a
             | bad approach and what you actually want to embed and
             | compare is something similar to the expected result. They
             | don't go into detail on this but using their example of
             | "what is the capital of France" you would conceivably
             | transform that into "list of European capital cities" or
             | "list of cities in France" using an LLM, embed that, find
             | the similar documents, feed those documents into an LLM
             | along with the query and some system instructions about how
             | to format the response and then return that. Keep in mind
             | this is an absurdly simplified example query and none of
             | this process is needed to answer the actual question which
             | the LLM would know from its training data but you would
             | want this process in place to ensure accurate results for
             | more complex or specialized queries.
        
               | lukev wrote:
               | > none of this process is needed to answer the actual
               | question which the LLM would know from its training data.
               | 
               | I think this isn't true; even if the model has the answer
               | stored implicitly in its weights, it has no way of
               | "citing it's source" or demonstrating that the answer is
               | correct.
        
               | throwaway4aday wrote:
               | if your model can't predict the completion of "the
               | capital of France is _" then it's going to really suck
               | for other completions
        
               | lukev wrote:
               | This is a great example of something GPT-4 gets
               | confidently wrong, today. I just ran this query:
               | 
               | Prompt: "The year is 894 AD. The capital of France is:
               | Response: "In 894 AD, the capital of France was Paris."
               | 
               | This is incorrect. According to Wikipedia, "In the 10th
               | century Paris was a provincial cathedral city of little
               | political or economic significance..."
               | 
               | The problem is that there's no good way to tell from this
               | interaction whether it's true or false, because the
               | mechanism that GPT-4 uses to return an answer is the same
               | whether it's correct or incorrect.
               | 
               | Unless you already know the answer, the _only_ way to be
               | confident that a LLM is answering correctly is to use RAG
               | to find a citation.
        
       | simonw wrote:
       | "Query-Document Mismatch: This model assumes that query embedding
       | and the content embedding are similar in the embedding space,
       | which is not always true based on the text you're trying to
       | search over."
       | 
       | There are embeddings models that take this into account, which
       | are pretty fascinating.
       | 
       | I've been exploring https://huggingface.co/intfloat/e5-large-v2
       | which lets you calculate two different types of embeddings in the
       | same space. Example from their README:                   passage:
       | As a general guideline, the CDC's average requirement of protein
       | for women ages 19 to 70 is 46 grams per day         query: how
       | much protein should a female eat
       | 
       | You can then build your embedding database out of "passage: "
       | embeddings, then run "query: " embeddings against it to try and
       | find passages that can answer the question.
       | 
       | I've had pretty great initial results trying that out against
       | paragraphs from my blog:
       | https://til.simonwillison.net/llms/embed-paragraphs#user-con...
       | 
       | This won't help address other challenges mentioned in that post,
       | like "what problems did we fix last week?" - but it's still a
       | useful starting point.
        
         | bthomas wrote:
         | Neat! Do you happen to have the analogous similarity queries
         | with a default embedding? Curious to see them side by side.
         | 
         | (I know I can reproduce myself and I appreciate all the code
         | you posted there - thought I'd ask first!)
        
           | simonw wrote:
           | No, I haven't been disciplined enough to have good examples
           | for that yet.
           | 
           | One of my goals right now is to put together a solid RAG
           | system based on top of LLM and Datasette that makes it really
           | easy to compare different embedding models, chunking
           | strategies and prompts to figure out what works best - but
           | that's still just an idea in my head at the moment.
        
         | [deleted]
        
       | eshack94 wrote:
       | https://archive.ph/lEynt
        
       | binarymax wrote:
       | I agree with the premise of the article, but I'm not sure about
       | the proposed solution.
       | 
       | Search relevance tuning is a thing. Learn how to use a search
       | engine and combine multiple features into ranking signals with
       | relevance judgement data.
       | 
       | I recommend the books "Relevant Search" and "AI Powered Search"
       | (the latter of which I'm a contributing author).
       | 
       | You'll find that having a well tuned retriever is the backbone
       | for most complex text AI. Learn the best practices from people
       | who have been in the field for years, instead of trying to
       | reinvent the wheel.
        
         | majorbadass wrote:
         | Agree with your sentiment, though the article explicitly
         | mentions precision/recall, suggesting at least some level of
         | tuning. Query understanding via structured attributes is SOTA
         | and used at top companies. Rewriting the query as a method is
         | weird, and yeah I'm not so convinced.
         | 
         | One reoccuring problem - the hacker ethos doesn't scale with AI
         | products. "Mess around until it works" is ok to prototype. This
         | is effectively using the dev's intuition on the 10 examples
         | they look at as the offline eval function.
         | 
         | But many (most?) new-wave AI products don't have consistent
         | offline metrics they optimize for. I think this quickly stops
         | working when you've absorbed the obvious gains.
        
           | natsucks wrote:
           | Do you know of a good example demonstrating RAG with query
           | understanding via structured attributes?
        
             | ivalm wrote:
             | A bit of a plug but
             | 
             | https://auxhealth.io/try
             | 
             | Does it's generations with RAG with a mix of structured
             | attributes + semantic retrieval.
        
         | sroussey wrote:
         | I went to buy it, but apparently I already have an account, so
         | I did a password reset, and then it wants my previous password
         | to activate the account, and well, I can't buy it.
        
           | binarymax wrote:
           | Hi! Send me an email (it's in my profile) and maybe we can
           | figure it out for you!
        
         | ramoz wrote:
         | it also seems costly to deploy such a robust search backend (eg
         | Elastic cluster, vector db, reranking ensemble, LLM for complex
         | parsing... these are not cheap technologies)
        
         | [deleted]
        
         | softwaredoug wrote:
         | I actually wonder why people dump gobs of user input to the
         | vector db, or try to tokenize it into something smart, instead
         | of being smarter and asking for queries to be generated. Such
         | as:
         | 
         | --
         | 
         | Given a Jira issue database, I want to give you additional
         | context to answer a question about a project called FooBar. The
         | Jira project id is FOOBAR. Please generate JQL that you would
         | like to use to answer this question
         | 
         | My question is: what are the major areas of technical debt in
         | project FOOBAR?
         | 
         | --
         | 
         | Given a search engine for the wiki for project foobar, generate
         | queries that help you answer this question:
         | 
         | What's the current status of project foobar?
         | 
         | ---
         | 
         | Or somesuch...
         | 
         | (and hi Max, thanks for plugging our book :-p )
        
           | binarymax wrote:
           | :waves: Hi Doug! (he's co-author of Relevant Search and
           | contributing author of AI Powered Search too)
           | 
           | That's definitely a thing. But alarms go off in my head when
           | I think about query latency and cost. Can't imagine running
           | 1k qps while sending every single one to GPT or LLama - thats
           | the stuff of production nightmares for me!
           | 
           | If you've got less demand and have a couple queries a second,
           | then maybe it's OK - but you're probably adding a good second
           | on top of your query latency.
        
         | natsucks wrote:
         | So in your opinion what are some examples of highly effective
         | RAG systems/implementations?
        
           | binarymax wrote:
           | Any good search you used before all this LLM stuff started
           | happening is a perfect candidate for RAG. How do you know if
           | a search was good? If you weren't pulling your hair out and
           | actually got decent results for your queries (search is a
           | thankless job like that - everyone expects it to work and
           | complains when it doesnt).
           | 
           | The reason good search is best for RAG is because the prompt
           | is seeded by the top results for the query. The only thing
           | RAG does is summarize things for you and gives you answers
           | instead of a list of documents.
           | 
           | And now I gotta confess something, after making RAG systems
           | for clients and having to use them with all the web search
           | engines these days - I kinda miss the list of documents, and
           | find myself just skipping the summary at the top half the
           | time and going back to reading the 10 blue links.
        
             | natsucks wrote:
             | Interesting. Do you think that points to the current
             | limitations of RAG or a mismatch in what a user truly wants
             | from search?
        
               | binarymax wrote:
               | I think it works well when it's not a blob of text. One
               | issue is that most of them are really long-winded. For
               | example, if the answer can be nouns, just give me the
               | list of nouns instead of a full sentence or paragraph.
               | 
               | Take for example this search: https://search.brave.com/se
               | arch?q=what+are+the+captain+ameri...
               | 
               | Why the paragraph? Just give me a bulleted list! It's
               | hard to read and kinda annoying.
               | 
               | Another issue for me is trust. Web search is oft polluted
               | with web spam (this is not new). Mentally, one can see a
               | URL and skip a site that doesn't have strong authority.
               | So now in RAG, I either need to trust the answer, or I
               | need to look at the embedded citation and find the
               | document and then see if it's trustworthy. This adds
               | friction.
               | 
               | This is also not unique to web search. Private search can
               | also have poor relevance - do I know the LLM is being
               | given the best context? Or is it getting bad context and
               | hallucinating? I need to look at the results to be sure
               | anyway.
               | 
               | I think when used in appropriate ways it can be good. But
               | the experience of "summarize these 10 results for me"
               | might not be the best for every query.
        
               | DebtDeflation wrote:
               | You're referring to what in the NLP subfield of Question
               | Answering Systems would be known as a "factoid question".
               | Historically, things like knowledge graphs and RDF triple
               | stores would be used for answering these types of
               | questions. I'm still not sold on the idea that an LLM is
               | the answer to all QA/Chat problems and this is one
               | example.
        
         | darkteflon wrote:
         | To someone not familiar with the space, search seems like an
         | incredibly complex and difficult space to get right. In your
         | view, is it reasonable for the average developer prepared to
         | read both of those books to expect to come out the other side
         | and construct something ready for production? Thanks!
        
       | cuuupid wrote:
       | I think these have common solutions that don't require building
       | more systems, running more queries and jacking up GPU bills,
       | which is the direction we should be moving in at this point.
       | 
       | e.g. asymmetric embeddings, instruct-based embeddings, and
       | retrieval-rerank all address parts of the problems the author is
       | presenting, all while keeping things generally light on infra.
        
         | jxnlco wrote:
         | > asymmetric embeddings, instruct-based embeddings, and
         | retrieval-rerank
         | 
         | how would you handle a relative time range? what if you're
         | provided a search client that does not support embeddings
         | 
         | (maybe say, google calendar api)
        
         | darkteflon wrote:
         | Could you elaborate on "asymmetric embeddings"? That's the
         | first time I've heard that term used in this context.
        
           | rolisz wrote:
           | It's embeddings that are generated differently for queries
           | and for documents. The idea of that queries are usually
           | short, while documents are longer, so if you embed them in
           | the same way, the most relevant docs will be far from the
           | query. Instruct from HKU is an example of such asymmetric
           | embeddings
        
             | fudged71 wrote:
             | I've heard the concept of applying a linear transformation
             | to an embedding model output, is that a similar idea?
        
               | rolisz wrote:
               | That's a sort of adapting an embedding to a certain
               | task/domain. But I wouldn't call it asymmetric
               | embeddings.
        
       | hexterPOP wrote:
       | Interesting article, I did this thing using langchain, it has a
       | Multi Query Retriever which does the same thing, check it out
       | https://python.langchain.com/docs/modules/data_connection/re...
        
       | natsucks wrote:
       | Author has a good point, but no mention of hybrid search?
       | 
       | ...not that hybrid search solves everything.
        
         | jxnlco wrote:
         | I'd just see that as modeling
         | 
         | {query: str, keywords: List[str]}
        
       | nobodyminus wrote:
       | > Query-Document Mismatch: This model assumes that query
       | embedding and the content embedding are similar in the embedding
       | space, which is not always true based on the text you're trying
       | to search over. Only using queries that are semantically similar
       | to the content is a huge limitation!
       | 
       | It seems like fine tuning for joint embeddings between your
       | queries and content is a far more elegant way to solve this
       | problem.
        
       | tunesmith wrote:
       | The pattern of "one request yields multiple kinds of responses"
       | is challenging. You're basically looking at either having the
       | client ask for the results and get them back, and then send the
       | results to backend to get back the summary, OR, you're setting up
       | some sort of sockets/server-sent-events thing where the frontend
       | request establishes a connection and subscribes, while the
       | backend sends back different sorts of "response events" as they
       | become available.
        
         | jxnlco wrote:
         | why not just asyncio.gather?
        
       ___________________________________________________________________
       (page generated 2023-09-21 23:01 UTC)