[HN Gopher] RAG is more than just embedding search ___________________________________________________________________ RAG is more than just embedding search Author : jxnlco Score : 104 points Date : 2023-09-21 16:18 UTC (6 hours ago) (HTM) web link (jxnl.github.io) (TXT) w3m dump (jxnl.github.io) | dsr_ wrote: | So instead of asking google or wikipedia, you ask a natural | language tokenizer to break your query into several possible | queries, then feed that to an LLM in order to get an essay that | might answer your question. | | Do I have that basically correct? | | edit, 43 minutes later: the first three responders say yes. So, | it's a way of increasing the verbosity and reducing the | reliability of responses to search queries. Yay! Who would not | want such a thing? | | (me. And probably you.) | fnordpiglet wrote: | At its most basic perhaps. But the LLM has an enormous semantic | corpus embedded in its model that augments the retrieved | document. The retrieved document in a way cements the context | better to help prevent wandering into hallucinations. So the | LLM would indeed be able to summarize the retrieved document, | but also synthesize it with other "knowledge" embedded in its | model. | | But the more important thing is you can interrogate the LLM to | ask it the specific questions you have based on what it has | said and your goals. Contrast this to an information retrieval | based methods where you read the article hoping your questions | are answered, and when they aren't you are stuck digging | through less and less relevant results or refining a search | string hoping to find the right incantation that tweaks the | index in the right way, sifting through documents that may | contain the kernel of information somewhere if it wasn't SEO'ed | out of existence. This is a really unnatural way of discovering | information - the natural way, say with a teacher, is to be | told background, ask questions, and iterate to understanding. | This is how chat based LLMs work. | | However with RAG you can ground them more concretely, as their | model is a massive mishmash of everything that may or may not | embed the information sought, but it's also mixed in with | everything else trained. You can bring in factual information | into context that may not have even been trained. However the | facts are a small aspect of knowledge - the overall semantics | in the total corpus supports the facts in adjacent areas. | throwaway4aday wrote: | You could also introduce a classifier step that takes the | result of the query and asks the LLM if the results truly are | relevant or not before passing them on to the summarization | step. You can even add more steps (with possibly diminishing | returns) such as taking the more relevant results and | crafting a new query that is a very condensed summary, | embedding it and then finding more results that are | semantically similar to it. | fnordpiglet wrote: | Yep. But the idea that a RAG backed LLM is merely an | efficient summarizer is missing the real power, which is it | can summarize then be interrogated iteratively to refine in | a semantic sense the actual questions you have, or explore | adjacent spaces. It's not just a search engine that can | summarize, it's a search engine that you can interrogate in | natural language and it responds directly to your | questions, as opposed to throwing a bunch of documents at | you that have a probability of being related to your query. | simcop2387 wrote: | Not quite, the advantage is that you can give it any documents | you want to search even ones that aren't available to google or | wikipedia. But I think otherwise that is essentially what is | proposed here. The nice part is that since you know which | documents got looked up when forumlating the answer you can | also provide those as part of the output to the user so they | can then go check the source data to confirm what was stated by | the LLM. | lukev wrote: | Yes, a bit, though an important feature here is it's still | searching the underlying data sources (e.g. Google, Wikipedia, | or others) and then using a LLM to summarize the results. | | The "natural language tokenizer" itself is often an LLM (they | do a pretty good job of this). | | A further extension this article doesn't talk about is to have | a LLM with a different prompt analyze the answer before | returning to the user, and do more queries if it doesn't | believe the question has been well answered (imagine clicking | "next page" of google search results under the hood). | | The potential complexity of this scales all the way up to a | full "research assistant" LLM "agent" that calls itself | recursively. | pplonski86 wrote: | Before learning about RAG I thought that it is recurrent LLM | agent that traverse over documents. After some study I must | say that VectorDBs are boring. | throwaway4aday wrote: | It can be as simple or as complicated as you want. The | article starts off by saying the naive approach of just | embedding the query and looking for similar documents is a | bad approach and what you actually want to embed and | compare is something similar to the expected result. They | don't go into detail on this but using their example of | "what is the capital of France" you would conceivably | transform that into "list of European capital cities" or | "list of cities in France" using an LLM, embed that, find | the similar documents, feed those documents into an LLM | along with the query and some system instructions about how | to format the response and then return that. Keep in mind | this is an absurdly simplified example query and none of | this process is needed to answer the actual question which | the LLM would know from its training data but you would | want this process in place to ensure accurate results for | more complex or specialized queries. | lukev wrote: | > none of this process is needed to answer the actual | question which the LLM would know from its training data. | | I think this isn't true; even if the model has the answer | stored implicitly in its weights, it has no way of | "citing it's source" or demonstrating that the answer is | correct. | throwaway4aday wrote: | if your model can't predict the completion of "the | capital of France is _" then it's going to really suck | for other completions | lukev wrote: | This is a great example of something GPT-4 gets | confidently wrong, today. I just ran this query: | | Prompt: "The year is 894 AD. The capital of France is: | Response: "In 894 AD, the capital of France was Paris." | | This is incorrect. According to Wikipedia, "In the 10th | century Paris was a provincial cathedral city of little | political or economic significance..." | | The problem is that there's no good way to tell from this | interaction whether it's true or false, because the | mechanism that GPT-4 uses to return an answer is the same | whether it's correct or incorrect. | | Unless you already know the answer, the _only_ way to be | confident that a LLM is answering correctly is to use RAG | to find a citation. | simonw wrote: | "Query-Document Mismatch: This model assumes that query embedding | and the content embedding are similar in the embedding space, | which is not always true based on the text you're trying to | search over." | | There are embeddings models that take this into account, which | are pretty fascinating. | | I've been exploring https://huggingface.co/intfloat/e5-large-v2 | which lets you calculate two different types of embeddings in the | same space. Example from their README: passage: | As a general guideline, the CDC's average requirement of protein | for women ages 19 to 70 is 46 grams per day query: how | much protein should a female eat | | You can then build your embedding database out of "passage: " | embeddings, then run "query: " embeddings against it to try and | find passages that can answer the question. | | I've had pretty great initial results trying that out against | paragraphs from my blog: | https://til.simonwillison.net/llms/embed-paragraphs#user-con... | | This won't help address other challenges mentioned in that post, | like "what problems did we fix last week?" - but it's still a | useful starting point. | bthomas wrote: | Neat! Do you happen to have the analogous similarity queries | with a default embedding? Curious to see them side by side. | | (I know I can reproduce myself and I appreciate all the code | you posted there - thought I'd ask first!) | simonw wrote: | No, I haven't been disciplined enough to have good examples | for that yet. | | One of my goals right now is to put together a solid RAG | system based on top of LLM and Datasette that makes it really | easy to compare different embedding models, chunking | strategies and prompts to figure out what works best - but | that's still just an idea in my head at the moment. | [deleted] | eshack94 wrote: | https://archive.ph/lEynt | binarymax wrote: | I agree with the premise of the article, but I'm not sure about | the proposed solution. | | Search relevance tuning is a thing. Learn how to use a search | engine and combine multiple features into ranking signals with | relevance judgement data. | | I recommend the books "Relevant Search" and "AI Powered Search" | (the latter of which I'm a contributing author). | | You'll find that having a well tuned retriever is the backbone | for most complex text AI. Learn the best practices from people | who have been in the field for years, instead of trying to | reinvent the wheel. | majorbadass wrote: | Agree with your sentiment, though the article explicitly | mentions precision/recall, suggesting at least some level of | tuning. Query understanding via structured attributes is SOTA | and used at top companies. Rewriting the query as a method is | weird, and yeah I'm not so convinced. | | One reoccuring problem - the hacker ethos doesn't scale with AI | products. "Mess around until it works" is ok to prototype. This | is effectively using the dev's intuition on the 10 examples | they look at as the offline eval function. | | But many (most?) new-wave AI products don't have consistent | offline metrics they optimize for. I think this quickly stops | working when you've absorbed the obvious gains. | natsucks wrote: | Do you know of a good example demonstrating RAG with query | understanding via structured attributes? | ivalm wrote: | A bit of a plug but | | https://auxhealth.io/try | | Does it's generations with RAG with a mix of structured | attributes + semantic retrieval. | sroussey wrote: | I went to buy it, but apparently I already have an account, so | I did a password reset, and then it wants my previous password | to activate the account, and well, I can't buy it. | binarymax wrote: | Hi! Send me an email (it's in my profile) and maybe we can | figure it out for you! | ramoz wrote: | it also seems costly to deploy such a robust search backend (eg | Elastic cluster, vector db, reranking ensemble, LLM for complex | parsing... these are not cheap technologies) | [deleted] | softwaredoug wrote: | I actually wonder why people dump gobs of user input to the | vector db, or try to tokenize it into something smart, instead | of being smarter and asking for queries to be generated. Such | as: | | -- | | Given a Jira issue database, I want to give you additional | context to answer a question about a project called FooBar. The | Jira project id is FOOBAR. Please generate JQL that you would | like to use to answer this question | | My question is: what are the major areas of technical debt in | project FOOBAR? | | -- | | Given a search engine for the wiki for project foobar, generate | queries that help you answer this question: | | What's the current status of project foobar? | | --- | | Or somesuch... | | (and hi Max, thanks for plugging our book :-p ) | binarymax wrote: | :waves: Hi Doug! (he's co-author of Relevant Search and | contributing author of AI Powered Search too) | | That's definitely a thing. But alarms go off in my head when | I think about query latency and cost. Can't imagine running | 1k qps while sending every single one to GPT or LLama - thats | the stuff of production nightmares for me! | | If you've got less demand and have a couple queries a second, | then maybe it's OK - but you're probably adding a good second | on top of your query latency. | natsucks wrote: | So in your opinion what are some examples of highly effective | RAG systems/implementations? | binarymax wrote: | Any good search you used before all this LLM stuff started | happening is a perfect candidate for RAG. How do you know if | a search was good? If you weren't pulling your hair out and | actually got decent results for your queries (search is a | thankless job like that - everyone expects it to work and | complains when it doesnt). | | The reason good search is best for RAG is because the prompt | is seeded by the top results for the query. The only thing | RAG does is summarize things for you and gives you answers | instead of a list of documents. | | And now I gotta confess something, after making RAG systems | for clients and having to use them with all the web search | engines these days - I kinda miss the list of documents, and | find myself just skipping the summary at the top half the | time and going back to reading the 10 blue links. | natsucks wrote: | Interesting. Do you think that points to the current | limitations of RAG or a mismatch in what a user truly wants | from search? | binarymax wrote: | I think it works well when it's not a blob of text. One | issue is that most of them are really long-winded. For | example, if the answer can be nouns, just give me the | list of nouns instead of a full sentence or paragraph. | | Take for example this search: https://search.brave.com/se | arch?q=what+are+the+captain+ameri... | | Why the paragraph? Just give me a bulleted list! It's | hard to read and kinda annoying. | | Another issue for me is trust. Web search is oft polluted | with web spam (this is not new). Mentally, one can see a | URL and skip a site that doesn't have strong authority. | So now in RAG, I either need to trust the answer, or I | need to look at the embedded citation and find the | document and then see if it's trustworthy. This adds | friction. | | This is also not unique to web search. Private search can | also have poor relevance - do I know the LLM is being | given the best context? Or is it getting bad context and | hallucinating? I need to look at the results to be sure | anyway. | | I think when used in appropriate ways it can be good. But | the experience of "summarize these 10 results for me" | might not be the best for every query. | DebtDeflation wrote: | You're referring to what in the NLP subfield of Question | Answering Systems would be known as a "factoid question". | Historically, things like knowledge graphs and RDF triple | stores would be used for answering these types of | questions. I'm still not sold on the idea that an LLM is | the answer to all QA/Chat problems and this is one | example. | darkteflon wrote: | To someone not familiar with the space, search seems like an | incredibly complex and difficult space to get right. In your | view, is it reasonable for the average developer prepared to | read both of those books to expect to come out the other side | and construct something ready for production? Thanks! | cuuupid wrote: | I think these have common solutions that don't require building | more systems, running more queries and jacking up GPU bills, | which is the direction we should be moving in at this point. | | e.g. asymmetric embeddings, instruct-based embeddings, and | retrieval-rerank all address parts of the problems the author is | presenting, all while keeping things generally light on infra. | jxnlco wrote: | > asymmetric embeddings, instruct-based embeddings, and | retrieval-rerank | | how would you handle a relative time range? what if you're | provided a search client that does not support embeddings | | (maybe say, google calendar api) | darkteflon wrote: | Could you elaborate on "asymmetric embeddings"? That's the | first time I've heard that term used in this context. | rolisz wrote: | It's embeddings that are generated differently for queries | and for documents. The idea of that queries are usually | short, while documents are longer, so if you embed them in | the same way, the most relevant docs will be far from the | query. Instruct from HKU is an example of such asymmetric | embeddings | fudged71 wrote: | I've heard the concept of applying a linear transformation | to an embedding model output, is that a similar idea? | rolisz wrote: | That's a sort of adapting an embedding to a certain | task/domain. But I wouldn't call it asymmetric | embeddings. | hexterPOP wrote: | Interesting article, I did this thing using langchain, it has a | Multi Query Retriever which does the same thing, check it out | https://python.langchain.com/docs/modules/data_connection/re... | natsucks wrote: | Author has a good point, but no mention of hybrid search? | | ...not that hybrid search solves everything. | jxnlco wrote: | I'd just see that as modeling | | {query: str, keywords: List[str]} | nobodyminus wrote: | > Query-Document Mismatch: This model assumes that query | embedding and the content embedding are similar in the embedding | space, which is not always true based on the text you're trying | to search over. Only using queries that are semantically similar | to the content is a huge limitation! | | It seems like fine tuning for joint embeddings between your | queries and content is a far more elegant way to solve this | problem. | tunesmith wrote: | The pattern of "one request yields multiple kinds of responses" | is challenging. You're basically looking at either having the | client ask for the results and get them back, and then send the | results to backend to get back the summary, OR, you're setting up | some sort of sockets/server-sent-events thing where the frontend | request establishes a connection and subscribes, while the | backend sends back different sorts of "response events" as they | become available. | jxnlco wrote: | why not just asyncio.gather? ___________________________________________________________________ (page generated 2023-09-21 23:01 UTC)