[HN Gopher] Oracle of Zotero: LLM QA of Your Research Library ___________________________________________________________________ Oracle of Zotero: LLM QA of Your Research Library Author : SubiculumCode Score : 31 points Date : 2023-11-26 18:13 UTC (4 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | dmezzetti wrote: | Nice project! | | I've spent quite a lot of time in the medical/scientific | literature space. With regards to LLMs, specifically RAG, how the | data is chunked is quite important. With that, I have a couple | projects that might be beneficial additions. | | paperetl (https://github.com/neuml/paperetl) - supports parsing | arXiv, PubMed and integrates with GROBID to handle parsing | metadata and text from arbitrary papers. | | paperai (https://github.com/neuml/paperai) - builds embeddings | databases of medical/scientific papers. Supports LLM prompting, | semantic workflows and vector search. Built with txtai | (https://github.com/neuml/txtai). | | While arbitrary chunking/splitting can work, I've found that | integrating parsing that has knowledge of medical/scientific | paper structure increases the overall accuracy and experience of | downstream applications. | panabee wrote: | these are awesome projects. thanks for sharing. | | it would accelerate research so much if LLM accuracy increased | on biomedical papers. | | very much agreed on the potential to extract signal from paper | structures. | | two questions if you don't mind: | | 1. did you post a summary of your chunking analysis somewhere? | i'm curious which method maximized accuracy, and which | sentence-overlap methods were most effective. | | 2. do you think general tokenization methods limit LLMs on | scientific/biomedical papers? | dmezzetti wrote: | Appreciate it! | | > 1. did you post a summary of your chunking analysis | somewhere? i'm curious which method maximized accuracy, and | which sentence-overlap methods were most effective. | | Good idea on this but nothing posted. In general, grouping by | sections of a paper has worked best (i.e. methods, | conclusions, results etc). GROBID is helpful with arbitrary | papers. | | > 2. do you think general tokenization methods limit LLMs on | scientific/biomedical papers? | | Possibly. For vectorization, specifically with medical, I do | have this model (https://huggingface.co/NeuML/pubmedbert- | base-embeddings) which is a fine-tuned sentence embeddings | model using this base model | (https://huggingface.co/microsoft/BiomedNLP-BiomedBERT- | base-u...). The base model does have a custom vocabulary. | | In terms of LLMs, I've found that this model | (https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca) works | well but haven't experimented with domain specific LLMs. ___________________________________________________________________ (page generated 2023-11-26 23:00 UTC)