[HN Gopher] Oracle of Zotero: LLM QA of Your Research Library
       ___________________________________________________________________
        
       Oracle of Zotero: LLM QA of Your Research Library
        
       Author : SubiculumCode
       Score  : 31 points
       Date   : 2023-11-26 18:13 UTC (4 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | dmezzetti wrote:
       | Nice project!
       | 
       | I've spent quite a lot of time in the medical/scientific
       | literature space. With regards to LLMs, specifically RAG, how the
       | data is chunked is quite important. With that, I have a couple
       | projects that might be beneficial additions.
       | 
       | paperetl (https://github.com/neuml/paperetl) - supports parsing
       | arXiv, PubMed and integrates with GROBID to handle parsing
       | metadata and text from arbitrary papers.
       | 
       | paperai (https://github.com/neuml/paperai) - builds embeddings
       | databases of medical/scientific papers. Supports LLM prompting,
       | semantic workflows and vector search. Built with txtai
       | (https://github.com/neuml/txtai).
       | 
       | While arbitrary chunking/splitting can work, I've found that
       | integrating parsing that has knowledge of medical/scientific
       | paper structure increases the overall accuracy and experience of
       | downstream applications.
        
         | panabee wrote:
         | these are awesome projects. thanks for sharing.
         | 
         | it would accelerate research so much if LLM accuracy increased
         | on biomedical papers.
         | 
         | very much agreed on the potential to extract signal from paper
         | structures.
         | 
         | two questions if you don't mind:
         | 
         | 1. did you post a summary of your chunking analysis somewhere?
         | i'm curious which method maximized accuracy, and which
         | sentence-overlap methods were most effective.
         | 
         | 2. do you think general tokenization methods limit LLMs on
         | scientific/biomedical papers?
        
           | dmezzetti wrote:
           | Appreciate it!
           | 
           | > 1. did you post a summary of your chunking analysis
           | somewhere? i'm curious which method maximized accuracy, and
           | which sentence-overlap methods were most effective.
           | 
           | Good idea on this but nothing posted. In general, grouping by
           | sections of a paper has worked best (i.e. methods,
           | conclusions, results etc). GROBID is helpful with arbitrary
           | papers.
           | 
           | > 2. do you think general tokenization methods limit LLMs on
           | scientific/biomedical papers?
           | 
           | Possibly. For vectorization, specifically with medical, I do
           | have this model (https://huggingface.co/NeuML/pubmedbert-
           | base-embeddings) which is a fine-tuned sentence embeddings
           | model using this base model
           | (https://huggingface.co/microsoft/BiomedNLP-BiomedBERT-
           | base-u...). The base model does have a custom vocabulary.
           | 
           | In terms of LLMs, I've found that this model
           | (https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca) works
           | well but haven't experimented with domain specific LLMs.
        
       ___________________________________________________________________
       (page generated 2023-11-26 23:00 UTC)