hngopher.com

       [HN Gopher] Haystack 1.0 - open-source NLP framework to build NL...
       ___________________________________________________________________
        
       Haystack 1.0 - open-source NLP framework to build NLProc back end
       applications
        
       Author : antti909
       Score  : 81 points
       Date   : 2021-12-09 18:27 UTC (4 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | artembugara wrote:
       | Interesting. Could anyone think of a use case with news data? We
       | index over 1 million news articles daily.
       | 
       | Maybe, there's a way to have something for a specific industry?
       | 
       | https://newscatcherapi.com/
        
         | whalesalad wrote:
         | This is perhaps just an innocent question for the community -
         | but it's simultaneously an ingenious approach to inbound
         | marketing.
        
           | artembugara wrote:
           | That's both and I feel no shame for it.
        
         | antti909 wrote:
         | Whoa, this is cool :) I could think of a ton of marketing (or
         | maybe even 'devrel') applications for it.
        
       | visarga wrote:
       | Tried the demo but it could not answer correctly any of my
       | questions, maybe it didn't have the answers in its index.
        
         | tholor wrote:
         | The demo corpus there just contains documents about countries
         | and capital cities. So you could try asking questions like
         | "What's the climate of Beijing?" or "How many people live in
         | the capital of the US?".
        
           | timomo wrote:
           | btw the demo can be found at https://haystack-
           | demo.deepset.ai/
        
           | anentropic wrote:
           | Haystack looks great, but the demo maybe highlights some
           | difficulties with this kind of task.
           | 
           | "What is the population of Italy?" ...gives the population of
           | Rome as first answer at 78.32 relevance :)
           | 
           | I get similar result for some other countries.
           | 
           | "What is the population of Cambridge?" ...to be fair, this is
           | an ambiguous place name as there are several around the
           | world. However the answer it gives is quite far removed from
           | any of them: "In 1788, Kingston had a population of 25,000",
           | Relevance: 93.14
        
             | antti909 wrote:
             | Yep, that's definitely this challenge with commonly
             | available models. In a real-life product development
             | there's most often an important step of evaluating the
             | model(s) and fine-tuning if necessary.
        
             | antti909 wrote:
             | Re "Kingston" - interesting! :) Probably, because of
             | "Cambridgeshire"?
        
             | szanz wrote:
             | (Disclaimer: I'm a Haystack maintainer and I helped
             | creating this demo)
             | 
             | I had to try it out the questions you asked, because your
             | first seems totally answerable to me. And indeed I do get
             | the right answer in the first position (60 million). Did
             | you ask exactly the same question you posted?
             | 
             | For the second, unfortunately we included only country
             | pages and capital city pages, so it's likely that the
             | information about the population of Cambridge simply wasn't
             | there.
             | 
             | In general though I agree this task is not perfect for a
             | demo. It's hard to tell whether the model is wrong because
             | it doesn't have enough info, or whether it does have the
             | data but couldn't find it. The best way to evaluate it will
             | always be to try it out on your own data :)
        
               | timomo wrote:
               | What is the population of Cambridge?
               | 
               | for me the demo returns that the model did not find an
               | answer...
        
       | JPKab wrote:
       | I've been looking at this for a while now.
       | 
       | My company has a need to accurately identify, with high
       | precision, records in elasticsearch, but with a bit more of a
       | semantic match that existing elasticsearch plugins don't support.
       | Ideally the best of huggingface on top of elasticsearch.
       | 
       | Has anyone on here tried this out? Curious what your experiences
       | are.
        
         | tholor wrote:
         | Semantic document search is one of the core use cases we see in
         | the community (besides Question Answering) and Haystack was
         | pretty much started because we saw that you need much more than
         | just models. It's so much pain to integrate models properly
         | with document storage (e.g. elasticsearch), route requests
         | effectively in larger pipelines or track user feedback in
         | production. Have you tried using DPR or sentence transformers
         | for your case?
         | 
         | Disclaimer: I am one of the maintainers of Haystack:)
        
         | antti909 wrote:
         | Happy to help too - feel free to ask in the community channels
         | as well :)
        
         | matheist wrote:
         | I had a client who was using sentence transformers with
         | elasticsearch already. My colleague suggested switching to
         | haystack to enable a larger number of model architectures.
         | Switching over to haystack was pretty straightforward because
         | we just used it as a wrapper around sentence transformers, but
         | I do remember some inconvenience around all the other
         | dependencies that haystack pulled in.
         | 
         | Haystack does a lot more besides just wrapping sentence
         | transformers, and we weren't using the rest of it, so it was
         | just a lot of extra dependencies sitting around taking up disk
         | space and memory (I think we had to go up to a larger instance
         | size). I remember feeling a bit frustrated that the
         | dependencies weren't split up into "core" and "optional" in a
         | more fine-grained way, but maybe most users don't mind and so
         | it doesn't make sense for them to prioritize that?
         | 
         | [edit: looks like there's an open issue related to this:
         | https://github.com/deepset-ai/haystack/issues/1070]
         | 
         | [edit 2: 'JPKab happy to share more about using huggingface and
         | elasticsearch. email is in my profile]
        
           | antti909 wrote:
           | Noted, we've been discussing dependencies internally indeed
           | :) Thanks for the highlight above!!
        
         | eriklarsonr wrote:
         | Yeah I'm using their FAISS document store and QA pipeline to
         | run semantic search over a set of YouTube transcripts. Was
         | easier to set up than Jina AI in my specific use case, and the
         | search results are actually useful. Only real constraint for me
         | is GPU access, creating the embeddings to store in the FAISS
         | index sans a GPU takes an unreasonable amount of time.
        
           | antti909 wrote:
           | Glad it worked for you - thanks for sharing!
        
       | dexter89_kp3 wrote:
       | Interesting. My current side project is exploring semantic search
       | using sentence transformers. Will definitely check this out.
        
         | antti909 wrote:
         | Happy to help! - we've got Slack and everything :)
        
       | dragosbulugean wrote:
       | does it work on a typical ES index, or do you have to re-index in
       | a certain way for it to work?
        
       | sorenbs wrote:
       | This is really cool!
       | 
       | Can Haystack be used to index structured data, or just text?
       | 
       | Is it required to use elastic as the backend, or can you use a
       | simpler file-based or in-memory backend?
        
         | antti909 wrote:
         | Re structured data - in theory, yes :) We have to work a bit
         | more in that direction. Here's the first step - querying table
         | data, which could be really helpful for reports, financial
         | data, etc. In regards to the storage backend - it's currently
         | Elasticsearch, OpenSearch, SQL+FAISS/Milvus/Weaviate (when
         | using dense vectors/dense passage retrieval). There is also an
         | in-memory datastore using python primitives for fast
         | prototyping.
         | 
         | (Also, latest features highlights here
         | https://www.deepset.ai/blog/new-features-in-haystack-v1.0)
        
       | Der_Einzige wrote:
       | Is there any path forward to make Haystack do word-level
       | extractive summarization? e.g. like this:
       | https://github.com/Hellisotherpeople/CX_DB8
       | 
       | or like this:
       | https://huggingface.co/spaces/Hellisotherpeople/Unsupervised...
       | 
       | I am trying to find anything better than these two for this task.
       | I feel like Haystack could be an option - but I am not sure.
        
       ___________________________________________________________________
       (page generated 2021-12-09 23:00 UTC)