[HN Gopher] Haystack 1.0 - open-source NLP framework to build NL... ___________________________________________________________________ Haystack 1.0 - open-source NLP framework to build NLProc back end applications Author : antti909 Score : 81 points Date : 2021-12-09 18:27 UTC (4 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | artembugara wrote: | Interesting. Could anyone think of a use case with news data? We | index over 1 million news articles daily. | | Maybe, there's a way to have something for a specific industry? | | https://newscatcherapi.com/ | whalesalad wrote: | This is perhaps just an innocent question for the community - | but it's simultaneously an ingenious approach to inbound | marketing. | artembugara wrote: | That's both and I feel no shame for it. | antti909 wrote: | Whoa, this is cool :) I could think of a ton of marketing (or | maybe even 'devrel') applications for it. | visarga wrote: | Tried the demo but it could not answer correctly any of my | questions, maybe it didn't have the answers in its index. | tholor wrote: | The demo corpus there just contains documents about countries | and capital cities. So you could try asking questions like | "What's the climate of Beijing?" or "How many people live in | the capital of the US?". | timomo wrote: | btw the demo can be found at https://haystack- | demo.deepset.ai/ | anentropic wrote: | Haystack looks great, but the demo maybe highlights some | difficulties with this kind of task. | | "What is the population of Italy?" ...gives the population of | Rome as first answer at 78.32 relevance :) | | I get similar result for some other countries. | | "What is the population of Cambridge?" ...to be fair, this is | an ambiguous place name as there are several around the | world. However the answer it gives is quite far removed from | any of them: "In 1788, Kingston had a population of 25,000", | Relevance: 93.14 | antti909 wrote: | Yep, that's definitely this challenge with commonly | available models. In a real-life product development | there's most often an important step of evaluating the | model(s) and fine-tuning if necessary. | antti909 wrote: | Re "Kingston" - interesting! :) Probably, because of | "Cambridgeshire"? | szanz wrote: | (Disclaimer: I'm a Haystack maintainer and I helped | creating this demo) | | I had to try it out the questions you asked, because your | first seems totally answerable to me. And indeed I do get | the right answer in the first position (60 million). Did | you ask exactly the same question you posted? | | For the second, unfortunately we included only country | pages and capital city pages, so it's likely that the | information about the population of Cambridge simply wasn't | there. | | In general though I agree this task is not perfect for a | demo. It's hard to tell whether the model is wrong because | it doesn't have enough info, or whether it does have the | data but couldn't find it. The best way to evaluate it will | always be to try it out on your own data :) | timomo wrote: | What is the population of Cambridge? | | for me the demo returns that the model did not find an | answer... | JPKab wrote: | I've been looking at this for a while now. | | My company has a need to accurately identify, with high | precision, records in elasticsearch, but with a bit more of a | semantic match that existing elasticsearch plugins don't support. | Ideally the best of huggingface on top of elasticsearch. | | Has anyone on here tried this out? Curious what your experiences | are. | tholor wrote: | Semantic document search is one of the core use cases we see in | the community (besides Question Answering) and Haystack was | pretty much started because we saw that you need much more than | just models. It's so much pain to integrate models properly | with document storage (e.g. elasticsearch), route requests | effectively in larger pipelines or track user feedback in | production. Have you tried using DPR or sentence transformers | for your case? | | Disclaimer: I am one of the maintainers of Haystack:) | antti909 wrote: | Happy to help too - feel free to ask in the community channels | as well :) | matheist wrote: | I had a client who was using sentence transformers with | elasticsearch already. My colleague suggested switching to | haystack to enable a larger number of model architectures. | Switching over to haystack was pretty straightforward because | we just used it as a wrapper around sentence transformers, but | I do remember some inconvenience around all the other | dependencies that haystack pulled in. | | Haystack does a lot more besides just wrapping sentence | transformers, and we weren't using the rest of it, so it was | just a lot of extra dependencies sitting around taking up disk | space and memory (I think we had to go up to a larger instance | size). I remember feeling a bit frustrated that the | dependencies weren't split up into "core" and "optional" in a | more fine-grained way, but maybe most users don't mind and so | it doesn't make sense for them to prioritize that? | | [edit: looks like there's an open issue related to this: | https://github.com/deepset-ai/haystack/issues/1070] | | [edit 2: 'JPKab happy to share more about using huggingface and | elasticsearch. email is in my profile] | antti909 wrote: | Noted, we've been discussing dependencies internally indeed | :) Thanks for the highlight above!! | eriklarsonr wrote: | Yeah I'm using their FAISS document store and QA pipeline to | run semantic search over a set of YouTube transcripts. Was | easier to set up than Jina AI in my specific use case, and the | search results are actually useful. Only real constraint for me | is GPU access, creating the embeddings to store in the FAISS | index sans a GPU takes an unreasonable amount of time. | antti909 wrote: | Glad it worked for you - thanks for sharing! | dexter89_kp3 wrote: | Interesting. My current side project is exploring semantic search | using sentence transformers. Will definitely check this out. | antti909 wrote: | Happy to help! - we've got Slack and everything :) | dragosbulugean wrote: | does it work on a typical ES index, or do you have to re-index in | a certain way for it to work? | sorenbs wrote: | This is really cool! | | Can Haystack be used to index structured data, or just text? | | Is it required to use elastic as the backend, or can you use a | simpler file-based or in-memory backend? | antti909 wrote: | Re structured data - in theory, yes :) We have to work a bit | more in that direction. Here's the first step - querying table | data, which could be really helpful for reports, financial | data, etc. In regards to the storage backend - it's currently | Elasticsearch, OpenSearch, SQL+FAISS/Milvus/Weaviate (when | using dense vectors/dense passage retrieval). There is also an | in-memory datastore using python primitives for fast | prototyping. | | (Also, latest features highlights here | https://www.deepset.ai/blog/new-features-in-haystack-v1.0) | Der_Einzige wrote: | Is there any path forward to make Haystack do word-level | extractive summarization? e.g. like this: | https://github.com/Hellisotherpeople/CX_DB8 | | or like this: | https://huggingface.co/spaces/Hellisotherpeople/Unsupervised... | | I am trying to find anything better than these two for this task. | I feel like Haystack could be an option - but I am not sure. ___________________________________________________________________ (page generated 2021-12-09 23:00 UTC)