[HN Gopher] LLM Python library now provides tools for working wi... ___________________________________________________________________ LLM Python library now provides tools for working with embeddings Author : simonw Score : 18 points Date : 2023-09-04 20:37 UTC (2 hours ago) (HTM) web link (simonwillison.net) (TXT) w3m dump (simonwillison.net) | haxton wrote: | Curious to know what value you've seen out of these clusters. In | my experience k means clustering was very lackluster. Having to | define the number of clusters was a big pain point too. | | You almost certainly want a graph like structure (overlapping | communities rather than clusters). | | But unsupervised clustering was almost entirely ineffective for | every use case I had :/ | simonw wrote: | I only got the clustering working this morning, so aside from | playing around with it a bit I've not had any results that have | convinced me it's a tool I should throw at lots of different | problems. | | I mainly like it as another example of the kind of things you | can use embeddings for. | | My implementation is very naive - it's just this: | sklearn.cluster.MiniBatchKMeans(n_clusters=n, n_init="auto") | | I imagine there are all kinds of improvements that could be | made to this kind of thing. | | I'd love to understand if there's a good way to automatically | pick an interesting number of clusters, as opposed to picking a | number at the start. | | https://github.com/simonw/llm-cluster/blob/main/llm_cluster.... | haxton wrote: | Elbow method is a good place to start for finding the number | of clusters. | simonw wrote: | There's a lot of stuff in this release. | | Don't miss the new llm-cluster plugin, which can both calculate | clusters from embeddings and use another LLM call to generate a | name for each cluster: https://github.com/simonw/llm-cluster | | Example usage: | | Fetch all issues, embed them and store the embeddings and content | in SQLite: paginate-json 'https://api.github.co | m/repos/simonw/llm/issues?state=all&filter=all' \ | jq | '[.[] | {id: .id, title: .title}]' \ | llm embed-multi | llm-issues - \ --database issues.db \ | --model sentence-transformers/all-MiniLM-L6-v2 \ | --store | | Group those in 10 clusters and generate a summary for each one | using a call to GPT-4: llm cluster llm-issues | --database issues.db 10 --summary --model gpt-4 | quickthrower2 wrote: | I would change the title to: Python Library | "llm" now provides tools for working with embeddings | | I initially was trying to parse that, thinking "is this an open | AI thing?". Of course the answer is just a click away, but people | might miss this if they are interested in Python coding and AI. | dang wrote: | OK, we've put Python library up there. | simonw wrote: | Looks like you missed my reply by seconds pointing out that | it's not just a Python library, it's also a CLI tool: | https://news.ycombinator.com/item?id=37385788 | quickthrower2 wrote: | Aah! Sorry about that both of you. I didn't think dang | would see this and simon would update the title and sanity | check it. | simonw wrote: | It's not just a Python library though: it's also a CLI tool. | | I put a bunch of work into getting it into Homebrew so that | people who aren't Python developers can "brew install llm" and | start using it. | | Details on the CLI here: | https://llm.datasette.io/en/stable/usage.html and | https://llm.datasette.io/en/stable/embeddings/cli.html | thatcherthorn wrote: | This is a fantastic library. I plan to use some of the search | functionality with a system that tries to figure out how to | manipulate/work with/add features to existing code. ___________________________________________________________________ (page generated 2023-09-04 23:00 UTC)