[HN Gopher] Launch HN: Vellum (YC W23) - Dev Platform for LLM Apps
       ___________________________________________________________________
        
       Launch HN: Vellum (YC W23) - Dev Platform for LLM Apps
        
       Hi HN - Noa, Akash, and Sidd here. We're building Vellum
       (https://www.vellum.ai), a developer platform for building on LLMs
       like OpenAI's GPT-3 and Anthropic's Claude. We provide tools for
       efficient prompt engineering, semantic search, performance
       monitoring, and fine-tuning, helping you bring LLM-powered features
       from prototype to production.  The MLOps industry has matured
       rapidly for traditional ML (typically open-source models hosted in-
       house), but companies using LLMs are suffering from a lack of
       tooling to support things like experimentation, version control,
       and monitoring. They're forced to build these tools themselves,
       taking valuable engineering time away from their core product.
       There are 4 main pain points. (1) Prompt engineering is tedious and
       time consuming. People iterate on prompts in playgrounds of
       individual model providers and store results in spreadsheets or
       documents. Testing across many test cases is usually not done
       because of the manual nature of prompt engineering. (2) LLM calls
       against a corpus of text are not possible without semantic search.
       Due to limited context windows, any time an LLM has to return
       factual data from a set of documents, companies need to create
       embeddings, store them in a vector database and host semantic
       search models to query for relevant results at runtime; building
       this infrastructure is complex and time consuming. (3) There is
       limited observability / monitoring once LLMs are used in
       production. With no baseline for how something is performing, it's
       scary making changes to it for fear of making it worse; and (4)
       Creating fine-tuned models and re-training them as new data becomes
       available is rarely done despite the potential gains (higher
       quality, lower cost, lower latency, more defensibility). Companies
       don't usually have the capacity to build the infrastructure for
       collecting high-quality training data and the automation pipelines
       used to re-train and evaluate new models.  We know these pain
       points from experience. Sidd and Noa are engineers who worked at
       Quora and DataRobot building ML tooling. Then the three of us
       worked together for a couple years at Dover (YC S19), where we
       built features powered by GPT-3 when it was still in beta. Our
       first production feature was a job description writer, followed by
       a personalized recruiting email generator and then a classifier for
       email responses.  We found it was easy enough to prototype, but
       taking features to production and improving them was a different
       story. It was a pain to keep track of what prompts we had tried and
       to monitor how they were performing under real user inputs. We
       wished we could version control our prompts, roll back, and even
       A/B test. We found ourselves investing in infrastructure that had
       nothing to do with our core features (e.g. semantic search). We
       ended up being scared to change prompts or try different models for
       fear of breaking existing behavior. As new LLM providers and
       foundation models were released, we wished we could compare them
       and use the best tool for the job, but didn't have the time to
       evaluate them ourselves. And so on.  It's clear that better tools
       are required for businesses to adopt LLMs at scale, and we realized
       we were in a good position to build them, so here we are! Vellum
       consists of 4 systems to address the pain points mentioned above:
       (1) Playground--a UI for iterating on prompts side-by-side and
       validating them against multiple test cases at once. Prompt
       variants may differ in their text, underlying model, model
       parameters (e.g. "temperature"), and even LLM provider. Each run is
       saved as a history item and has a permanent url that can be shared
       with teammates.  (2) Search--upload a corpus of text (e.g. your
       company help docs) in our UI (PDF/TXT) and Vellum will convert the
       text to embeddings and store it in a vector database to be used at
       run time. While making an LLM call, we inject relevant context from
       your documents into the query and instruct the LLM to only answer
       factually using the provided context. This helps prevent
       hallucination and avoids you having to manage your own embeddings,
       vector store, and semantic search infra.  (3) Manage--a low-
       latency, high-reliability API wrapper that's provider-agnostic
       across OpenAI, Cohere, and Anthropic (with more coming soon). Every
       request is captured and persisted in one place, providing full
       observability into what you're sending these models, what they're
       giving back, and their performance. Prompts and model providers can
       be updated without code changes. You can replay historical requests
       and version history is maintained. This serves as a data layer for
       metrics, monitoring, and soon, alerting.  (4) Optimize--the data
       collected in Manage is used to passively build up training data,
       which can be used to fine-tune your own proprietary models. With
       enough high quality input/output pairs (minimum 100, but depends on
       the use case), Vellum can produce fine-tuned models to provide
       better quality, lower cost or lower latency. If a new model solves
       a problem better, it can be swapped without code changes.  We also
       offer periodic evaluation against alternative models (i.e. we can
       see if fine-tuning Curie produces results of comparable quality to
       Davinci, but at a lower price). Even though OpenAI is the dominant
       model provider today, we expect there to be many providers with
       strong foundation models, and in that case model interoperability
       will be key!  Here's a video demo showcasing Vellum (feel free to
       watch on 1.5x!):
       https://www.loom.com/share/5dbdb8ae87bb4a419ade05d92993e5a0.  We
       currently charge a flat monthly platform fee that varies based on
       the quantity and complexity of your use-cases. In the future, we
       plan on having more transparent pricing that's made up of a fixed
       platform fee + some usage-based component (e.g. number of tokens
       used or requests made).  If you look at our website you'll notice
       the dreaded "Request early access" rather than "Try now". That's
       because the LLM Ops space is evolving extremely quickly right now.
       To maximize our learning rate, we need to work intensively with a
       few early customers to help get their AI use cases into production.
       We'll invite self-serve signups once that core feature set has
       stabilized a bit more. In the meantime, if you're interested in
       being one of our early customers, we'd love to hear from you and
       you can request early access here: https://www.vellum.ai/landing-
       pages/hacker-news.  We deeply value the expertise of the HN
       community! We'd love to hear your comments and get your perspective
       on our overall direction, the problems we're aiming to solve, our
       solution so far, and anything we may be missing. We hope this post
       and our demo video provide enough material to start a good
       conversation and we look forward to your thoughts, questions, and
       feedback!
        
       Author : noaflaherty
       Score  : 82 points
       Date   : 2023-03-06 16:20 UTC (6 hours ago)
        
       | barefeg wrote:
       | Do you have any plans for automated acceptance testing?
        
         | noaflaherty wrote:
         | Great question! We're starting with the manually triggered unit
         | tests in Playground and back-tests prior to updating existing
         | deployments that you see in the demo video, but absolutely
         | envision automated tests being a natural extension of this as
         | we learn what works well through manually-triggered tests.
        
       | ajhai wrote:
       | Congrats on the launch! I'm glad to see all the tooling come up
       | in this space.
       | 
       | Regarding tests, how do you evaluate the generated completions
       | for tests? Allowing users to execute a set of tests against a
       | prompt and show completions for visual inspection is a good start
       | but imho doesn't scale when the app is in production with a large
       | corpus of tests. Something we are exploring right now is to
       | generate a similarity/divergence score between generated
       | completions to make this easy at scale.
       | 
       | Disclosure: We are building something very similar at Promptly
       | (https://trypromptly.com) out of our experience using GPT-3 at
       | MakerDojo
        
         | noaflaherty wrote:
         | Thanks! We totally agree that spot-checking won't scale long
         | term. We're currently testing a feature in beta that allows you
         | to provide an "expected output" and then choose from a variety
         | of comparison metrics (e.g. exact match, semantic similarity,
         | Levenshtein distance, etc.) to derive a quantitative measure of
         | output quality. The jury's still out whether this is
         | sufficient, but we're excited to continue pushing in this
         | direction.
         | 
         | p.s. it's cool to hear from another company that's helping
         | expand this market!
        
           | ajhai wrote:
           | Letting users pick a comparison metric of their choice is a
           | good option till something better comes along. Good luck with
           | Vellum!
        
       | lukasb wrote:
       | What if I'm building a service that leverages LLMs for my
       | customers? Would I be able to use an API to upload my customers'
       | data and have embeddings created for that? Or is this not a use
       | case you're building for?
        
         | noaflaherty wrote:
         | Hi yes, that's the idea! The example shown in the demo video
         | uses internal help docs as the "source of knowledge" for
         | embeddings, but the same principles apply to customer data.
        
           | lukasb wrote:
           | Great! Would I be able to provide customers any guarantees
           | about the privacy of their data? Could you create embeddings
           | based on data encrypted homomorphically?
        
             | noaflaherty wrote:
             | We'd love to learn more about what types of guarantees your
             | customers expect - it's likely we can provide many of them
             | now and will inevitably offer even more down the line. Feel
             | free to reach out directly to noa@vellum.ai if you'd like
             | to discuss!
             | 
             | Vellum currently embeds any text you send it, but to be
             | honest, we haven't experimented with performing semantic
             | search across homomorphically encrypted text and can't
             | speak to its performance. If this becomes a recurring theme
             | from our customers, we'd be excited to dig into it deeper!
        
               | lukasb wrote:
               | Yeah I understand that operating on opaque data might not
               | be one of the first items on your roadmap. Thanks for the
               | quick responses.
        
       | jacky2wong wrote:
       | Congrats on launching!
       | 
       | I personally think the target audience for this is a little hard
       | to find when compared to products like langchain that do
       | something similar already (i wouldn't be surprised if you guys
       | built on top of this).
       | 
       | As a developer, I wouldn't have much difficulty spinning a Colab
       | instance and running Langchain (takes a few minutes) and get it
       | up and running compared to a solution like yours. Would be
       | awesome to get a pros/cons table of a solution like yours
       | compared to Langchain so developers can best figure out how to
       | dedicate their time without having to try both tools.
        
         | noaflaherty wrote:
         | Appreciate the feedback! A comparison table is a great idea and
         | something we'll look into.
         | 
         | We fully anticipate having tighter integrations with Langchain
         | in the near future. We view them as complimentary frameworks in
         | many ways. For example, we might subclass the `BaseLLM` class
         | such that you can interact with Vellum deployments and get all
         | the monitoring/observability that Vellum provides, but invoke
         | them via your Lanchain chain.
        
       | instagary wrote:
       | Congrats on the launch! I'm building an app that interfaces with
       | OpenAI GPT models and they recently released an API to upload and
       | create text embeddings.
       | 
       | I watched most of your Loom and was left wondering why wouldn't I
       | use them directly vs you?
        
         | noaflaherty wrote:
         | Thank you and good question! If you're comfortable with the
         | quality of OpenAI's embeddings, performing your own chunking,
         | rolling your own integration with a vector db, and don't need
         | Vellum's other features that surround the usage of those
         | embeddings, then Vellum is probably not a good fit. Vellum's
         | Search offering is most valuable to companies that want to be
         | able to experiment with different embedding models, don't want
         | to manage their own semantic search infra, and want a tight
         | integration with how those embeddings are used downstream.
        
       | Nowado wrote:
       | Do you provide optimization options for finetuning, RLHF or both?
        
         | noaflaherty wrote:
         | Thanks for the question! Would you mind elaborating on what you
         | mean by "optimization options?" We've helped a number of our
         | customers fine tune models and optimize for increased quality,
         | lower cost, or decreased latency (e.g. fine-tune curie to
         | perform as well as regular davinci, but at a lower cost and
         | latency).
         | 
         | We offer UIs and APIs for "feeding back actuals" and providing
         | indications on the quality of the models output / what it
         | should have output. This feedback loop is used to then
         | periodically re-train fine-tuned models.
         | 
         | Hopefully this answers your question, but happy to respond with
         | follow-ups if not!
        
           | Nowado wrote:
           | I'm thinking about improving model response quality.
           | 
           | Training of preexisting LLM models that I'm familiar with
           | consists of two aspects/sides/options: fine-tuning the model
           | with additional, domain specific data (like internal company
           | documentation) and RLHF (like comparing model responses to
           | customer service actual responses) to further improve how
           | well it's using that and original resources it has access to.
           | That's how https://github.com/CarperAI sets up the process,
           | for example.
           | 
           | What you're describing seems closer to the latter, but I'm
           | not entirely sure if you're following the same structure at
           | all.
        
             | siddseethepalli wrote:
             | Hey, Sidd from Vellum here!
             | 
             | Right now we offer traditional fine tuning with
             | prompt/completion pairs but not training a reward model.
             | This works great for a lot of use cases including
             | classifier, extracting structured data, or responding with
             | a very specific tone and style.
             | 
             | For making use of domain specific data we recommend using
             | semantic search to pull in the correct context at runtime
             | instead of trying to fine tune a model on the entire corpus
             | of knowledge.
        
       | IncRnd wrote:
       | Congrats on the Launch!
       | 
       | So that you know, Vellum [1] is the name for an often used and
       | well known piece of software used to write books. It's an
       | absolutely fantastic piece of software. Vellum has been around
       | since 2015. [2]
       | 
       | Vellum (the word) is prepared animal skin or membrane, typically
       | used as writing material. [3]
       | 
       | [1] https://vellum.pub/
       | 
       | [2]
       | https://web.archive.org/web/20151112064306/http://vellum.pub...
       | 
       | [3] https://en.wikipedia.org/wiki/Vellum
        
         | noaflaherty wrote:
         | Thank you for flagging! We've come across them in prior
         | searches, but interesting to learn how well-known they are
        
       | bradhilton wrote:
       | Cool, I just requested early access! I have been using Open AI's
       | APIs for text summarization tasks, have also played around with a
       | few other platforms.
        
         | noaflaherty wrote:
         | I appreciate your interest! We'll be reaching out soon :)
        
       | swyx wrote:
       | congrats on launching! 1) how do you evaluate the opportunity
       | here vs previous players like Humanloop (seems to have pivoted to
       | weak labeling) and Dust.tt (unclear traction)?
       | 
       | and 2) it seems with OpenAI being so far ahead of everyone else
       | (https://crfm.stanford.edu/helm/latest/?group=core_scenarios) I
       | think the "model interoperability" is a key assumption that needs
       | to be tested. Nobody's talking about "model interoperability"
       | between dalle, midjourney, or stable diffusion - they each have
       | their strengths, and that's that. prompts aren't code that can be
       | shipped indiscriminately everywhere, they only exist within the
       | context of the model they are run against
        
         | noaflaherty wrote:
         | Thank you for the thoughtful questions!
         | 
         | 1) We believe that timing is a critical piece of this
         | opportunity. With the recent media buzz around ChatGPT, we have
         | found that leadership in companies large and small are actively
         | considering how to best make use of LLMs in their business. The
         | problems we've identified emerged as clear patterns across
         | hundreds of calls with companies that are either currently
         | managing LLM-powered features in production, or aspiring to.
         | The level of interest was much smaller just 6 months ago, has
         | grown quickly, and we anticipate it to grow only more in the
         | near future.
         | 
         | 2) We agree that with OpenAI's current dominance in the space,
         | being provider-agnostic is not top of mind for most at the
         | moment. We are betting that this will become increasingly
         | important as the space evolves. We are already seeing Google
         | investing hundreds of millions in Anthropic
         | (https://www.bloomberg.com/news/articles/2023-02-03/google-
         | in...), Google working on their own LLMs (e.g. BARD), and
         | Facebook launching their own LLM
         | (https://ai.facebook.com/blog/large-language-model-llama-
         | meta...). We expect this to become an increasingly competitive
         | space and hope to provide companies with the tools needed to
         | effectively evaluate their options.
        
       ___________________________________________________________________
       (page generated 2023-03-06 23:00 UTC)