hngopher.com

       [HN Gopher] Show HN: Langfuse - Open-source observability and an...
       ___________________________________________________________________
        
       Show HN: Langfuse - Open-source observability and analytics for LLM
       apps
        
       Hi HN! Langfuse is OSS observability and analytics for LLM
       applications (repo: https://github.com/langfuse/langfuse, 2 min
       demo: https://langfuse.com/video, try it yourself:
       https://langfuse.com/demo)  Langfuse makes capturing and viewing
       LLM calls (execution traces) a breeze. On top of this data, you can
       analyze the quality, cost and latency of LLM apps.  When GPT-4
       dropped, we started building LLM apps - a lot of them! [1, 2] But
       they all suffered from the same issue: it's hard to assure quality
       in 100% of cases and even to have a clear view of user behavior.
       Initially, we logged all prompts/completions to our production
       database to understand what works and what doesn't. We soon
       realized we needed more context, more data and better analytics to
       sustainably improve our apps. So we started building a homegrown
       tool.  Our first task was to track and view what is going on in
       production: what user input is provided, how prompt templates or
       vector db requests work, and which steps of an LLM chain fail. We
       built async SDKs and a slick frontend to render chains in a nested
       way. It's a good way to look at LLM logic 'natively'. Then we added
       some basic analytics to understand token usage and quality over
       time for the entire project or single users (pre-built dashboards).
       Under the hood, we use the T3 stack (Typescript, NextJs, Prisma,
       tRPC, Tailwind, NextAuth), which allows us to move fast + it means
       it's easy to contribute to our repo. The SDKs are heavily
       influenced by the design of the PostHog SDKs [3] for stable
       implementations of async network requests. It was a surprisingly
       inconvenient experience to convert OpenAPI specs to boilerplate
       Python code and we ended up using Fern [4] here. We're fans of
       Tailwind + shadcn/ui + tremor.so for speed and flexibility in
       building tables and dashboards fast.  Our SDKs run fully
       asynchronously and make network requests in the background. We did
       our best to reduce any impact on application performance to a
       minimum. We never block the main execution path.  We've made two
       engineering decisions we've felt uncertain about: to use a Postgres
       database and Looker Studio for the analytics MVP. Supabase performs
       well at our scale and integrates seamlessly into our tech stack. We
       will need to move to an OLAP database soon and are debating if we
       need to start batching ingestion and if we can keep using Vercel.
       Any experience you could share would be helpful!  Integrating
       Looker Studio got us to first analytics charts in half a day. As it
       is not open-source and does not work with our UI/UX, we are looking
       to switch it out for an OSS solution to flexibly generate charts
       and dashboards. We've had a look at Lightdash and would be happy to
       hear your thoughts.  We're borrowing our OSS business model from
       Posthog/Supabase who make it easy to self-host with features
       reserved for enterprise (no plans yet) and a paid version for
       managed cloud service. Right now all of our code is available under
       a permissive license (MIT).  Next, we're going deep on analytics.
       For quality specifically, we will build out model-based evaluations
       and labeling to be able to cluster traces by scores and use cases.
       Looking forward to hearing your thoughts and discussion - we'll be
       in the comments. Thanks!  [1] https://learn-from-ai.com/  [2]
       https://www.loom.com/share/5c044ca77be44ff7821967834dd70cba  [3]
       https://posthog.com/docs/libraries  [4] https://buildwithfern.com/
        
       Author : marcklingen
       Score  : 105 points
       Date   : 2023-08-29 16:14 UTC (6 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | steventey wrote:
       | > We will need to move to an OLAP database soon and are debating
       | if we need to start batching ingestion
       | 
       | Highly recommend https://tinybird.com for this - they're a
       | fantastic OLAP DB for ingesting & visualizing time-series data!
        
         | [deleted]
        
         | mdeichmann wrote:
         | Hi, this is Max - one of the founders of Langfuse and super
         | excited to show Langfuse to HN today. Thanks a lot for the
         | suggestion. I had not heard of Tinybird but it seems like a
         | great product. It could be valuable to use their materialized
         | views to calculate aggregates for our analytics UI. We will
         | need to discuss whether we can use them as they are not open
         | source. However, for anyone reading this, they use Clickhouse
         | under the hood and have created a knowledge base
         | (https://github.com/tinybirdco/clickhouse_knowledge_base). I
         | will browse it to learn more.
        
           | NathanFlurry wrote:
           | GPT wrappers [handshake emoji] ClickHouse wrappers
        
       | v3np wrote:
       | Cool stuff and congrats on the Show HN! Out of curiosity, at what
       | point do you see teams usually adopting something like langfuse?
       | In regular development, you sometimes even have test-driven
       | development - I imagine this doesn't really apply for LLMs. Do
       | you see this changing over time as the process of building LLM
       | apps becomes more mature?
        
         | mdeichmann wrote:
         | Thanks a lot! We see teams adopt Langfuse quite early already.
         | Say you have one or two engineers working on a rather complex
         | LLM feature, they look for a solution like Langfuse already in
         | a test environment before going to production. The majority
         | observes their LLM features in production though. We dont see
         | test-driven development as much but we do think that model and
         | rule based eval will become more important in the future and
         | CIs will only pass if a certain score was achieved.
        
       | elamje wrote:
       | Awesome. There is a definitely a need for LLM product analytics
       | that is currently completely underserved by traditional tools
       | like GA, Mixpanel, etc.
        
       | phillipcarter wrote:
       | Congrats on the release! I'm keenly interested in this space, as
       | I believe that Observability is one of the top ways to steer LLMs
       | to be more reliable in production.
       | 
       | I noticed your SDKs use tracing concepts! Are there plans to
       | implement OpenTelemetry support?
        
         | mdeichmann wrote:
         | Thank you so much, fully share your sentiment on this and
         | aligned our domain language to OpenTelemetry. Currently users
         | add lots of metadata and configuration details to the trace by
         | manually instrumenting it using the SDKs (or via Langchain
         | integration). We are thinking about integrating OpenTelemetry,
         | as this would be a step function on making integrations with
         | apps easier. However, hadn't had the time yet to figure out how
         | capture all the metadata that's relevant as context to the
         | trace.
        
           | phillipcarter wrote:
           | Makes sense! If you're curious, I added an
           | autoinstrumentation library for openai's python client here:
           | https://github.com/cartermp/opentelemetry-instrument-
           | openai-...
           | 
           | The main challenge I see is that since there's no standard
           | that each LLM has for inputs/outputs (let alone retrieval
           | APIs!) any kind of automatic instrumentation will need to
           | have a bunch of adapters. I suppose LangChain helps here, but
           | even then with so many folks ripping it out for production
           | you're still in the same place.
           | 
           | Happy to collaborate on any design thinking for how to
           | incorporate OTel support!
        
             | mdeichmann wrote:
             | Yes, we were thinking about the lack of standards as well.
             | I would be super happy to have a design discussion around
             | the topic, i will reach out to you.
        
       | kaspermarstal wrote:
       | I'm curious if you investigated the TimescaleDB extension that is
       | built into Supabase for your usecase? And if so, what was the
       | pros and cons?
        
         | marcklingen wrote:
         | Thanks for the hint. Having only one fully managed db seems to
         | be interesting to reduce needs for joining across databases and
         | less operations overhead (managing uptime, setting up infra for
         | testing, CI etc.) on our 2 people engineering "team". Timescale
         | is definitely on the list to be at least an intermediary
         | solution that would be faster to adopt than migrating to e.g.
         | ClickHouse over time. As you mentioned it, any obvious
         | limitations to watch out for?
        
           | kaspermarstal wrote:
           | None that I am aware of, and that is why I am very interested
           | in learning about any red flags you might have found that
           | motivated your decision to move off Postgres/Supabase and
           | that others should be aware of.
        
             | marcklingen wrote:
             | Currently the project is all Postgres and the managed
             | version is on Supabase. No scalability issues yet, we
             | consider different OLAP options as we are mostly interested
             | in faster analytical queries
        
             | zacmps wrote:
             | I've been using it recently and I will say it is definitely
             | harder to perform common time series queries than something
             | like InfluxDB.
        
       | [deleted]
        
       | fiehtle wrote:
       | If you're looking to replace Looker with open source and the
       | ability to style it to your needs maybe a mix of cube.dev plus
       | tremor.so would do the trick?
        
         | mdeichmann wrote:
         | Thanks for the suggestion. We love tremor as it perfectly fits
         | into our React/Tailwind setup. Cubeis great for collecting data
         | from multiple resources, caching aggregates, and providing an
         | API to call from our React FE. I think this could be a solution
         | for the future in case we run into performance issues or end up
         | having data stored in different databases. I am rather
         | wondering how we can provide our users with a DD like dashboard
         | experience. We would love to provide many different graphs, the
         | ability to select and filter data, maybe even SQL like queries
         | from the FE.
        
       | pranay01 wrote:
       | Congrats on the launch! Curious to learn what specific use case
       | you have seen around observability of LLM apps which are not
       | covered by standard observability tools like DataDog, SigNoz, etc
       | 
       | Also, how do you compare in terms of features with DataDog's LLM
       | monitoring product which was launched recently?
       | 
       | Disclaimer : I am a maintainer at SigNoz
        
         | mdeichmann wrote:
         | This is Max, one of the co-founders. We appreciate existing
         | observability tools as they have saved us so much time in the
         | past already. Excited to get your view on this! We've found
         | many observability demands to be quite different when working
         | on LLM applications. Mainly: Unpredictable input (users input
         | free-form text that cannot be fully tested for), control flow
         | highly dynamic when running on the textual output of a previous
         | step and quality of output is not known at runtime (for the
         | application it is just text). Many teams read manually through
         | the LLM inputs and outputs to get a feeling for correctness or
         | ask for user feedback. In addition, currently working on
         | abstraction for model-based evals to make it simple to try
         | which one works best for a use case and automatically run it on
         | all production prompts/completions. One user described the
         | difference to be that they use observability usually to know
         | that nothing is going wrong whereas they use Langfuse many
         | hours per day to understand how to best improve the application
         | and navigate cost/latency/quality trade offs.
        
       | addisonj wrote:
       | Congrats on the launch!
       | 
       | I have quite a few years of observability experience behind me
       | and hand't really considered some of the unique aspects that LLMs
       | bring into the picture. Here are a few thoughts, responses to
       | your questions, and feedback items
       | 
       | * Generally, I think you do a good job of having a clear, concise
       | story and value proposition that is fairly early in a market
       | where the number of people hitting these problems is rapidly
       | growing, which is a pretty nice place to be! But, I do think that
       | can be a challenge in that you have to help people recognize the
       | problem, which often means lots of content and lots of outreach.
       | 
       | * I think going open-source and following a PLG model of
       | cloud/managed services is pretty reasonable way to go and
       | certainly can be a leg up over the existing players, but I
       | noticed in your pricing a note about enterprise support of self-
       | hosting in customer VPC and dedicated instances. There is lots of
       | money there... but it also can just be _extremely_ big time sink
       | for early stage teams, so I would be careful, or at least make
       | sure you price it such that it supports hiring.
       | 
       | * Also on pricing, I wonder if doing this based on storage is how
       | people would think about? Generally, I think about observability
       | data in terms of events/sec first and then retention period. If
       | you can make it work with a single usage based metric of storage,
       | than that is great! but I would be concerned that 1) you aren't
       | telling the user which plan can support throughput and 2) you
       | could end up with some large variance in cost based on different
       | usage patterns
       | 
       | * The biggest question I have is how much did you explore
       | opentelemetry? Obviously, it is not as simple as just going and
       | building your own API and SDK... but when I look at the
       | capabilities, I could see opentelemetry being the underlying
       | protocol with some thinner convenience wrappers on top. From your
       | other comments, I understand that you see some ways in which this
       | data is different than typical trace/observability data, but I do
       | wonder if that choice will 1) scare off some companies that are
       | already "all in" on otel and 2) you don't get any opportunity to
       | use all of the stuff around otel, for example, Kafka integration
       | if you someday need that.
       | 
       | * As far as your question about OLAP, I wouldn't rush it... In
       | general, once you are big enough that the cost/scalability
       | limitations of PG are looming, you will be a different company
       | and know a lot more about the real requirements. I will also say
       | that in all likelihood, ClickHouse is probably the right choice,
       | but even knowing that, there are lots of different ways to tackle
       | that problem (like using hosted vs self-managed) and the right
       | way to do it will depend on usage patterns, cost structure, where
       | you end up with enterprise dedicated / self-hosted, etc. I will
       | mention though that timescaledb is not a bad way to maybe buy you
       | a bit of headroom, but it is important to note that the
       | timescaledb offered by supabase shouldn't be compared to
       | timescaledb community / cloud. The supabase version isn't bad, it
       | just isn't quite the same thing (i.e. no horizontal scalability)
       | 
       | Anyways, congrats again! It looks like you are off to a good
       | start.
       | 
       | If you have any other questions for me, my email is in my
       | profile.
        
         | hrpnk wrote:
         | +1 on the OTel mention. Having telemetry in place in a system,
         | one would typically implement a single behavioral tracking SDK
         | on top. Adding yet another SDK for LLMs is a hard ask given how
         | specific the implementation will be. Backing back on a standard
         | you offer value-added insights on top.
         | 
         | On the other hand, if you target just the applications that
         | implement an API behind an LLM, you will have customers
         | expecting value-added services on top of telemetry, like prompt
         | optimization, classification, result caching, etc.
         | 
         | Your choice which direction and target group you will focus on
         | first.
        
         | mdeichmann wrote:
         | This reads like a book, thank you so much for putting this
         | together!
         | 
         | > About value prop: Thanks for the feedback! We are already
         | trying to be as vocal about it as possible by writing great
         | docs etc. but can probably do better.
         | 
         | > PLG & OSS: thanks for the hint, we will be careful around
         | managing deployments within customer VPCs.
         | 
         | > Pricing: Currently picked storage as the first metric to
         | price on as this varies a lot across users. Some use langfuse
         | to track complex embedding processes with a lot of context,
         | others just simple chat messages with relatively low-context,
         | low-value events.
         | 
         | > OTel: We looked into it but did not go into all the details.
         | We wanted to have a product out there fast and liked the
         | experience of e.g. Posthog SDKs. I might reach out to you
         | concerning this topic after investing more time on it. Thanks
         | for the offer!
         | 
         | > OLAP: Agree, i also learned to tackle scaling issues once
         | they appear and so far we are good. Interesting that Supabase
         | has no horizontal scaling. This would be one of the main
         | reasons to use it IMO.
        
       | anirudhrx wrote:
       | Congrats on the launch! This is really cool. Would love to see
       | OTel integration in the future. I'm curious if this might
       | eventually work with request-context based routing, i.e. being
       | able to use the propagated metadata between layers to dynamically
       | test different versions of the stack, replay requests / route to
       | specific underlying implementation versions at different levels
       | of the stack.
        
       | ij23 wrote:
       | [dead]
        
       ___________________________________________________________________
       (page generated 2023-08-29 23:00 UTC)