[HN Gopher] Show HN: Dataherald AI - Natural Language to SQL Engine
       ___________________________________________________________________
        
       Show HN: Dataherald AI - Natural Language to SQL Engine
        
       Hi HN community. We are excited to open source Dataherald's
       natural-language-to-SQL engine today
       (https://github.com/Dataherald/dataherald). This engine allows you
       to set up an API from your structured database that can answer
       questions in plain English.  GPT-4 class LLMs have gotten
       remarkably good at writing SQL. However, out-of-the-box LLMs and
       existing frameworks would not work with our own structured data at
       a necessary quality level. For example, given the question "what
       was the average rent in Los Angeles in May 2023?" a reasonable
       human would either assume the question is about Los Angeles, CA or
       would confirm the state with the question asker in a follow up.
       However, an LLM translates this to:  select price from rent_prices
       where city="Los Angeles" AND month="05" AND year="2023"  This pulls
       data for Los Angeles, CA and Los Angeles, TX without getting
       columns to differentiate between the two. You can read more about
       the challenges of enterprise-level text-to-SQL in this blog post I
       wrote on the topic: https://medium.com/dataherald/why-enterprise-
       natural-languag...  Dataherald comes with "batteries-included." It
       has best-in-class implementations of core components, including,
       but not limited to: a state of the art NL-to-SQL agent, an LLM-
       based SQL-accuracy evaluator. The architecture is modular, allowing
       these components to be easily replaced. It's easy to set up and use
       with major data warehouses.  There is a "Context Store" where
       information (NL2SQL examples, schemas and table descriptions) is
       used for the LLM prompts to make the engine get better with usage.
       And we even made it fast!  This version allows you to easily
       connect to PG, Databricks, BigQuery or Snowflake and set up an API
       for semantic interactions with your structured data. You can then
       add business and data context that are used for few-shot prompting
       by the engine.  The NL-to-SQL agent in this open source release was
       developed by our own Mohammadreza Pourreza, whose DIN-SQL algorithm
       is currently top of the Spider (https://yale-lily.github.io/spider)
       and Bird (https://bird-bench.github.io/) NL 2 SQL benchmarks. This
       agent has outperformed the Langchain SQLAgent anywhere from
       12%-250%.5x (depending on the provided context) in our own internal
       benchmarking while being only ~15s slower on average.  Needless to
       say, this is an early release and the codebase is under swift
       development. We would love for you to try it out and give us your
       feedback! And if you are interested in contributing, we'd love to
       hear from you!
        
       Author : aazo11
       Score  : 70 points
       Date   : 2023-08-23 18:38 UTC (4 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | ajhai wrote:
       | We were using OpenAI's codex model with a bunch of few shot
       | examples for natural language queries over blockchain data
       | coupled with a finetuned completions model for plot the results
       | when we were working on makerdojo.io.
       | 
       | Without the ability to fine tune Codex, we had to employ a bunch
       | of techniques to pick the right set of examples to provide in the
       | context to get good SQL. We have come very far since then and it
       | has been great to see projects projects like sqlcoder and this.
       | 
       | Pieces of the pipeline powering makerdojo eventually became
       | LLMStack (https://github.com/trypromptly/LLMStack). We are
       | looking to integrate one of these SQL generators as processors in
       | LLMStack so we can build higher level applications.
        
         | aazo11 wrote:
         | We are looking to partner with other open source projects.
         | Let's connect. Shoot me a line at amir at dataherald.com
        
         | [deleted]
        
       | xfalcox wrote:
       | We are experimenting in this area, as a way to provide a way for
       | community managers to query their Discourse PG database using
       | natural language.
       | 
       | Your product is very interesting. Maybe we can collaborate here,
       | as Discourse is a standard Rails database schema on PostgreSQL
       | and could be a good way to show your tool working in a real
       | database.
        
         | aazo11 wrote:
         | Hi -- would love to connect and discuss more. Shoot me an email
         | amir at dataherald.com.
        
       | o_____________o wrote:
       | How are you dealing with stored procedures?
        
         | aazo11 wrote:
         | TLDR -- We currently do not automatically load stored
         | procedures as context for in context learning/prompting but it
         | is on the roadmap.
         | 
         | Additional color: Once you configure a connection to a DB, you
         | can trigger a 'scan' of the DB through the API which detects
         | various tables/columns, identifies low cardinality
         | (categorical) columns and their values and also loads in VIEWS
         | stored on tables. In our default NL2SQL agent, these are then
         | used together with 'verified' NL2SQL pairs and with other
         | metadata you can manually add to generate the SQL.
         | 
         | We are looking to automatically add context from Stored
         | procedures and other sources like dbt and various data
         | dictionaries as well.
        
       | gourabmi wrote:
       | How is this approach different from Nalir?
       | https://dl.acm.org/doi/10.1145/2588555.2594519
        
         | [deleted]
        
       | klft wrote:
       | Could you provide more details about your approach? Is the
       | context store based in vector-embeddings?
        
         | MrezaPourreza wrote:
         | I'd be delighted to assist you with your inquiries. Indeed, the
         | context store interacts with vector databases to retrieve
         | samples based on vector embeddings and cosine similarity.
        
       | DylanDmitri wrote:
       | I've seen impressive demos with just LLM direct to SQL. Can't
       | wait to see what people build with these new tools.
        
         | aazo11 wrote:
         | Hi Dylan -- new GPT-4 class LLMs have gotten good at writing
         | correct SQL, but while the SQL they generate almost always
         | executes correctly they often write SQL that generates
         | incorrect answers to the question. Some reasons for this can be
         | the business context or definition is not in the table schema,
         | or if the correct query is complex and requires multiple joins.
        
           | verdverm wrote:
           | It's also because of the inherent nature and hallucinations
           | in LLMs, likely impossible to remove completely, always
           | double check the LLM's work
        
             | aazo11 wrote:
             | Totally agree. In the hosted version built on top of the
             | engine we block the answer from going to the question asker
             | until an authorized user 'verifies' the answer. However
             | these 'verified' answers are then stored in the context
             | store and retrieved for few shot prompting.
        
           | thom wrote:
           | Yeah, you really shouldn't be building this (or really any
           | sort of important or destructive interface) with LLMs unless
           | you have an intermediate representation which users can
           | inspect and understand. Even if the understanding of English
           | were perfect there are still ambiguities you'd need to sort
           | out. You also need to be able to suggest something useful if
           | the user asks a nonsense question (for example data that
           | doesn't exist in your schema or its contents). When I worked
           | on this a decade ago we were quite good at mapping unknown
           | queries to various suggestions drawn from a canonical subset
           | of English for which we had very explicit semantics. But
           | obviously the Soave of unknowns was huge.
           | 
           | I've no doubt the AIs will reach human parity at some point
           | (and therefore still make mistakes!) but I'd be terrified
           | deploying any sort of one-shot text-to-results pipeline
           | today.
        
       | tillvz wrote:
       | _" So is enterprise conversational BI impossible in 2023? Will
       | there be a few more years of academic papers and company AI
       | hackathon projects before a solution can be deployed in
       | production? We don't think so."_ -- from the medium article.
       | 
       | Want to put your attention to https://www.veezoo.com as well. A
       | conversational self-service analytics solution that's been around
       | since 2016 and productively deployed in fortune 500 companies and
       | used by thousands of users daily :)
       | 
       | Congrats on the launch - and also reaching the top of the Spider
       | dataset. We're very familiar with that dataset and its
       | difficulties :)!
       | 
       | Happy to have a chat as well!
        
         | aazo11 wrote:
         | Amazing! Feel free to shoot me a line at amir at dataherald
        
         | saigal wrote:
         | Very cool. I'm checking out veezoo.com now!
        
         | [deleted]
        
       | derekx wrote:
       | [flagged]
        
       | cpursley wrote:
       | Neat! How difficult would it be to extend this to SQL like query
       | languages like Hasura and PostgREST (via Supabase)?
       | 
       | - https://hasura.io/docs/latest/api-reference/graphql-api/quer...
       | 
       | - https://postgrest.org/en/stable/references/api/tables_views....
       | 
       | ^ The advantage would be, these tools handle authorization.
        
         | aazo11 wrote:
         | Interesting. So far we have been focused on allowing non-
         | technical users to self-serve from enterprise data warehouses.
         | It might be possible to enable this scenario with a simple
         | addition to the NL-2-SQL engine to translate the generated SQL
         | to API calls. If that does not work, a custom imeplementation
         | of the NL-2-SQL module would def solve this.
         | 
         | Do you have a specific use-case in mind? Would love to
         | collaborate on adding it to the open-source library if you are
         | open to it.
        
           | cpursley wrote:
           | Mostly my concerns are around authorization (who can access
           | what) and those two tools handle that before generating a
           | database query. So maybe the approach would be to go from
           | natural lang -> SQL (as you do now) then to Hasura's or
           | PostgREST's specific query syntax (which are already pretty
           | SQL like). And specifically with Hasura, since it's GraphQL -
           | you can get all the available types. But that might all be to
           | far out of scope of what you're trying to do.
        
         | jcjc712 wrote:
         | It shouldn't be difficult, if we provide many examples and the
         | db/table schema the agent can provide an accurate response
        
       | steno132 wrote:
       | Excellent! This blows writing SQL by hand out of the water.
       | 
       | In 10 years, programming languages like Java, SQL, etc will be
       | dead.
       | 
       | I predict everyone will write in their native language (English,
       | Chinese, etc) and a AI-based compiler will write the actual code
       | for them.
        
         | catchnear4321 wrote:
         | said like a true swe
        
           | xigency wrote:
           | A mix of magic and imagination.
        
         | saigal wrote:
         | the future indeed!
        
       ___________________________________________________________________
       (page generated 2023-08-23 23:00 UTC)