[HN Gopher] Show HN: Sketch - AI code-writing assistant that und...
       ___________________________________________________________________
        
       Show HN: Sketch - AI code-writing assistant that understands data
       content
        
       Hey HN!  I'm excited to share sketch: a tool to help anyone who
       uses python and pandas quickly iterate and get to answers for their
       data questions.  Sketch installs as a pandas extension that offers
       utility functions that operate on natural language prompts. Using
       the `ask` interface you can get answers in natural language. Using
       the `howto` interface you can get get python and pandas code
       directly. The primary benefit of this over copilot and chatGPT is
       that this adds data-content based context so that the generated
       answers are much more accurate and relevant to the data problem at
       hand.  Check out the demo video[1] and try it out using the colab
       notebook (on github)!  [1] https://user-
       images.githubusercontent.com/916073/212602281-4...
        
       Author : bluecoconut
       Score  : 176 points
       Date   : 2023-01-16 13:33 UTC (9 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | tdebroc wrote:
       | Looks really nice, but I tried it:                 import sketch
       | import pandas as pd            data_pd = pd.read_csv("input.csv",
       | sep=';')       print(data_pd)       print(data_pd.sketch.ask("Is
       | there any PII in this dataset ?"))
       | print(data_pd.sketch.ask("Which columns are integer type?"))
       | 
       | With this input.csv:                 name;age;address;phone
       | Bob;34;106 DOYERS ST. 8 ARLINGTON DR. 599 NW BAY
       | BLVD;1-541-754-3010       Anna;34;694 Short Street, Austin,
       | Texas;001-541-754-3010
       | 
       | And I have no results (and no runtime error as well) :-( Here is
       | the console output:                    name  age
       | address             phone       0   Bob   34  106 DOYERS ST. 8
       | ARLINGTON DR. 599 NW BAY BLVD    1-541-754-3010       1  Anna
       | 34                 694 Short Street, Austin, Texas
       | 001-541-754-3010       <IPython.core.display.HTML object>
       | None       <IPython.core.display.HTML object>       None
       | 
       | Am I missing something ? The "ask" interface doesn't seems to
       | need external OpenAI credentials right ?
        
         | bluecoconut wrote:
         | to get the strings of the results back out, add the kwarg
         | `call_display=False` to the functions.
         | 
         | so: ``` print(data_pd.sketch.ask("Is there any PII in this
         | dataset ?", call_display=False)) ``` should work for you.
         | 
         | Right now it by default assumes its in an ipython context that
         | can display HTML objects.
        
           | tdebroc wrote:
           | Ah yes it displayed the string, thanks!
           | 
           | But the result looks wrong with this input:
           | age                                         address       0
           | 34  106 DOYERS ST. 8 ARLINGTON DR. 599 NW BAY BLVD       1
           | 34                 694 Short Street, Austin, Texas
           | 
           | It says:                 No, there is no PII (personally
           | identifiable information) in this dataset. The only columns
           | are index, age, and address, none of which contain any
           | sensitive information.
           | 
           | Sometimes, it seems to work with phone number though. Here:
           | age address phone 0 34 106 DOYERS ST. 8 ARLINGTON DR. 599 NW
           | BAY BLVD 1-541-754-3010 1 34 694 Short Street, Austin, Texas
           | 001-541-754-3010                 Yes, this dataset contains
           | PII (personally identifiable information) such as age,
           | address, and phone number.
           | 
           | I retried:                    pirce
           | address             phone       0    123  106 DOYERS ST. 8
           | ARLINGTON DR. 599 NW BAY BLVD    1-541-754-3010       1
           | 43543                 694 Short Street, Austin, Texas
           | 001-541-754-3010            No, there is no personally
           | identifiable information (PII) in this dataset. The columns
           | contain only generic information such as index, price,
           | address, and phone number. None of these columns contain any
           | information that could be used to identify an individual.
           | 
           | Which is wrong. Is there explanation ?
        
       | ibestvina wrote:
       | Great work, and a really interesting application of GPT3. Some
       | time ago I developed Datasloth [1] which might be a nice
       | complementary feature to Sketch. Ping me if you're interested to
       | bounce ideas :)
       | 
       | [1] https://github.com/ibestvina/datasloth
        
       | gcatalfamo wrote:
       | Cool project, although the name kinda clashes with the well-known
       | https://www.sketch.com/ in the UI/UX design space
        
       | daveguy wrote:
       | This is very cool. A useful case for gpt. One question / concern:
       | isn't a person's address considered PII? Is the system flexible
       | enough to add pre-statements such as "treat an address as PII"?
        
         | harvey9 wrote:
         | Related question: is this done on my machine or do I end up
         | sending possible pii to a cloud service for evaluation?
        
           | bluecoconut wrote:
           | This is sending summary statistics to a cloud machine by
           | default (for ease of immediate use.
           | https://github.com/approximatelabs/sketch#sketch-
           | currently-u...
           | 
           | You can run using your own OpenAI key by setting 2
           | environment variables: (1)
           | SKETCH_USE_REMOTE_LAMBDAPROMPT=False (2)
           | OPENAI_API_KEY=YOUR_API_KEY
           | 
           | To run entirely locally (using your own GPU and a model like
           | Bloom) one would have to add a new prompt type to
           | `lambdaprompt` (the package that this depends on), have a
           | machine with enough GPU resources, and then add a slight
           | modification to sketch.
        
             | adabyron wrote:
             | Not sure if this is a business you're building out of this
             | or an experiment. For real use for any of my customers, I
             | would need to run this entirely locally.
             | 
             | I think it's really awesome though!
             | 
             | Curious what "enough GPU resources" looks like? Would a
             | GeForce RTX 40 or 30 series card with 12-24GB of RAM be
             | sufficient per user running locally on their machine?
        
       | irthomasthomas wrote:
       | This is very cool! I've literally today been noodling with ideas
       | to use probabilistic data structures in LLMs.
       | 
       | And TIL you can embed mp4s in a GitHub readme. Is that new?
        
       | sean_the_geek wrote:
       | Really cool and helpful. Is there anything similar for R?
        
         | pklee wrote:
         | GPT3 model generates a SQL. You can sqldf on top of your
         | data.table. We will be demo'ing at one of the events shortly.
         | BTW, you could do somewhat similar with other LLMs such as GPTJ
         | and GPT NEOX if you have worked with them
        
           | rafaelmelhem wrote:
           | is GPTJ/NEOX good enough to generate code? tried it with SQL
           | and it was really disappointing
        
       | jerpint wrote:
       | Does using this mean sending all of your potentially private data
       | via an api call to openAI?
        
         | abrichr wrote:
         | From
         | https://github.com/approximatelabs/sketch/blob/main/sketch/p...
         | it appears that this library is calling a remote API, which
         | obviates the utility of the demonstrated use case.
         | 
         | Upon closer inspection, it looks like
         | https://github.com/approximatelabs/sketch interfaces with the
         | model via https://github.com/approximatelabs/lambdaprompt,
         | which is made by the same organization. This suggests to me
         | that the former may be a toy demonstration of the latter.
         | 
         | Interesting how as of the time of writing this, most of the
         | comments here (i.e. dozens) are praising this as a legitimate
         | use case. Maybe I'm missing something obvious, but it seems
         | clear to me that uploading data to a third party to verify
         | whether that data contains PII is a non-starter for any serious
         | application.
        
         | teaearlgraycold wrote:
         | "Does this data contain PII?"
         | 
         | "Yes, and you just shared it all with Microsoft :D"
        
       | jonwinstanley wrote:
       | Very cool demo!
       | 
       | Regarding the choice of name, presumably you already know about
       | Sketch, the popular image editing software.
       | 
       | I wonder if the image editing guys will in the future incorporate
       | AI functionality too? Which might make "Googling" for your
       | product difficult for your potential customers?
        
         | Jugglerofworlds wrote:
         | There's also a program synthesis project called Sketch, which
         | is much closer to the domain of what the user posted:
         | https://people.csail.mit.edu/asolar/
        
       | pfd1986 wrote:
       | Hi, cool stuff! Which LLM is being used in the background? I may
       | have missed that info in the readme. Thanks!
        
         | swyx wrote:
         | digging thru the code
         | https://github.com/approximatelabs/sketch/blob/9d567ec161015...
         | 
         | this seems to be using their gpt3 frameowrk:
         | https://github.com/approximatelabs/lambdaprompt
         | 
         | which uses text-davinci-003 by default
         | https://github.com/approximatelabs/lambdaprompt/blob/main/la...
        
         | bluecoconut wrote:
         | Thanks!
         | 
         | Right now this is running off of GPT-3 (`text-davinci-003`) and
         | via a small code change can run on codex (`code-davinci-002`)
         | but the quality only improves a little bit with that change.
         | 
         | That said, this is the first version to show that the interface
         | is viable; we are currently working on training our own
         | foundation model on a hybrid tokenization of data and word
         | tokens. I hope to improve this same toolkit in the future with
         | these new models of our own that we are training.
        
       | ethanwillis wrote:
       | Well, I'm locked out of my github account right now and don't
       | feel like going through all those hoops right now but I wanted to
       | point something minor out.
       | 
       | In this line,
       | https://github.com/approximatelabs/sketch/blob/9d567ec161015...
       | 
       | I think you can end up marking control characters as "UNKNOWN"
       | characters by accident by assuming that in all
       | contexts/environments that dictionary.items() always returns
       | items in a consistent order. This isn't always true.
       | 
       | edit: actually with the way the code is written if you have any
       | overlapping ranges at all you'll end up double/triple/etc.
       | counting a character into multiple categories.
        
       | mmaia wrote:
       | Very promising. I believe the uses of OpenAI that will stick in
       | the long term are like this, and other tools should be
       | experimenting with this kind of integration.
       | 
       | Otherwise, there's room for other solutions, as airops sidekick
       | [1] that uses browser extensions to embed itself in other data
       | tools.
       | 
       | 1- https://www.airops.com/
        
       | hgarg wrote:
       | I spent few weeks last year building a text to sql tool using
       | codex model to do something like this but for all kinds of data
       | sources. We pivoted away to something else for various reasons.
       | 
       | But your approach is much better. Pandas is used a lot. Build a
       | tool on top of pandas. This is awesome.
        
       | javierluraschi wrote:
       | https://hal9.com is focused on building data apps with LLMs,
       | would love to explore integrating and contributing to Sketch. If
       | this sounds interesting I'm at javier at hal9.ai
        
       | drcongo wrote:
       | I use TabNine [0] for local context aware AI suggestions, and I
       | find it spookily good at guessing what I'm half way through
       | typing. Sadly they've left the Sublime plugin to rot and it's
       | mostly a hinderance in ST4.
       | 
       | [0] https://www.tabnine.com
        
       | ldh0011 wrote:
       | So... Microsoft bought 48 or 49% of OpenAI right? Integrating
       | this into Excel would make everyone an excel power user.
        
         | bufferoverflow wrote:
         | But if it makes a logical mistake, it would take a real power
         | user to notice it.
        
         | localhost wrote:
         | But wouldn't you need to integrate Python into Excel for this
         | to work?
        
         | mmaia wrote:
         | A lot of people already uses excelformulabot. The impact of
         | something integrated into Excel would be pretty big.
        
           | davidbressler wrote:
           | It's already integrated into Excel with the add-on.
           | 
           | What else did you have in mind?
        
       | blakeburch wrote:
       | This is fantastic and exactly where our team at Shipyard is
       | expecting the data space to go. Context aware, AI driven. Great
       | work on this!
       | 
       | We were just talking last week about how we should create a
       | feature to describe transformations you want in Natural Language
       | that get compiled to pandas/SQL. Input data is everything
       | associated with the original file/dataframe.
       | 
       | Visual transformation tools are typically limited and non-
       | reproducible. If you could switch it around to be code-compiled
       | but description-driven, that would open up new possibilities.
       | 
       | I'd love to chat if you're open to it. Email in bio.
        
         | jadbox wrote:
         | I'd love something like a standalone SQL IDE where I can ask an
         | AI to generate queries or migration scripts.
         | 
         | Sadly to be honest, I don't think I'd pay a subscription for
         | such a service. I would prefer to pay a one time tooling fee
         | and just run trained model in the IDE locally.
        
           | rafaelmelhem wrote:
           | I did something similar to it for my own use. Using natural
           | language it make sql queries to your .csv, xlsx (soon I'll
           | add features so you can connect to databases). but it is not
           | mature enough to sell as a service. Feel free to reach me
           | info [at] rafaelmelhem . com if you want and I send a demo :)
        
           | vorpalhex wrote:
           | Yeah the risk of your sql walking off to an AI vendor is not
           | worth the time savings.
        
       | swyx wrote:
       | This is a great demo, OP.
       | 
       | I'm wondering about the UX of this vs Copilot. is this basically
       | just a way to get around the fact that you dont have Copilot
       | inside of notebooks? what else am I missing about this
       | experience?
        
         | bluecoconut wrote:
         | Thanks!
         | 
         | That is definitely a big part of it, getting to use copilot
         | style answers without having to install any plugins to the IDE
         | (so getting to use this in colab or jupyter notebooks directly
         | feels great).
         | 
         | That said, I use both copilot and sketch in my VScode
         | notebooks, and find that they have slightly different feelings
         | to the iteration loop.
         | 
         | Sketch offers a more "local" data context (pinning the
         | text/prompt to the specific dataframe) which increases the
         | quality of the suggestions (since more relevant information is
         | within the token limit).
        
       | allisdust wrote:
       | I don't have any experience with pandas. Can this directly
       | connect to a db and run queries there (video seems to load a csv
       | file).
        
         | harvey9 wrote:
         | If you can already write SQL to return a data set then you can
         | get that set to pandas with pyodbc.
        
       | [deleted]
        
       | jamal-kumar wrote:
       | Damn, this looks pretty useful. I was finding that github copilot
       | was really good at reading a CSV file and writing all the imports
       | from that into migrations for DB import, but this looks like it
       | does these data transformations even more robustly.
       | 
       | Is there any plans on getting this to work outside of the
       | python/pandas ecosystem or is it intrinsically tied to that
       | environment?
        
       ___________________________________________________________________
       (page generated 2023-01-16 23:00 UTC)