[HN Gopher] Show HN: Sketch - AI code-writing assistant that und... ___________________________________________________________________ Show HN: Sketch - AI code-writing assistant that understands data content Hey HN! I'm excited to share sketch: a tool to help anyone who uses python and pandas quickly iterate and get to answers for their data questions. Sketch installs as a pandas extension that offers utility functions that operate on natural language prompts. Using the `ask` interface you can get answers in natural language. Using the `howto` interface you can get get python and pandas code directly. The primary benefit of this over copilot and chatGPT is that this adds data-content based context so that the generated answers are much more accurate and relevant to the data problem at hand. Check out the demo video[1] and try it out using the colab notebook (on github)! [1] https://user- images.githubusercontent.com/916073/212602281-4... Author : bluecoconut Score : 176 points Date : 2023-01-16 13:33 UTC (9 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | tdebroc wrote: | Looks really nice, but I tried it: import sketch | import pandas as pd data_pd = pd.read_csv("input.csv", | sep=';') print(data_pd) print(data_pd.sketch.ask("Is | there any PII in this dataset ?")) | print(data_pd.sketch.ask("Which columns are integer type?")) | | With this input.csv: name;age;address;phone | Bob;34;106 DOYERS ST. 8 ARLINGTON DR. 599 NW BAY | BLVD;1-541-754-3010 Anna;34;694 Short Street, Austin, | Texas;001-541-754-3010 | | And I have no results (and no runtime error as well) :-( Here is | the console output: name age | address phone 0 Bob 34 106 DOYERS ST. 8 | ARLINGTON DR. 599 NW BAY BLVD 1-541-754-3010 1 Anna | 34 694 Short Street, Austin, Texas | 001-541-754-3010 <IPython.core.display.HTML object> | None <IPython.core.display.HTML object> None | | Am I missing something ? The "ask" interface doesn't seems to | need external OpenAI credentials right ? | bluecoconut wrote: | to get the strings of the results back out, add the kwarg | `call_display=False` to the functions. | | so: ``` print(data_pd.sketch.ask("Is there any PII in this | dataset ?", call_display=False)) ``` should work for you. | | Right now it by default assumes its in an ipython context that | can display HTML objects. | tdebroc wrote: | Ah yes it displayed the string, thanks! | | But the result looks wrong with this input: | age address 0 | 34 106 DOYERS ST. 8 ARLINGTON DR. 599 NW BAY BLVD 1 | 34 694 Short Street, Austin, Texas | | It says: No, there is no PII (personally | identifiable information) in this dataset. The only columns | are index, age, and address, none of which contain any | sensitive information. | | Sometimes, it seems to work with phone number though. Here: | age address phone 0 34 106 DOYERS ST. 8 ARLINGTON DR. 599 NW | BAY BLVD 1-541-754-3010 1 34 694 Short Street, Austin, Texas | 001-541-754-3010 Yes, this dataset contains | PII (personally identifiable information) such as age, | address, and phone number. | | I retried: pirce | address phone 0 123 106 DOYERS ST. 8 | ARLINGTON DR. 599 NW BAY BLVD 1-541-754-3010 1 | 43543 694 Short Street, Austin, Texas | 001-541-754-3010 No, there is no personally | identifiable information (PII) in this dataset. The columns | contain only generic information such as index, price, | address, and phone number. None of these columns contain any | information that could be used to identify an individual. | | Which is wrong. Is there explanation ? | ibestvina wrote: | Great work, and a really interesting application of GPT3. Some | time ago I developed Datasloth [1] which might be a nice | complementary feature to Sketch. Ping me if you're interested to | bounce ideas :) | | [1] https://github.com/ibestvina/datasloth | gcatalfamo wrote: | Cool project, although the name kinda clashes with the well-known | https://www.sketch.com/ in the UI/UX design space | daveguy wrote: | This is very cool. A useful case for gpt. One question / concern: | isn't a person's address considered PII? Is the system flexible | enough to add pre-statements such as "treat an address as PII"? | harvey9 wrote: | Related question: is this done on my machine or do I end up | sending possible pii to a cloud service for evaluation? | bluecoconut wrote: | This is sending summary statistics to a cloud machine by | default (for ease of immediate use. | https://github.com/approximatelabs/sketch#sketch- | currently-u... | | You can run using your own OpenAI key by setting 2 | environment variables: (1) | SKETCH_USE_REMOTE_LAMBDAPROMPT=False (2) | OPENAI_API_KEY=YOUR_API_KEY | | To run entirely locally (using your own GPU and a model like | Bloom) one would have to add a new prompt type to | `lambdaprompt` (the package that this depends on), have a | machine with enough GPU resources, and then add a slight | modification to sketch. | adabyron wrote: | Not sure if this is a business you're building out of this | or an experiment. For real use for any of my customers, I | would need to run this entirely locally. | | I think it's really awesome though! | | Curious what "enough GPU resources" looks like? Would a | GeForce RTX 40 or 30 series card with 12-24GB of RAM be | sufficient per user running locally on their machine? | irthomasthomas wrote: | This is very cool! I've literally today been noodling with ideas | to use probabilistic data structures in LLMs. | | And TIL you can embed mp4s in a GitHub readme. Is that new? | sean_the_geek wrote: | Really cool and helpful. Is there anything similar for R? | pklee wrote: | GPT3 model generates a SQL. You can sqldf on top of your | data.table. We will be demo'ing at one of the events shortly. | BTW, you could do somewhat similar with other LLMs such as GPTJ | and GPT NEOX if you have worked with them | rafaelmelhem wrote: | is GPTJ/NEOX good enough to generate code? tried it with SQL | and it was really disappointing | jerpint wrote: | Does using this mean sending all of your potentially private data | via an api call to openAI? | abrichr wrote: | From | https://github.com/approximatelabs/sketch/blob/main/sketch/p... | it appears that this library is calling a remote API, which | obviates the utility of the demonstrated use case. | | Upon closer inspection, it looks like | https://github.com/approximatelabs/sketch interfaces with the | model via https://github.com/approximatelabs/lambdaprompt, | which is made by the same organization. This suggests to me | that the former may be a toy demonstration of the latter. | | Interesting how as of the time of writing this, most of the | comments here (i.e. dozens) are praising this as a legitimate | use case. Maybe I'm missing something obvious, but it seems | clear to me that uploading data to a third party to verify | whether that data contains PII is a non-starter for any serious | application. | teaearlgraycold wrote: | "Does this data contain PII?" | | "Yes, and you just shared it all with Microsoft :D" | jonwinstanley wrote: | Very cool demo! | | Regarding the choice of name, presumably you already know about | Sketch, the popular image editing software. | | I wonder if the image editing guys will in the future incorporate | AI functionality too? Which might make "Googling" for your | product difficult for your potential customers? | Jugglerofworlds wrote: | There's also a program synthesis project called Sketch, which | is much closer to the domain of what the user posted: | https://people.csail.mit.edu/asolar/ | pfd1986 wrote: | Hi, cool stuff! Which LLM is being used in the background? I may | have missed that info in the readme. Thanks! | swyx wrote: | digging thru the code | https://github.com/approximatelabs/sketch/blob/9d567ec161015... | | this seems to be using their gpt3 frameowrk: | https://github.com/approximatelabs/lambdaprompt | | which uses text-davinci-003 by default | https://github.com/approximatelabs/lambdaprompt/blob/main/la... | bluecoconut wrote: | Thanks! | | Right now this is running off of GPT-3 (`text-davinci-003`) and | via a small code change can run on codex (`code-davinci-002`) | but the quality only improves a little bit with that change. | | That said, this is the first version to show that the interface | is viable; we are currently working on training our own | foundation model on a hybrid tokenization of data and word | tokens. I hope to improve this same toolkit in the future with | these new models of our own that we are training. | ethanwillis wrote: | Well, I'm locked out of my github account right now and don't | feel like going through all those hoops right now but I wanted to | point something minor out. | | In this line, | https://github.com/approximatelabs/sketch/blob/9d567ec161015... | | I think you can end up marking control characters as "UNKNOWN" | characters by accident by assuming that in all | contexts/environments that dictionary.items() always returns | items in a consistent order. This isn't always true. | | edit: actually with the way the code is written if you have any | overlapping ranges at all you'll end up double/triple/etc. | counting a character into multiple categories. | mmaia wrote: | Very promising. I believe the uses of OpenAI that will stick in | the long term are like this, and other tools should be | experimenting with this kind of integration. | | Otherwise, there's room for other solutions, as airops sidekick | [1] that uses browser extensions to embed itself in other data | tools. | | 1- https://www.airops.com/ | hgarg wrote: | I spent few weeks last year building a text to sql tool using | codex model to do something like this but for all kinds of data | sources. We pivoted away to something else for various reasons. | | But your approach is much better. Pandas is used a lot. Build a | tool on top of pandas. This is awesome. | javierluraschi wrote: | https://hal9.com is focused on building data apps with LLMs, | would love to explore integrating and contributing to Sketch. If | this sounds interesting I'm at javier at hal9.ai | drcongo wrote: | I use TabNine [0] for local context aware AI suggestions, and I | find it spookily good at guessing what I'm half way through | typing. Sadly they've left the Sublime plugin to rot and it's | mostly a hinderance in ST4. | | [0] https://www.tabnine.com | ldh0011 wrote: | So... Microsoft bought 48 or 49% of OpenAI right? Integrating | this into Excel would make everyone an excel power user. | bufferoverflow wrote: | But if it makes a logical mistake, it would take a real power | user to notice it. | localhost wrote: | But wouldn't you need to integrate Python into Excel for this | to work? | mmaia wrote: | A lot of people already uses excelformulabot. The impact of | something integrated into Excel would be pretty big. | davidbressler wrote: | It's already integrated into Excel with the add-on. | | What else did you have in mind? | blakeburch wrote: | This is fantastic and exactly where our team at Shipyard is | expecting the data space to go. Context aware, AI driven. Great | work on this! | | We were just talking last week about how we should create a | feature to describe transformations you want in Natural Language | that get compiled to pandas/SQL. Input data is everything | associated with the original file/dataframe. | | Visual transformation tools are typically limited and non- | reproducible. If you could switch it around to be code-compiled | but description-driven, that would open up new possibilities. | | I'd love to chat if you're open to it. Email in bio. | jadbox wrote: | I'd love something like a standalone SQL IDE where I can ask an | AI to generate queries or migration scripts. | | Sadly to be honest, I don't think I'd pay a subscription for | such a service. I would prefer to pay a one time tooling fee | and just run trained model in the IDE locally. | rafaelmelhem wrote: | I did something similar to it for my own use. Using natural | language it make sql queries to your .csv, xlsx (soon I'll | add features so you can connect to databases). but it is not | mature enough to sell as a service. Feel free to reach me | info [at] rafaelmelhem . com if you want and I send a demo :) | vorpalhex wrote: | Yeah the risk of your sql walking off to an AI vendor is not | worth the time savings. | swyx wrote: | This is a great demo, OP. | | I'm wondering about the UX of this vs Copilot. is this basically | just a way to get around the fact that you dont have Copilot | inside of notebooks? what else am I missing about this | experience? | bluecoconut wrote: | Thanks! | | That is definitely a big part of it, getting to use copilot | style answers without having to install any plugins to the IDE | (so getting to use this in colab or jupyter notebooks directly | feels great). | | That said, I use both copilot and sketch in my VScode | notebooks, and find that they have slightly different feelings | to the iteration loop. | | Sketch offers a more "local" data context (pinning the | text/prompt to the specific dataframe) which increases the | quality of the suggestions (since more relevant information is | within the token limit). | allisdust wrote: | I don't have any experience with pandas. Can this directly | connect to a db and run queries there (video seems to load a csv | file). | harvey9 wrote: | If you can already write SQL to return a data set then you can | get that set to pandas with pyodbc. | [deleted] | jamal-kumar wrote: | Damn, this looks pretty useful. I was finding that github copilot | was really good at reading a CSV file and writing all the imports | from that into migrations for DB import, but this looks like it | does these data transformations even more robustly. | | Is there any plans on getting this to work outside of the | python/pandas ecosystem or is it intrinsically tied to that | environment? ___________________________________________________________________ (page generated 2023-01-16 23:00 UTC)