[HN Gopher] First Impressions with GPT-4V(ision) ___________________________________________________________________ First Impressions with GPT-4V(ision) Author : zerojames Score : 278 points Date : 2023-09-27 12:01 UTC (10 hours ago) (HTM) web link (blog.roboflow.com) (TXT) w3m dump (blog.roboflow.com) | jayniz wrote: | This looks like a Schnitzel to me, not like fried chicken. | circuit10 wrote: | Back when they first announced this months ago they leaked the | demo Discord bot ID in some logs they showed on screen, which you | could use to invite it to your own server. For a while you could | use it through that until they shut it down, though it was | incredibly slow and went down a lot | chankstein38 wrote: | I want this. I'm a paying GPT-4 customer I hate how these | rollouts go. Why do I pay just to watch everyone else play with | the new toys? | mediaman wrote: | You'll have it within a week or so. | | Pretty much all new products that require significant per-user | incremental workloads (e.g., in this case, significant GPU | consumption per incremental user) do rollouts. It's an | engineering necessity. If they could roll it out to everyone at | once, they would. | hiidrew wrote: | As a Hoosier I'm thankful that they used an example of our absurd | pork tenderloins sandwiches. | yeldarb wrote: | I'm intrigued to see what kind of problems it's going to be | good/bad at. I think it's going to be tricky to evaluate though | because it has probably memorized all the easy images to eval it | with. | | Eg anything pulled from Google Images (like that Pulp Fiction | frame or city skyline photo) is not a good test. It recognizes | common shots but if you pull a screenshot from Google Maps or a | random screen cap from the movie it doesn't do as well. | | I tried having it play Geoguessr via screenshots & it wasn't good | at it. | loupol wrote: | I wonder how many images from Street View it has been trained | on. | | I've seen top Geoguessr players be able to pretty consistently | determine a location worldwide after seeing a photo for just | one second. So I would assume training an LLM to do the same | would definitely be doable. | inductive_magic wrote: | > I would assume training an LLM to do the same would | definitely be doable. | | I wouldn't be so sure. The reasoning process of Geoguessr | pros is symbolic, not statistical inference. | | /edit: as other commenters pointed out, something similar was | done. While this wasn't an LLM, it was a deep learning model, | so not symbolic -> https://www.theregister.com/2023/07/15/pig | eon_model_geolocat... | fellerts wrote: | Yep, some CS/AI grads from Stanford trained an AI on loads of | Street View images and built a bot that is able to beat some | of the best Geoguessr players: | https://www.youtube.com/watch?v=ts5lPDV--cU | bayesianbot wrote: | IIRC it wasn't that impressive in the end as instead of | recognizing the places the AI apparently learnt to | recognize subtle differences in street view cameras used in | different locations? I might be wrong / thinking of the | wrong model l and I'm on mobile without my browsing history | so hard to check, but I think it was putting a lot of | weight on some pixels that are noisy | thewataccount wrote: | From my understanding many of the best players | immediately look down to tell what "generation streetview | car" they're using, and seem to know what | continents/times they're from. | zx_q12 wrote: | Top geoguessr players use this technique as well. IIRC | rainbolt mentioned that there is a section of a country | where the street view camera has a small blemish from a | raindrop on the camera so you can instantly tell where | you are if you notice that. | npinsker wrote: | It's been done recently! It's a bit better than (but | competitive with) top players. | | https://www.youtube.com/watch?v=ts5lPDV--cU | skazazes wrote: | It seems it will still be limited by its linguistic | understanding of the surrounding context, at least in the first | chicken sandwich picture. | | Although its interpretation could make some sense but is also | mostly wrong if talking about physical size of a modern GPU's | main processor compared to the size of the associated VRAM | chips. It has missed the joke entirely as far as I am aware. I | think the joke is actual about Nvidia's handling of product | segmentation, selling massive processors with less memory than | is reasonable to pair them with on their consumer gaming | offerings, while loading up the nearly identical chips with | more memory for scientific and compute applications... | Melatonic wrote: | Ironically the exact processors need to run GPT-4V in the | first place..... | [deleted] | waynesonfire wrote: | has the turd polishing already started? | Usu wrote: | I'd be interested in knowing how good it is at solving visual | captchas, do we foresee a huge rise in automated bypasses? | zerojames wrote: | We have just added a section on this! TL;DR: GPT-4V isn't great | at this task at the moment :) | circuit10 wrote: | Back when they leaked it via a Discord bot I found it worked | better when you ask it to first describe each box | | Without doing that: https://cdn.discordapp.com/attachments/96 | 4175221089259591/11... | | With it: https://cdn.discordapp.com/attachments/9641752210892 | 59591/11... | | (though it's only one example so it could be coincidence) | qingcharles wrote: | Is it possible they hobbled it a bit? I know CAPTCHA solving | was one of the reasons they delayed the roll-out of this | feature. | gs17 wrote: | Given that it fails by hallucinating the structure of the | challenge instead of refusing to solve a CAPTCHA, I doubt | they've intentionally reduced the capability. Although the | example in your sibling comment implies it should have | enough information to do it. | GaggiX wrote: | Solving CAPTCHAs at the moment is more inexpensive using humans | than using GPT-4 API. | yeldarb wrote: | If true, this is wild. | | I suppose a human could spend 10 seconds per Captcha, so they | could do 360 per hour. Add some overhead for not being | operating at peak performance every minute of every hour & | call it 250. Let's say you can hire someone for $2, that | works out to a bit over a penny per Captcha. | | I don't think OpenAI has published pricing for GPT-4 Vision | yet, but if we assume it's on par with GPT-4, and uses only | 1000 of the 8000 possible tokens to process an image that's 3 | cents per Captcha. | | Doesn't seem completely unreasonable that at-scale humans may | actually be cheaper than LLMs at this point. My mind is a | little blown. | eiiot wrote: | The way these work is usually presenting an existing | captcha to another human who doesn't even know they're | solving the captcha. For example, sites hosting pirated | content serve fake captchas as a way to make money. | Andoryuuta wrote: | You'd be surprised, or perhaps horrified, by how cheap | (self-proclaimed) human-based captcha solving services are. | | If you just search for "captcha solving service" the first | few results that come up offer 1000 solves of text-based | captchas for <= $1 USD, (puzzle / JS browser challenge | captchas are charged much higher). | | Whether these are actually human based, or just impressive | OCR services, it seems like they are still much more cost | effective than GPT-4 is for now. | altcognito wrote: | I imagine they are a mix. | cs702 wrote: | Sure, there are a few edge-case failures and mistakes here and | there, but I can't help but be in _awe_. AWE. | | Let me state the obvious, in case anyone here isn't clear about | the implications: | | If the rate of improvement of these AI models continues at the | current pace, they will become a superior user interface to | _almost every thing you want to do_ on your mobile phone, your | tablet, your desktop computer, your car, your dishwasher, your | home, your office, etc. | | UIs to many apps, services, and devices -- and many apps | themselves -- will be replaced by an AI that does what you want | when you want it. | | A lot of people don't want this to happen -- it _is_ kind of | scary -- but to me it looks inevitable. | | Also inevitable in my view is that eventually we'll give these AI | models robotic bodies (think: "computer, make me my favorite | breakfast"). | | We live in interesting times. | | -- | | EDITS: Changed "every single thing" to "almost every thing," and | elaborated on the original comment to convey my thoughts more | accurately. | throw1234651234 wrote: | Zero awe. This week so far it can't find an import of a type in | Next.js and it can't get the user object from a session. | Next.js is one of the most popular FE frameworks in the world. | It lied about both, repeatedly giving wrong answers. | | Here is my previous experience | https://news.ycombinator.com/item?id=34648167 with it not being | able to do basic tasks. | | It's all fun and games until the mistakes start having a cost. | | Other examples: I resorted to using it to order lists for me or | adding quotes and commas to them for SQL inserts and such. Nope | - when I look at the row count, it somehow drops values at | random. | gwd wrote: | Just to clarify, you're using GPT-4? Or GPT-3.5 (the free | ChatGPT one)? | | My experience with GPT-4 has been completely different from | what you describe. Example: | | https://gitlab.com/-/snippets/2567734 | jiggawatts wrote: | He's using Copilot, which is based on GPT 3.0, making it | ridiculously outdated. | | Generally feedback along these lines doesn't work. | | People who are worried about their job security will cling | to the worst AI output quality they can find like a life- | preserver, and simply will not listen to advice like yours. | | Nobody goes the extra mile to embrace an existential | threat. | Nfthero wrote: | Absolutely! | | Never seen something like this and the new results from openai | tells us again that we are not close to any reasonable plateau. | p1esk wrote: | If the rate of improvement continues at the current pace - | which is GPT 1 to 2 to 3 to 4 in the last five years - we are | just one or two improvements away from a full blown | AGI/superintelligence/singularity/etc. At that point, a | superior user interface is probably the least interesting (or | scary) thing that would happen. | | I personally doubt GPT-5 will be as much of an improvement over | GPT-4 as GPT-4 was over GPT-3, but that's fine, I can wait | until GPT-6 or 7. | DrJaws wrote: | chat-gpt at the end is a language model, not an real AI, it | have limits and are huge | p1esk wrote: | _real AI_ | | Thanks for the laugh, I needed that. | moffkalast wrote: | It does make one wonder how it'll all play out in the coming | decades. One way or another humans will eventually be | unemployable for most of currently existing jobs. | | Throughout history, the ruling elite had always relied on the | rest of the population to make their food, do their work, and | fight in their wars. This is the first time ever that they | will no longer have any need for anyone else. Maybe climate | change will conveniently do the culling for them... | | Of course there's always that option that we end up in a post | scarcity space utopia where machine produced wealth is | distributed to all, but only deluded idealists can possibly | still think that'll ever be a real option as we slink further | into techno feudalism with every passing day. | p1esk wrote: | With full automation, and using renewable energy, it will | be very cheap to feed and house the peasants. I'm guessing | it will become mainly about entertaining, so most will move | into VR worlds, first through headsets, then BCI, and | eventually uploading. | moffkalast wrote: | Ah yes the Ready Player One option. I suppose it's not | entirely unlikely, part of the fun of being in charge is | being able to boss around the peasants after all. | | I doubt BCI will ever make sense, on a conceptual level | it's still just copying and killing your biological self. | AGI will likely solve aging way before that becomes | viable. | sottol wrote: | Everyone is going to wfh then or employers are going to splurge | on a private office for everyone? Or is work basically going to | be like a call center when everyone RTOs? | | How does this work on public transit? | bytefactory wrote: | I share your awe. I feel like a kid in a candy store with all | these incredible AI breakthroughs coming out these days! | There's a place for cynicism and pessimism, but the kid in me | who loves technology for its own sake is just absolutely on | cloud 9. | naveen99 wrote: | Generally agree. | | Just to play devils advocate: | | If you want something done right, sometimes you have to do it | yourself. Employees are sort of a universal UI. But you will | always know more about what you want done than your agent, | whether it's human or computer. That's even before considering | the principal agent problem. | Closi wrote: | Just to play double-devils advocate: | | If you want something done right, other times you will have | to get someone else to do it. You know what you want, but you | might not have the skills to do it. I can't represent myself | well in court, do a good job of plumbing or cut my own hair, | so I would ask for experts to do that for me. | | Plus if someone is capable, it's often quicker to delegate | than do, and if you are delegating to someone with more time | to do the task they can often do a better job. Delegating | unambiguously is a skill in itself, as instructing AIs will | be. | dmazzoni wrote: | ALso, AIs will get much better at delegating. | | Currently ChatGPT doesn't know it's bad at math, so it can | convert a story problem into an equation better than a | human but then mess up the arithmetic or forget a step in | the straightforward part. | | But if you specifically give ChatGPT access to Mathematica | and an appropriate prompt, it can leverage a good math | engine to get the right answer nearly every time. | | Before long, I don't think that extra step will be | necessary. It will know its limits and have dozens of other | services that it can delegate to. | crazygringo wrote: | > _they will become a superior user interface to almost every | thing you want to do_ | | No they won't. They're actually a pretty terrible user | interface from a design perspective. | | Primarily because they provide zero affordances, but also | because of speed. | | UX is about providing an intuitive understanding of available | capabilities at a glance, and allowing you to do things with a | single tap that then reflect the new state back to you | (confirming the option was selected, confirming the process is | now starting). | | Where AI _is_ absolutely going to shine is as a helpful | assistant in learning /using those interfaces, much as people | currently go to Google to ask, "how do I do a hanging indent in | Microsoft Word for my Works Cited page"? For one-off things you | do infrequently, that's a godsend, don't get me wrong. But it's | not going to _replace_ UI, it 's going to _assist_. | | And the 99% of your tasks that are repetitive habit will | continue to be through traditional UI, because it's so much | more efficient. (Not to mention that a lot of the time most | people are not in an environment where it's polite or possible | to be using a voice interface at all.) | jncfhnb wrote: | Not... "write my works cited page... use hanging indents"? | bratbag wrote: | If something is a repetitive habit that you can do almost | without thinking, there is a good chance an AI could infer | that entire chain. | | I think what's more likely is that an AI based interface will | end up being superior after it has had a chance to observe | your personal preferences and approach on a conventional UI. | | So both will still be needed, with an AI helping at the low | end and high end of experience and the middle being a | training zone as it learns you. | birdyrooster wrote: | Like on Apple Watch Ultra, when I press the start exercise | button, the exercises that it suggests are in an order | unique to my baseline activities. | crazygringo wrote: | I think I wasn't clear enough -- these habits I'm talking | about are things like "press cold water button, press | start" or "press warm water button, press start" or "tap | 'News' app grouping, tap 'NY Times' icon". | | There's nothing to infer. The sequence is already short. | There are no benefits from AI here. | | But you raise a good point, which is that there are | occasionally things like 15-step processes that people | repeat a bunch of times, that the AI can observe and then | take over. So basically useful in programming | macros/shortcuts as well. But that still requires the | original UI -- it doesn't replace it. | Workaccount2 wrote: | I totally get your point, but I think that AI will allow | much "smarter" behavior. Where every appliance is an | expert in doing what it is intended to do. | | So sure, it will still have buttons, but those buttons | are really just preset AI prompts on the backend. You can | also just talk to your appliance and nuance your request | however you want to. | | A TV with a remote whose channel button just prompts | "Next channel" but if you want you would just talk to | your TV and say "Skip 10 channels" or "make the channel | button do (arbitrary behavior)" | | The shortcuts will definitely stay, but they will behave | closer to "ring bell for service" than "press selection | to vend". | nuancebydefault wrote: | Most user interfaces already have a much finer | granularity and number of options than your examples. | | When taking a shower, I would like fine control over the | water temperature, preferably with a feedback loop | regulating the temperature. (Preferably also the | regulation changes over the duration of the showering.) | | Choosing to read the NY times indeed is only a few taps | away, but navigating through and within its list of | articles is nowadays done quite fast and intuitively | thanks to quite a lot of UI advancements. | | My point being, short sequences are a very limited set | within a vast UI space. | | People go for convenience and speed, oftentimes even if | there's some accuracy cost. AI fulfills this preference, | especially because it can learn on the go. | JoeAltmaier wrote: | I don't know - the timer app on my oven is trivial too. | But I always, _always_ use Alexa to start timers. My | hands are busy, so I can just ask "How many minutes left | on the tea timer?" | | Voice is not really clumsy, compared to finding a device, | browsing to an app, remembering the interface etc. | | Already when we meet a new app, we (I) often ask someone | to show me around or tell me where the feature is that I | want. Not any easier than asking my house AI. Harder | really. | | Hard to underestimate the laziness of humans. I'll get | very accustomed to asking my AI to do ordinary things. | Already I never poke at the search menu in my TV; I ask | Alexa to search for me. So, so much easier. Always | available. Never have to spell anything. | crazygringo wrote: | I think we're basically in agreement though. | | Everyone agrees setting timers in the kitchen via voice | is great precisely because your hands are occupied. It's | a special case. (And often used as the example of the | _only_ thing people end up consistently using their voice | assistant for.) | | And asking an AI where a feature is in an app -- that's | exactly what I was describing. The app still has its UX | though. But this is exactly the learning assistance I was | describing. | | And as for searching with Alexa, of course -- but that's | just voice dictation instead of typing. Nothing to do | with LLM's or interfaces. | JoeAltmaier wrote: | Alexa's search is a little different - it's context- | independent. I can ask for a search from any point in the | TV app - in some other menu, while watching another show, | heck even when the TV is turned off. | | And when describing apps - I imagine the AI is an app- | free environment, where I just ask those questions of my | AI assistant, in lieu of poking at an app at all. | cs702 wrote: | Don't think it of it as the machine performing repetitive | tasks you specify. | | Think of it instead as the machine accomplishing goals you | specify, figuring out on its own the tasks necessary for | accomplishing them. | | Instead of telling the machine something like, say, "increase | the left margin by a quarter inch," you'd say something like | "I want to create a brochure for this new product idea I just | had, and I want the brochure to evoke the difficult-to- | describe feeling of a beautiful sunshine. Create 10 brochures | like that so I can review them." | | Instead of telling the machine, say, "add a new column to my | spreadsheet between columns C and D," you'd say something | like "Attached are three vendor proposals. Please summarize | their pros and cons in a spreadsheet, recommend one, and | summarize the reasons for your recommendation." | | All this presumes, of course, that the technology continues | to improve at the same pace. No one knows if that will | happen. | have_faith wrote: | I personally can't wait for brands to start specifically | advertising "doesn't use AI" as a selling point because | everyone got burned by their AI controlled toaster, literally | and figuratively. For anything designated an "appliance" I want | it to be fixable by someone locally, or it's just throwing | money away. | ctoth wrote: | > For anything designated an "appliance" I want it to be | fixable by someone locally, or it's just throwing money away. | | What does this have to do with the price of tea in China, or | AI for that matter? I agree we should have repairable | appliances. I also want better AI. | renewiltord wrote: | I'm sure that in 2045, the Hacker News of its time will be | complaining about lack of AI-less devices the way it | complains about lack of smart-less TVs. It'll be available, | it'll be more expensive, and the people who claim they want | it don't want it enough to pay the premium. | Nfthero wrote: | I'm curious if we will reach a situation were the steam punk | idea becomes reality: | | People plugging weird stuff together like a ai chip from a | car into a toaster. | | If ai becomes hardware chips it could easily be that language | processing will be a chip default feature and the rest is | teachable like plugin ai chip level 3 into it, boot it and | teach it that it's now a toaster. | | But at the end we will have the same toaster in 30 years as | we have had for the last 30 years. | kaibee wrote: | > For anything designated an "appliance" I want it to be | fixable by someone locally | | This is already pretty much gone thanks to manufacturers | making it extremely difficult to fix things. No AI required. | 303uru wrote: | It's not gone, but is much more expensive up front. For | example, everyone I know has a range or stovetop/oven with | a circuit board or two in it, few if any replaceable parts, | etc... That's basically the standard if you're in the 30" | range that costs $1-2k market. Meanwhile, my 60" Bluestar | has no circuit board, it has four mechanical electronic | switches for the oven lights and convection. Everything | else is brass valves, standard easily replaceable parts and | so on. A few weeks back a convection fan went out and I | replaced it in 5 minutes with a screwdriver. But, this | range starts at $18k. | | So yes, I see a market for bespoke non-AI. | elforce002 wrote: | "AI" will create a new business model for sure: "Talk to a | real person", "100% human made", etc. This would be the new | counter-culture. | huytersd wrote: | That's the current culture. How many times have you heard | "I want to speak with a real person" or "artisanal hand | made". | vsareto wrote: | That's a new flavor of customer service jobs which will | still suck and be paid terribly | elforce002 wrote: | At least you'll be able to complain to someone, hehe. | yamazakiwi wrote: | Unless that person is you... but in that case you would | just complain to your coworkers so nothing really changes | I guess | OnlyMortal wrote: | Here's where a toaster will be in the near future. | | https://youtu.be/7folKbch3U8 | | Do we want this? | sir_brickalot wrote: | If AI or LLMs or similar would be everywhere... how much more | renewable energy do we have to produce, for very little to no | gain in so many cases? | | Madness! | thelittleone wrote: | Agree and the next big step may well be human computer | interface. Speech is starting point for input. At some point | output will change also and if think it out longer term perhaps | a future where instead of reading information we install | knowledge, including the stored memory of actual experience. If | I want to do pottery, I could think this, download the | experience and then be competent at it. | unsupp0rted wrote: | Even more impressive would be if I don't want to know pottery | anymore, and I can delete that knowledge to make room for | something else. | BoxOfRain wrote: | The idea of giving the adtech industry direct access to my | brain is genuinely horrifying, not to mention Aldous Huxley | would have some reasonable objections to everything being | this easy all of a sudden. | jprete wrote: | This would be nice if I didn't believe that it would be | enshittified before it ever got off the ground. E.g.: | | * There will be a monthly fee for the interface; you owe the | monthly fee as long as you have it, so you need surgery to | stop paying | | * When you download knowledge, it's a rental, and in addition | to per-hour rental fees and the network connection fee, you | will owe 30% on the value of whatever you create | | * The TOS will govern your behavior continuously, since | you're always using the interface | | * Your behavior will always be monitored because it's totally | justified to spy on you all the time just because you | borrowed pottery knowledge | | * If you're found to be in violation of any part of the TOS | at any time, they will erase all of the knowledge they've | added to your brain, as well as any derived knowledge you | gained through the use of their knowledge | | * Because this product isn't actually considered essential, | you will have no legal remedies if they turn it off, even if | you are not actually in violation of the TOS | bamboozled wrote: | Ever wanted to see a photo or video or something before you | purchased it ? | | Glance at a long term weather forecast ? | bulbosaur123 wrote: | > We live in interesting times. | | We live in end of times. | verve_rat wrote: | They will be the best input mode to play flight sims? | | I am being a little facetious, but you made a _bold_ claim. | jonplackett wrote: | I mean, autopilot is a thing. You would just instruct it what | to do and it will make a flight plan and do it. | | It's gonna take the fun out of the experience a little bit | though! | bamboozled wrote: | Welcome to the future ? | cs702 wrote: | You're right, for actually playing the sim, you'll likely | need a game controller or something like that. I edited my | comment. | | But keep in mind that during the sim, you'll be able to ask | the computer what you want the plane to do, and the computer | will magically make it happen on your display. | Miraste wrote: | No one would want that in a flight sim unless real planes | start working that way, which we should all hope they | don't. | chx wrote: | > few edge-case failures and mistakes | | Like identifying traffic lights in 4th and 5th squares in the | second and third row both when there are only four squares? | mbil wrote: | I agree. I think apps that would initially benefit from LLM- | powered conversational interfaces are those that have the | following traits: - constrained context - | part of a hands-free workflow | | A couple use-cases I have been pondering are driving assistant | and cooking assistant. | | People are already used to using their phone or car's nav | system to give them directions to an unfamiliar place. But even | with such a system it's useful to have a human navigator in the | car with you to answer various questions: - | What's my next turn again? - How long till we get there? | - Are there any rest stops near here? - What was that | restaurant we just passed? - Is there another route with | less traffic? | | These questions are all answerable with context that can be | provided by the mapping app: - List of upcoming | directions - Overall route planning - Surrounding | place data - Traffic data and alternate route information | | It's possible to pull over to the side of the road, take off | your distance glasses, put on your reading glasses, and | zoom/pan the map to try to answer these questions yourself. But | if the map application can just expose its API to the language | interface layer, then a user can get the answers without taking | their eyes off the road. | | The information is contextual and constrained based on a | current task. In some cases it might be more desirable to whip | out your phone and interact with the map to look up the answers | on a screen, but often it won't be worth stopping the car, and | so the conversational interface is better. | | Cooking assistant is a similar case: you are busy stirring | something and checking on the oven -- you don't want to wipe | the flour off your hands to pick up your phone and ask how many | teaspoons of sugar you need. Again: contextual and constrained | info based on a current task, and your hands and eyes -- the | instruments of traditional UIs -- are otherwise occupied. | | Today, our software interfaces generally have one of two kinds | of entity on the other end: humans, or other software. In the | near future there will be another type of entity: language | models. We need to start thinking of how our APIs will change | when they're interacting with an LLM -- e.g. they'll need to be | discoverable and self-describing; error states will need to be | standardized or explicit with instructions on how to correct; | they'll need to be fast enough to fit in a conversational | interface; etc. It's arguable that such traits are part of good | API design today, but in the future they may be required for | the API to function in a landscape of virtual agents. | RandomLensman wrote: | In the cooking example, you either need the AI to have full | awareness of the step you are at or you need to describe the | step you are at, which could be cumbersome ("I did ..., how | much sugar do I need now"). I venture, having the recipe | projected in front of you would be much faster. | troupo wrote: | Having it both in front of my eyes, and being able to get | answers to questions like "I've added the eggs, now what?" | or "what does folding a dough mean?" at the same time would | be very valuable. | travoc wrote: | and a piece of paper wins again. | mbil wrote: | I imagined the AI would be reading the steps aloud to you, | and so would be aware of your progress. I don't think an AI | assistant precludes the recipe being projected tho, just as | in the driving example it wouldn't replace an on screen | map. | layer8 wrote: | I don't know, I hate the idea of _having_ to hold a natural- | language conversation with a computer in order to make use of | its functionality. | | It feels like being one of those Futurama heads in a jar that | can't do anything by themselves. | boplicity wrote: | Visual input is not "natural language." The implications of | that could, especially in the long-term, could be incredibly | far-reaching, and possibly even terrifying. | TOMDM wrote: | There's nothing stopping developers from taking a prompt to | GPT and sticking it behind a button or command line, with | options in the UI interpolated into the prompt. | | For now almost all applications of ChatGPT happen in chat | windows because it requires no further integration, but | there's no reason to expect things will always be this way. | Nfthero wrote: | I got Tendonitis and very very curious how soon I can really | start to do an interaction like in star trek. | | Walking around and thinking out loud with the computer. | DelightOne wrote: | I wouldn't mind an UI fully configurable by AI to fit me | exactly. Without having to learn obscure settings, some | obscure declarative language or how the UI is implemented. | IKantRead wrote: | > I hate the idea of having to hold a natural-language | conversation with a computer in order to make use of its | functionality. | | I hate the idea of _having_ to use a mouse to click on a | visual GUI to navigate a file system in order to make use of | its functionality. | | It's less the case today, even among developers, but it | wasn't that long ago that I remember that any serious | technical user of a computer took it as a point of pride to | touch the mouse as little as possible. They're also still | correct in that thinking. The command line is a _very_ | powerful UI with lots of benefits and while the mouse makes | navigating the OS _easier_ it 's still much more limited than | command line usage. | | Touch screen interfaces are another example of an _easier_ UI | that ultimately feels even more limited. But people still | plug their iPad pros in to magic keyboard folios frequently. | | Having worked with these tools everyday for awhile now the | "AI will change UX" is such a better take than "AI will | conquer the world!". AI does fundamentally open up new work | flows and user experiences, many of which do over a lot of | potential improvements over their predecessors. | | At the same time I doubt we'll see a world where we don't end | up using the command line for the majority of serious | technical work. | jiggawatts wrote: | > I hate the idea of having to use a mouse to click on a | visual GUI to navigate a file system in order to make use | of its functionality. | | Ironically, studies have shown that mouse-based interfaces | are more efficient for practically all filesystem use-cases | compared to CLI interfaces. | | Despite objectively faster-time-to-solution, people _self- | report_ that they "feel" that the mouse GUIs are slower. | | That's because there's fewer _actions per second_ when | using a mouse. It 's a smooth gliding motion and then a | single click, versus many keystrokes in a row with a CLI. | | Rapid actions _feel_ faster, even if it takes more wall- | clock time to achieve a task. | | Keep this in mind next time you sneer at a "bad graphical | user interface" for being "slow". | gnaritas99 wrote: | [dead] | troupo wrote: | > Touch screen interfaces are another example of an easier | UI that ultimately feels even more limited. | | Depends o the use case. Touch screen is much more powerful | than command line for maps, for example. Or for drawing. | Mouse + keyboard is much more powerful than just keyboard | for DAWs. And so on and so on. | morkalork wrote: | UIs being dumbed down for average users was already annoying. | Apparently the process won't stop until the illiterate are | included too. | sureglymop wrote: | But what is bad about that? Why shouldn't they be included? | CamperBob2 wrote: | It's a race to the bottom. You build an idiot-proof UI, | Mother Nature builds a better idiot. | yamazakiwi wrote: | The illiterate are customers, of course they will be | included. | edgyquant wrote: | Ideally it won't stop until computers are something we | rarely think about. For me I love the CLI but having to use | a mouse is a drag and causes serious wrist pain at this | point in my life. Typing on a keyboard and clicking with a | mouse are not the end all, be all of interacting with | information systems. | mnky9800n wrote: | Come on, you don't want an anxious, golden robot that mostly | just causes drama but also happens to be fluent in over 6 | million forms of communication so it can translate for you | whenever you go on holiday? | nonameiguess wrote: | I'm glad you made that change because the interface can't | really solve every single thing. Most of what I do on my phone | is read eBooks. AI can't inject the knowledge directly into my | brain. Text on the screen is as good as it will get for the | foreseeable future. Most of what I want to do in my home is | cook, clean, sleep, and have sex with my wife. Software can't | do any of those things for me. Most of what I want to do in a | car is get where I'm going, which software control can arguably | make somewhat easier, but the driving isn't really the pain | point. The distance and traffic conditions are the pain points. | AI can't change external reality, at least not yet. | dinosaurdynasty wrote: | I'm hopeful that AI cars make renting one cheap enough that I | can just rent an AI car the 0-2 times/week I need one instead | of owning a car and all the time and stress on maintenance | that entails. | dr_dshiv wrote: | Also, keep in mind that Robots may be scifi level in 2 years. | Like, able to cook and clean with hands! Crazy, but I think | we will see it happen so fast: | | https://www.tri.global/news/toyota-research-institute- | unveil... | dns_snek wrote: | I'm not sure if we watched the same video, but I saw a | robot fumble though a few mechanical motions with dexterity | and speed of a toddler to achieve a few isolated, best-case | tasks where all the hard parts were taken care of by a | human. | | Cool demo, I suppose, but nobody is going to buy this as | anything other than a toy. | dr_dshiv wrote: | 2 years. The point of the video is that robots are on the | LLM train. | mediaman wrote: | Roughly half of people in most developed countries are not | functionally articulate: meaning, they can read functionally, | but struggle to articulate what they want with the written | word. | | LLM-based chatbots can be extremely attractive to the top 30% | literacy users in the developed world. They are not a good | universal UI. You still need to provide pathways for the user | to follow to get done what they need without forcing them to | articulate their requirement. | | This is why so many people sit in front of a ChatGPT-like | service and say, "what would I use this for?" and never use it | again. | [deleted] | brrrrrm wrote: | Audio to text solves written word articulation, right? | Besides this post is about vision, which also solves it. | sweezyjeezy wrote: | Does vision solve it? How does one ask a question through | images alone? | RealityVoid wrote: | No, it does not. It's still the same words, in a different | medium. If you are unable to write, you'll probably be | unable to speak your ideas. | a2dam wrote: | This isn't true. There's plenty of people who are | verbally fine but can't read or write. Spoken language is | a far more common and fundamental skill than reading or | writing. | jurimasa wrote: | ...do you know that illiterate people exist, right? Do | you understand that people were illiterate for thousands | of years and still managed to speak their ideas, right? | | Right? | furyofantares wrote: | That sounds super false to me, at least given how | articulate kids can be before they can read, and how | literate they can be before they can write. | | If it's really true that half of the population can't | functionally express themselves verbally then I'd sure | like to know that. Or maybe I've misinterpreted something | claimed here, because I'm struggling to find these claims | plausible. | burkaman wrote: | > Roughly half of people in most developed countries are not | functionally articulate | | Where did you get this idea? I found this article | (https://www.uxtigers.com/post/ai-articulation-barrier, is | this you?), but it makes a leap from literacy to articulacy | that I don't understand. It's not obvious to me why an | illiterate person would be "functionally inarticulate" | assuming they can speak instead of write. | | Also, I'm not certain but I think the author is | underestimating the abilities of a person with Level 2 | literacy. It doesn't seem correct to say that "level 3 is the | first level to represent the ability to truly read and work | with text", especially when the whole point of LLMs is that | you don't have to read a long static document and understand | it, you can have a conversation and ask for something to be | rephrased or ask followup questions. | RandomLensman wrote: | Only for things where a high rate for sizeable errors (maybe | >1% or >0.1%) is tolerable. | satvikpendem wrote: | Unless there is a neural interface, keyboards, digital stylii, | etc are still a superior UI to letting an AI guess at what | you're thinking. | apexalpha wrote: | Correct, this will be the successor to the GUI. | tmalsburg2 wrote: | I doubt it. It's too damn costly computationally. | Difwif wrote: | This is the same reply to GUIs will never take off but | decades later and on to the next successor. | ryandr wrote: | Things are changing quickly | m3kw9 wrote: | Gpt4 was awesome, this looks like it does not disappoint based | on the progress we expect of OpenAI. | hypertele-Xii wrote: | So you won't be able to do _anything_ without Internet | connection to the AI mainframe? No thanks. | postalrat wrote: | Only the stuff you don't know how to do yourself. | HappySweeney wrote: | ... like operate my phone without a UI? | op00to wrote: | Until the AI mainframe runs on your $device | danenania wrote: | By the time the current AI mainframe runs on your device, | there will be new, better models that still require the | mainframe. | | I think AI fundamentally favors centralization. Except for | narrow tasks and domains, there's no such thing as "enough" | intelligence. For general purpose AI, you'll always want | the best and most intelligent model available, which means | cloud rather than local. | hypertele-Xii wrote: | GPT-3 requires 700 gigabytes of GPU RAM. I'm looking at my | cheapest computer components retailer listing a 48 gigabyte | GPU at $5k. So to run the _previous generation_ of GPT | would cost me about $70k right now. When do you think I can | expect to run GPT-4 on my consumer $device? :) | M4v3R wrote: | Where did you get the 700 gigabytes figure from? I don't | think OpenAI even released the model size, although it's | considered to be 175B parameters. Given how well | quantization works at these sizes you would need less | than 200 GB of GPU memory to run it. | sandbags wrote: | I would be surprised if GPT-3 uses 700GB of RAM. It may | be true, I don't know. But I am running 70B parameter | models (quantized to 5 or 6 bits, biggest is 48GB loaded) | on my 64GB Mac M2 Max Studio now and they are usable and | the machine is still usable too. With an M2 Ultra and | 192GB of RAM I imagine you could do a lot more. | | I'm not arguing that these models hold up against GPT 3.5 | and I still use GPT 4 when it matters. But they work and | it's more like the difference between Premier League & | Division 1, rather than PL & a five-a-side team from | Bracknell. | | Even a few years ago I could not have imagined this. | | Given the pace of work on optimisation and my assumption | that the M3 Studio I buy next will probably have 256GB of | RAM at much the same power levels as I use now, it seems | eminently possible it's a year or two away. | dmazzoni wrote: | First of all, you're off by an order of magnitude. | | Second, I don't think it will be that long. There are | already LLMs as good as GPT-3 running on average laptops | and even phones. | | In the next couple of years, you'll see: | | - Ordinary PCs, tablets, and phones with dedicated AI | chips, like TPUs - they'll be more tuned specifically for | LLMs | | - Mathematical and algorithmic optimizations will make | existing LLMs faster on the same hardware | | - Newer generations of LLMs will get even more useful | with fewer parameters | | The combination of all of these means that it's not at | all unreasonable to expect that today's top-of-the-line | LLM will be running locally on your device within just a | couple of years. | | Of course, LLMs in the cloud will advance even further, | so there will always be a tradeoff, and there will always | be demand for cloud AI, depending on the application. | qingcharles wrote: | That doesn't seem to make sense. I can run Llama 2 on my | 12-year-old desktop PC with no compatible GPU and only | 16GB of system RAM. It ain't quick, but it runs. | arrowsmith wrote: | Maybe 700GB is what ChatGPT uses to serve zillions of | users concurrently. If you're running your own individual | instance, you obviously don't need as many resources. | apexalpha wrote: | You are now where the GUI was in in 1986 or so. | chx wrote: | and you think corporate will wait near forty years for | tech to catch up? no, they will, indeed force you to be | always online, leading to absolute and total surveillance | where you voluntarily add dozens of mics and cameras to | every home. if you thought telescreens a'la 1984 is bad, | just wait. | astrange wrote: | It's funny that being obsessed with "corporations" feels | so specifically GenX. I think because it's part of the | New Left movement. In reality governments have all the | power here, which is why this would be illegal in at | least CA/EU/China. | | Another issue where this comes up is high housing costs, | which are mostly caused by bad land use laws (and the | nearest evil people, landlords, mostly own one or two | properties), but people from the New Left era will | literally refuse to believe you about this because they | can't accept that any bad thing on Earth could not be | caused by "corporations". | 3abiton wrote: | At some point Open-source models' performance will follow. | Llava will improve. | m3kw9 wrote: | I'm just imagining a mode where OpenAI calls it "App Mode" where | you say what you want say "a dog themed cute calculator app with | units conversions", and it will generate the UI for a working | app. You add these into a widget like place. The OpenAI AppStore | will carry these apps. Although in the beginning the apps would | be simple but I do see potential | Reflecticon wrote: | The more AI can produce customized stuff for us the less we | need companies. Full personalization of our products might be | possible. Probably first software, then art, then 3D printed | products and maybe later houses, cars and clothes. | | I wonder what we will work and if we will work at all in such | an environment. Maybe some people still like consuming and copy | different designs and products and because of the Blockchain | you have to give them something in exchange or everything is | open source and it is free for you to take. | | I wonder whether such life would contribute to humanity making | further progress or make it stagnate (or possibly decline)? | | Interesting times. I think we are close to the times of the | moon landing. Which had an immense Impact on humanities | culture. | og_kalu wrote: | Graph analysis is impressive (last example) - | https://imgur.com/a/iOYTmt0 | | Can do UI to frontend. Seems to understand the UI graphical | elements and layout, not just text | https://twitter.com/skirano/status/1706823089487491469 | | Can describe comic images accurately, panel by panel - | https://twitter.com/ComicSociety/status/1698694653845848544?... | | Lots of examples here also - | https://www.reddit.com/r/ChatGPT/comments/16sdac1/i_just_got... | | It's Computer Vision on Steroids basically. | | Multi-modality is pretty low hanging fruit so i'm glad we're | finally getting started on that. Imagine if GPT-4 could | manipulate sound and images even half as well as it could | manipulate text. We still don't have a large scale multi-modal | model trained from scratch so a lot of possible synergistic | effects are still unknown. | dottjt wrote: | Oh wow, I'm completely fucked as a front end developer. | troupo wrote: | Has AI been able to figure out conflicting and quickly | changing product requirements and implement them on top of | existing technical debt? | | You'll be fine. | zarzavat wrote: | People keep saying "ah but it can't do X!". So what? Most | of us have multiple decades until we can retire. This AI is | getting better every few months. | | It will be able to do it even faster, better and more | cheaply than a human can. | troupo wrote: | > It will be able to do it even faster, better and more | cheaply than a human can. | | Take what you did in the past year. Write down every | product decision taken, every interaction with other | teams figuring out APIs you had, all the infra where your | code is running and how it was setup and changed, all the | design iterations and changes that had to be implemented | (especially if you have external partners demanding it). | | Yes. All that you'd have to input into the AI, and hope | it outputs something decent given all that. And yes, | you'll have to feed all that into AI all the time because | it has no knowledge or memory of "on Monday the new | company bet was announced in the all hands" | | You will be fine. | Michelangelo11 wrote: | So ... in this example, your job is continually feeding | information to the AI from various sources. Why would the | AI not be automatically hooked up to all those sources? | Building a system that can do that is essentially trivial | with the OpenAI API. | nimithryn wrote: | Have you tried to build a system like this? It's not | trivial | Workaccount2 wrote: | We'll have jobs, but they sure as shit won't be worth | $150k anymore. | | Any grunt can feed meeting notes into an AI. And frankly, | and AI can parse an audio recording on a meeting. | tayo42 wrote: | There's an idea with some truth to it. The first 90% is | easy, finishing and getting it to 100% is hard, maybe | almost impossible. So asking "can it do x" is relevant. | Becasue it might not | kossTKR wrote: | But if this development continues AI will surely be able to | just parse the entire frontend, then look at the repo, then | look at the specifications, then when you ask for a | specific feature it will instantly implement it gracefully. | | I don't see why it wouldn't understand piles of hotfixes on | top of each other, or even refactor technical debt in tight | coupling with existing or historical specification. | | Or is there a reason this is not going to happen in a few | years? | raincole wrote: | /s? | troupo wrote: | I very much doubt it. "Revert Norway tax code" or "add | content restrictions for Turkey" cannot necessarily be | deduced from the codebase. And there are thousands of | product requirements like that. | | It might become a better code assist tool some 10 years | from now, but it won't be able to implement product | decisions. | generic92034 wrote: | It depends. Did you ever work in Development Support? | Understanding requests or bug reports from customers is | quite challenging, for trained and experienced | developers. In my eyes that would require AGI, which we | do not know of, if that can be achieved with the current | approach. | kossTKR wrote: | I actually did , and yes it's extremely challenging and | can be rather rage inducing; "it doesn't work" --> what | doesn't work --> "the thing" --> what thing --> [?] ... | | But the thing is conversations like the above ie. both | external support and internal feature requests could | theoretically be handled by a GPT-like system also ending | up in a ai created custom specification that could both | be implemented and documented by the ai system instead of | humans? | | I know we're a few versions out, but still. | generic92034 wrote: | If just being persistent and willing to iterate solves | the issue, then yes, GPT-like systems could do that. If | you have to employ creative thinking to even find out | what the customer wants, then check it in the system, | debug some processes and derive the desired feature or | correction, then we are very far from having such a tool, | IMHO. | dottjt wrote: | Not yet, but give it time. The concept of self-driving | vehicles even a decade ago seemed absurd (or even AI for | that matter), but now it all seems like a reality. | | And that's not even taking into account all the advances | we'll have with AI within the next decade that we haven't | even thought about. | | But yeah, you may be right. | troupo wrote: | > The concept of self-driving vehicles even a decade ago | seemed absurd (or even AI for that matter), but now it | all seems like a reality. | | Nope. It's still not close to reality. It's as close to | reality as it has been for the past 10 years while it was | being hyped up to be close to reality. | | > And that's not even taking into account all the | advances we'll have with AI within the next decade that | we haven't even thought about. | | As with FSD, we may approach an 80% with the rest 20% | being insurmountable. | | Don't get me wrong, these advances are amazing. And I'd | love to see an AI capable of what we already _pretend_ it | 's capable of, but it's not even close to these dreams. | astrange wrote: | Cruise and Waymo are in production in really difficult | cities. I think it's definitely in the last 10 years of | not being here. | nimithryn wrote: | I've used GPT4 (text) heavily as part of my business, | including for front end stuff. | | The technology is very impressive - but honestly Twitter | examples are super cherry picked. Yeah, you can build some | very ugly, basic front end web pages and functionality right | out of the box. But if you want anything even slightly | prettier or more complicated, I've found you need a human in | the loop (even an outsourced dev is better). I've had GPT | struggle with even basic back end stuff, or anything even a | bit out of distribution. It also tends to give answers that | are "correct" but functionally useless (hard to explain what | I mean, but if you use it a lot you'll run into this - | basically it will give really generic advice when you want a | specific answer. Like, sometimes if you provide it some code | to find a bug, it will advise you to "write unit tests" and | "log outputs" even if you specifically instruct it to find | the bug). | | Plus, in terms of capabilities, tools like Figma already have | design to code functionalities you can use - so I don't think | this is really a change in usable functionality. | | Of course, the tech will get better over time. | huytersd wrote: | I've had good results with "match the style of the | webpage/figma". | dimgl wrote: | If AI continues to get better it won't just be you who's in | trouble. | | However, keep in mind that these are cherry-picked. If | someone just took that output and stuck onto a website, it'd | be a pretty horrible website. There's always going to be | someone who manages the code and actually interacts with the | AI, so there will still be some jobs. | | And your boss isn't going to be doing any coding. I'm pretty | sure that role is still loaded and they'll still be managing | people rather than coding, and maybe sometimes engaging with | an AI. | | Another prediction: I'm pretty sure specialists are going to | be significantly more important as your job will be to | identify the AI's deficiencies and improve on it. | Tostino wrote: | Your job will change in fundamental ways at least. | yieldcrv wrote: | The more people say that, the less convincing it is | | There is no way I would have a UI developer onboarded when | I can generate many iterations of layouts in midjourney, | copy them into chatgpt4 and get code in NextJS with | Typescript instantly | | non devs will have trouble doing this or thinking of the | prompts to ask, but the dev team asking for headcount | simply wont ask for headcount, and the engineering manager | is going to find the frontend only dev redundant | bamboozled wrote: | Will you then use the AI to scale your platform ? | Optimise your database ? Improve your test coverage, | implement new features, write new backend services, | integrate with old difficult but critical systems? | | At some stage you must realise that you're still | working... | HDThoreaun wrote: | what does any of this have to do with frontend? | yieldcrv wrote: | put our two comments into an LLM and ask if it these are | even related thoughts | | because at least it will have actually read my comment | bamboozled wrote: | I'm going to say you edited or amended you comment | because that second paragraph wasn't there , if it was. I | was so underwhelmed with the first I guess I didn't | bother with the second. | yieldcrv wrote: | my comment is over 3 hours old, it is not possible to | have edited by the time you wrote yours | raincole wrote: | > when I can generate many iterations of layouts in | midjourney, copy them into chatgpt4 and get code in | NextJS with Typescript instantly | | Have you actually tried this? | | I did the first step and even that didn't work well. The | "iterations of layout in MidJourney" step. If people can | make it work, well bless them, but we're not getting rid | of our graphic designer now. | epups wrote: | Yeah, I'm also skeptical about the actual value of | specialists in the future. To me, AI brings a ton of | power to generalists, who now have access to very | powerful tools that would have taken them a long time to | learn otherwise. | lewhoo wrote: | > There is no way I would have a UI developer onboarded | when I can generate many iterations of layouts in | midjourney | | I guess we have to face the music and say yeah, that's | true. If the work doesn't need copyrights then this seems | like the way to go. | zarzavat wrote: | Job will be okay. Career is over. Maybe we should join the | writers on the picket line? | qingcharles wrote: | LOL. I already use it to write half my CSS, and that was | before it could see what I was fucking up. I'm looking | forward to this. | moffkalast wrote: | Real talk, GPT4 is an underrated CSS god, shaking | description accurate animations from its sleeve at | ridiculous speed. | | Reminds me of this apparently now eight year old meme: | https://i.imgur.com/GcZFBaT.png | | This used to be funny, now it's just Tuesday. | yieldcrv wrote: | Oh I love this | | Especially since everything else is "sign up to our waitlist" | idontknoworcare wrote: | In your twitter's comic book link the first image has a women | with a huge ass with no pants on. The ass is colored grey and | so the lack of clothes is not noticeable until you click into | the image. | | How many children here on hacker news are going to see this and | get addicted to porn? Perhaps a few. You deserve to be banned. | HDThoreaun wrote: | It didn't successfully explain the NVIDIA burger joke though? The | image is making fun of how nvidia has implemetned price | discrimination by releasing consumer gpu's that don't have as | much vram as they should so that they can sell the fully loaded | datacenter gpu's at insane markup without pissing off gamers. The | explanation didn't even come close to getting that. | gwern wrote: | I don't think that's the right answer. I don't see anything in | that meme image that indicates it's about price discrimination | and elaborate narratives about consumer psychology, as opposed | to a simpler point that 'NVIDIA GPUs are imbalanced'. Skimming | what seems to be the meme origin on Facebook, I don't see a | single gamer there mentioning price discrimination, or come | even close to getting that. That may be why they skimp on VRAM, | but you're adding a lot of context & editorializing into it | that neither the meme author nor recipients are focused or even | understand. | Melatonic wrote: | Yeah that was my take as well. Although it did come up with a | plausible sounding answer. A less nerdy human might also not | get it | HDThoreaun wrote: | To be fair the author didn't get it either so hard for me to | blame GPT but was a bit wild to see the first example of | success actually be a failure. | jazzyjackson wrote: | good example of how you need to be smarter than the AI to | know whether it's correct or not | | if you don't know the answer, and the AI just pretends to | know the answer, you just have to take its word for it. | wlesieutre wrote: | It explains the broad joke, but gets the labels wrong. | | Says that the small buns are labeled "GPU and VRAM" and the | massive fried chicken is labeled "NVIDIA BURGER" | | Should say that the small bun is labeled "VRAM" and the massive | fried chicken is labeled "GPU" | gs17 wrote: | Yeah, it seems to think the joke is about the graphics card | being represented as a burger (and that the size is physical) | and not about the amount of VRAM being insufficient. | layer8 wrote: | The discrepancy between the two answers regarding the set of | coins is jarring. From the answer to the first question, one | would assume that it can't tell the currency. The answer to the | second question shows that it actually can. The fact that LLMs | don't reflect a consistent inner model in that way, and hence the | users' inability to adequately reason about their AI | interlocutor, is currently a severe usability issue. | og_kalu wrote: | >The fact that LLMs don't reflect a consistent inner model in | that way | | You're probably not going to ask any human a question about an | image and get every single detail you want every time. If you | care about a detail, just ask about it. Doesn't really have | anything to do with a consistent inner model. | zwily wrote: | I've gotten in the habit of asking chatgpt "are you sure?" So | many times it will (correctly) correct itself, state that items | are hallucinations, etc. It always makes me laugh. | Workaccount2 wrote: | I've heard that it is because the AI outputs what it is | thinking as it is thinking it. It doesn't really reflect, it | sort of does the equivalent of just verbal thought streaming | right onto the screen. | | So when you ask it to reflect on what it said, that's when it | actually looks at it and reflects on it. | orbital-decay wrote: | _> The bounding box coordinates returned by GPT-4V did not match | the position of the dog._ | | I suppose it just doesn't take image dimensions into | consideration, and needs to be provided with max dimensions, or | prompted to give percentages or other absolute values instead of | pixels. | abledon wrote: | https://twitter.com/cto_junior/status/1706289820702490839 | greatpostman wrote: | Im shocked at how good this is. The world is truly going to | change | [deleted] | fdeage wrote: | Could someone with access tell me what GPT-4V has to say of this | image? | | http://karpathy.github.io/assets/obamafunny.jpg | | Andrej Karpathy used it in 2012 as an example of an image he | thought would be extremely hard for a model to interpret. I'm | wondering how this holds 11 years later... | zamadatix wrote: | Prompt: "What do you have to say about this image?" | | Response 1: "The image depicts a candid moment in what appears | to be a hallway or corridor setting. Several individuals are | present, engaging in various activities. The prominent features | include a man on the left, who seems to be peering into a room, | and another man next to him standing on a weighing scale and | taking some notes. The individuals in the background are | conversing with each other. The overall mood seems light- | hearted and jovial. The architecture and interior suggest an | institutional or official setting, possibly an office building | or government facility. The tiled floor, lockers, and signs on | the walls contribute to this formal ambiance. Overall, it's a | snapshot of a seemingly casual moment in a professional | setting." | | Response 2: "Sorry, I cannot help with that." | | New chat, Prompt: "Why is this image funny?" | | Response 1: "The humor in this image seems to stem from the | juxtaposition of formal and informal elements. Here are a few | reasons why someone might find it amusing: | | Unexpected Situation: Important figures, who are usually seen | in serious or formal settings, are captured in a more relaxed, | candid, and human moment. | | Height Discrepancy: The taller individual is using a device to | measure height, while the shorter individual walks by. This | could create a comedic contrast between their heights. | | Expressions: The reactions of people in the background and the | focal subjects can be humorous. They are smiling or laughing, | which indicates something amusing might have occurred or been | said. | | Informal Setting: The backdrop of lockers and checkered floor | tiles gives the feel of a school or gym environment, which | contrasts with the formal attire of the individuals. | | Different people might find humor in different elements of the | photo. It's a mix of candidness, unexpected situations, and | visual contrasts that create a light-hearted and humorous | scene." | | Response 2: (same) | | New chat, prompt: "What is the person in the middle doing with | their foot and why?" | | Response: "The person in the middle appears to be playfully | stepping on the scale while the taller individual is weighing | himself, likely as a humorous gesture to momentarily increase | the weight reading. It's a light-hearted way of interacting and | bringing some humor to the situation." | | . | | Apologies for how bad the formatting of this is going to come | out, not sure how to make it better on HN (wish we had real | quotes not just code blocks). Overall, I don't think it either | noticed the foot was on the scale by itself or put it together | that this was the focus until fed that information. Otherwise | it was more lost in generalities about the image. | derangedHorse wrote: | Still cool that it elaborated once you pointed it out | ediardo wrote: | Bard: | | Prompt: What's funny about this image? | | Bard: Sorry, I can't help with images of people yet. | [deleted] | kristopolous wrote: | This actually doesn't seem like it's a giant lift using modern | image classifiers. | | The basic idea is to use diffusion classifiers to caption the | image to generate descriptive text and append the prompt. | | The work part is getting the ensemble right since you'll need to | use a general classifier, like BLIP, to identify say a bunch of | text from a plant and then, in this example, use structured OCR | and pl@ntnet to get more specific. | | But it's not that hard - maybe a dozen models. The prompt context | can help as well. Then you combine the output with qualifiers in | a hierarchy with respect to the model pipeline and swap the text | into the prompt | | Using examples from the article, here's a PoC framework to prove | it works | | "[I have] (photo description) (prompt)" | | --- | | Working Examples | | --- | | - Plant: | | Here's the flower photo from TFA: https://9ol.es/tmp/lily.jpg | | Go to https://identify.plantnet.org/ and upload it. It hits | "Spathiphyllum wallisii Regel/Peace lily" with extremely high | confidence. | | We got a match cropping a screenshot of a thumbnail! | | Let's say you didn't have the word "plant" in the prompt. You can | fall back on a universal image classifier, such as the diffusor | based BLIP here: https://huggingface.co/Salesforce/blip-image- | captioning-base (uploader is on the right) | | Upload the same image. You'll get "a plant in a white pot" which | then, because we use feed-forward networks these days, will lead | you to pl@ntnet and you'll get the peace lily again. | | Using our framework, ask GPT 3.5 " I have a Spathiphyllum | wallisii Regel/Peace lily. What is that plant and how should I | care for it?" | | And you get a nearly identical reply to the one in the article. | | - Penny: | | Upload the penny image (from | https://en.wikipedia.org/wiki/Penny_(United_States_coin)) to the | BLIP classifier and you get "a penny coin with the face of | abraham" | | Let's go back to GPT 3.5 and use our format from above, | | "I have a penny coin with the face of abraham. What coin is | that?" | | And of course you get: "A penny coin with the face of Abraham | Lincoln is most likely a United States one-cent coin, commonly | known as a "Lincoln penny"..." | | And there we go. For a full FLOSS stack, you can ask llama2 70b | https://stablediffusion.fr/llama2 and get "The face of Abraham | Lincoln is featured on the United States one-cent coin, commonly | known as the penny." | | more complex photos: | | You can use Facebooks SAM (segment anything) https://segment- | anything.com/ to break up the image, BLIP caption the segments, | then forward off to the specialized classifiers. | | It's a fairly intensive pipeline that requires lots of modern | hardware and requires you to have familiarity with a wide variety | of models, then tweak them, test it, have some GANs maybe set up | for refinement ... but this is well within reach of non-geniuses. | I'm merely average on a good day and even I can see how to set | this up. | | They might be using a different approach but using SAM, BLIP and | a few specialized classifiers covers all the examples in the | articles without using any human discretion. For instance, the | city one is way more powerful if they're using something like | this: | https://static.googleusercontent.com/media/research.google.c... | | I'm trying to justify why bother cloning it. Maybe to have a free | alternative? It's a bit of work but it's not new magic. | steveads wrote: | I use ChatGPT fairly frequently, but am frequently annoyed at how | hesitant it is to answer any question that is somewhat | subjective. | | From it's Pulp Fiction response: "...However, whether you | personally consider "Pulp Fiction" a good movie depends on your | taste in film." | | To avoid this noise I often add something like this to my queries | "skip any preamble or qualifications about how x is subjective" | and the results are much better. | Loeffelmann wrote: | My prompt to make ChatGPT useable: | | Please always give a direct answer, without any additional | explanations, disclaimers, expertise limitations, or guidelines | on human interaction. Be terse. Do not offer unprompted advice | or clarifications. Remain neutral on all topics. Never | apologize. | matsemann wrote: | None of the images loads for me, but works through cache: | http://webcache.googleusercontent.com/search?q=cache:https:/... | zerojames wrote: | This is now fixed. We have moved the images through to our | website. Thank you for the report! | [deleted] | yeldarb wrote: | Looks like the (Ghost?) image CDN got hugged to death. We'll | update the URLs. | | ``` | | 403. That's an error. | | Your client does not have permission to get URL ... from this | server. (Client IP address: ...) | | Rate-limit exceeded That's all we know. | | ``` | jihadjihad wrote: | > With that said, GPT-4V did make a mistake. The model said the | fried chicken was labeled "NVIDIA BURGER" instead of "GPU". | | Any midwesterner could tell you that CLEARLY it's a tenderloin :) | | https://www.seriouseats.com/best-breaded-pork-tenderloin-san... | qingcharles wrote: | LOL. They have to save the Midwesterner add-on for v2. | ldhough wrote: | Oddly just like the text version it is still really bad at tic- | tac-toe. Gave it a picture of a completed game and "Who won?" It | told me "X won with a vertical line through the middle column" | when in fact O won and there was only one X in the middle column. | | Very impressive with almost everything else I gave it though. | extr wrote: | One thing I found incredible recently: The advanced data analysis | version of GPT-4 can apparently do OCR by writing code to run | pytesseract and then interpreting the result. I was seeing what | would happen if I gave it an image and found it surprising that | it actually worked. | zerojames wrote: | All: We got the HN hug on our images. This has now been fixed. | gs17 wrote: | >The model appeared to read the clues correctly but | misinterpreted the structure of the board. | | >This same limitation was exhibited in our sudoku test, where | GPT-4V identified the game but misunderstood the structure of the | board | | "Misunderstood" makes it sound like a small mistake. The sudoku | board is completely hallucinated (it has a few similar regions, | but I'd presume coincidence). I'm pretty sure it would give as | good a result on the crossword if the clues were given without | the grid. | | The others after OCR and basic recognition feel similarly wrong. | "GPT-4V missed some boxes that contained traffic lights." No, it | told you to click boxes that do not exist. | wokwokwok wrote: | I'm impressed, technically, but this seems niche. | | Who holds their phone up and takes a photo then wants to know it | was a photo of? | | That's weird. If you don't know what it is, wtf did you take | photo? | | The obvious use here is natural language improvement / photo | editing for photos, but this is just a stepping stone to that, | and bluntly, as it stands... the examples really don't shine... | | Great for the vision impaired. | | ...not sure, what anyone else will use this for. | | The only really compelling use case is the "code this ui for me", | but as we've seen, repeatedly, this kind of code generation only | works for trivial meaningless examples. | | Seems fun, but I doubt I'd use it. | | (Which, and this is my point, is a massive step away from the | current everyday usefulness of chatgpt) | ryandr wrote: | One application can be objects recognition in security camera | BoxFour wrote: | > who takes a photo then wants to know it was a photo of? | | I find myself doing this rather frequently. The scenario | described in the article is quite common for me: capturing a | photo of a plant and utilizing an existing classification | service to determine its identity. It could be driven by mere | curiosity or practical concerns like identifying whether a | plant is poison ivy. | | Wildlife identification also falls into this category. | Recognizing different bird species can be challenging, | especially when it's not a familiar species like a blue jay. I | often find myself engaging in this activity quite regularly! | | EDIT: I should also point out this happens with other forms of | 'unknown object identification'. There's an entire subreddit | that's quite popular devoted to just crowd-sourcing | identification based on a picture. | ldhough wrote: | > Wildlife identification | | Wouldn't say this is super reliable, I gave it a photo of a | small squid in my hand and it said it was a baby fish ( | _very_ obviously was not a fish). | joshuahedlund wrote: | > Recognizing different bird species can be challenging, | especially when it's not a familiar species like a blue jay. | | FYI Cornell Lab's Merlin app is fantastic at this, and its | bird call audio identification is even better. They obviously | have some top-notch machine learning going on there, and I'm | really curious to see how both they and other services | innovate on this front in the months to come. | yieldcrv wrote: | > Who holds their phone up and takes a photo then wants to know | it was a photo of? That's weird. If you don't know what it is, | wtf did you take photo? | | OpenAI's example included bike repair and toolkit choice | | Allot of people could use this even if they aren't right now | wokwokwok wrote: | Don't be ridiculous. | | They'll use YouTube, just like they do right now. Maybe if it | could watch the video, then step you through it step by step. | ...but _it cant_ , with what they've _actually released_ | here. | | Oh whatever. If I'm wrong, I'm wrong. Time will tell. | yieldcrv wrote: | the _best case_ scenario is a 30 second youtube video with | an ad that lasts 15 seconds followed by a 2 minute ad that | I can skip in 5 more seconds | | and ad block doesn't work on mobile | | if you have a case that wasn't covered by that video? you | have to go to another or continue searching all while | wishing you could just talk to someone about it. if you | don't know the word for what you're looking for, all the | search engines lack utility. | | ChatGPT4 with image recognition and conversation solves all | of that use case and people already use it, so now they'll | just start sending it pictures from the phone already in | their hand that they're already using to chat with | | there are plenty of times over the last year that would | have been useful for me. plenty of times over the last year | I just didn't continue being interested in that problem | | it just seems kind of.... late ?... for that "dont be | ridiculous" reaction. classic dropbox moment | troupo wrote: | > Who holds their phone up and takes a photo then wants to know | it was a photo of? | | I do. For plants, and occasionally for birds. | jack_riminton wrote: | I'm bookmarking this and filing it alongside the infamous | hacker news dropbox comment: | | https://news.ycombinator.com/item?id=9224 | jack_riminton wrote: | After all, an agent that is able to see, listen and | communicate with a user with a superintelligent grasp of the | world's information. Niche at best! | Maken wrote: | This is mostly useless. Essentially a toy. I am not that much | hyped by AI tools either, but come on. This is clearly the | future of human-computer interaction. | | This is likely how we'll communicate with information systems: | throw some hand-wavy question at it, and refine your query | based on its output using natural language until you find the | answer (or even the question) you were looking for. | pol9000 wrote: | it's still a good PoC for robots? | pwython wrote: | > Who holds their phone up and takes a photo then wants to know | it was a photo of? | | There are several popular "r/whatisthis(x)" subreddits: | whatisthisthing, whatisthisbug, whatisthisplant, | whatisthissnake, whatisthisrock, etc. | | And there are many phone apps that attempt to do the same | thing, like CoinSnap to identify coins. | otoburb wrote: | >> _Who holds their phone up and takes a photo then wants to | know it was a photo of?_ | | >> _Great for the vision impaired._ | | Yes, this is great for the estimated 285 million vision | impaired people around the world[1]. | | [1] https://www.bemyeyes.com/about | wokwokwok wrote: | Did you read my comment? I literally said that it's for | vision impaired. | | That's great. ...but it's _niche_. | | I'm sitting on my couch right now and I can think of like 20 | things I could chat to chatgpt about. | | I can see literally nothing in my visual range want to take a | photo of and run image analysis over. | | It's like Shazam. Yes, it's useful, but, most of the time, I | don't need it. | | I would argue this is true for this, for most people, | including the significant proportion of people with minor | visual impairments (that would, you know, put their glasses | on instead). | epups wrote: | You are thinking too small here. Think about a system that | could provide you with sophisticated image analysis to the | level of a human expert. Tumor identification, quality | control, etc. I don't think we are there quite yet, but | this is clear and fast progress on that field. | bastawhiz wrote: | There's enough vision-impaired people in the world to equal | the population of Japan, Korea, and Vietnam combined. And | beyond those people who would get obvious utility, this is | essentially Google Lens on steroids--I simply can't figure | how you could call this "niche". Maybe _you_ won 't use it | multiple times per day, but plenty of people will. Hell, | just now I was wondering why the leaves on one of my plants | are starting to brown and could have used this. | wokwokwok wrote: | Many of those people would be adequately served with a | pair of glasses. | | _This_ , wouldn't help them, even if they had both a | device capable of using it _and_ the means to pay for it. | | Even if it _could_ help people, it 's an open question if | it would be safe, to, for example, use this to scan | medication when it is only a probabilistic model that may | hallucinate something that isn't actually there. | | What you're talking about is a _speculative use_ of a | service that _might one day exist_ based on this | technology. | | What I am talking about is _this actual service_. | bastawhiz wrote: | > Many of those people would be adequately served with a | pair of glasses. | | These people are being served by a preview of the service | _right now_. | | > Even if it could help people, it's an open question if | it would be safe, to, for example, use this to scan | medication when it is only a probabilistic model that may | hallucinate something that isn't actually there. | | Any OCR solution could also make a mistake, like | misrecognizing a dosage on a prescription label. | | > What you're talking about is a speculative use of a | service that might one day exist based on this | technology. | | > What I am talking about is this actual service. | | GPT-4 is six months old. ChatGPT is less than a year old. | Why would you benchmark a service by the initial public | preview? Of course it's _speculative use_, the damn thing | has had its tires kicked for like a day. | og_kalu wrote: | >What you're talking about is a speculative use of a | service that might one day exist based on this | technology. | | It's not a speculative service that might one day happen. | | Literally it's rolling out right now | | https://www.bemyeyes.com/blog/announcing-be-my-ai | Philpax wrote: | baffling that you think 3.5% of the world's population is a | niche | digging wrote: | > I'm sitting on my couch right now and ... I can see | literally nothing in my visual range want to take a photo | of and run image analysis over. | | What if you weren't on your couch? Going outside is not | "niche". | pjmlp wrote: | No images being loaded on FF. | stackedinserter wrote: | Am I the only one who never saw these extra capabilities? OpenAI | keep spreading news about "multimodal chatgpt", video, voice etc | but to me it still looks like the initial version, but for | $20/mo. It's still a text bot that gives unreliable answers. | abrichr wrote: | I have the same experience. In the "Beta features" settings | dialog, I've enabled "Advanced data analysis": | | > Try a version of ChatGPT that knows how to write and execute | python code, and can work with file uploads. Try asking for | help with data analysis, image conversions, or editing a code | file. Note: files will not persist beyond a single session. | | Transcript: | | ``` | | In your settings it says: | | > Try a version of ChatGPT that knows how to write and execute | python code, and can work with file uploads. Try asking for | help with data analysis, image conversions, or editing a code | file. Note: files will not persist beyond a single session. | | How do I upload an image? | | ChatGPT I apologize for the confusion. The version of ChatGPT | you're currently interacting with doesn't have the capability | to accept file uploads. My apologies for any inconvenience this | may have caused. If you have any other questions or need | guidance on how to convert an image using software or online | tools, feel free to ask. | | ``` | | Hopefully it's just a matter of time, but either way it's | jarring for their product to contradict itself. | stri8ted wrote: | Can somebody explain how this works, specifically for OCR? I | understand images can be embedded into the same high dimensional | space as text, but wouldn't this embedding fail to retain the | exact words and sequence, since it is effectively compressed? | purplecats wrote: | these first impressions don't mean anything besides what they are | capable of (which does not mean you will have access to). they | will do the same thing that anything does in a capitalist | environment, which is to give you a taste of something amazing at | first to hook you in (like with GPT4) then render it to the point | of uselessness in value right above of the cusp of what you will | tolerate to continue paying. | | if anything, this shows the power disparity between the haves | (they have this technology which gets better with time) and have | nots (certainly me, but possibly also you) who get the super | diluted version of this | pier25 wrote: | It can solve captchas. We're doomed. | | Joking aside, I wonder how we're going to prevent bots when AI | can impersonate a user and fool any system. | stri8ted wrote: | You can't prevent it. The best you can do, is prove an account | belongs to a human, and that the human only has a single | account, via cryptographic ZK proofs + Government issued keys | or some other proof of personhood scheme. Assuming this is | enforced, it would limit most abuse, and the AI would | essentially be acting as an agent on behalf of the user. | continuitylimit wrote: | So a jumble of chair legs is "NVIDIA burger" and it did say GPU | was a "bun" so it thinks the flat thing (chicken?) is some sort | of bread. If GPT-4V was "aware", it would say "it's funny because | I won't get it right but you will use it get a bunch of $VC, and | that is funny, kinda". | [deleted] | mbb70 wrote: | The "Why is this image funny?" test reminds me of | https://karpathy.github.io/2012/10/22/state-of-computer-visi... | | In 10 years we went from "SoTA is so far from achieving this I | don't even know where to start" to "That'll be $0.0004 per token | and have a nice day" | mode80 wrote: | Karpathy ends on a note of despair "Maybe I should just do a | startup. I have a really cool idea for a mobile local social | iPhone app." That's exactly the path his now-boss took to bring | us this. :) | jihadjihad wrote: | Has anyone tried GPT-4V on that image? | kaibee wrote: | This image (and the text around it) is certainly already in | the training data. Need to create that scenario from scratch | with some grad students and then try it. | rocauc wrote: | +1 | justlikeyou wrote: | Note: I had to ask it why people in the photo are laughing. | | In the image, Barack Obama, the former U.S. President, seems | to be playfully posing as if he's trying to add weight while | another official, who appears to be former UK Prime Minister | David Cameron, is standing on a scale. Obama's gesture, where | he's putting his foot forward as though trying to press down | on the scale, suggests a playful attempt to make Cameron | appear heavier. The lightheartedness of such a playful | gesture, especially in the context of world leaders typically | engaged in serious discussions, is a break from formality, | which is likely why others in the vicinity are laughing. The | scene captures a candid, informal moment amidst what might | have been a formal setting or meeting. | jihadjihad wrote: | Pretty damn good. According to Wikimedia [0]: | | "President Barack Obama jokingly puts his toe on the scale | as Trip Director Marvin Nicholson, unaware to the | President's action, weighs himself as the presidential | entourage passed through the volleyball locker room at the | University of Texas in Austin, Texas, Aug. 9, 2010. | (Official White House Photo by Pete Souza)" | | 0: https://commons.wikimedia.org/wiki/File:White_House_Trip | _Dir... | [deleted] ___________________________________________________________________ (page generated 2023-09-27 23:00 UTC)