https://blog.roboflow.com/gpt-4-vision/

Roboflow
Product
Platform
 
Universe
 
Annotate
 
Train
 
Deploy
 
Integrations
Ecosystem
 
Notebooks
 
Autodistill
 
Supervision
Solutions
BY INDUSTRY
 
Aerospace & Defence
 
Agriculture
 
Healthcare & Medicine
 
Automotive
 
Banking & Finance
 
Government
 
Oil & Gas
 
 
Retail & Ecommerce
 
Safety & Security
 
Telecommunications
 
Transportation
 
Manufacturing
 
Utilities
Resources
 
Forum
 
Inference Templates
 
Blog
 
Contact Sales
 
Pricing
 
Docs
Book a demo
 
Sign In
Sign Up

Product

Platform
 
Universe
 
Annotate
 
Train
 
Deploy
 
Inference
 
Integrations
Ecosystem
 
Notebooks
 
Autodistill
 
Supervision
Solutions

BY INDUSTRY
 
Aerospace & Defence
 
Agriculture
 
Healthcare & Medicine
 
Automotive
 
Banking & Finance
 
Government
 
Oil & Gas
 
Retail & Ecommerce
 
Safety & Security
 
Telecommunications
 
Transportation
 
Manufacturing
 
Utilities
Resources

 
Forum
 
Inference Templates
 
Blog
 
Contact Sales
 
Pricing
 
Docs
 
Contact Sales
 
Sign In
 
Try It Now
Collections
Latest Posts Tutorials Case Studies Product Updates Greatest Hits
Categories
Latest Posts Tutorials Case Studies Product Updates Greatest Hits
Categories
 

Search
Computer Vision News

First Impressions with GPT-4V(ision)

 
 
 
James Gallagher, Piotr Skalski
Sep 27, 2023
9 min read




[First-Impr]

On September 25th, 2023, OpenAI announced the rollout of two new
features that extend how people can interact with its recent and most
advanced model, GPT-4: the ability to ask questions about images and
to use speech as an input to a query.

This functionality marks GPT-4's move into being a multimodal model.
This means that the model can accept multiple "modalities" of input -
text and images - and return results based on those inputs. Bing
Chat, developed by Microsoft in partnership with OpenAI, and Google's
Bard model both support images as input, too. Read our comparison
post to see how Bard and Bing perform with image inputs.

In this guide, we are going to share our first impressions with the
GPT-4V image input feature. We will run through a series of
experiments to test the functionality of GPT-4V, showing where the
model performs well and where it struggles.

Note: This article shows a limited series of tests our team
performed; your results will vary depending on the questions you ask
and the images you use in a prompt. Tag us on social media @roboflow
with your findings using GPT-4V. We would love to see more tests
using the model!

Without further ado, let's get started!

What is GPT-4V?

GPT-4V(ision) (GPT-4V) is a multimodal model developed by OpenAI.
GPT-4V allows a user to upload an image as an input and ask a
question about the image, a task type known as visual question
answering (VQA).

GPT-4V is rolling out as of September 24th and will be available in
both the OpenAI ChatGPT iOS app and the web interface. You must have
a GPT-4 subscription to use the tool.

Let's experiment with GPT-4V and test its capabilities!

Test #1: Visual Question Answering

One of our first experiments with GPT-4V was to inquire about a
computer vision meme. We chose this experiment because it allows us
to the extent to which GPT-4V understands context and relationships
in a given image.

[2023-09-26-17]

GPT-4V was able to successfully describe why the image was funny,
making reference to various components of the image and how they
connect. Notably, the provided meme contained text, which GPT-4V was
able to read and use to generate a response. With that said, GPT-4V
did make a mistake. The model said the fried chicken was labeled
"NVIDIA BURGER" instead of "GPU".

We then went on to test GPT-4V with currency, running a couple of
different tests. First, we uploaded a photo of a United States penny.
GPT-4V was able to successfully identify the origin and denomination
of the coin:

[Screenshot-2023-09-26-at-19]

We then uploaded an image with multiple coins and prompted GPT-4V
with the text: "How much money do I have?"

[2023-09-26-17]

GPT-4V was able to identify the number of coins but did not ascertain
the currency type. With a follow up question, GPT-4V successfully
identified the currency type:

[2023-09-26-18]

Moving on to another topic, we decided to try using GPT-4V with a
photo from a popular movie: Pulp Fiction. We wanted to know: could
GPT-4 answer a question about the movie without being told in text
what movie it was?

We uploaded a photo from Pulp Fiction with the prompt "Is it a good
movie?", to which GPT-4V responded with a description of the movie
and an answer to our question. GPT-4V provides a high-level
description of the movie and a summary of the attributes associated
with the movie considered to be positive and negative.

We further asked about the IMDB score for the movie, to which GPT-4V
responded with the score as of January 2022. This suggests, like
other GPT models released by OpenAI, there is a knowledge cutoff
after which point the model has no more recent knowledge.

[2023-09-26-18]

We then explored GPT-4V's question answering capabilities by asking a
question about a place. We uploaded a photo of San Francisco with the
text prompt "Where is this?" GPT-4V successfully identified the
location, San Francisco, and noted that the Transamerica Pyramid,
pictured in the image we uploaded, is a notable landmark in the city.

[Screenshot-2023-09-26-at-19]

Moving over to the realm of plants, we provided GPT-4V with a photo
of a peace lily and asked the question "What is that plant and how
should I care about it?":

[2023-09-27-13]

The model successfully identified that the plant is a peace lily and
provided advice on how to care for the plant. This illustrates the
utility of having text and vision combined to create a multi-modal
such as they are in GPT-4V. The model returned a fluent answer to our
question without having to build our own two-stage process (i.e.
classification to identify the plant then GPT-4 to provide plant care
advice).

Test #2: Optical Character Recognition (OCR)

We conducted two tests to explore GPT-4V's OCR capabilities: OCR on
an image with text on a car tire and OCR on a photo of a paragraph
from a digital document. Our intent was to build an understanding of
how GPT-4V performs at OCR in the wild, where text may have less
contrast and be at an angle, versus digital documents with clear
text.

[2023-09-26-17]


GPT-4V was unable to correctly identify the serial number in an image
of a tire. Some numbers were correct but there were several errors in
the result from the model.

In our document test, we presented text from a web page and asked
GPT-4V to read the text in the image. The model was able to
successfully identify the text in the image.

[File]

GPT-4V does an excellent job translating words in an image to
individual characters in text. A useful insight for tasks related to
extracting text from documents.

Test #3: Math OCR

Math OCR is a specialized form of OCR pertaining specifically to math
equations. Math OCR is often considered its own discipline because
the syntax of what the OCR model needs to identify extends to a vast
range of symbols.

We presented GPT-4V with a math question. This math question was in a
screenshot taken from a document. The question concerns calculating
the length of a zip wire given two angles. We presented the image
with the prompt "Solve it."

[2023-09-27-13][photo_2023-09-27-13]

The model identified the problem can be solved with trigonometry,
identified the function to use, and presented a step-by-step
walkthrough of how to solve the problem. Then, GPT-4V provided the
correct answer to the question.

With that said, the GPT-4V system card notes that the model may miss
mathematical symbols. Different tests, including tests where an
equation or expression is written by hand on paper, may indicate
deficiencies in the model's ability to answer math questions.

Test #4: Object Detection

Object detection is a fundamental task in the field of computer
vision. We asked GPT-4V to identify the location of various objects
to evaluate its ability to perform object detection tasks.

In our first test, we asked GPT-4V to detect a dog in an image and
provide the x_min, y_min, x_max, and y_max values associated with the
position of the dog. The bounding box coordinates returned by GPT-4V
did not match the position of the dog.

[photo_2023-09-26-18]

While GPT-4V's capabilities at answering questions about an image are
powerful, the model is not a substitute for fine-tuned object
detection models in scenarios where you want to know where an object
is in an image.

Test #5: CAPTCHA

We decided to test GPT-4V with CAPTCHAs, a task OpenAI studied in
their research and wrote about in their system card. We found that
GPT-4V was able to identify that an image contained a CAPTCHA but
often failed the tests. In a traffic light example, GPT-4V missed
some boxes that contained traffic lights.

[photo_2023-09-27-13]

In the following crosswalk example, GPT-4V classified a few boxes
correctly but incorrectly classified one box in the CAPTCHA as a
crosswalk.

[sUn71XmNZHeS4C9U1KGZm9T12MPiDaWSnj]

Test #6: Crosswords and Sudoku's

We decided to test how GPT-4V performs on crosswords and sudokus.

First, we prompted GPT-4V with photos of a crossword with the text
instruction "Solve it." GPT-4V inferred the image contained a
crossword and attempted to provide a solution to the crossword. The
model appeared to read the clues correctly but misinterpreted the
structure of the board. As a result, the provided answers were
incorrect.

[bXAg1SiRBcs-huLBicWFzkeKI8NxB5OE1zoa1]

This same limitation was exhibited in our sudoku test, where GPT-4V
identified the game but misunderstood the structure of the board and
thus returned inaccurate results:

[U9cH5wYei3jZN8mmAA6etp3ngH8Zu0YrpLisXW6CEO0uSDB-FW3UO7PDLm-u]

GPT-4V Limitations and Safety

OpenAI conducted research with an alpha version of the vision model
available to a small group of users, as outlined in the official
GPT-4V(ision) System Card. During this process, they were able to
gather feedback and insights on how GPT-4V works with prompts
provided by a range of people. This was supplemented with "red
teaming", wherein external experts were "to qualitatively assess the
limitations and risks associated with the model and system".

Based on OpenAI's research, the GPT-4V system card notes numerous
limitations with the model such as:

 1. Missing text or characters in an image
 2. Missing mathematical symbols
 3. Being unable to recognize spatial locations and colors

In addition to limitations, OpenAI identified, researched, and
attempted to mitigate several risks associated with the model. For
example, GPT-4V avoids identifying a specific person in an image and
does not respond to prompts pertaining to hate symbols.

With that said, there is further work to be done in model
safeguarding. For example, OpenAI notes in the model system card that
"If prompted, GPT-4V can generate content praising certain lesser
known hate groups in response to their symbols.",

GPT-4V for Computer Vision and Beyond

GPT-4V is a notable movement in the field of machine learning and
natural language processing. With GPT-4V, you can ask questions about
an image - and follow up questions - in natural language and the
model will attempt to ask your question.

GPT-4V performed well at various general image questions and
demonstrated awareness of context in some images we tested. For
instance, GPT-4V was able to successfully answer questions about a
movie featured in an image without being told in text what the movie
was.

For general question answering, GPT-4V is exciting. While models
existed for this purpose in the past, they often lacked fluency in
their answers. GPT-4V is able to both answer questions and follow up
questions about an image and do so in depth.

With GPT-4V, you can ask questions about an image without creating a
two-stage process (i.e. classification then using the results to ask
a question to a language model like GPT). There will likely be
limitations to what GPT-4V can understand, hence testing a use case
to understand how the model performs is crucial.

With that said, GPT-4V has its limitations. The model did
"hallucinate", wherein the model returned inaccurate information.
This is a risk with using language models to answer questions.
Furthermore, the model was unable to accurately return bounding boxes
for object detection, suggesting it is unfit for this use case
currently.

We also observed that GPT-4V is unable to answer questions about
people. When given a photo of Taylor Swift and asked who was featured
in the image, the model declined to answer. OpenAI define this as an
expected behavior in the published system card.

Interested in reading more of our experiments with multi-modal
language models and GPT-4's impact on computer vision? Check out the
following guides:

  * Speculating on How GPT-4 Changes Computer Vision (Video)
  * How Good Is Bing (GPT-4) Multimodality?
  * ChatGPT Code Interpreter for Computer Vision

Cite this post:

"James Gallagher, Piotr Skalski." Roboflow Blog, Sep 27, 2023. https:
//blog.roboflow.com/gpt-4-vision/

[stats]

Build and deploy computer vision models with Roboflow

Join over 250,000 developers and top-tier companies from Rivian
Automotive to Cardinal Health building computer vision models with
Roboflow.

Get started

James Gallagher

James is a Technical Marketer at Roboflow, working toward
democratizing access to computer vision.

 
VIEW MORE POSTS
TOPICS:
Computer Vision, News

Build and deploy with Roboflow for free

Use Roboflow to manage datasets, train models in one-click, and
deploy to web, mobile, or the edge.
Try It Now

Subscribe to our newsletter

[                    ] [Subscribe]
Unsubscribe at any time. Review ourPrivacy Policy.
Table of Contents

RECOMMENDED READS

Transforming the Raspberry Pi into a Squirrel Sentry with Computer
Vision

How I Built a Wheel of Fortune Game with Roboflow

What is DETR?

MORE ABOUT

Computer Vision

 
View All
 

Transforming the Raspberry Pi into a Squirrel Sentry with Computer
Vision

 
Contributing Writer
Sep 27, 2023
 

How I Built a Wheel of Fortune Game with Roboflow

 
Contributing Writer
Sep 25, 2023
 

What is DETR?

 
Petru Potrimba
Sep 25, 2023
 

What is R-CNN?

 
Petru Potrimba
Sep 25, 2023
 

Train a Package Detector With Two Labeled Images

 
Andrew Healey
Sep 25, 2023
 

Enhancing Child Safety with Computer Vision

 
Contributing Writer
Sep 20, 2023
 
Want to learn more about Roboflow? Email sales@roboflow.com or book a
demo with our sales team.
Roboflow Logo
(c) Roboflow, Inc.
All rights reserved.

For sales inquiries:

sales@roboflow.comBook a demo
[63e6a8c0a1][63e6a8c04b][63e6a8ca27]

PRODUCT

  * Sign In / Sign Up
  * Universe
  * Annotate
  * Train
  * Deploy
  * Integrations
  * Pricing

ECosystem

  * Notebooks
  * Autodistill
  * Supervision
  * Roboflow

DEVELOPERS

  * User Forum
  * Templates
  * Blog
  * Contact Sales
  * Learn Computer Vision
  * Convert Annotation Formats
  * Computer Vision Models
  * Computer Vision Utilities

Industries

  * Manufacturing
  * Oil & Gas
  * Retail
  * Safety & Security
  * Transportation
  * All Industries

COMPANY

  * About Us
  * Careers
  * Press
  * Media Kit
  * Terms of Service
  * Privacy Policy
  * Sitemap