https://blog.roboflow.com/gpt-4-vision/ Roboflow Product Platform Universe Annotate Train Deploy Integrations Ecosystem Notebooks Autodistill Supervision Solutions BY INDUSTRY Aerospace & Defence Agriculture Healthcare & Medicine Automotive Banking & Finance Government Oil & Gas Retail & Ecommerce Safety & Security Telecommunications Transportation Manufacturing Utilities Resources Forum Inference Templates Blog Contact Sales Pricing Docs Book a demo Sign In Sign Up Product Platform Universe Annotate Train Deploy Inference Integrations Ecosystem Notebooks Autodistill Supervision Solutions BY INDUSTRY Aerospace & Defence Agriculture Healthcare & Medicine Automotive Banking & Finance Government Oil & Gas Retail & Ecommerce Safety & Security Telecommunications Transportation Manufacturing Utilities Resources Forum Inference Templates Blog Contact Sales Pricing Docs Contact Sales Sign In Try It Now Collections Latest Posts Tutorials Case Studies Product Updates Greatest Hits Categories Latest Posts Tutorials Case Studies Product Updates Greatest Hits Categories Search Computer Vision News First Impressions with GPT-4V(ision) James Gallagher, Piotr Skalski Sep 27, 2023 9 min read [First-Impr] On September 25th, 2023, OpenAI announced the rollout of two new features that extend how people can interact with its recent and most advanced model, GPT-4: the ability to ask questions about images and to use speech as an input to a query. This functionality marks GPT-4's move into being a multimodal model. This means that the model can accept multiple "modalities" of input - text and images - and return results based on those inputs. Bing Chat, developed by Microsoft in partnership with OpenAI, and Google's Bard model both support images as input, too. Read our comparison post to see how Bard and Bing perform with image inputs. In this guide, we are going to share our first impressions with the GPT-4V image input feature. We will run through a series of experiments to test the functionality of GPT-4V, showing where the model performs well and where it struggles. Note: This article shows a limited series of tests our team performed; your results will vary depending on the questions you ask and the images you use in a prompt. Tag us on social media @roboflow with your findings using GPT-4V. We would love to see more tests using the model! Without further ado, let's get started! What is GPT-4V? GPT-4V(ision) (GPT-4V) is a multimodal model developed by OpenAI. GPT-4V allows a user to upload an image as an input and ask a question about the image, a task type known as visual question answering (VQA). GPT-4V is rolling out as of September 24th and will be available in both the OpenAI ChatGPT iOS app and the web interface. You must have a GPT-4 subscription to use the tool. Let's experiment with GPT-4V and test its capabilities! Test #1: Visual Question Answering One of our first experiments with GPT-4V was to inquire about a computer vision meme. We chose this experiment because it allows us to the extent to which GPT-4V understands context and relationships in a given image. [2023-09-26-17] GPT-4V was able to successfully describe why the image was funny, making reference to various components of the image and how they connect. Notably, the provided meme contained text, which GPT-4V was able to read and use to generate a response. With that said, GPT-4V did make a mistake. The model said the fried chicken was labeled "NVIDIA BURGER" instead of "GPU". We then went on to test GPT-4V with currency, running a couple of different tests. First, we uploaded a photo of a United States penny. GPT-4V was able to successfully identify the origin and denomination of the coin: [Screenshot-2023-09-26-at-19] We then uploaded an image with multiple coins and prompted GPT-4V with the text: "How much money do I have?" [2023-09-26-17] GPT-4V was able to identify the number of coins but did not ascertain the currency type. With a follow up question, GPT-4V successfully identified the currency type: [2023-09-26-18] Moving on to another topic, we decided to try using GPT-4V with a photo from a popular movie: Pulp Fiction. We wanted to know: could GPT-4 answer a question about the movie without being told in text what movie it was? We uploaded a photo from Pulp Fiction with the prompt "Is it a good movie?", to which GPT-4V responded with a description of the movie and an answer to our question. GPT-4V provides a high-level description of the movie and a summary of the attributes associated with the movie considered to be positive and negative. We further asked about the IMDB score for the movie, to which GPT-4V responded with the score as of January 2022. This suggests, like other GPT models released by OpenAI, there is a knowledge cutoff after which point the model has no more recent knowledge. [2023-09-26-18] We then explored GPT-4V's question answering capabilities by asking a question about a place. We uploaded a photo of San Francisco with the text prompt "Where is this?" GPT-4V successfully identified the location, San Francisco, and noted that the Transamerica Pyramid, pictured in the image we uploaded, is a notable landmark in the city. [Screenshot-2023-09-26-at-19] Moving over to the realm of plants, we provided GPT-4V with a photo of a peace lily and asked the question "What is that plant and how should I care about it?": [2023-09-27-13] The model successfully identified that the plant is a peace lily and provided advice on how to care for the plant. This illustrates the utility of having text and vision combined to create a multi-modal such as they are in GPT-4V. The model returned a fluent answer to our question without having to build our own two-stage process (i.e. classification to identify the plant then GPT-4 to provide plant care advice). Test #2: Optical Character Recognition (OCR) We conducted two tests to explore GPT-4V's OCR capabilities: OCR on an image with text on a car tire and OCR on a photo of a paragraph from a digital document. Our intent was to build an understanding of how GPT-4V performs at OCR in the wild, where text may have less contrast and be at an angle, versus digital documents with clear text. [2023-09-26-17] GPT-4V was unable to correctly identify the serial number in an image of a tire. Some numbers were correct but there were several errors in the result from the model. In our document test, we presented text from a web page and asked GPT-4V to read the text in the image. The model was able to successfully identify the text in the image. [File] GPT-4V does an excellent job translating words in an image to individual characters in text. A useful insight for tasks related to extracting text from documents. Test #3: Math OCR Math OCR is a specialized form of OCR pertaining specifically to math equations. Math OCR is often considered its own discipline because the syntax of what the OCR model needs to identify extends to a vast range of symbols. We presented GPT-4V with a math question. This math question was in a screenshot taken from a document. The question concerns calculating the length of a zip wire given two angles. We presented the image with the prompt "Solve it." [2023-09-27-13][photo_2023-09-27-13] The model identified the problem can be solved with trigonometry, identified the function to use, and presented a step-by-step walkthrough of how to solve the problem. Then, GPT-4V provided the correct answer to the question. With that said, the GPT-4V system card notes that the model may miss mathematical symbols. Different tests, including tests where an equation or expression is written by hand on paper, may indicate deficiencies in the model's ability to answer math questions. Test #4: Object Detection Object detection is a fundamental task in the field of computer vision. We asked GPT-4V to identify the location of various objects to evaluate its ability to perform object detection tasks. In our first test, we asked GPT-4V to detect a dog in an image and provide the x_min, y_min, x_max, and y_max values associated with the position of the dog. The bounding box coordinates returned by GPT-4V did not match the position of the dog. [photo_2023-09-26-18] While GPT-4V's capabilities at answering questions about an image are powerful, the model is not a substitute for fine-tuned object detection models in scenarios where you want to know where an object is in an image. Test #5: CAPTCHA We decided to test GPT-4V with CAPTCHAs, a task OpenAI studied in their research and wrote about in their system card. We found that GPT-4V was able to identify that an image contained a CAPTCHA but often failed the tests. In a traffic light example, GPT-4V missed some boxes that contained traffic lights. [photo_2023-09-27-13] In the following crosswalk example, GPT-4V classified a few boxes correctly but incorrectly classified one box in the CAPTCHA as a crosswalk. [sUn71XmNZHeS4C9U1KGZm9T12MPiDaWSnj] Test #6: Crosswords and Sudoku's We decided to test how GPT-4V performs on crosswords and sudokus. First, we prompted GPT-4V with photos of a crossword with the text instruction "Solve it." GPT-4V inferred the image contained a crossword and attempted to provide a solution to the crossword. The model appeared to read the clues correctly but misinterpreted the structure of the board. As a result, the provided answers were incorrect. [bXAg1SiRBcs-huLBicWFzkeKI8NxB5OE1zoa1] This same limitation was exhibited in our sudoku test, where GPT-4V identified the game but misunderstood the structure of the board and thus returned inaccurate results: [U9cH5wYei3jZN8mmAA6etp3ngH8Zu0YrpLisXW6CEO0uSDB-FW3UO7PDLm-u] GPT-4V Limitations and Safety OpenAI conducted research with an alpha version of the vision model available to a small group of users, as outlined in the official GPT-4V(ision) System Card. During this process, they were able to gather feedback and insights on how GPT-4V works with prompts provided by a range of people. This was supplemented with "red teaming", wherein external experts were "to qualitatively assess the limitations and risks associated with the model and system". Based on OpenAI's research, the GPT-4V system card notes numerous limitations with the model such as: 1. Missing text or characters in an image 2. Missing mathematical symbols 3. Being unable to recognize spatial locations and colors In addition to limitations, OpenAI identified, researched, and attempted to mitigate several risks associated with the model. For example, GPT-4V avoids identifying a specific person in an image and does not respond to prompts pertaining to hate symbols. With that said, there is further work to be done in model safeguarding. For example, OpenAI notes in the model system card that "If prompted, GPT-4V can generate content praising certain lesser known hate groups in response to their symbols.", GPT-4V for Computer Vision and Beyond GPT-4V is a notable movement in the field of machine learning and natural language processing. With GPT-4V, you can ask questions about an image - and follow up questions - in natural language and the model will attempt to ask your question. GPT-4V performed well at various general image questions and demonstrated awareness of context in some images we tested. For instance, GPT-4V was able to successfully answer questions about a movie featured in an image without being told in text what the movie was. For general question answering, GPT-4V is exciting. While models existed for this purpose in the past, they often lacked fluency in their answers. GPT-4V is able to both answer questions and follow up questions about an image and do so in depth. With GPT-4V, you can ask questions about an image without creating a two-stage process (i.e. classification then using the results to ask a question to a language model like GPT). There will likely be limitations to what GPT-4V can understand, hence testing a use case to understand how the model performs is crucial. With that said, GPT-4V has its limitations. The model did "hallucinate", wherein the model returned inaccurate information. This is a risk with using language models to answer questions. Furthermore, the model was unable to accurately return bounding boxes for object detection, suggesting it is unfit for this use case currently. We also observed that GPT-4V is unable to answer questions about people. When given a photo of Taylor Swift and asked who was featured in the image, the model declined to answer. OpenAI define this as an expected behavior in the published system card. Interested in reading more of our experiments with multi-modal language models and GPT-4's impact on computer vision? Check out the following guides: * Speculating on How GPT-4 Changes Computer Vision (Video) * How Good Is Bing (GPT-4) Multimodality? * ChatGPT Code Interpreter for Computer Vision Cite this post: "James Gallagher, Piotr Skalski." Roboflow Blog, Sep 27, 2023. https: //blog.roboflow.com/gpt-4-vision/ [stats] Build and deploy computer vision models with Roboflow Join over 250,000 developers and top-tier companies from Rivian Automotive to Cardinal Health building computer vision models with Roboflow. Get started James Gallagher James is a Technical Marketer at Roboflow, working toward democratizing access to computer vision. VIEW MORE POSTS TOPICS: Computer Vision, News Build and deploy with Roboflow for free Use Roboflow to manage datasets, train models in one-click, and deploy to web, mobile, or the edge. Try It Now Subscribe to our newsletter [ ] [Subscribe] Unsubscribe at any time. Review ourPrivacy Policy. Table of Contents RECOMMENDED READS Transforming the Raspberry Pi into a Squirrel Sentry with Computer Vision How I Built a Wheel of Fortune Game with Roboflow What is DETR? MORE ABOUT Computer Vision View All Transforming the Raspberry Pi into a Squirrel Sentry with Computer Vision Contributing Writer Sep 27, 2023 How I Built a Wheel of Fortune Game with Roboflow Contributing Writer Sep 25, 2023 What is DETR? Petru Potrimba Sep 25, 2023 What is R-CNN? Petru Potrimba Sep 25, 2023 Train a Package Detector With Two Labeled Images Andrew Healey Sep 25, 2023 Enhancing Child Safety with Computer Vision Contributing Writer Sep 20, 2023 Want to learn more about Roboflow? Email sales@roboflow.com or book a demo with our sales team. Roboflow Logo (c) Roboflow, Inc. All rights reserved. For sales inquiries: sales@roboflow.comBook a demo [63e6a8c0a1][63e6a8c04b][63e6a8ca27] PRODUCT * Sign In / Sign Up * Universe * Annotate * Train * Deploy * Integrations * Pricing ECosystem * Notebooks * Autodistill * Supervision * Roboflow DEVELOPERS * User Forum * Templates * Blog * Contact Sales * Learn Computer Vision * Convert Annotation Formats * Computer Vision Models * Computer Vision Utilities Industries * Manufacturing * Oil & Gas * Retail * Safety & Security * Transportation * All Industries COMPANY * About Us * Careers * Press * Media Kit * Terms of Service * Privacy Policy * Sitemap