hngopher.com

       [HN Gopher] MiniGPT-4
       ___________________________________________________________________
        
       MiniGPT-4
        
       Author : GaggiX
       Score  : 790 points
       Date   : 2023-04-17 08:56 UTC (14 hours ago)
        
 (HTM) web link (minigpt-4.github.io)
 (TXT) w3m dump (minigpt-4.github.io)
        
       | macrolime wrote:
       | A modified BLIP2 that works with diffusion models rather than
       | vision transformers would be pretty cool. Using Vicuna-13B or
       | another large language model as the language model of Stable
       | Diffusion, instead of the text encoder of CLIP, would be a game
       | changer, it would completely change how you prompt Stable
       | Diffusion from something kinda like a tag list to it being able
       | to follow instructions in English instead.
        
       | jokoon wrote:
       | Are there more university/research centric resource of
       | communities regarding AI, I mean things on a more epistemic and
       | higher level, that disregards current ML techniques and
       | implementations?
       | 
       | I have no problem when there is a lot of money spent to implement
       | things that can be used and tested, but I wish I could read about
       | the next step of AI, something that involves science and not
       | software engineering.
       | 
       | I've yet to read anything related to the analysis of trained data
       | or trained neural networks, the structure of the brains of small
       | insects, psychology on general intelligence, AI and emotions, the
       | real ability of an AI to think and write meaningful code or
       | understand what is a problem/solution etc, I mean anything that
       | is related to AI that can be treated by computer science.
       | 
       | I have no problems with algorithms and implementations, but don't
       | university/professors also try to go beyond, with epistemology,
       | to connect the dots with biology and neurology?
       | 
       | Unless I see science cover several fields of research, it's not
       | real, worthy research in my view, which is why the reason why I'm
       | so stubbornly skeptical of chatGPT.
        
         | thewarrior wrote:
         | We've tried but we understand even less about how "real"
         | intelligence works. Atleast in a way that we can replicate it.
         | 
         | The people doing so called alchemy are making far better
         | systems that anyone who's trying to actually understand things.
         | And we've been trying for decades now.
        
         | visarga wrote:
         | If the engineering part is of secondary importance for you,
         | then at least remember about the dataset. It is in the
         | composition and quality of the training data that all the
         | skills find their origin.
         | 
         | Most of our discussions are about model size, but few about
         | dataset. Yet all the scaling laws hint at the great usefulness
         | of more data. Sometimes even little data can have a great
         | impact in the fine-tuning phase. In the end it is the training
         | data that transforms a random init into the model.
        
       | Tepix wrote:
       | I think it's poor form that they are taking the GPT-4 name for an
       | unrelated project. After all, the underlying Vicuna is merely a
       | fine-tuned LLaMA. Plus they use the smaller 13B version.
       | 
       | The results look interesting, however.
       | 
       | Here's hoping that they'll add GTPQ 4bit quantizing so the 65B
       | version of the model can be run on 2x 3090.
        
         | seydor wrote:
         | The way things are going, very soon GPT-4 will not be something
         | to aspire to be
        
         | twelvechairs wrote:
         | Its poor form to be calling it 'Open' AI. So I guess its swings
         | and roundabouts. Everyone is leeching where they can.
        
           | iandanforth wrote:
           | To be fair they were open when that name was picked and it
           | looks like they may be trying to transition to just 'ai.com'.
        
             | teacpde wrote:
             | Wow, that must be an expensive domain name.
        
               | was_a_dev wrote:
               | I'm sure they can afford it
               | 
               | But justai.com would also be apt
        
               | philosopher1234 wrote:
               | they already did afford it. it redirects to chatgpt
        
               | xiphias2 wrote:
               | wow, how come it's not known?
        
               | sebzim4500 wrote:
               | What about this url redirect is interesting enough to you
               | that you think it should be widely known?
        
         | alsodumb wrote:
         | It's pretty clear that they're doing it for citation farming.
         | I've seen another paper from KAUST that uses fancy title on a
         | trivial paper.
        
         | collaborative wrote:
         | Adding GPT4 to anything now increases marketing x4. So many AI
         | news coming out lately that not adding it risks drowning in a
         | sea of info.. even in the case of a good project
        
           | Uehreka wrote:
           | The word for this is "trademark infringement". You are
           | specifically not allowed to capitalize on the marketing of
           | another entity's product to bolster yours by implying through
           | your name that you are somehow related. This is why "DALL-E
           | Mini" had to change their name to craiyon.
        
             | HarHarVeryFunny wrote:
             | It's also just (deliberately) misleading. It's based on the
             | 13B Vicuna/Llama model, not 175B GPT-3 or 1T GPT-4. There
             | is zero justification for calling it MiniGPT-4.
             | 
             | A more honest name would be Visual-Vicuna or Son-of-BLIP.
        
               | sebzim4500 wrote:
               | I don't see how it's misleading. MiniGPT-4 makes it sound
               | like a smaller alternative to GPT-4, if it was based on
               | GPT-4 there would be nothing 'mini' about it.
        
               | HarHarVeryFunny wrote:
               | It has more in common with GPT-3 than GPT-4 in terms of
               | size, but in reality it's based on Vicuna/Llama which is
               | 10x smaller than either, so as far as the LLM part of it
               | goes its not mini-anything - it's just straight-up Vicuna
               | 13B.
               | 
               | The model as a whole is just BLIP-2 with a larger linear
               | layer, and using Vicuna as the LLM. If you look at their
               | code it's literally using the entire BLIP-2 encoder
               | (Salesforce code).
               | 
               | https://arxiv.org/pdf/2301.12597.pdf
        
               | [deleted]
        
               | Tepix wrote:
               | > _1T GPT-4_
               | 
               | The number of parameters used for GPT-4 is unknown.
        
               | HarHarVeryFunny wrote:
               | I got the 1T GPT-4 number from here - this is the video
               | that goes with the Microsoft "Sparks of AGI" paper, by a
               | Microsoft researcher that had early access to GPT-4 as
               | part of their relationship with OpenAI.
               | 
               | https://www.youtube.com/watch?v=qbIk7-JPB2c
        
               | sandkoan wrote:
               | Bubeck has clarified that the "1 trillion" number he was
               | throwing around was just a hypothetical metaphorical--it
               | was in no way shape or form implying that GPT-4 has 1
               | trillion parameters [0].
               | 
               | [0] https://twitter.com/SebastienBubeck/status/1644151579
               | 7238251...
        
               | HarHarVeryFunny wrote:
               | OK - thanks!
               | 
               | So we're back to guessing ...
               | 
               | A couple of years ago Altman claimed that GPT-4 wouldn't
               | be much bigger than GPT-3 although it would use a lot
               | more compute.
               | 
               | https://news.knowledia.com/US/en/articles/sam-altman-q-
               | and-a...
               | 
               | OTOH, given the massive performance gains scaling from
               | GPT-2 to GPT-3, it's hard to imagine them not wanting to
               | increase the parameter count at least by a factor of 2,
               | even if they were expecting most of the performance gain
               | to come from elsewhere (context size, number of training
               | tokens, data quality).
               | 
               | So in 0.5-1T range, perhaps ?
        
             | collaborative wrote:
             | At this point the letters GPT make more sense than "AI" or
             | "LLM" in many peoples minds
        
               | Uehreka wrote:
               | Hard disagree.
               | 
               | Outside of the brand name ChatGPT, lay members of the
               | general public are way more likely to call these chatbots
               | (like Bard and Bing) "AIs" than "GPTs". And although GPT
               | could technically refer to any model that uses a
               | Generative Pre-trained Transformer approach (although it
               | probably wouldn't be an open-and-shut case), the mark
               | "GPT-4" definitely is associated with OpenAI and their
               | product, and you can't just use it without their
               | permission.
        
               | collaborative wrote:
               | So OpenAI ostensibly owns "GPT4" according to your
               | argument. But does it own "MiniGPT4"? I hope you see the
               | absurdity of this.
               | 
               | Let's not discuss the amount of copyright licenses OpenAI
               | has already infringed, too
        
               | Uehreka wrote:
               | I'll put it this way:
               | 
               | At Brewer's Art in Baltimore, MD they just released a
               | beer called GPT (Green Peppercorn Tripel)[1]. They're
               | likely allowed to do that because a reasonable consumer
               | would probably not actually think they had collaborated
               | with OpenAI, because OpenAI does not make beer.
               | 
               | OP is releasing a model called "MiniGPT-4". A reasonable
               | consumer could look at that name and become confused
               | about the origin of the product, thinking it was from
               | OpenAI. This would be understandable, since OpenAI also
               | makes large language models and has a well known one that
               | they've been promoting whose brand name is "GPT-4". If
               | MiniGPT-4 does not meet that consumer's expectation of
               | quality (which has been built up through using and
               | hearing about GPT-4) it may cause them to think something
               | like "Wow, I guess OpenAI is going downhill".
               | 
               | Trademark cases are generally decided on a "reasonable
               | consumer" basis. So yeah, they can seem a little
               | arbitrary. But it's important for consumers to be able to
               | distinguish the origin of the goods they are consuming
               | and for creators to be able to benefit from their
               | investment in advertising and product development.
               | 
               | [1] https://www.thebrewersart.com/bottles-cans
        
           | nashashmi wrote:
           | They can always say GPT-like. Or miniaturized GPT-like LLM.
        
         | causi wrote:
         | Someone needs to write a buyer's guide for GPUs and LLMs. For
         | example, what's the best course of action if don't need to
         | train anything but do want to eventually _run_ whatever model
         | becomes the first local-capable equivalent to ChatGPT? Do you
         | go with Nvidia for the CUDA cores or with AMD for more VRAM? Do
         | you do neither and wait another generation?
        
           | thewataccount wrote:
           | Nvidia and the highest amount of vram you can get.
           | 
           | Currently the 4090, the rumor is the 4090ti will have 48gb of
           | vram, idk if its worth waiting or not.
           | 
           | The more VRAM the higher paremeter count you can run all in
           | memory (fastest by far).
           | 
           | AMD is almost a joke in ML. The lack of CUDA support (which
           | is nvidia proprietary) is straight lethal, and also even
           | though ROCM does have much better support these days, from
           | what I've seen it's still a fraction of the performance of
           | what it should be. I'm also not sure if you need projects to
           | support it or not, I know pytorch has backend support for it
           | but I'm not sure how easy it is to drop in.
        
             | brandall10 wrote:
             | I think the most recent rumors were amended to it having
             | 24, unfortunately.
        
               | thewataccount wrote:
               | Darn.
               | 
               | I mean in all honestly there's no reason a gaming card
               | would need 48gb at the moment when so few games even use
               | 24gb.
               | 
               | 48GB really only makes sense for workstation cards.
        
             | causi wrote:
             | That's good to know. A lot of people might look at the
             | 7900XTX with its 24GB of ram for a grand and think "ah, a
             | good deal for fitting an LLM in".
        
           | lhl wrote:
           | For a general guide, I recommend:
           | https://timdettmers.com/2023/01/30/which-gpu-for-deep-
           | learni...
           | 
           | There's a subreddit r/LocalLLaMA that seems like the most
           | active community focused on self-hosting LLMs. Here's a
           | recent discussion on hardware: https://www.reddit.com/r/Local
           | LLaMA/comments/12lynw8/is_anyo...
           | 
           | If you're looking just for local inference, you're best bet
           | is probably to buy a consumer GPU w/ 24GB of RAM (3090 is
           | fine, 4090 more performance potential), which can fit a 30B
           | parameter 4-bit quantized model that can probably be fine-
           | tuned to ChatGPT (3.5) level quality. If not, then you can
           | probably add a second card later on.
           | 
           | Alternatively, if you have an Apple Silicon Mac, llama.cpp
           | performs surprisingly well, it's easy to try for free:
           | https://github.com/ggerganov/llama.cpp
           | 
           | Current AMD consumer cards have terrible software support and
           | IMO isn't really an option. On Windows you _might_ be able to
           | use SHARK or DirectML ports, but nothing will run out of the
           | box. ROCm still has no RDNA3 support (supposedly coming w /
           | 5.5 but no release date announced) and it's unclear how well
           | it'll work - basically, unless you would rather be fighting
           | w/ hardware than playing around w/ ML, it's probably best to
           | avoid (the older RDNA cards also don't have tensor cores, so
           | perf would be hobbled even if you could get things running.
           | Lots of software has been written w/ CUDA-only in mind).
        
             | BeefWellington wrote:
             | FWIW I had no real issues getting StableDiffusion to run on
             | a 6800 I have in one of my systems.
             | 
             | I haven't tried with LLaMA at all.
        
               | lhl wrote:
               | 6800 is RDNA2, not RDNA3. The latter is still waiting for
               | ROCm support 4 months post-launch:
               | https://github.com/RadeonOpenCompute/ROCm/issues/1813
        
               | BeefWellington wrote:
               | I'm aware that a 6800 is not RDNA3. You stated broadly:
               | 
               | > Current AMD consumer cards have terrible software
               | support and IMO isn't really an option. On Windows you
               | might be able to use SHARK or DirectML ports, but nothing
               | will run out of the box.
               | 
               | I was merely sharing that I did not have that same
               | experience that current consumer cards have terrible
               | support.
        
               | lhl wrote:
               | Sure, and I was merely clarifying that only last-gen
               | architectures work. While I'm glad that SD works for you,
               | if we are getting into it, I think that having no support
               | on current-gen flagship models _does_ equal broadly
               | terrible software support and is more worth highlighting,
               | since otherwise someone might assume they could just go
               | pick one of those up and get a 24GB GPU on the cheap,
               | especially in the context of LLMs (which is what the OP
               | was asking about).
               | 
               | For RDNA2, you apparently _can_ get LLMs running, but it
               | requires forking /patching both bitsandbytes and GPTQ:
               | https://rentry.org/eq3hg - and this will be true for any
               | library (eg, can you use accelerate? deepspeed? fastgen?
               | who knows, but certainly no one is testing it and AMD
               | doesn't care if you're not on CDNA). It's important to
               | note again, anything that works atm will still only work
               | with last-gen cards, on Linux-only (ROCm does not work
               | through WSL), w/ limited VRAM (no 30Bq4 models), and
               | since RDNA2 tensor support is awful, if the SD benchmarks
               | are anything to go by, performance will still end up
               | worse than an RTX 3050:
               | https://www.tomshardware.com/news/stable-diffusion-gpu-
               | bench...
        
               | AuryGlenz wrote:
               | Running is one thing. Running with any sort of speed is
               | another.
        
           | l33tman wrote:
           | You don't need a GPU, alpaca.cpp runs purely on the CPU if
           | you have 32+ GB of RAM for the 30B Llama model. It's a bit
           | slow, but you can toy with it..
        
           | avereveard wrote:
           | Depends on whether you just want to play around or have
           | actual business needs, but to play around I'm waiting for the
           | next generation of unified memory architectures
        
             | ilteris wrote:
             | Can you point any source re: next generation of unified
             | memory architectures? How far are they away?
        
               | avereveard wrote:
               | Well currently there's just apple and they are super
               | secretive about their roadmap, the M1/2 top line chip
               | already perform quite well, and while the top version
               | look expensive, a pair of a6000 will definitely eat more
               | money both to purchase and to run.
               | 
               | M3 is around the corner tho, and there's some
               | announcement to come from intel or arm following their
               | partnership. There's also the new card coming from intel
               | that is supposed to be aimed squarely at machine learning
               | workloads, and they don't have to segment their market by
               | memory sizing like Nvidia do, but they aren't well
               | supported as device targets, but a pair of these will
               | likely be very cost effective if and only if they will
               | get credible compatibility with the libraries and models
        
           | Tepix wrote:
           | Read Tim Dettmer's guide.
           | 
           | Basically, you want nVidia, and you want lots of VRAM. Buy
           | used for much more bang for the buck.
           | 
           | Depending on your budget, get:
           | 
           | - an RTX 3060 with 12GB or
           | 
           | - 1 used RTX 3090 with 24GB (approx twice as expensive as the
           | 3060 but twice the VRAM and much faster) or
           | 
           | - 2 used RTX 3090 cards if you need more than 24GB.
           | 
           | Everything beyond that gets quite a bit more expensive
           | because then you need a platform with more PCIe lanes, you
           | may need more than one PSU and you will have problems fitting
           | and cooling everything.
           | 
           | With two cards and 2x24GB you can run the largest version of
           | the LLaMA model (the 65B variant) and all its descendants
           | with 4-bit quantization inside your GPU's VRAM, i.e. with
           | good performance. Can can also try some low resource fine-
           | tuning variants (LoRa etc).
           | 
           | Oh and while you're at it also get a decent amount of RAM
           | like 64GB or 128GB (it's very cheap right now) and a NVMe
           | SSD. These models are quite large.
        
           | mk_stjames wrote:
           | Ask: how many parameters is the model? At how many bits? and
           | has it been quantized to a smaller number of bits or if not
           | can it be without losing quality? multiply number of bits by
           | the number of parameters, convert to GB, and then add so
           | amount of overhead for however much else needs to load and/or
           | how horrible/amazing of an implementation the inference code
           | is. Done.
           | 
           | Example: A 30B parameter model trained at 16bit FP gets
           | quantized down to 4 bit ints. 4 bits = 0.5 byte. 30 billion *
           | 0.5 byte = 15GB of VRAM (plus a GB or few of other overhead)
           | 
           | For more real world discussion see
           | 
           | https://github.com/ggerganov/llama.cpp/issues/13
        
           | redox99 wrote:
           | It's pretty simple actually.
           | 
           | Get a 3090 or 4090. Forget about AMD.
        
             | mewpmewp2 wrote:
             | Should I get a gaming PC with 4090 if I want to get into
             | personal side projects in ML and also games?
             | 
             | Do I need dualboot? Or is Windows good?
        
               | dharma1 wrote:
               | 4090 is amazing, but very large card. 3090 is "good
               | enough" for ML - same 24gb vram - and you can pick them
               | up used for half the price of a new 4090. That's what I
               | did.
               | 
               | WSL on windows apparently decent, or native PyTorch, dual
               | boot windows/ubuntu still prob best tho.
        
               | artificial wrote:
               | Getting CUDA on OpenSUSE was super easy. The Nvidia blob
               | drivers are easy to install and CUDA needs another
               | download and some copy paste. Even Unreal Editor was
               | easier to install than on Windows.
        
               | Regic wrote:
               | I'm running models locally on my 3090 and it's fast
               | enough, although for example building a vector database
               | can take a while. I can run LoRa training but I haven't
               | done anything meaningful with it so far. I chose 3090
               | because of the cable issue of 4090 (also, no nvlink,
               | although I'm not sure that matters) but it's debatable if
               | my fears are justified. I need to leave the gpu running
               | while I'm away and I just don't feel comfortable doing
               | that with a 4090. I rather take the lower performance.
               | 
               | One caveat though, my asus b650e-f is barely supported by
               | the currently used ubuntu kernel (e.g. my microphone
               | doesn't work, before upgrading kernel + bios I didn't
               | have lan connection...) so expect some problems if you
               | want to use a relatively new gaming setup for linux.
        
               | thewataccount wrote:
               | Windows generally works but there may be a somewhat small
               | performance hit. IMO linux is much easier to get to work
               | judging by all the github issue threads I see able
               | SD/LLaMa stuff on windows - but I don't use windows so I
               | dont have personal experience.
               | 
               | 4090 24GB is 1800USD, The Ada A6000 48GB is like 8000USD
               | and idk where you buy it? So if you want to run games and
               | models locally the 4090 is honestly the best option.
               | 
               | EDIT: I forgot - there is a rumored 4090ti with 48gb of
               | vram, no idea if thats worth waiting for.
        
               | wincy wrote:
               | If I was going to spend $8000 on a video card I'd hunt on
               | eBay for an A100 80GB rather than settle for the A6000
        
               | thewataccount wrote:
               | Honestly yeah a used A100 80GB sounds like a better idea.
        
               | macrolime wrote:
               | The A6000 is actually the old generation, Ampere. The new
               | Ada generation one is called 6000. Seems many places
               | still sell A6000 (Ampere) for the same price as RTX 6000
               | (Ada) though, even though the new one is twice as fast.
               | 
               | Seems you can get used RTX A6000s for around $3000 on
               | ebay.
        
               | thewataccount wrote:
               | That.... That explains why I can't find it and makes a
               | ton of sense.....
               | 
               | I think that's such a silly name for it, but oh well
               | 
               | Thanks for the correction!
        
               | macrolime wrote:
               | Just to add to the confusion, there's another older RTX
               | 6000 with 24GB ram. This is from an even older
               | generation, same as the GeForce 20 series.
        
               | thewataccount wrote:
               | You're kidding? So they called it the RTX 6000, then
               | called it the RTX A6000 for ampere, then back to RTX 6000
               | for Ada?
               | 
               | Why do they do this? Sometimes consumer products are
               | versioned weirdly to mislead customers (like intel cpus)
               | - but these wouldn't even make sense to do that with as
               | they're enterprise cards?
        
               | macrolime wrote:
               | Actually the first one is called Quadro RTX 6000, while
               | the Ada one is just RTX 6000 without "Quadro" in front.
               | Not that it makes the naming make much more sense.
               | 
               | According to GPT-4 the next generation one will be called
               | Galactic Unicorn RTX 6000 :D
        
               | dannyw wrote:
               | You could also get more detailed help from
               | https://reddit.com/r/buildapc/
        
               | whywhywhywhy wrote:
               | Save some money and go 3090, same vram, speed difference
               | probably isn't worth the premium for the 4090. Then
               | upgrade when the rumored 5090 generational leap happens.
        
               | sampa wrote:
               | you can run Windows in VM and passthrough your videocard
               | to it. So it's basically a native gaming (3-5% cost) w/o
               | reboots. Look for VFIO.
        
               | turmeric_root wrote:
               | Windows reserves a certain percentage for VRAM for some
               | reason. So I'd recommend Linux. Or find a way to disable
               | the desktop/UI in Windows.
        
             | sureglymop wrote:
             | I have access to an Nvidia A100. But as a layman, what
             | specs does the rest of the system need to use it for some
             | real work? I would assume there needs to be at least as
             | much ram as vram and maybe a few terabytes of disk space.
             | Does anyone have experience with this?
        
               | michaelt wrote:
               | If you have an A100, which in its 80GB variant costs
               | $23,667 [1], you would not generally quibble over the
               | price of a few terabytes of disk space.
               | 
               | [1] https://www.dell.com/en-us/shop/nvidia-
               | ampere-a100-pcie-300w...
        
           | seydor wrote:
           | At least for llama derivatives, reddit has a list https://www
           | .reddit.com/r/LocalLLaMA/comments/11o6o3f/how_to_...
        
           | dannyw wrote:
           | Definitely NVIDIA. Good luck running any real ML models on
           | AMD GPUs.
        
         | [deleted]
        
         | iandanforth wrote:
         | Filed an issue: Up or down vote there as you see fit.
         | 
         | https://github.com/Vision-CAIR/MiniGPT-4/issues/9
        
       | cloudking wrote:
       | It doesn't seem to recognize text from a screenshot very well, I
       | gave detailed error messages from a Windows screenshot:
       | https://filestore.community.support.microsoft.com/api/images...
       | 
       | and prompted "Describe any issues found in this screenshot and
       | steps to resolve them"
       | 
       | while it correctly identified it as a screenshot from a computer,
       | it gave a very generic response and didn't identify the error
       | messages correctly:
       | 
       | "The screenshot shows a computer screen with a warning message
       | that reads, "The system is currently unavailable. Please try
       | again later." This indicates that there is an issue with the
       | computer system that needs to be resolved.
       | 
       | To resolve this issue, you can try the following steps:
       | 
       | 1. Check the system status: Check the system status to see if
       | there are any errors or issues that need to be resolved.
       | 
       | 2. Restart the computer: Try restarting the computer to see if
       | that resolves the issue.
       | 
       | 3. Check for updates: Check for any updates that may have been
       | released for the computer system and apply them if necessary.
       | 
       | 4. Check for viruses: Run a virus scan on the computer to check
       | for any viruses that may be causing the issue.
       | 
       | 5. Check for hardware issues: Check for any hardware issues that
       | may be causing the issue. This could include issues with the
       | power supply, hard drive, or other components.
       | 
       | 6. Contact technical support: If none of the above steps resolve
       | the issue, contact technical support for further assistance. They
       | may be able to provide additional troubleshooting steps or help
       | you resolve the issue."
        
       | HansHamster wrote:
       | Just tried it with a random schematic:
       | 
       | > The image shows a block diagram of a power amplifier circuit
       | with a power supply and a load resistor. The power supply is
       | connected to the input of the amplifier, which is then connected
       | to the output. The load resistor is connected to the output of
       | the amplifier and the power supply. The amplifier is designed to
       | amplify the input signal to a higher level, which is then sent to
       | the load resistor to power the load. The power supply provides
       | the necessary voltage to operate the amplifier.
       | 
       | On the one hand I'm impressed that it recognises an amplifier,
       | but just about everything in the description is (slightly) wrong.
       | It is not a power amplifier, the power supply is not connected to
       | the input (why should it be?), there is no load resistor...
        
         | mftb wrote:
         | I also got middling results. I gave it a novel NASA picture of
         | the moon passing in front of the earth, taken by the Deep Space
         | Climate Observatory Satellite, so an unusual vantage point.
         | First it was just wrong, and said that it was a picture of a
         | crescent moon taken from a mountain top, or something like
         | that. Not terrible at least it recognized the moon, but wrong.
         | I tried to give it a chance and corrected it with the
         | additional info that the picture was taken from space. It then
         | did better, but on the last line it threw in that it was not a
         | photograph, but a rendering of some sort, which again, is just
         | wrong.
         | 
         | More and more I get the impression that people are so taken
         | with the chat interface that they are grading these models more
         | favorably than is warranted.
        
           | HansHamster wrote:
           | And after pointing out all the mistakes in the description,
           | it took a very ... interesting approach to address my
           | remarks:
           | 
           | > I apologize for the confusion. The image shows a block
           | diagram of a basic amplifier circuit with a power supply, but
           | it does not show any input stages, output stages, or load
           | resistors. [...] The input stage is made up of an operational
           | amplifier, which is used to amplify the input signal. The
           | output of the input stage is then connected to the power
           | supply, which provides the necessary voltage to operate the
           | amplifier. The amplifier is designed to amplify the input
           | signal to a higher level, which is then sent to the load
           | resistor to power the load.
           | 
           | So it acknowledges the remarks, describes everything that is
           | _not_ in the image and then puts the stuff back into the
           | description anyway :)
        
       | aabajian wrote:
       | From a radiology world this is fascinating. I'm not worried about
       | job security as I'm an interventionalist. What I'm wondering is
       | about go-to-market strategies for diagnostics. I do some
       | diagnostic reads and I would love to have something like this
       | pre-draft reports (especially for X-Rays). There are tons of "AI
       | in rads" companies right now, none of which have models that come
       | anywhere close to GPT-4 or even this. Personally, I would not
       | _start_ such a company because the barrier to entry is now so
       | low. I think there needs to be an open-source project that just
       | subtrains these models on rads images and is available to all
       | radiologists.
        
         | euroderf wrote:
         | Your profession and... a few hundred others ?
        
       | deeplstm wrote:
       | Here's the video summary https://youtu.be/uPV9Gk3IC-g
        
       | atgctg wrote:
       | It's hard to distinguish non-Google projects with Google Sans in
       | their templates from actual Google Research papers, as the font
       | is meant to be exclusively used by Google[1].
       | 
       | [1]
       | https://developers.google.com/fonts/faq#how_can_i_get_a_lice...
        
         | sebzim4500 wrote:
         | Surely most people would read the authors list to determine
         | provenance rather than the font?
        
           | pxeger1 wrote:
           | I didn't think about it consciously but I think I did
           | implicitly assume it was a Google project because of the font
        
       | tarruda wrote:
       | Looking forward to the next generation of cheap GPUs with enough
       | VRAM to run models like Vicuna-13 locally.
        
         | adam_arthur wrote:
         | A lot of people are running fairly powerful models directly on
         | the CPU these days... seems like inference will not be a GPU
         | exclusive activity going forward. Given that RAM is the main
         | bottleneck at this point, running on CPU seems more practical
         | for most end users
         | 
         | See: https://news.ycombinator.com/item?id=35602234
        
         | valine wrote:
         | Cheap is relative I suppose.
         | 
         | I'm running Vicuna 13b 16f locally and it needs 26GB of VRAM,
         | which won't even fit on a single RTX 4090. The next gen RTX
         | Titan might have enough vram but that won't come cheap. I'm
         | expecting a price point above $2500.
        
           | 22c wrote:
           | I'm not sure if it's the point GP is trying to make, but I
           | would like to see GPUs with extra VRAM that don't have the
           | extra compute. eg. similar performance of a 4070Ti but with
           | 24GB or 32GB of VRAM.
           | 
           | I don't see a really good reason why OEMs couldn't do that
           | now, in the past there have been OEM cards that have more
           | VRAM than the reference design. I'm sure there's an appetite
           | for cards like that for people who don't want to refinance
           | their home loan to get 2 x RTX 4090 cards.
        
             | Tepix wrote:
             | I'm fairly sure that NVIDIA is making sure that consumer
             | cards are no good alternative to their $10000 80GB VRAM
             | A100 cards.
        
       | ggregoire wrote:
       | Could you use something like this to summarize the content of a
       | video? Or would you first need to extract the images from the
       | video (like 1 fps, with ffmpeg) and tell the AI to analyze all
       | the images and summarize them like if it was a video?
        
         | all2 wrote:
         | I'd do this using both streams of information: audio and video.
         | 
         | I'd segment the audio semantically based on the topic of
         | discussion, and I'd segment the video based on editing,
         | subjects in scene, etc. We could start simply and just have a
         | "timestamp": [ subjects, in, frame] key-value.
         | 
         | It'd take some fiddling to sort how to mesh these two streams
         | of data back together. The first thing I'd try is segment by
         | time chunks (the resolution of which would depend on min/max
         | segment lengths in video and audio streams) and then clump the
         | time chunks together based on audio+video content.
        
       | golergka wrote:
       | I don't know what unholy abomination they did to the scrolling,
       | but after I increased the page font size, it just sticks to the
       | bottom and you can't scroll up.
        
         | arjonagelhout wrote:
         | I'm experiencing the same thing on iOS Safari, it rapidly jumps
         | back and forth between two positions.
        
         | [deleted]
        
         | detrites wrote:
         | Right off the bat I lost all scrolling inertia here, it's like
         | being in the dark ages.
        
       | raylad wrote:
       | I gave it a picture of a bottle, which it did a fairly good job
       | of identifying (although it misread some of the text, mistaking a
       | 1 for a 4).
       | 
       | Then I asked it what are the likely ingredients of the product.
       | It still hadn't replied after 2274s so I gave up on it.
        
       | simonw wrote:
       | Hate to be the person complaining about the name, but we already
       | saw how this plays out with DALL-E mini: if you name your project
       | directly after something else like this, no matter how much extra
       | explanatory text you attach to it a large number of people will
       | assume it's an "official" variant of the thing it was named
       | after.
       | 
       | Eventually you'll have to rename it, either to resolve the
       | confusion or because OpenAI pressure you to do so, or both.
       | 
       | So better to pick a less confusing name from the start.
       | 
       | (This one is even more confusing because it's about image inputs,
       | but GPT4 with image inputs had not actually been released to
       | anyone yet - similar in fact to how DALL-E mini got massive
       | attention because DALL-E itself was still in closed preview)
        
         | xiphias2 wrote:
         | ,,So better to pick a less confusing name from the start.''
         | 
         | I'm not sure if that's better from a marketing standpoint
         | though....it works, you still remember DALL-E mini
        
       | jiwidi wrote:
       | a bit nasty to call something minigpt-4 when it comes from llama
        
       | wsgeorge wrote:
       | Giving this a quick spin and I like what I'm seeing. I gave it a
       | picture of Paolo Veronese's Lament over the Dead Christ [0] and
       | asked what was going on in the background.
       | 
       | > The painting depicts the scene of the crucifixion of Jesus
       | Christ. In the foreground, there is a group of people standing
       | around the cross, including Mary, the mother of Jesus, and the
       | two thieves who were crucified with him. In the background, there
       | is a hill with a group of people standing on it, watching the
       | crucifixion. The sky is dark and stormy, with lightning
       | illuminating the scene. The overall mood of the painting is one
       | of sadness and tragedy.
       | 
       | Then I asked for a sonnet on it:
       | 
       | "In this painting, we see the scene of the cross
       | 
       | Where Jesus is crucified, with Mary at his side
       | 
       | And the thieves who were crucified with him
       | 
       | In the background, a hill with people standing
       | 
       | Watching the tragic event unfold
       | 
       | The sky is dark and stormy, with lightning flashing
       | 
       | Illuminating the scene of the crucifixion
       | 
       | The mood is one of sadness and tragedy
       | 
       | As we contemplate the sacrifice of Jesus
       | 
       | But even in death, he shows us the way
       | 
       | To love and forgiveness, to live another day
       | 
       | And so we remember the crucifixion
       | 
       | And the love that Jesus showed us that day."
       | 
       | So...not quite what I was expecting, but it's progress!
       | 
       | [0]
       | https://en.wikipedia.org/wiki/Lament_over_the_Dead_Christ_(V...
        
         | omeid2 wrote:
         | What did you expect?
        
           | wsgeorge wrote:
           | A 14-line poem with a consistent rhyme scheme and meter.
           | Perhaps my request should have been more specific.
        
             | pmoriarty wrote:
             | From my experimentation, LLMs tend to kind of suck at rhyme
             | and meter, and all but the simplest types of poetry, so
             | even if you'd specified it probably wouldn't have been able
             | to deliver.
             | 
             | This is definitely something they could be trained to be
             | much better at, but I guess it's hasn't been a priority.
        
               | astrange wrote:
               | GPT4 is surprisingly good at it, considering BPE
               | tokenization means it shouldn't be able to do rhyme at
               | all.
        
               | jcims wrote:
               | Has anyone tried using phonetic tokens instead of text?
               | I'm curious if that would help with things like rhyming.
        
           | causi wrote:
           | I never know what to expect anymore. We live in a world where
           | computers can describe paintings and write sonnets about them
           | but a half-trillion dollar car company can't figure out how
           | to parallel park with eight cameras.
        
             | astrange wrote:
             | That's a well known result from the last AI hype era.
             | 
             | https://en.wikipedia.org/wiki/Moravec%27s_paradox
             | 
             | Although, I'm not surprised AI can describe paintings, but
             | I'm still pretty surprised it can generate them. How'd we
             | get that to work!?
        
               | causi wrote:
               | Yeah, and I'm still completely lost as to why resolution
               | is such a limiting factor. If you know you're drawing a
               | soccer ball why is a 512x512 soccer ball so much easier
               | than a 1024x1024 soccer ball?
        
               | gary_0 wrote:
               | There are a few cases where people have used ChatGPT to
               | generate SVG[0], with mostly unimpressive results. I'm
               | sure sooner or later models will be developed
               | specifically for creating vector drawings instead of
               | raster, including with the ability to apply textures and
               | gradients.
               | 
               | Also, the resolution of Stable Diffusion's output isn't
               | much of a limitation if you're willing to use other tools
               | to massage the output into something professional-
               | quality. See [1]
               | 
               | [0] For example:
               | https://praeclarum.org/2023/04/03/chatsvg.html
               | 
               | [1] https://news.ycombinator.com/item?id=35592847
        
             | moffkalast wrote:
             | DriveGPT as it hits 2 parked cars and runs over a dog:
             | "Apologies for the confusion earlier. You are correct that
             | the cars were in the planned local planner path..."
        
               | dceddia wrote:
               | Sorry, yes, that was a dog.
               | 
               | Let me try again and this time I will definitely not hit
               | anything.
               | 
               | Sorry, that was another dog.
        
               | brycedriesenga wrote:
               | "Was that another dog?"
               | 
               | BingDrive: I'm sorry, but I prefer not to continue this
               | conversation.
        
               | gary_0 wrote:
               | It's like something right out of a Douglas Adams book.
        
             | niemandhier wrote:
             | That's because the decided they do not need lidar.
        
               | valine wrote:
               | Near term it's a frustrating decision, but if these gpt4
               | vision LLMs are anything to go by it will prove to be the
               | right decision in the long term.
        
               | rafram wrote:
               | Why wouldn't LIDAR _in addition to_ computer vision with
               | cameras be a strictly better idea?
        
               | bradstewart wrote:
               | It's all trade offs. I'm just spitballing here, but if
               | you have limited resources, you can either spend
               | cash/time on lidar or invest in higher-quality mass-
               | produced optics, or better computer vision software. If
               | you get to a functional camera-only system sooner, might
               | everyone be better off as you can deploy it more rapidly.
               | 
               | Manufacturing capacity of lidar components might be
               | limited.
               | 
               | Another might be reliability/failure modes. If the system
               | relies on lidar, that's another component that can break
               | (or brownout and produce unreliable inputs).
               | 
               | So in a vaccum, yea a lidar+camera system is _probably_
               | better, but who knows with real life trade offs.
               | 
               | (again, I just made these up, I do not work on this
               | stuff, but these are a few scenarios I can imagine)
        
               | astrange wrote:
               | Remember that "cameras" aren't as good as human
               | perception because human eyes interact with the
               | environment instead of being passive sensors. (That is,
               | if you can't see something you can move your head.)
               | 
               | Plus we have ears, are under a roof so can't get rained
               | on, are self cleaning, temperature regulating, have much
               | better dynamic range, wear driving glasses...
        
               | spaceman_2020 wrote:
               | And we still get into millions of accidents every year...
        
               | program_whiz wrote:
               | I keep hearing this argument over and over, but I find it
               | uncompelling. As a relatively young person with good
               | vision, who has never been in an accident after many
               | years of driving, and who doesn't make the kind of simple
               | mistakes I've seen the absurd mistakes self-driving cars
               | make and I would not trust my life to a self-driving car.
               | 
               | Asking people to accept a driverless car based on over-
               | arching statistics is papering over some very glaring
               | issues. For example, are most accidents in cars being
               | caused by "average" drivers or are they young / old /
               | intoxicated / distracted / bad vision? Are the statistics
               | randomly distributed (e.g. any driver is just as likely
               | as the next to get in accidents)? Because the driverless
               | cars seem to have accidents at random in unpredictable
               | ways, but human drivers can be excellent (no accidents,
               | no tickets ever), or terrible (drive fast, tickets, high
               | insurance, accidents, etc). The distribution of accidents
               | among humans is not close to uniform, and is usually
               | explainable. I wouldn't trust a poor human driver on a
               | regular basis, nor would I trust an AI because I'm
               | actually a much better driver than both (no tickets, no
               | accidents, can handle complex situations the AI can't).
               | Are the comparisons of human accidents being treated as
               | homogenous (e.g. the chance of ramming full speed into a
               | parked car the same as a fender-bender?). I see 5.8M car
               | crashes anually, but deaths remain fairly low (~40k,
               | .68%), vs 400 driverless accidents with ~20 deaths (5%),
               | I'm not sure we're talking about the same type of
               | accidents.
               | 
               | tl;dr papering over the complexity of driving and how
               | good a portion of drivers might be by mixing non-
               | homogenous groups of drivers and taking global statistics
               | of all accidents and drivers to justify unreliable and
               | relatively dangerous technology would be a strict
               | downgrade for most good drivers (who are most of the
               | population).
        
               | rootusrootus wrote:
               | Which sounds like a lot until you realize 1) we drive
               | over three _trillion_ miles a year in the US, and 2) the
               | majority of those accidents are concentrated to a
               | fraction of all drivers. The median human driver is quite
               | good, and the state of the art AI isn 't even in the same
               | galaxy yet.
        
               | Philip-J-Fry wrote:
               | It's not lidar they need. BMW, Mercedes, Porsche, etc.
               | All can park themselves almost perfectly every time.
               | Teslas can't, and will take 5x as long to do it when they
               | can, if the computer can even bother to recognise the
               | space.
               | 
               | It's software. Original Teslas with AP1 better than
               | Teslas own in house software on their latest AP.
        
               | causi wrote:
               | Sure, sure, but would it have killed them to drop in a
               | few five dollar "don't hit this object" ultrasonic
               | proximity sensors?
        
               | michaelt wrote:
               | While ultrasonic sensors would be fine for parking, they
               | don't have very good range so they aren't much help in
               | avoiding, for example, crashing into stationary fire
               | trucks or concrete lane dividers at freeway speeds.
        
               | itsyaboi wrote:
               | Just disable autopilot 0.00001 seconds before impact and
               | it becomes the driver's fault.
        
               | tough wrote:
               | Couldn't afford*
        
             | bugglebeetle wrote:
             | Don't worry. That's valuation has always been ZIRP nonsense
             | and will fall with Musk's addiction to publicly
             | demonstrating his incompetence.
        
         | pmoriarty wrote:
         | How about asking it to describe a Robert Williams painting?
         | 
         | Here's a relatively easy one: [1]
         | 
         | [1] -
         | http://www.illustrationartgallery.com/acatalog/WilliamsTimmy...
        
           | wg0 wrote:
           | I tried that and here's what it has to say:
           | 
           | > This painting appears to be a depiction of a child running
           | down a dirt road with a kite in hand. The child is wearing a
           | yellow shirt and blue shorts, and the background is filled
           | with various colors and shapes. The overall mood of the
           | painting is playful and carefree.
           | 
           | What's amazing is that it has correctly described the scene
           | including details such as:                 - A boy       - On
           | a dirt road.       - Having a Kite in hand.       - Wearing
           | Yellow shirt and blue shorts.
        
           | jcims wrote:
           | Used this prompt
           | 
           | Could you describe the image as thoroughly as possible?
           | Please include the content, any noteworthy symbology,
           | techniques or styles, what kind of media, guess at the artist
           | or identify any attribution, then attempt to describe the
           | theme or idea it represents?
           | 
           | Got this
           | 
           | The image is a painting of a child running down a dirt road
           | with a kite in hand. The child is wearing a yellow shirt and
           | blue shorts. The background is a mix of urban and rural
           | elements, with buildings and trees in the background and a
           | field of grass in the foreground. The style of the painting
           | is a mix of realism and abstraction, with a focus on the
           | child and the kite. The artist may be attempting to convey a
           | sense of freedom and playfulness, with the child running with
           | the kite representing the joy and wonder of childhood.
           | 
           | The medium used is oil paint on canvas. The artist may have
           | used brushes and other tools to create the different textures
           | and effects in the painting. The artist may have also used
           | underpainting to establish the composition and values before
           | adding the final layers of color and detail.
           | 
           | It is difficult to identify the artist without more
           | information, but the style and technique of the painting
           | suggest that it may be the work of an artist who is skilled
           | in realism and abstraction, and who is interested in
           | exploring themes of childhood and playfulness.
        
             | pmoriarty wrote:
             | It completely missed what makes this painting distinctive
             | and funny/scary, and what's actually going on.
             | 
             | It also missed the right half of the painting.
             | 
             | Try asking it why the painting is called "Timmy's Last
             | Surprise".
        
       | dom96 wrote:
       | Can this decode CAPTCHAs? Are we finally in a place where
       | CAPTCHAs are useless?
        
         | spyder wrote:
         | Just tried it with a simple 4 character one and it's bad at it,
         | detects 1 or two characters correctly from the 4 if it outputs
         | anything. It's probably better with the "select the traffic
         | lights" kind of captchas, but those are also already possible
         | to solve with other image models too if I remember correctly.
        
         | alex_duf wrote:
         | would it be a good things for captchas to be useless? Because
         | in my career I've been a few times on the receiving end of spam
         | attacks and various nastiness and it's nice to be able to
         | integrate such a thing - as unpleasant as it is for the user -
         | and solve the problem at hand.
         | 
         | if this disappears then spammers and the various botnets will
         | have the upper hand again.
        
           | dom96 wrote:
           | It wouldn't be good, did I imply it would? I'm interested to
           | know if we are now in a place where they do become useless
        
         | dalys wrote:
         | I was also thinking about this the other day.
         | 
         | Idea one: Captchas are to become pretty useless as a "is this a
         | human" tactic soon. Maybe it already is, I don't know. What
         | other things could we think off to prove someone is human? I
         | was watching Lex Fridman and Max Tegmark and they were
         | remarking on how Twitter using payment as a differentiator
         | between human and bot is actually really good. And maybe the
         | only way we can reliably determine if someone is a human or not
         | right now. Just by the virtue that having thousands of bots
         | doing something, that suddenly costs $5 per event will deter
         | most attacks. Integrating online identification systems from
         | various countries could be one tactic (such as
         | https://en.wikipedia.org/wiki/BankID that we use in Sweden to
         | log in to basically any online service). New startup: Un-
         | botable authentication as a service.
         | 
         | Idea two: Since captchas are useless, we'll be able to do bots
         | that can do almost everything on the web. No need for writing
         | automation scripts, headless browsers, regexp etc. Just feed
         | real visual data from browser to GPT-4 (or MiniGPT-4 or
         | similar). Give instructions like "You need to accomplish this
         | task: Go to facebook.com and create a user account and be
         | friends with 100 people and act like a human. Follow the
         | instructions on the website.". Then let the bot figure out
         | where to move the mouse and send click events, keyboard events
         | etc. Obviously much slower than a scripted bot, but would be
         | very hard to detect as a bot. Good for E2E testing also? New
         | startup: WebGPT: Bots that can do any task on any website.
         | TestGPT: E2E testing service. Just write test instructions like
         | "Make sure the login works on /login" and you're done! And you
         | thought flaky tests were a thing of the past... Would be kind
         | of cool for UX testing also. "You are a 43 year old truck
         | driver in Spain, please go to news.ycombinator.com and see how
         | easy it is for you to register an account, log in, go to any
         | news story and make a new comment. Report anything you think is
         | confusing."
        
       | tysam_and wrote:
       | I was not expecting this to be good due to the hype wave, but at
       | least looking at the examples, they seem very impressive and
       | worthy of the (potential) hype! I can't wait to see what new
       | waves of these models keep coming out.
       | 
       | I for one will probably still stay near the fundamental side of
       | things as all of this thrash is simply far too much for me, but
       | it is quite nice and beautiful for me to get to see all of the
       | variety and beauty of what is being built here. <3 <3 <3 <3
       | 
       | Now, to figure out a good way to integrate these seamlessly into
       | my development workflow....I go about as painfully low-tech as
       | one could. I might as well be using an electronic typewriter with
       | only Ctrl+F search, lolzers....
        
       | mdrzn wrote:
       | Interesting, will try it out.
       | 
       | The queue is about 100 at the moment, with 700s of waiting.
        
         | mritchie712 wrote:
         | 1200s now...
        
           | punkspider wrote:
           | Also some links don't seem to open in a new tab and if you
           | click around while you wait you might lose your progress.
           | 
           | I clicked on the Video button while waiting, assuming that it
           | would open in a new tab, and lost my place in queue.
        
             | zapdrive wrote:
             | Also once you get your turn and ask your question, it puts
             | you back in the queue.
        
       | djangovm wrote:
       | If I have to learn how to be able to read this code and
       | understand what it and its dependencies are doing, where do I
       | start? Is reading their paper an effective strategy?
       | 
       | https://github.com/Vision-CAIR/MiniGPT-4/blob/main/MiniGPT_4...
        
       | lxe wrote:
       | Why Vicuna over something like oasst-pythia? Would love to see a
       | table comparing all the new models side by side.
        
         | GaggiX wrote:
         | All Pythia models were trained on 300B tokens, LLaMa models
         | were trained on 1/1.4T tokens.
        
         | lhl wrote:
         | Fabrice Bellard has run a standard set of benchmarks w/ lm-eval
         | on a big chunk of open models here:
         | https://bellard.org/ts_server/ - Flan T5 XXL and GPT-NeoX 20B
         | both outperform Pythia 12B on average (LLaMA 13B+ tops the
         | charts).
        
           | [deleted]
        
       | vlovic wrote:
       | Do I understand this correctly: they just took Blip2 and replaced
       | the LLM with Vicuna, and to do that they just added a single
       | linear layer to translate between frozen vision encoder and
       | (frozen) Vicuna? Additionally, and importantly, they manually
       | create a high quality dataset for finetuning their model.
       | 
       | If that is the case, then this is really a very, very simple
       | paper. But I guess simple things can lead to great improvements,
       | and indeed their results seem very impressive. Goes to show how
       | much low hanging fruit there must be in deep learning these days
       | by leveraging the amazing, and amazingly general, capabilities of
       | LLMs.
        
         | visarga wrote:
         | Yes, model composability magic.
        
       | 23459872345 wrote:
       | Has anyone gotten this running in Colab?
        
       | felipesabino wrote:
       | the lost me when the demo bot chose php for the website builder
       | demo /s
        
       | 152334H wrote:
       | On a technical level, they're doing something really simple --
       | take BLIP2's ViT-L+Q-former, connect it to Vicuna-13B with a
       | linear layer, and train just the tiny layer on some datasets of
       | image-text pairs.
       | 
       | But the _results_ are pretty amazing. It completely knocks
       | Openflamingo  && even the original blip2 models out of the park.
       | And best of all, it arrived before OpenAI's GPT-4 Image Modality
       | did. Real win for Open Source AI.
       | 
       | The repo's default inference code is kind of bad -- vicuna is
       | loaded in fp16 so it can't fit on any consumer hardware. I
       | created a PR on the repo to load it with int8, so hopefully by
       | tomorrow it'll be runnable by 3090/4090 users.
       | 
       | I also developed a toy discord bot
       | (https://github.com/152334H/MiniGPT-4-discord-bot) to show the
       | model to some people, but inference is very slow so I doubt I'll
       | be hosting it publicly.
        
         | sho wrote:
         | > they're doing something really simple -- take BLIP2's
         | ViT-L+Q-former, connect it to Vicuna-13B with a linear layer,
         | and train just the tiny layer on some datasets of image-text
         | pairs
         | 
         | Oh yes. Simple! Jesus, this ML stuff makes a humble web dev
         | like myself feel like a dog trying to read Tolstoy.
        
           | selfhoster11 wrote:
           | Seriously, ChatGPT was the thing that gave me a foothold into
           | the AI/machine learning world... because it gave me hope that
           | a mere mortal can achieve something reasonable with this tech
           | without a crazy amount of work and educational background.
        
             | 152334H wrote:
             | I don't have much experience myself. I only started ~10
             | months ago -- just a month or two before Stable Diffusion.
             | 
             | You just have to do it every day. It's fun!
        
               | jack_riminton wrote:
               | Can you recommend what kind of small daily activities
               | would help a web dev get into it?
        
               | 152334H wrote:
               | Regardless of what you want to learn, "small daily
               | activities" is a bit hard. You can learn some stuff by
               | osmosis, following the feeds of AI devs && AI channels,
               | but the bulk of what I learn comes from starting projects
               | & digging into code & reading papers.
               | 
               | If you can hold attention span over several days (I
               | can't), work on a project bit-by-bit. Just make sure it
               | uses modern AI stuff, and that you have smart people to
               | talk around with.
        
             | anonzzzies wrote:
             | There are really great resources now from eli5 about all of
             | this tech to books like 'the little learner' which any
             | programmer can get into. Yes, it takes effort but it is a
             | great time for it.
        
           | pansa2 wrote:
           | Web stuff probably makes ML devs feel the same way.
           | 
           | ML is just a different field, using a different set of
           | technologies from those you're familiar with.
        
             | tourgen wrote:
             | [dead]
        
             | spaceman_2020 wrote:
             | The best ML PhDs can't do what frontend devs can:
             | understand CSS :D
        
           | ikurei wrote:
           | > This ML stuff makes a humble web dev like myself feel like
           | a dog trying to read Tolstoy.
           | 
           | Just like any discussion between advanced web devs would make
           | any humble woodworker feel?
           | 
           | And just like any discussion between advanced woodworkers
           | would make a humble web dev feel?
           | 
           |  _" It's really simple, they're just using a No. 7 jointer
           | plane with a high-angle frog and a PM-V11 blade to flatten
           | those curly birch boards, then a No. 4 smoother plane with a
           | Norris-type adjuster and a toothed blade for the final
           | pass."_
           | 
           | Whut?
           | 
           |  _" You could use Webpack to bundle your HTML, CSS and Babel-
           | transpiled TypeScript 5 down to shim-included Ecmascript 4"_,
           | _" They're just using OAuth2 authentication with Passport.js
           | and JWT tokens, which easily gets you CSRF protection"_, _"
           | Our e-learning platform uses LMS.js and xAPI.js, plus SCORM
           | for course packaging and Moodle as the LMS backend."_, ...
           | 
           | There was a time you didn't know what any of that meant.
           | 
           | Just because you don't know what the words mean shouldn't
           | make it sound difficult. Not saying AI is easy, just that the
           | jargon is not a good indication of difficulty and we should
           | know better than to be so easily mystified.
        
             | tracker1 wrote:
             | Okay, I won't mention how much is wrong in the webdev
             | statement... :-D
        
             | brycedriesenga wrote:
             | Hey, guys. Hey. Ready to talk plate processing and residue
             | transport plate funneling? Why don't we start with joust
             | jambs? Hey, why not? Plates and jousts. Can we couple them?
             | Hell, yeah, we can. Want to know how? Get this. Proprietary
             | to McMillan. Only us. Ready? We fit Donnely nut spacing
             | grip grids and splay-flexed brace columns against beam-
             | fastened derrick husk nuts and girdle plate Jerries, while
             | plate flex tandems press task apparati of ten vertipin-
             | plated pan traps at every maiden clamp plate packet.
             | Knuckle couplers plate alternating sprams from the t-nut to
             | the SKN to the chim line. Yeah. That is the McMillan way.
             | And it's just another day at the office.
        
               | SanderNL wrote:
               | Talk dirty to me!
        
               | intelVISA wrote:
               | runtime polymorphism
        
               | therein wrote:
               | Just tell me do we need a turbo encabulator or not?
        
               | Rzor wrote:
               | This post is double great and I will never forgive Amazon
               | for canceling that show.
               | 
               | For those that don't know this is from a show called
               | Patriot.
               | 
               | https://en.wikipedia.org/wiki/Patriot_(TV_series)
               | 
               | Scene: https://youtube.com/watch?v=-F-IHvF5OCA
        
             | SpaceL10n wrote:
             | Web devs have become blue collar!? =P
             | 
             | Great idea, actually. I do hope for a curriculum that
             | enables kids on the trade school path to learn more about
             | programming. Why not Master/Journeyman/Apprentice style
             | learning for web dev??
        
               | wnolens wrote:
               | That's kind of how I think about bootcamps pumping out
               | web devs. They're like trade schools, teaching you just
               | enough fundamentals to know how to use existing tools.
        
               | tracker1 wrote:
               | Mostly agree... though I don't think the bootcamps get
               | enough fundamentals in. Not to mention that it takes the
               | type of person that will go above and beyond what has
               | been assigned to succeed trying to be a productive
               | employee in the space. I'm self-taught and the first
               | years of my career spent countless hours reading,
               | practicing and solving problems. I still spend a good
               | 10-15 hours a week reading and exploring software
               | development and try to at least keep up with what's out
               | there. In the end, the best you can do is be aware of
               | what, or even that options are out there.
               | 
               | I can't imagine starting out today...
        
               | ikurei wrote:
               | I kind of agree, but I'd add that I don't think it's a
               | bad thing.
        
             | f1codz wrote:
             | You make a good point. Except that a number of these
             | concepts and tooling in the ML world have been slingshotted
             | into the forefront in a relatively short time and it has
             | been hard to play catch up. For eg. - someone said "frozen
             | Vicuna" below - what does that mean?
        
               | idkyall wrote:
               | Vicuna is a specific open source AI LLM:
               | https://ai.plainenglish.io/vicuna-the-unparalleled-open-
               | sour...
        
             | birdyrooster wrote:
             | I love your analysis.
        
             | ttul wrote:
             | The thing is, machine learning sorta requires a few math
             | prerequisites: linear algebra, differential equations, and
             | to some degree vector calculus. Most web developers don't
             | have this background.
        
               | craigching wrote:
               | If you want to understand the theory, that's true. If you
               | want to develop an intuitive understanding without having
               | to understand all the nuts and bolts (and I understand
               | that can be a big ask for how some people
               | learn/understand), give this a try:
               | https://karpathy.ai/zero-to-hero.html
        
               | akiselev wrote:
               | The irony is Karpathy presents the limit/epsilon
               | definition of derivatives in the first half hour (quite
               | well IMO and he never actually says "epsilon") which is
               | very much a nuts and bolts kind of explanation in
               | calculus.
               | 
               | That said, when most people say differential equations
               | they're usually thinking of analytical solutions which is
               | very much not necessary for practical ML.
        
               | idiotsecant wrote:
               | I would say the limit epsilon derivative is exactly the
               | sort of thing grandparent post is talking about. It's
               | quite intuitive and doesn't require hardly any
               | mathematical foundation at all, other than basic geometry
               | and algebra. You can understand topics that build on that
               | simple concept without understanding the more formal
               | derivative definitions.
        
           | mjburgess wrote:
           | In practice, it's a lot more like web dev than you might
           | imagine.
           | 
           | The above means that the approach is web-dev like gluing,
           | almost literally just,                   from existingliba
           | import someop         from existinglibb import anotherop
           | from someaifw import glue              a = someop(X)
           | b = glue(a)         Y = anotherop(b)
        
             | amelius wrote:
             | It's more like gardening:                   1. plant seed
             | 2. ...wait a very long time...         3. observe
             | completely unexpected but cool result
             | 
             | The unexpected part of step 3 is what makes this very
             | different from any kind of engineering, even webdev.
             | 
             | Of course, there _is_ a lot of engineering involved in good
             | ML, but that is more comparable to agricultural engineering
             | in the sense that it 's just a lot of dumb plumbing that
             | any engineer can do without knowledge of the actual
             | application.
        
               | tudorw wrote:
               | count me in :)
        
               | kerkeslager wrote:
               | I mean, for me, the unexpected part of 3 is what got me
               | into programming in general. The first time you type a
               | mysterious incantation into an editor and a few more
               | mysterious incantations into the console and the console
               | prints "Hello, world" like it was supposed to, it's
               | unexpected because it's hard to believe that any of this
               | mysterious incantation stuff actually works at all.
               | 
               | As you get better at programming you have to take on
               | harder problems to create the surprise of something
               | working, because you gain confidence, and as you gain
               | confidence, you start expecting your code to work. It's
               | only when you've compiled the thing 6 times with small
               | corrections and gotten segfaults each time and the 7th
               | time you finally find the place you weren't updating the
               | pointer and you correct it, but this is the 7th error
               | you've corrected without the segfault going away, so you
               | don't really expect it to fix the problem, but then you
               | run it and it's fixed!
               | 
               | And then you get a job and the reality is that most of
               | the jobs you're just writing CRUD apps and for a little
               | while you can get some surprise out of learning the
               | frameworks, but eventually you actually get really,
               | really knowledgeable about the Postrgres/Django/React
               | stack and nothing surprises you any more, but _because_
               | nothing surprises you any more, you 're really effective
               | and you start being able to bill the big bucks _but only
               | for work on that stack_ because it takes time to struggle
               | enough to get surprised, and the time that takes means
               | your time is worth less to your clients. Money ruins
               | everything. And if you don 't do anything non-billable,
               | it's easy to forget what programming felt like when you
               | didn't know how your tools all worked inside and out. Not
               | everyone takes this path but it's certainly the easiest
               | path to take.
               | 
               | I think for a lot of folks who have been doing this for a
               | long time, the reason ML is so exciting is it's getting
               | them back out of their comfort zone, and into a space
               | where they can experience surprise again.
               | 
               | But that surprise has always been available if you
               | continue to find areas of programming that push you out
               | of your comfort zone. For me it's been writing
               | compilers/interpreters for programming languages.
               | _Crafting Interpreters_ was awesome: for the first time I
               | benchmarked a program written in _my_ language against a
               | Python program, and _my_ program was faster: I never
               | expected I 'd be able to do that! More recently, I wrote
               | a generational GC. It's... way too memory-intensive to be
               | used in my language which uses one-GC-per-thread for
               | potentially millions of threads, but it certainly was a
               | surprise when that worked.
               | 
               | Personally, I'm keeping track of ML enough to know broad
               | strokes of things but I'm not getting my hands dirty with
               | code until there are some giants to stand on the
               | shoulders of. Those may already exist but it's not clear
               | who they are yet. And I've got very little interest in
               | plugging together opaque API components; I know how to
               | make an API call. I want to write the model code and
               | train it myself.
        
               | incidentnormal wrote:
               | I like how you've expressed this insight, and it is so
               | true.
               | 
               | Becoming great at a particular technology stack means
               | modelling it in great detail in your head, so you can
               | move through it without external assistance. But that
               | leaves an arena without discovery, where you just
               | reinforce the same synapses, leading to rigidity and an
               | absence of awe.
        
             | teruakohatu wrote:
             | There is a little more to it than that. Abstractions in ML
             | are very leaky.
        
             | marban wrote:
             | And repeat that ~4 times to make it look like LangChain
        
             | guax wrote:
             | And just like webdev, each of those were done in a
             | different platform and require arcane incantations and 5h
             | of doc perusing to make it work on your system.
        
               | KaoruAoiShiho wrote:
               | Buddy this ain't 2022 anymore, ask chatgpt (with a plugin
               | that can read docs).
        
               | MattPalmer1086 wrote:
               | You can just ask GPT how to do it. Much like a lot of web
               | dev!
        
               | kerkeslager wrote:
               | And the code GPT gives you won't work, much like a lot of
               | web dev? ;P
        
               | pc86 wrote:
               | Maybe it's because of how I use it, but the code ChatGPT
               | gives me has always been super helpful and 99% correct.
               | But, we have a policy at work not to use it for work
               | product so I have to spend time changing enough of it
               | where it's different, and I'm never copy/pasting
               | anything. Enough changes to the structure and variables
               | to make it sufficiently different that it can't be
               | considered pasting company data into GPT, ask my
               | question(s), see what comes back out, refactor/type
               | manually into my IDE, test. I'd say one out of every 8-9
               | times I get something objectively wrong - a method that
               | doesn't exist, something not compiling, etc. But it's
               | faster than using google/DDG, especially with some
               | prompting so that it just spits back code and not 5th-
               | grade level explanatory paragraphs before and after. And
               | well over half the time it does exactly what I need or
               | sufficiently close that my initial refactoring step gets
               | me the rest of the way.
        
               | MattPalmer1086 wrote:
               | Would you say that this satisfies the spirit of the
               | company policy? Or is it a bit of a hack to get around
               | it?
               | 
               | I ask because we are about to produce a similar policy at
               | work. We can see the advantages of it, but likewise, we
               | can't have company data held in their systems.
        
               | pc86 wrote:
               | The policy is to not send any "sensitive company data"
               | into ChatGPT, which I 100% agree with. How we implement a
               | given Vue component or a particular API isn't sensitive
               | or particularly novel so if I strip the business logic
               | out I do honestly believe I'm complying with the spirit
               | of the policy.
        
               | SanderNL wrote:
               | If I use it I also make sure it's something completely
               | non-core business, like an arcane piece of sorting or
               | ugly rxjs construction.
               | 
               | I get the IP angst, but some companies think their
               | GetGenericObjectFromDB() REST bs is secret sauce.
        
               | intelVISA wrote:
               | To the average VC a computer switching on is secret sauce
               | enough, the rest is really just an implementation detail.
        
               | Gravityloss wrote:
               | at some point someone makes a service where you can let
               | AI take over your computer directly. Easier that way!
               | Curling straight to shell taken to next level.
        
               | wrayjustin wrote:
               | So...AutoGPT? Now with command-line access! Have fun :)
               | 
               | https://github.com/Significant-Gravitas/Auto-GPT/
        
               | vorticalbox wrote:
               | Found my next hobby project
        
               | minimaxir wrote:
               | This is why the Hugging Face transformer ecosystem is so
               | good, as each of those blocks will roughly have the same
               | unified API.
        
           | MrGilbert wrote:
           | Just get rid of all the abbreviations in your mind - they
           | seem to be very intimidating. I really liked the explanation
           | that Stephen Wolfram did on ChatGPT:
           | 
           | https://writings.stephenwolfram.com/2023/02/what-is-
           | chatgpt-...
           | 
           | Maybe someone has resources to understand machine-learning on
           | an ELI5 level.
        
             | wwalexander wrote:
             | Wow, he waits until halfway through the article to mention
             | A New Kind of Science. Usually he works it into the first
             | couple of paragraphs!
        
               | alchemist1e9 wrote:
               | I known it's hard to believe but I sense LLMs have
               | slightly knocked his ego down and injected a small dose
               | of humility.
               | 
               | https://youtu.be/z5WZhCBRDpU
               | 
               | I pick that up in above video and also in the post above.
               | 
               | Definitely healthy for him which just to be clear I'm a
               | huge Wolfram fan and the ego doesn't really bother me,
               | it's just part of who he is, however I do find it nice
               | that LLMs are having him self reflect more than typical.
        
               | HarHarVeryFunny wrote:
               | Not a big Wolfram fan myself. I gave him the benefit of
               | the doubt and bought "A New Kind of Science" (freakin'
               | expensive when it first came out), and read the whole
               | 1280 pages cover to cover ... Would have been better
               | presented as a short blog post.
               | 
               | I find it funny how despite being completely uninvolved
               | in ChatGPT he felt the need to inject himself into the
               | conversation and write a book about it. I guess it's the
               | sort of important stuff that he felt an important person
               | like himself should be educating the plebes on.
               | 
               | Predictably he had no insight into it and will have left
               | the plebes thinking it's something related to MNIST and
               | cat-detection.
        
               | MrGilbert wrote:
               | I just happen to read this article of him, which I found
               | easy to understand. I'm neither a huge proponent nor
               | opponent of the likes of his work. Or, bluntly speaking:
               | I don't know much else about his reputation in the
               | community.
        
           | pc86 wrote:
           | Big "a monad is just a monoid in the category of
           | endofunctors" vibes from this one.
        
           | joaogui1 wrote:
           | FWIW I work in LLMs and I consistently fail to do simple
           | webdev stuff
        
           | craigching wrote:
           | I was where you're at about ... oh wow, it's been almost ten
           | years since I jumped into machine learning. Mind you, I've
           | been learning on the side most of this time other than a
           | theoretical class at the University of Minnesota. But, that
           | aside, and depending on where you're at in your
           | understanding, this is a great resource for catching up if
           | you're really interested: https://karpathy.ai/zero-to-
           | hero.html it was posted on HN a couple of weeks ago and I
           | have to say it's a really good introduction and Andrej
           | Karpathy is a passionate and excellent teacher. You may want
           | to brush up on some intro Calculus, but it's very
           | understandable.
        
           | thewarrior wrote:
           | I've only been reading ML stuff for a few months and I kind
           | of understand what it's saying. This stuff isn't as complex
           | as its made out to be.
           | 
           | It's just a bunch of black boxes AKA "pure functions".
           | 
           | BLIP2's ViT-L+Q-former AKA                   //I give you a
           | picture of a plate of lobster it will say "A plate of
           | lobster".              getTextFromImage(image) -> Text
           | 
           | Vicuna-13B AKA                   //I give you a prompt and
           | you return completion ChatGPT style
           | getCompletionFromPrompt(text) -> Text
           | 
           | We want to take the output of the first one and then feed in
           | a prompt to the LLM (Vicuna) that will help answer a question
           | about the image. However the datatypes don't match. Lets add
           | in a mapper.                   getAnswerToQuestion(image,
           | question) -> answer              text =
           | getTextFromImage(image)             prompt =
           | mapTextToPrompt(text)             return
           | getCompletionForPrompt(prompt)
           | 
           | Now where did this mapTextToPrompt come from ?
           | 
           | This is the magic of ML. We can just "learn" this function
           | from data. And they plugged in a "simple" layer and learned
           | it from a few examples of (image , question) -> answer. This
           | is what frameworks like Keras, Pytorch allow you to do. You
           | can wire up these black boxes with some intermediate layers
           | and pass in a bunch of data and voila you have a new model.
           | This is called differentiable programming.
           | 
           | The thing is you don't need to convert to text and then map
           | back into numbers to feed into the LLM. You skip that and use
           | the numbers it outputs and multiply directly with an
           | intermediate matrix.
           | getAnswerToQuestion(image, question) -> answer
           | text = getEmbeddingFromImage(image)             embedding =
           | mapEmbeddingToInputEmbeddingForLLM(text)             return
           | getCompletionForEmbedding(embedding)
           | 
           | Congratulations you now understood that sentence.
        
             | artificial wrote:
             | Thank you for the insightful breakdown. Cheers!
        
             | eternauta3k wrote:
             | Interesting, so the LLM is "just" getting your question
             | plus a normal text description of the image (as vectors)?
        
               | thewarrior wrote:
               | At a high level yes.
               | 
               | More precisely - It gets the question After irs passed
               | through a matrix that transforms the text description of
               | the image so the LLM can "understand" it.
               | 
               | It maps from the space of one ML model to the other.
        
           | visarga wrote:
           | > take BLIP2's ViT-L+Q-former
           | 
           | This thing takes an image and creates a representation
           | matrix.
           | 
           | > connect it to Vicuna-13B with a linear layer
           | 
           | Vicuna is an open LLM, pretty good quality, not as good as
           | GPT3.5 though.
           | 
           | This is the beautiful part - a mere multiplication is enough
           | to convert the image tensor to text tensor. One freaking line
           | of code, and a simple one.
           | 
           | > and train just the tiny layer on some datasets of image-
           | text pairs
           | 
           | You then get a shitload of image-text pairs and train the
           | model to describe the images in text. But keep both the image
           | and text model frozen. Is that hard? No, just flip a flag. So
           | this "linear projection layer" (a matrix multiplication) is
           | the only learned part. That means it takes less time to
           | train, needs fewer examples and requires less memory.
           | 
           | Training the image and text models was much more difficult.
           | But here we don't train these models, they use them as ready-
           | made parts. It's a hack on top of two unrelated models, so it
           | is cheap.
           | 
           | In the end the finishing touches - they label 3500 high
           | quality image-text pairs, and fine-tune on them. Now the
           | model becomes truly amazing. It has broad visual
           | intelligence, and scooped OpenAI who didn't release Image
           | GPT-4 in the APIs yet.
           | 
           | The important lesson to take is that unrelated models can be
           | composed together with a bit of extra training for the glue
           | model. And that open AI is just as powerful as "Open"AI
           | sometimes. It's breathing down their necks, just one step
           | behind. This model is also significant for applications - it
           | can power many automations in a flexible way.
        
             | rafaelero wrote:
             | > This is the beautiful part - a mere multiplication is
             | enough to convert the image tensor to text tensor. One
             | freaking line of code, and a simple one.
             | 
             | I thought they were creating image tokens based on the
             | queries during finetuning and appending them to the
             | language model. They are not text tokens.
        
           | relativeadv wrote:
           | > like a dog trying to read Tolstoy
           | 
           | this got a chuckle out loud from me. great visual.
        
             | egeozcan wrote:
             | This could be a great prompt to test the limits of txt2img
             | models. The astronaut riding a horse got boring already :)
        
           | mewpmewp2 wrote:
           | Only because of big complicated sounding terms, that also
           | exist in web dev.
        
           | macawfish wrote:
           | Arf!
        
         | seydor wrote:
         | Then it's an impressive demonstration of how modular neural
         | networks can be. Maybe we don't even need to train monoliths
        
           | artificial wrote:
           | Maybe a distributed trainer? AI@Home?
        
         | kramerger wrote:
         | Can any of this realistically run on CPU at some point?
         | 
         | (Not training obviously)
        
           | tantony wrote:
           | Someone is probably going to port it to llama.cpp soon.
        
           | lhl wrote:
           | Yes, you can run inference at decent speeds on CPU with
           | llama.cpp. A token is about 0.75/words, so you can see lots
           | of people getting 4-8 words/s on their CPUs:
           | https://github.com/ggerganov/llama.cpp/issues/34
           | 
           | There a lot of optimizations that can be done. Here's one w/
           | potentially a 15X AVX speedup for example:
           | https://github.com/ggerganov/llama.cpp/pull/996
        
           | anonzzzies wrote:
           | It will do, probably quite soon. Many people are trying.
        
           | GaggiX wrote:
           | I've run LLaMa models on my CPU before, ViT-L and the
           | Q-former are two transformer models as well, so I can't see
           | why they wouldn't run on a CPU.
        
           | Ambix wrote:
           | I'm developing framework [1] in Golang with this goal in mind
           | :) It successfully runs relatively big LLM right now, and
           | diffusion models will be the next step
           | 
           | [1] https://github.com/gotzmann/llama.go/
        
           | muyuu wrote:
           | quantized Vicuna runs ok-ish in my 16GB i7 laptop (onboard
           | graphics) and the output is usable
           | 
           | see this comparison: https://old.reddit.com/r/LocalLLaMA/comm
           | ents/12ezcly/compari...
           | 
           | these models quantised to 4bit should run in CPU set ups with
           | 16GB of RAM + 16GB of swap (Linux) and perhaps other setups
           | run similarly
        
         | HopenHeyHi wrote:
         | Thanks for a useful comment.
         | 
         | Do you reckon the 4bit quantized Vicuna just won't do here?
         | https://huggingface.co/anon8231489123/vicuna-13b-GPTQ-4bit-1...
         | 
         | I think with this everything OpenAI demonstrated ~5 weeks ago
         | has been recreated by actually-open AI. Even if it runs much
         | much slower on prosumer hardware and with less good results at
         | least it is de-magicked.
        
           | p1esk wrote:
           | The magic is in the quality of GPT-4 output. That hasn't been
           | recreated yet.
        
           | kolinko wrote:
           | Open AI still hasn't exactly reached the level of gpt3.5.
           | GPT-4 is way ahead of anything.
        
           | 152334H wrote:
           | It'll work! I just haven't touched any of the 4bit stuff
           | myself, so I don't personally know how to add it. Great low-
           | hanging fruit for anyone else to take on.
        
         | chenzhekl wrote:
         | It sounds like a BLIP2 with an extra linear layer for
         | finetuning (or aligning the Q-former with a new LLM?). What
         | makes it more powerful than BLIP2?
        
           | 152334H wrote:
           | It's better because
           | 
           | 1. it's using vicuna as a base.
           | 
           | 2. It has a pretty high quality fine-tuning dataset. I
           | initially missed this, and it's a very important advantage.
           | 
           | 3. (speculatively) it doesn't collapse to extremely short
           | responses (which BLIP2 and other models trained on image-text
           | caption pairs) because of how small/simple the adapter is.
           | 
           | I was interested in training a BLIP2-LLaMA model before this,
           | and I might still do it just to test (3).
        
         | matthewfcarlson wrote:
         | >so hopefully by tomorrow it'll be runnable by 3090/4090 users.
         | 
         | Taking a step back, this is just a wild statement. I know
         | there's some doom and gloom out there, but in certain aspects,
         | it's an awesome time to be alive.
        
         | cs702 wrote:
         | Indeed, really simple. And yes, the results are shockingly
         | good. But what I find most remarkable about this is that the
         | ViT-L+Q-former's hidden states _are related by only a linear
         | projection_ (plus bias) to the Vicuna-13B 's token embeddings:
         | emb_in_vicuna_space = emb_in_qformer_space @ W + B
         | 
         | These two models are trained independently of each other, on
         | very different data (RGB images vs integer token ids
         | representing subwords), and yet somehow they learn to embed
         | different data in feature vectors that are so... _similar_. WHY
         | should that be the case?
         | 
         | It suggests to me there may be something _universal_ about the
         | embedding layers and hidden states of _all_ trained deep
         | learning models.
        
           | sigmoid10 wrote:
           | >somehow they learn to embed different data in feature
           | vectors that are so... similar
           | 
           | At it's core, BLIP2 already projects RGB inputs into text
           | token space and Vicuna (or rather LLaMA) uses such tokens as
           | inputs as well as outputs. The only reason why a linear layer
           | is needed at all is because they are not trained at the same
           | time, so you still have to move text embeddings from one
           | space to another. But it should not be surprising at all that
           | one hidden linear layer suffices to do just that (see the
           | universal approximation theorem [1]). This approach is just
           | an efficient way to combine different models for downstream
           | fine-tuning tasks while keeping their weights frozen, but it
           | is neither new nor particularly surprising.
           | 
           | [1] https://en.wikipedia.org/wiki/Universal_approximation_the
           | ore...
        
             | cs702 wrote:
             | Thanks. Your comment about BLIP2 already projecting RGB
             | inputs into (a different) text token space makes sense to
             | me. See also fpgaminer's comment at
             | https://news.ycombinator.com/item?id=35603246 . However, I
             | don't see how the universal approximation theorem is
             | relevant here. The fact that deep models with sufficient
             | capacity can approximate any function does not imply that
             | two deep models trained independently of each other on
             | different tasks will learn to approximate functions that
             | relate to each other only by a linear transformation.
        
               | sigmoid10 wrote:
               | >I don't see how the universal approximation theorem is
               | relevant here. The fact that deep models
               | 
               | The universal approximation is exactly _not_ about deep
               | models. Deep means many layers. But in the most simple
               | (and proven) case, a single hidden layer perceptron is
               | all it needs according to the UAT. Technically it also
               | needs a nonlinear activation function, but you get all
               | sorts of nonlinearities for free downstream anyways in
               | this particular model.
        
           | fpgaminer wrote:
           | BLIP2 is a contrastive Image-Language model. The embeddings
           | from the BLIP2 image model are already both aligned with
           | text, and linear. It should not be a surprise that only a
           | projection is required to translate it to LLaMA's embedding
           | space.
        
             | cs702 wrote:
             | This is the best answer. It makes sense to me. Thank you
             | :-)
        
           | AbrahamParangi wrote:
           | I think it's just that affine transforms in high dimensions
           | are surprisingly expressive. Since the functions are sparsely
           | defined they're much less constrained compared to the low
           | dimensional affine transformations we usually think of.
        
             | cs702 wrote:
             | Good point. Didn't think of that. It's a plausible
             | explanation here, because the dimensionality of the spaces
             | is so different, 5120 vs 768. Not surprisingly, the trained
             | weight matrix has rank 768: it's using every feature in the
             | lower-dimensional space.
             | 
             | Still, it's kind of shocking that it works so well!
             | 
             | I'd be curious to see if the learned weight matrix ends up
             | being full-rank (or close to full-rank) if both spaces have
             | the same dimensionality.
        
               | visarga wrote:
               | They would have full-rank because all the embedding space
               | is used. There are no unused large pockets.
        
               | cs702 wrote:
               | The weight matrix's rank would decrease for each feature
               | in the target space that cannot be expressed as as a
               | linear combination of features in the input space (plus a
               | bias). For example, if the target space has a feature
               | representing a non-visual quality like "smelliness," it
               | would not be expressible as a linear combination of
               | features representing visual attributes like "redness,"
               | "blueness," and "greenness," etc. in the input space.
               | 
               | If both spaces have the same dimensionality, the learned
               | weight matrix would be full-rank _only_ if every feature
               | in the target space is expressible as a linear
               | combination of features in the input space (plus a bias).
               | Which brings me back to my original question: WHY would
               | that be the case when the two models are trained
               | independently on data that is so different?
        
               | sdenton4 wrote:
               | A random nxn matrix is full rank... So it's kinda the
               | default: any amount of noise in the embedding is going to
               | result in full-rank transformations.
               | 
               | So it's really less-than-full rank which would require an
               | explanation - ie, why does this image representation
               | project into this perfectly isolated subspace of the
               | language representation (or vice versa)?
               | 
               | If that happened I would start looking for things like a
               | vocabulary of smell which is completely distinct and non-
               | overlapping with any visual context. But we use cross-
               | modal analogies in language /constantly/ (many smells are
               | associated with things we can see - 'smells like a rose')
               | so you wouldn't expect any clean separations for
               | different modalities... Maybe there's some branch of
               | analytic philosophy which has managed to completely
               | divorce itself from the physical world...
        
         | eurekin wrote:
         | > I created a PR on the repo to load it with int8, so hopefully
         | by tomorrow it'll be runnable by 3090/4090 users.
         | 
         | How about 2x3090? Can it be run on multiple gpus?
        
         | collaborative wrote:
         | With fp8, would 4GB be enough or is 6GB more like it?
        
         | jxy wrote:
         | This opens a huge possibilities. It's likely we could simply
         | plug in stable diffusion using a linear layer. As well as
         | whisper and some TTS. Getting a back to back mixed
         | image/sound/text engine running on a laptop.
         | 
         | I wonder if there's powerful enough ViT model that does OCR.
        
       | m00dy wrote:
       | Hi CHEN,
       | 
       | Invoker here, I would like to have a chat or send me an email @
       | community@invoker.network
        
         | endorphine wrote:
         | DotA?
        
           | m00dy wrote:
           | I'm way better than him ;)
        
       | xnzakg wrote:
       | The ramen example is kind of hilarious. Wonder if it would make
       | more sense with a bigger model.
        
         | m348e912 wrote:
         | Ok I'll bite. What makes the Raman example so funny?
        
           | burkaman wrote:
           | The recipe says to make a broth, cook the noodles in the
           | broth, then throw the broth away and add the noodles to a pan
           | with meat. You'll end up with burned crispy noodles and meat
           | in a dry bowl.
        
           | Thomashuet wrote:
           | The recipe it suggests is completely off. Just search for
           | "ramen recipe" and compare. For instance it tells you to add
           | soy sauce, sesame oil and green onions to the broth (so far
           | so good) but then tells you to throw all of it away ("once
           | the noodles are cooked, drain them"). The problem here is
           | that you should cook the noodles separately in plain water,
           | not a broth with soy sauce.
        
       | thatsadude wrote:
       | How is this a paper, they simply fine-tune the fully-connected
       | layer of BLIP-2 https://arxiv.org/pdf/2301.12597.pdf
        
       | nobrains wrote:
       | I think the real news is where this research is coming from. It
       | is from King Abdullah University of Science and Technology in
       | Saudi. The arab world has recently been quite behind in tech
       | research, but this is good news overall for the arab world, which
       | will inspire other universities and researchers in the region to
       | one-up their colleagues, which is overall good for the entire
       | world.
        
         | sterlind wrote:
         | I would be more thrilled if it were literally any other country
         | in the Arab world, but maybe that will come.
        
       | Imnimo wrote:
       | I'm pretty surprised that the frozen encoder has enough
       | information in its representations to do the "design the website
       | from a hand-written whiteboard mock-up" task, and that that
       | information can be effectively accessed by the language model
       | with just a small projection layer. I really would have guessed
       | that you needed to be training the image encoder simultaneously.
        
       | devinprater wrote:
       | Holy crap! If this has image to text, and can run on a CPU, even
       | with 16 GB RAM, one could hook up an NVDA addon, straight Python
       | so it shouldn't be too hard, and have AI read whatever the user
       | wants. Pictures, video frames, UI's, and if it can break the UI
       | down into something the screen reader can understand, boom!
        
       | m3kw9 wrote:
       | This is basically Image to Text, then from the text, start a GPT
       | type chat. No need to overthink anything
        
       ___________________________________________________________________
       (page generated 2023-04-17 23:00 UTC)