[HN Gopher] MiniGPT-4 ___________________________________________________________________ MiniGPT-4 Author : GaggiX Score : 790 points Date : 2023-04-17 08:56 UTC (14 hours ago) (HTM) web link (minigpt-4.github.io) (TXT) w3m dump (minigpt-4.github.io) | macrolime wrote: | A modified BLIP2 that works with diffusion models rather than | vision transformers would be pretty cool. Using Vicuna-13B or | another large language model as the language model of Stable | Diffusion, instead of the text encoder of CLIP, would be a game | changer, it would completely change how you prompt Stable | Diffusion from something kinda like a tag list to it being able | to follow instructions in English instead. | jokoon wrote: | Are there more university/research centric resource of | communities regarding AI, I mean things on a more epistemic and | higher level, that disregards current ML techniques and | implementations? | | I have no problem when there is a lot of money spent to implement | things that can be used and tested, but I wish I could read about | the next step of AI, something that involves science and not | software engineering. | | I've yet to read anything related to the analysis of trained data | or trained neural networks, the structure of the brains of small | insects, psychology on general intelligence, AI and emotions, the | real ability of an AI to think and write meaningful code or | understand what is a problem/solution etc, I mean anything that | is related to AI that can be treated by computer science. | | I have no problems with algorithms and implementations, but don't | university/professors also try to go beyond, with epistemology, | to connect the dots with biology and neurology? | | Unless I see science cover several fields of research, it's not | real, worthy research in my view, which is why the reason why I'm | so stubbornly skeptical of chatGPT. | thewarrior wrote: | We've tried but we understand even less about how "real" | intelligence works. Atleast in a way that we can replicate it. | | The people doing so called alchemy are making far better | systems that anyone who's trying to actually understand things. | And we've been trying for decades now. | visarga wrote: | If the engineering part is of secondary importance for you, | then at least remember about the dataset. It is in the | composition and quality of the training data that all the | skills find their origin. | | Most of our discussions are about model size, but few about | dataset. Yet all the scaling laws hint at the great usefulness | of more data. Sometimes even little data can have a great | impact in the fine-tuning phase. In the end it is the training | data that transforms a random init into the model. | Tepix wrote: | I think it's poor form that they are taking the GPT-4 name for an | unrelated project. After all, the underlying Vicuna is merely a | fine-tuned LLaMA. Plus they use the smaller 13B version. | | The results look interesting, however. | | Here's hoping that they'll add GTPQ 4bit quantizing so the 65B | version of the model can be run on 2x 3090. | seydor wrote: | The way things are going, very soon GPT-4 will not be something | to aspire to be | twelvechairs wrote: | Its poor form to be calling it 'Open' AI. So I guess its swings | and roundabouts. Everyone is leeching where they can. | iandanforth wrote: | To be fair they were open when that name was picked and it | looks like they may be trying to transition to just 'ai.com'. | teacpde wrote: | Wow, that must be an expensive domain name. | was_a_dev wrote: | I'm sure they can afford it | | But justai.com would also be apt | philosopher1234 wrote: | they already did afford it. it redirects to chatgpt | xiphias2 wrote: | wow, how come it's not known? | sebzim4500 wrote: | What about this url redirect is interesting enough to you | that you think it should be widely known? | alsodumb wrote: | It's pretty clear that they're doing it for citation farming. | I've seen another paper from KAUST that uses fancy title on a | trivial paper. | collaborative wrote: | Adding GPT4 to anything now increases marketing x4. So many AI | news coming out lately that not adding it risks drowning in a | sea of info.. even in the case of a good project | Uehreka wrote: | The word for this is "trademark infringement". You are | specifically not allowed to capitalize on the marketing of | another entity's product to bolster yours by implying through | your name that you are somehow related. This is why "DALL-E | Mini" had to change their name to craiyon. | HarHarVeryFunny wrote: | It's also just (deliberately) misleading. It's based on the | 13B Vicuna/Llama model, not 175B GPT-3 or 1T GPT-4. There | is zero justification for calling it MiniGPT-4. | | A more honest name would be Visual-Vicuna or Son-of-BLIP. | sebzim4500 wrote: | I don't see how it's misleading. MiniGPT-4 makes it sound | like a smaller alternative to GPT-4, if it was based on | GPT-4 there would be nothing 'mini' about it. | HarHarVeryFunny wrote: | It has more in common with GPT-3 than GPT-4 in terms of | size, but in reality it's based on Vicuna/Llama which is | 10x smaller than either, so as far as the LLM part of it | goes its not mini-anything - it's just straight-up Vicuna | 13B. | | The model as a whole is just BLIP-2 with a larger linear | layer, and using Vicuna as the LLM. If you look at their | code it's literally using the entire BLIP-2 encoder | (Salesforce code). | | https://arxiv.org/pdf/2301.12597.pdf | [deleted] | Tepix wrote: | > _1T GPT-4_ | | The number of parameters used for GPT-4 is unknown. | HarHarVeryFunny wrote: | I got the 1T GPT-4 number from here - this is the video | that goes with the Microsoft "Sparks of AGI" paper, by a | Microsoft researcher that had early access to GPT-4 as | part of their relationship with OpenAI. | | https://www.youtube.com/watch?v=qbIk7-JPB2c | sandkoan wrote: | Bubeck has clarified that the "1 trillion" number he was | throwing around was just a hypothetical metaphorical--it | was in no way shape or form implying that GPT-4 has 1 | trillion parameters [0]. | | [0] https://twitter.com/SebastienBubeck/status/1644151579 | 7238251... | HarHarVeryFunny wrote: | OK - thanks! | | So we're back to guessing ... | | A couple of years ago Altman claimed that GPT-4 wouldn't | be much bigger than GPT-3 although it would use a lot | more compute. | | https://news.knowledia.com/US/en/articles/sam-altman-q- | and-a... | | OTOH, given the massive performance gains scaling from | GPT-2 to GPT-3, it's hard to imagine them not wanting to | increase the parameter count at least by a factor of 2, | even if they were expecting most of the performance gain | to come from elsewhere (context size, number of training | tokens, data quality). | | So in 0.5-1T range, perhaps ? | collaborative wrote: | At this point the letters GPT make more sense than "AI" or | "LLM" in many peoples minds | Uehreka wrote: | Hard disagree. | | Outside of the brand name ChatGPT, lay members of the | general public are way more likely to call these chatbots | (like Bard and Bing) "AIs" than "GPTs". And although GPT | could technically refer to any model that uses a | Generative Pre-trained Transformer approach (although it | probably wouldn't be an open-and-shut case), the mark | "GPT-4" definitely is associated with OpenAI and their | product, and you can't just use it without their | permission. | collaborative wrote: | So OpenAI ostensibly owns "GPT4" according to your | argument. But does it own "MiniGPT4"? I hope you see the | absurdity of this. | | Let's not discuss the amount of copyright licenses OpenAI | has already infringed, too | Uehreka wrote: | I'll put it this way: | | At Brewer's Art in Baltimore, MD they just released a | beer called GPT (Green Peppercorn Tripel)[1]. They're | likely allowed to do that because a reasonable consumer | would probably not actually think they had collaborated | with OpenAI, because OpenAI does not make beer. | | OP is releasing a model called "MiniGPT-4". A reasonable | consumer could look at that name and become confused | about the origin of the product, thinking it was from | OpenAI. This would be understandable, since OpenAI also | makes large language models and has a well known one that | they've been promoting whose brand name is "GPT-4". If | MiniGPT-4 does not meet that consumer's expectation of | quality (which has been built up through using and | hearing about GPT-4) it may cause them to think something | like "Wow, I guess OpenAI is going downhill". | | Trademark cases are generally decided on a "reasonable | consumer" basis. So yeah, they can seem a little | arbitrary. But it's important for consumers to be able to | distinguish the origin of the goods they are consuming | and for creators to be able to benefit from their | investment in advertising and product development. | | [1] https://www.thebrewersart.com/bottles-cans | nashashmi wrote: | They can always say GPT-like. Or miniaturized GPT-like LLM. | causi wrote: | Someone needs to write a buyer's guide for GPUs and LLMs. For | example, what's the best course of action if don't need to | train anything but do want to eventually _run_ whatever model | becomes the first local-capable equivalent to ChatGPT? Do you | go with Nvidia for the CUDA cores or with AMD for more VRAM? Do | you do neither and wait another generation? | thewataccount wrote: | Nvidia and the highest amount of vram you can get. | | Currently the 4090, the rumor is the 4090ti will have 48gb of | vram, idk if its worth waiting or not. | | The more VRAM the higher paremeter count you can run all in | memory (fastest by far). | | AMD is almost a joke in ML. The lack of CUDA support (which | is nvidia proprietary) is straight lethal, and also even | though ROCM does have much better support these days, from | what I've seen it's still a fraction of the performance of | what it should be. I'm also not sure if you need projects to | support it or not, I know pytorch has backend support for it | but I'm not sure how easy it is to drop in. | brandall10 wrote: | I think the most recent rumors were amended to it having | 24, unfortunately. | thewataccount wrote: | Darn. | | I mean in all honestly there's no reason a gaming card | would need 48gb at the moment when so few games even use | 24gb. | | 48GB really only makes sense for workstation cards. | causi wrote: | That's good to know. A lot of people might look at the | 7900XTX with its 24GB of ram for a grand and think "ah, a | good deal for fitting an LLM in". | lhl wrote: | For a general guide, I recommend: | https://timdettmers.com/2023/01/30/which-gpu-for-deep- | learni... | | There's a subreddit r/LocalLLaMA that seems like the most | active community focused on self-hosting LLMs. Here's a | recent discussion on hardware: https://www.reddit.com/r/Local | LLaMA/comments/12lynw8/is_anyo... | | If you're looking just for local inference, you're best bet | is probably to buy a consumer GPU w/ 24GB of RAM (3090 is | fine, 4090 more performance potential), which can fit a 30B | parameter 4-bit quantized model that can probably be fine- | tuned to ChatGPT (3.5) level quality. If not, then you can | probably add a second card later on. | | Alternatively, if you have an Apple Silicon Mac, llama.cpp | performs surprisingly well, it's easy to try for free: | https://github.com/ggerganov/llama.cpp | | Current AMD consumer cards have terrible software support and | IMO isn't really an option. On Windows you _might_ be able to | use SHARK or DirectML ports, but nothing will run out of the | box. ROCm still has no RDNA3 support (supposedly coming w / | 5.5 but no release date announced) and it's unclear how well | it'll work - basically, unless you would rather be fighting | w/ hardware than playing around w/ ML, it's probably best to | avoid (the older RDNA cards also don't have tensor cores, so | perf would be hobbled even if you could get things running. | Lots of software has been written w/ CUDA-only in mind). | BeefWellington wrote: | FWIW I had no real issues getting StableDiffusion to run on | a 6800 I have in one of my systems. | | I haven't tried with LLaMA at all. | lhl wrote: | 6800 is RDNA2, not RDNA3. The latter is still waiting for | ROCm support 4 months post-launch: | https://github.com/RadeonOpenCompute/ROCm/issues/1813 | BeefWellington wrote: | I'm aware that a 6800 is not RDNA3. You stated broadly: | | > Current AMD consumer cards have terrible software | support and IMO isn't really an option. On Windows you | might be able to use SHARK or DirectML ports, but nothing | will run out of the box. | | I was merely sharing that I did not have that same | experience that current consumer cards have terrible | support. | lhl wrote: | Sure, and I was merely clarifying that only last-gen | architectures work. While I'm glad that SD works for you, | if we are getting into it, I think that having no support | on current-gen flagship models _does_ equal broadly | terrible software support and is more worth highlighting, | since otherwise someone might assume they could just go | pick one of those up and get a 24GB GPU on the cheap, | especially in the context of LLMs (which is what the OP | was asking about). | | For RDNA2, you apparently _can_ get LLMs running, but it | requires forking /patching both bitsandbytes and GPTQ: | https://rentry.org/eq3hg - and this will be true for any | library (eg, can you use accelerate? deepspeed? fastgen? | who knows, but certainly no one is testing it and AMD | doesn't care if you're not on CDNA). It's important to | note again, anything that works atm will still only work | with last-gen cards, on Linux-only (ROCm does not work | through WSL), w/ limited VRAM (no 30Bq4 models), and | since RDNA2 tensor support is awful, if the SD benchmarks | are anything to go by, performance will still end up | worse than an RTX 3050: | https://www.tomshardware.com/news/stable-diffusion-gpu- | bench... | AuryGlenz wrote: | Running is one thing. Running with any sort of speed is | another. | l33tman wrote: | You don't need a GPU, alpaca.cpp runs purely on the CPU if | you have 32+ GB of RAM for the 30B Llama model. It's a bit | slow, but you can toy with it.. | avereveard wrote: | Depends on whether you just want to play around or have | actual business needs, but to play around I'm waiting for the | next generation of unified memory architectures | ilteris wrote: | Can you point any source re: next generation of unified | memory architectures? How far are they away? | avereveard wrote: | Well currently there's just apple and they are super | secretive about their roadmap, the M1/2 top line chip | already perform quite well, and while the top version | look expensive, a pair of a6000 will definitely eat more | money both to purchase and to run. | | M3 is around the corner tho, and there's some | announcement to come from intel or arm following their | partnership. There's also the new card coming from intel | that is supposed to be aimed squarely at machine learning | workloads, and they don't have to segment their market by | memory sizing like Nvidia do, but they aren't well | supported as device targets, but a pair of these will | likely be very cost effective if and only if they will | get credible compatibility with the libraries and models | Tepix wrote: | Read Tim Dettmer's guide. | | Basically, you want nVidia, and you want lots of VRAM. Buy | used for much more bang for the buck. | | Depending on your budget, get: | | - an RTX 3060 with 12GB or | | - 1 used RTX 3090 with 24GB (approx twice as expensive as the | 3060 but twice the VRAM and much faster) or | | - 2 used RTX 3090 cards if you need more than 24GB. | | Everything beyond that gets quite a bit more expensive | because then you need a platform with more PCIe lanes, you | may need more than one PSU and you will have problems fitting | and cooling everything. | | With two cards and 2x24GB you can run the largest version of | the LLaMA model (the 65B variant) and all its descendants | with 4-bit quantization inside your GPU's VRAM, i.e. with | good performance. Can can also try some low resource fine- | tuning variants (LoRa etc). | | Oh and while you're at it also get a decent amount of RAM | like 64GB or 128GB (it's very cheap right now) and a NVMe | SSD. These models are quite large. | mk_stjames wrote: | Ask: how many parameters is the model? At how many bits? and | has it been quantized to a smaller number of bits or if not | can it be without losing quality? multiply number of bits by | the number of parameters, convert to GB, and then add so | amount of overhead for however much else needs to load and/or | how horrible/amazing of an implementation the inference code | is. Done. | | Example: A 30B parameter model trained at 16bit FP gets | quantized down to 4 bit ints. 4 bits = 0.5 byte. 30 billion * | 0.5 byte = 15GB of VRAM (plus a GB or few of other overhead) | | For more real world discussion see | | https://github.com/ggerganov/llama.cpp/issues/13 | redox99 wrote: | It's pretty simple actually. | | Get a 3090 or 4090. Forget about AMD. | mewpmewp2 wrote: | Should I get a gaming PC with 4090 if I want to get into | personal side projects in ML and also games? | | Do I need dualboot? Or is Windows good? | dharma1 wrote: | 4090 is amazing, but very large card. 3090 is "good | enough" for ML - same 24gb vram - and you can pick them | up used for half the price of a new 4090. That's what I | did. | | WSL on windows apparently decent, or native PyTorch, dual | boot windows/ubuntu still prob best tho. | artificial wrote: | Getting CUDA on OpenSUSE was super easy. The Nvidia blob | drivers are easy to install and CUDA needs another | download and some copy paste. Even Unreal Editor was | easier to install than on Windows. | Regic wrote: | I'm running models locally on my 3090 and it's fast | enough, although for example building a vector database | can take a while. I can run LoRa training but I haven't | done anything meaningful with it so far. I chose 3090 | because of the cable issue of 4090 (also, no nvlink, | although I'm not sure that matters) but it's debatable if | my fears are justified. I need to leave the gpu running | while I'm away and I just don't feel comfortable doing | that with a 4090. I rather take the lower performance. | | One caveat though, my asus b650e-f is barely supported by | the currently used ubuntu kernel (e.g. my microphone | doesn't work, before upgrading kernel + bios I didn't | have lan connection...) so expect some problems if you | want to use a relatively new gaming setup for linux. | thewataccount wrote: | Windows generally works but there may be a somewhat small | performance hit. IMO linux is much easier to get to work | judging by all the github issue threads I see able | SD/LLaMa stuff on windows - but I don't use windows so I | dont have personal experience. | | 4090 24GB is 1800USD, The Ada A6000 48GB is like 8000USD | and idk where you buy it? So if you want to run games and | models locally the 4090 is honestly the best option. | | EDIT: I forgot - there is a rumored 4090ti with 48gb of | vram, no idea if thats worth waiting for. | wincy wrote: | If I was going to spend $8000 on a video card I'd hunt on | eBay for an A100 80GB rather than settle for the A6000 | thewataccount wrote: | Honestly yeah a used A100 80GB sounds like a better idea. | macrolime wrote: | The A6000 is actually the old generation, Ampere. The new | Ada generation one is called 6000. Seems many places | still sell A6000 (Ampere) for the same price as RTX 6000 | (Ada) though, even though the new one is twice as fast. | | Seems you can get used RTX A6000s for around $3000 on | ebay. | thewataccount wrote: | That.... That explains why I can't find it and makes a | ton of sense..... | | I think that's such a silly name for it, but oh well | | Thanks for the correction! | macrolime wrote: | Just to add to the confusion, there's another older RTX | 6000 with 24GB ram. This is from an even older | generation, same as the GeForce 20 series. | thewataccount wrote: | You're kidding? So they called it the RTX 6000, then | called it the RTX A6000 for ampere, then back to RTX 6000 | for Ada? | | Why do they do this? Sometimes consumer products are | versioned weirdly to mislead customers (like intel cpus) | - but these wouldn't even make sense to do that with as | they're enterprise cards? | macrolime wrote: | Actually the first one is called Quadro RTX 6000, while | the Ada one is just RTX 6000 without "Quadro" in front. | Not that it makes the naming make much more sense. | | According to GPT-4 the next generation one will be called | Galactic Unicorn RTX 6000 :D | dannyw wrote: | You could also get more detailed help from | https://reddit.com/r/buildapc/ | whywhywhywhy wrote: | Save some money and go 3090, same vram, speed difference | probably isn't worth the premium for the 4090. Then | upgrade when the rumored 5090 generational leap happens. | sampa wrote: | you can run Windows in VM and passthrough your videocard | to it. So it's basically a native gaming (3-5% cost) w/o | reboots. Look for VFIO. | turmeric_root wrote: | Windows reserves a certain percentage for VRAM for some | reason. So I'd recommend Linux. Or find a way to disable | the desktop/UI in Windows. | sureglymop wrote: | I have access to an Nvidia A100. But as a layman, what | specs does the rest of the system need to use it for some | real work? I would assume there needs to be at least as | much ram as vram and maybe a few terabytes of disk space. | Does anyone have experience with this? | michaelt wrote: | If you have an A100, which in its 80GB variant costs | $23,667 [1], you would not generally quibble over the | price of a few terabytes of disk space. | | [1] https://www.dell.com/en-us/shop/nvidia- | ampere-a100-pcie-300w... | seydor wrote: | At least for llama derivatives, reddit has a list https://www | .reddit.com/r/LocalLLaMA/comments/11o6o3f/how_to_... | dannyw wrote: | Definitely NVIDIA. Good luck running any real ML models on | AMD GPUs. | [deleted] | iandanforth wrote: | Filed an issue: Up or down vote there as you see fit. | | https://github.com/Vision-CAIR/MiniGPT-4/issues/9 | cloudking wrote: | It doesn't seem to recognize text from a screenshot very well, I | gave detailed error messages from a Windows screenshot: | https://filestore.community.support.microsoft.com/api/images... | | and prompted "Describe any issues found in this screenshot and | steps to resolve them" | | while it correctly identified it as a screenshot from a computer, | it gave a very generic response and didn't identify the error | messages correctly: | | "The screenshot shows a computer screen with a warning message | that reads, "The system is currently unavailable. Please try | again later." This indicates that there is an issue with the | computer system that needs to be resolved. | | To resolve this issue, you can try the following steps: | | 1. Check the system status: Check the system status to see if | there are any errors or issues that need to be resolved. | | 2. Restart the computer: Try restarting the computer to see if | that resolves the issue. | | 3. Check for updates: Check for any updates that may have been | released for the computer system and apply them if necessary. | | 4. Check for viruses: Run a virus scan on the computer to check | for any viruses that may be causing the issue. | | 5. Check for hardware issues: Check for any hardware issues that | may be causing the issue. This could include issues with the | power supply, hard drive, or other components. | | 6. Contact technical support: If none of the above steps resolve | the issue, contact technical support for further assistance. They | may be able to provide additional troubleshooting steps or help | you resolve the issue." | HansHamster wrote: | Just tried it with a random schematic: | | > The image shows a block diagram of a power amplifier circuit | with a power supply and a load resistor. The power supply is | connected to the input of the amplifier, which is then connected | to the output. The load resistor is connected to the output of | the amplifier and the power supply. The amplifier is designed to | amplify the input signal to a higher level, which is then sent to | the load resistor to power the load. The power supply provides | the necessary voltage to operate the amplifier. | | On the one hand I'm impressed that it recognises an amplifier, | but just about everything in the description is (slightly) wrong. | It is not a power amplifier, the power supply is not connected to | the input (why should it be?), there is no load resistor... | mftb wrote: | I also got middling results. I gave it a novel NASA picture of | the moon passing in front of the earth, taken by the Deep Space | Climate Observatory Satellite, so an unusual vantage point. | First it was just wrong, and said that it was a picture of a | crescent moon taken from a mountain top, or something like | that. Not terrible at least it recognized the moon, but wrong. | I tried to give it a chance and corrected it with the | additional info that the picture was taken from space. It then | did better, but on the last line it threw in that it was not a | photograph, but a rendering of some sort, which again, is just | wrong. | | More and more I get the impression that people are so taken | with the chat interface that they are grading these models more | favorably than is warranted. | HansHamster wrote: | And after pointing out all the mistakes in the description, | it took a very ... interesting approach to address my | remarks: | | > I apologize for the confusion. The image shows a block | diagram of a basic amplifier circuit with a power supply, but | it does not show any input stages, output stages, or load | resistors. [...] The input stage is made up of an operational | amplifier, which is used to amplify the input signal. The | output of the input stage is then connected to the power | supply, which provides the necessary voltage to operate the | amplifier. The amplifier is designed to amplify the input | signal to a higher level, which is then sent to the load | resistor to power the load. | | So it acknowledges the remarks, describes everything that is | _not_ in the image and then puts the stuff back into the | description anyway :) | aabajian wrote: | From a radiology world this is fascinating. I'm not worried about | job security as I'm an interventionalist. What I'm wondering is | about go-to-market strategies for diagnostics. I do some | diagnostic reads and I would love to have something like this | pre-draft reports (especially for X-Rays). There are tons of "AI | in rads" companies right now, none of which have models that come | anywhere close to GPT-4 or even this. Personally, I would not | _start_ such a company because the barrier to entry is now so | low. I think there needs to be an open-source project that just | subtrains these models on rads images and is available to all | radiologists. | euroderf wrote: | Your profession and... a few hundred others ? | deeplstm wrote: | Here's the video summary https://youtu.be/uPV9Gk3IC-g | atgctg wrote: | It's hard to distinguish non-Google projects with Google Sans in | their templates from actual Google Research papers, as the font | is meant to be exclusively used by Google[1]. | | [1] | https://developers.google.com/fonts/faq#how_can_i_get_a_lice... | sebzim4500 wrote: | Surely most people would read the authors list to determine | provenance rather than the font? | pxeger1 wrote: | I didn't think about it consciously but I think I did | implicitly assume it was a Google project because of the font | tarruda wrote: | Looking forward to the next generation of cheap GPUs with enough | VRAM to run models like Vicuna-13 locally. | adam_arthur wrote: | A lot of people are running fairly powerful models directly on | the CPU these days... seems like inference will not be a GPU | exclusive activity going forward. Given that RAM is the main | bottleneck at this point, running on CPU seems more practical | for most end users | | See: https://news.ycombinator.com/item?id=35602234 | valine wrote: | Cheap is relative I suppose. | | I'm running Vicuna 13b 16f locally and it needs 26GB of VRAM, | which won't even fit on a single RTX 4090. The next gen RTX | Titan might have enough vram but that won't come cheap. I'm | expecting a price point above $2500. | 22c wrote: | I'm not sure if it's the point GP is trying to make, but I | would like to see GPUs with extra VRAM that don't have the | extra compute. eg. similar performance of a 4070Ti but with | 24GB or 32GB of VRAM. | | I don't see a really good reason why OEMs couldn't do that | now, in the past there have been OEM cards that have more | VRAM than the reference design. I'm sure there's an appetite | for cards like that for people who don't want to refinance | their home loan to get 2 x RTX 4090 cards. | Tepix wrote: | I'm fairly sure that NVIDIA is making sure that consumer | cards are no good alternative to their $10000 80GB VRAM | A100 cards. | ggregoire wrote: | Could you use something like this to summarize the content of a | video? Or would you first need to extract the images from the | video (like 1 fps, with ffmpeg) and tell the AI to analyze all | the images and summarize them like if it was a video? | all2 wrote: | I'd do this using both streams of information: audio and video. | | I'd segment the audio semantically based on the topic of | discussion, and I'd segment the video based on editing, | subjects in scene, etc. We could start simply and just have a | "timestamp": [ subjects, in, frame] key-value. | | It'd take some fiddling to sort how to mesh these two streams | of data back together. The first thing I'd try is segment by | time chunks (the resolution of which would depend on min/max | segment lengths in video and audio streams) and then clump the | time chunks together based on audio+video content. | golergka wrote: | I don't know what unholy abomination they did to the scrolling, | but after I increased the page font size, it just sticks to the | bottom and you can't scroll up. | arjonagelhout wrote: | I'm experiencing the same thing on iOS Safari, it rapidly jumps | back and forth between two positions. | [deleted] | detrites wrote: | Right off the bat I lost all scrolling inertia here, it's like | being in the dark ages. | raylad wrote: | I gave it a picture of a bottle, which it did a fairly good job | of identifying (although it misread some of the text, mistaking a | 1 for a 4). | | Then I asked it what are the likely ingredients of the product. | It still hadn't replied after 2274s so I gave up on it. | simonw wrote: | Hate to be the person complaining about the name, but we already | saw how this plays out with DALL-E mini: if you name your project | directly after something else like this, no matter how much extra | explanatory text you attach to it a large number of people will | assume it's an "official" variant of the thing it was named | after. | | Eventually you'll have to rename it, either to resolve the | confusion or because OpenAI pressure you to do so, or both. | | So better to pick a less confusing name from the start. | | (This one is even more confusing because it's about image inputs, | but GPT4 with image inputs had not actually been released to | anyone yet - similar in fact to how DALL-E mini got massive | attention because DALL-E itself was still in closed preview) | xiphias2 wrote: | ,,So better to pick a less confusing name from the start.'' | | I'm not sure if that's better from a marketing standpoint | though....it works, you still remember DALL-E mini | jiwidi wrote: | a bit nasty to call something minigpt-4 when it comes from llama | wsgeorge wrote: | Giving this a quick spin and I like what I'm seeing. I gave it a | picture of Paolo Veronese's Lament over the Dead Christ [0] and | asked what was going on in the background. | | > The painting depicts the scene of the crucifixion of Jesus | Christ. In the foreground, there is a group of people standing | around the cross, including Mary, the mother of Jesus, and the | two thieves who were crucified with him. In the background, there | is a hill with a group of people standing on it, watching the | crucifixion. The sky is dark and stormy, with lightning | illuminating the scene. The overall mood of the painting is one | of sadness and tragedy. | | Then I asked for a sonnet on it: | | "In this painting, we see the scene of the cross | | Where Jesus is crucified, with Mary at his side | | And the thieves who were crucified with him | | In the background, a hill with people standing | | Watching the tragic event unfold | | The sky is dark and stormy, with lightning flashing | | Illuminating the scene of the crucifixion | | The mood is one of sadness and tragedy | | As we contemplate the sacrifice of Jesus | | But even in death, he shows us the way | | To love and forgiveness, to live another day | | And so we remember the crucifixion | | And the love that Jesus showed us that day." | | So...not quite what I was expecting, but it's progress! | | [0] | https://en.wikipedia.org/wiki/Lament_over_the_Dead_Christ_(V... | omeid2 wrote: | What did you expect? | wsgeorge wrote: | A 14-line poem with a consistent rhyme scheme and meter. | Perhaps my request should have been more specific. | pmoriarty wrote: | From my experimentation, LLMs tend to kind of suck at rhyme | and meter, and all but the simplest types of poetry, so | even if you'd specified it probably wouldn't have been able | to deliver. | | This is definitely something they could be trained to be | much better at, but I guess it's hasn't been a priority. | astrange wrote: | GPT4 is surprisingly good at it, considering BPE | tokenization means it shouldn't be able to do rhyme at | all. | jcims wrote: | Has anyone tried using phonetic tokens instead of text? | I'm curious if that would help with things like rhyming. | causi wrote: | I never know what to expect anymore. We live in a world where | computers can describe paintings and write sonnets about them | but a half-trillion dollar car company can't figure out how | to parallel park with eight cameras. | astrange wrote: | That's a well known result from the last AI hype era. | | https://en.wikipedia.org/wiki/Moravec%27s_paradox | | Although, I'm not surprised AI can describe paintings, but | I'm still pretty surprised it can generate them. How'd we | get that to work!? | causi wrote: | Yeah, and I'm still completely lost as to why resolution | is such a limiting factor. If you know you're drawing a | soccer ball why is a 512x512 soccer ball so much easier | than a 1024x1024 soccer ball? | gary_0 wrote: | There are a few cases where people have used ChatGPT to | generate SVG[0], with mostly unimpressive results. I'm | sure sooner or later models will be developed | specifically for creating vector drawings instead of | raster, including with the ability to apply textures and | gradients. | | Also, the resolution of Stable Diffusion's output isn't | much of a limitation if you're willing to use other tools | to massage the output into something professional- | quality. See [1] | | [0] For example: | https://praeclarum.org/2023/04/03/chatsvg.html | | [1] https://news.ycombinator.com/item?id=35592847 | moffkalast wrote: | DriveGPT as it hits 2 parked cars and runs over a dog: | "Apologies for the confusion earlier. You are correct that | the cars were in the planned local planner path..." | dceddia wrote: | Sorry, yes, that was a dog. | | Let me try again and this time I will definitely not hit | anything. | | Sorry, that was another dog. | brycedriesenga wrote: | "Was that another dog?" | | BingDrive: I'm sorry, but I prefer not to continue this | conversation. | gary_0 wrote: | It's like something right out of a Douglas Adams book. | niemandhier wrote: | That's because the decided they do not need lidar. | valine wrote: | Near term it's a frustrating decision, but if these gpt4 | vision LLMs are anything to go by it will prove to be the | right decision in the long term. | rafram wrote: | Why wouldn't LIDAR _in addition to_ computer vision with | cameras be a strictly better idea? | bradstewart wrote: | It's all trade offs. I'm just spitballing here, but if | you have limited resources, you can either spend | cash/time on lidar or invest in higher-quality mass- | produced optics, or better computer vision software. If | you get to a functional camera-only system sooner, might | everyone be better off as you can deploy it more rapidly. | | Manufacturing capacity of lidar components might be | limited. | | Another might be reliability/failure modes. If the system | relies on lidar, that's another component that can break | (or brownout and produce unreliable inputs). | | So in a vaccum, yea a lidar+camera system is _probably_ | better, but who knows with real life trade offs. | | (again, I just made these up, I do not work on this | stuff, but these are a few scenarios I can imagine) | astrange wrote: | Remember that "cameras" aren't as good as human | perception because human eyes interact with the | environment instead of being passive sensors. (That is, | if you can't see something you can move your head.) | | Plus we have ears, are under a roof so can't get rained | on, are self cleaning, temperature regulating, have much | better dynamic range, wear driving glasses... | spaceman_2020 wrote: | And we still get into millions of accidents every year... | program_whiz wrote: | I keep hearing this argument over and over, but I find it | uncompelling. As a relatively young person with good | vision, who has never been in an accident after many | years of driving, and who doesn't make the kind of simple | mistakes I've seen the absurd mistakes self-driving cars | make and I would not trust my life to a self-driving car. | | Asking people to accept a driverless car based on over- | arching statistics is papering over some very glaring | issues. For example, are most accidents in cars being | caused by "average" drivers or are they young / old / | intoxicated / distracted / bad vision? Are the statistics | randomly distributed (e.g. any driver is just as likely | as the next to get in accidents)? Because the driverless | cars seem to have accidents at random in unpredictable | ways, but human drivers can be excellent (no accidents, | no tickets ever), or terrible (drive fast, tickets, high | insurance, accidents, etc). The distribution of accidents | among humans is not close to uniform, and is usually | explainable. I wouldn't trust a poor human driver on a | regular basis, nor would I trust an AI because I'm | actually a much better driver than both (no tickets, no | accidents, can handle complex situations the AI can't). | Are the comparisons of human accidents being treated as | homogenous (e.g. the chance of ramming full speed into a | parked car the same as a fender-bender?). I see 5.8M car | crashes anually, but deaths remain fairly low (~40k, | .68%), vs 400 driverless accidents with ~20 deaths (5%), | I'm not sure we're talking about the same type of | accidents. | | tl;dr papering over the complexity of driving and how | good a portion of drivers might be by mixing non- | homogenous groups of drivers and taking global statistics | of all accidents and drivers to justify unreliable and | relatively dangerous technology would be a strict | downgrade for most good drivers (who are most of the | population). | rootusrootus wrote: | Which sounds like a lot until you realize 1) we drive | over three _trillion_ miles a year in the US, and 2) the | majority of those accidents are concentrated to a | fraction of all drivers. The median human driver is quite | good, and the state of the art AI isn 't even in the same | galaxy yet. | Philip-J-Fry wrote: | It's not lidar they need. BMW, Mercedes, Porsche, etc. | All can park themselves almost perfectly every time. | Teslas can't, and will take 5x as long to do it when they | can, if the computer can even bother to recognise the | space. | | It's software. Original Teslas with AP1 better than | Teslas own in house software on their latest AP. | causi wrote: | Sure, sure, but would it have killed them to drop in a | few five dollar "don't hit this object" ultrasonic | proximity sensors? | michaelt wrote: | While ultrasonic sensors would be fine for parking, they | don't have very good range so they aren't much help in | avoiding, for example, crashing into stationary fire | trucks or concrete lane dividers at freeway speeds. | itsyaboi wrote: | Just disable autopilot 0.00001 seconds before impact and | it becomes the driver's fault. | tough wrote: | Couldn't afford* | bugglebeetle wrote: | Don't worry. That's valuation has always been ZIRP nonsense | and will fall with Musk's addiction to publicly | demonstrating his incompetence. | pmoriarty wrote: | How about asking it to describe a Robert Williams painting? | | Here's a relatively easy one: [1] | | [1] - | http://www.illustrationartgallery.com/acatalog/WilliamsTimmy... | wg0 wrote: | I tried that and here's what it has to say: | | > This painting appears to be a depiction of a child running | down a dirt road with a kite in hand. The child is wearing a | yellow shirt and blue shorts, and the background is filled | with various colors and shapes. The overall mood of the | painting is playful and carefree. | | What's amazing is that it has correctly described the scene | including details such as: - A boy - On | a dirt road. - Having a Kite in hand. - Wearing | Yellow shirt and blue shorts. | jcims wrote: | Used this prompt | | Could you describe the image as thoroughly as possible? | Please include the content, any noteworthy symbology, | techniques or styles, what kind of media, guess at the artist | or identify any attribution, then attempt to describe the | theme or idea it represents? | | Got this | | The image is a painting of a child running down a dirt road | with a kite in hand. The child is wearing a yellow shirt and | blue shorts. The background is a mix of urban and rural | elements, with buildings and trees in the background and a | field of grass in the foreground. The style of the painting | is a mix of realism and abstraction, with a focus on the | child and the kite. The artist may be attempting to convey a | sense of freedom and playfulness, with the child running with | the kite representing the joy and wonder of childhood. | | The medium used is oil paint on canvas. The artist may have | used brushes and other tools to create the different textures | and effects in the painting. The artist may have also used | underpainting to establish the composition and values before | adding the final layers of color and detail. | | It is difficult to identify the artist without more | information, but the style and technique of the painting | suggest that it may be the work of an artist who is skilled | in realism and abstraction, and who is interested in | exploring themes of childhood and playfulness. | pmoriarty wrote: | It completely missed what makes this painting distinctive | and funny/scary, and what's actually going on. | | It also missed the right half of the painting. | | Try asking it why the painting is called "Timmy's Last | Surprise". | dom96 wrote: | Can this decode CAPTCHAs? Are we finally in a place where | CAPTCHAs are useless? | spyder wrote: | Just tried it with a simple 4 character one and it's bad at it, | detects 1 or two characters correctly from the 4 if it outputs | anything. It's probably better with the "select the traffic | lights" kind of captchas, but those are also already possible | to solve with other image models too if I remember correctly. | alex_duf wrote: | would it be a good things for captchas to be useless? Because | in my career I've been a few times on the receiving end of spam | attacks and various nastiness and it's nice to be able to | integrate such a thing - as unpleasant as it is for the user - | and solve the problem at hand. | | if this disappears then spammers and the various botnets will | have the upper hand again. | dom96 wrote: | It wouldn't be good, did I imply it would? I'm interested to | know if we are now in a place where they do become useless | dalys wrote: | I was also thinking about this the other day. | | Idea one: Captchas are to become pretty useless as a "is this a | human" tactic soon. Maybe it already is, I don't know. What | other things could we think off to prove someone is human? I | was watching Lex Fridman and Max Tegmark and they were | remarking on how Twitter using payment as a differentiator | between human and bot is actually really good. And maybe the | only way we can reliably determine if someone is a human or not | right now. Just by the virtue that having thousands of bots | doing something, that suddenly costs $5 per event will deter | most attacks. Integrating online identification systems from | various countries could be one tactic (such as | https://en.wikipedia.org/wiki/BankID that we use in Sweden to | log in to basically any online service). New startup: Un- | botable authentication as a service. | | Idea two: Since captchas are useless, we'll be able to do bots | that can do almost everything on the web. No need for writing | automation scripts, headless browsers, regexp etc. Just feed | real visual data from browser to GPT-4 (or MiniGPT-4 or | similar). Give instructions like "You need to accomplish this | task: Go to facebook.com and create a user account and be | friends with 100 people and act like a human. Follow the | instructions on the website.". Then let the bot figure out | where to move the mouse and send click events, keyboard events | etc. Obviously much slower than a scripted bot, but would be | very hard to detect as a bot. Good for E2E testing also? New | startup: WebGPT: Bots that can do any task on any website. | TestGPT: E2E testing service. Just write test instructions like | "Make sure the login works on /login" and you're done! And you | thought flaky tests were a thing of the past... Would be kind | of cool for UX testing also. "You are a 43 year old truck | driver in Spain, please go to news.ycombinator.com and see how | easy it is for you to register an account, log in, go to any | news story and make a new comment. Report anything you think is | confusing." | tysam_and wrote: | I was not expecting this to be good due to the hype wave, but at | least looking at the examples, they seem very impressive and | worthy of the (potential) hype! I can't wait to see what new | waves of these models keep coming out. | | I for one will probably still stay near the fundamental side of | things as all of this thrash is simply far too much for me, but | it is quite nice and beautiful for me to get to see all of the | variety and beauty of what is being built here. <3 <3 <3 <3 | | Now, to figure out a good way to integrate these seamlessly into | my development workflow....I go about as painfully low-tech as | one could. I might as well be using an electronic typewriter with | only Ctrl+F search, lolzers.... | mdrzn wrote: | Interesting, will try it out. | | The queue is about 100 at the moment, with 700s of waiting. | mritchie712 wrote: | 1200s now... | punkspider wrote: | Also some links don't seem to open in a new tab and if you | click around while you wait you might lose your progress. | | I clicked on the Video button while waiting, assuming that it | would open in a new tab, and lost my place in queue. | zapdrive wrote: | Also once you get your turn and ask your question, it puts | you back in the queue. | djangovm wrote: | If I have to learn how to be able to read this code and | understand what it and its dependencies are doing, where do I | start? Is reading their paper an effective strategy? | | https://github.com/Vision-CAIR/MiniGPT-4/blob/main/MiniGPT_4... | lxe wrote: | Why Vicuna over something like oasst-pythia? Would love to see a | table comparing all the new models side by side. | GaggiX wrote: | All Pythia models were trained on 300B tokens, LLaMa models | were trained on 1/1.4T tokens. | lhl wrote: | Fabrice Bellard has run a standard set of benchmarks w/ lm-eval | on a big chunk of open models here: | https://bellard.org/ts_server/ - Flan T5 XXL and GPT-NeoX 20B | both outperform Pythia 12B on average (LLaMA 13B+ tops the | charts). | [deleted] | vlovic wrote: | Do I understand this correctly: they just took Blip2 and replaced | the LLM with Vicuna, and to do that they just added a single | linear layer to translate between frozen vision encoder and | (frozen) Vicuna? Additionally, and importantly, they manually | create a high quality dataset for finetuning their model. | | If that is the case, then this is really a very, very simple | paper. But I guess simple things can lead to great improvements, | and indeed their results seem very impressive. Goes to show how | much low hanging fruit there must be in deep learning these days | by leveraging the amazing, and amazingly general, capabilities of | LLMs. | visarga wrote: | Yes, model composability magic. | 23459872345 wrote: | Has anyone gotten this running in Colab? | felipesabino wrote: | the lost me when the demo bot chose php for the website builder | demo /s | 152334H wrote: | On a technical level, they're doing something really simple -- | take BLIP2's ViT-L+Q-former, connect it to Vicuna-13B with a | linear layer, and train just the tiny layer on some datasets of | image-text pairs. | | But the _results_ are pretty amazing. It completely knocks | Openflamingo && even the original blip2 models out of the park. | And best of all, it arrived before OpenAI's GPT-4 Image Modality | did. Real win for Open Source AI. | | The repo's default inference code is kind of bad -- vicuna is | loaded in fp16 so it can't fit on any consumer hardware. I | created a PR on the repo to load it with int8, so hopefully by | tomorrow it'll be runnable by 3090/4090 users. | | I also developed a toy discord bot | (https://github.com/152334H/MiniGPT-4-discord-bot) to show the | model to some people, but inference is very slow so I doubt I'll | be hosting it publicly. | sho wrote: | > they're doing something really simple -- take BLIP2's | ViT-L+Q-former, connect it to Vicuna-13B with a linear layer, | and train just the tiny layer on some datasets of image-text | pairs | | Oh yes. Simple! Jesus, this ML stuff makes a humble web dev | like myself feel like a dog trying to read Tolstoy. | selfhoster11 wrote: | Seriously, ChatGPT was the thing that gave me a foothold into | the AI/machine learning world... because it gave me hope that | a mere mortal can achieve something reasonable with this tech | without a crazy amount of work and educational background. | 152334H wrote: | I don't have much experience myself. I only started ~10 | months ago -- just a month or two before Stable Diffusion. | | You just have to do it every day. It's fun! | jack_riminton wrote: | Can you recommend what kind of small daily activities | would help a web dev get into it? | 152334H wrote: | Regardless of what you want to learn, "small daily | activities" is a bit hard. You can learn some stuff by | osmosis, following the feeds of AI devs && AI channels, | but the bulk of what I learn comes from starting projects | & digging into code & reading papers. | | If you can hold attention span over several days (I | can't), work on a project bit-by-bit. Just make sure it | uses modern AI stuff, and that you have smart people to | talk around with. | anonzzzies wrote: | There are really great resources now from eli5 about all of | this tech to books like 'the little learner' which any | programmer can get into. Yes, it takes effort but it is a | great time for it. | pansa2 wrote: | Web stuff probably makes ML devs feel the same way. | | ML is just a different field, using a different set of | technologies from those you're familiar with. | tourgen wrote: | [dead] | spaceman_2020 wrote: | The best ML PhDs can't do what frontend devs can: | understand CSS :D | ikurei wrote: | > This ML stuff makes a humble web dev like myself feel like | a dog trying to read Tolstoy. | | Just like any discussion between advanced web devs would make | any humble woodworker feel? | | And just like any discussion between advanced woodworkers | would make a humble web dev feel? | | _" It's really simple, they're just using a No. 7 jointer | plane with a high-angle frog and a PM-V11 blade to flatten | those curly birch boards, then a No. 4 smoother plane with a | Norris-type adjuster and a toothed blade for the final | pass."_ | | Whut? | | _" You could use Webpack to bundle your HTML, CSS and Babel- | transpiled TypeScript 5 down to shim-included Ecmascript 4"_, | _" They're just using OAuth2 authentication with Passport.js | and JWT tokens, which easily gets you CSRF protection"_, _" | Our e-learning platform uses LMS.js and xAPI.js, plus SCORM | for course packaging and Moodle as the LMS backend."_, ... | | There was a time you didn't know what any of that meant. | | Just because you don't know what the words mean shouldn't | make it sound difficult. Not saying AI is easy, just that the | jargon is not a good indication of difficulty and we should | know better than to be so easily mystified. | tracker1 wrote: | Okay, I won't mention how much is wrong in the webdev | statement... :-D | brycedriesenga wrote: | Hey, guys. Hey. Ready to talk plate processing and residue | transport plate funneling? Why don't we start with joust | jambs? Hey, why not? Plates and jousts. Can we couple them? | Hell, yeah, we can. Want to know how? Get this. Proprietary | to McMillan. Only us. Ready? We fit Donnely nut spacing | grip grids and splay-flexed brace columns against beam- | fastened derrick husk nuts and girdle plate Jerries, while | plate flex tandems press task apparati of ten vertipin- | plated pan traps at every maiden clamp plate packet. | Knuckle couplers plate alternating sprams from the t-nut to | the SKN to the chim line. Yeah. That is the McMillan way. | And it's just another day at the office. | SanderNL wrote: | Talk dirty to me! | intelVISA wrote: | runtime polymorphism | therein wrote: | Just tell me do we need a turbo encabulator or not? | Rzor wrote: | This post is double great and I will never forgive Amazon | for canceling that show. | | For those that don't know this is from a show called | Patriot. | | https://en.wikipedia.org/wiki/Patriot_(TV_series) | | Scene: https://youtube.com/watch?v=-F-IHvF5OCA | SpaceL10n wrote: | Web devs have become blue collar!? =P | | Great idea, actually. I do hope for a curriculum that | enables kids on the trade school path to learn more about | programming. Why not Master/Journeyman/Apprentice style | learning for web dev?? | wnolens wrote: | That's kind of how I think about bootcamps pumping out | web devs. They're like trade schools, teaching you just | enough fundamentals to know how to use existing tools. | tracker1 wrote: | Mostly agree... though I don't think the bootcamps get | enough fundamentals in. Not to mention that it takes the | type of person that will go above and beyond what has | been assigned to succeed trying to be a productive | employee in the space. I'm self-taught and the first | years of my career spent countless hours reading, | practicing and solving problems. I still spend a good | 10-15 hours a week reading and exploring software | development and try to at least keep up with what's out | there. In the end, the best you can do is be aware of | what, or even that options are out there. | | I can't imagine starting out today... | ikurei wrote: | I kind of agree, but I'd add that I don't think it's a | bad thing. | f1codz wrote: | You make a good point. Except that a number of these | concepts and tooling in the ML world have been slingshotted | into the forefront in a relatively short time and it has | been hard to play catch up. For eg. - someone said "frozen | Vicuna" below - what does that mean? | idkyall wrote: | Vicuna is a specific open source AI LLM: | https://ai.plainenglish.io/vicuna-the-unparalleled-open- | sour... | birdyrooster wrote: | I love your analysis. | ttul wrote: | The thing is, machine learning sorta requires a few math | prerequisites: linear algebra, differential equations, and | to some degree vector calculus. Most web developers don't | have this background. | craigching wrote: | If you want to understand the theory, that's true. If you | want to develop an intuitive understanding without having | to understand all the nuts and bolts (and I understand | that can be a big ask for how some people | learn/understand), give this a try: | https://karpathy.ai/zero-to-hero.html | akiselev wrote: | The irony is Karpathy presents the limit/epsilon | definition of derivatives in the first half hour (quite | well IMO and he never actually says "epsilon") which is | very much a nuts and bolts kind of explanation in | calculus. | | That said, when most people say differential equations | they're usually thinking of analytical solutions which is | very much not necessary for practical ML. | idiotsecant wrote: | I would say the limit epsilon derivative is exactly the | sort of thing grandparent post is talking about. It's | quite intuitive and doesn't require hardly any | mathematical foundation at all, other than basic geometry | and algebra. You can understand topics that build on that | simple concept without understanding the more formal | derivative definitions. | mjburgess wrote: | In practice, it's a lot more like web dev than you might | imagine. | | The above means that the approach is web-dev like gluing, | almost literally just, from existingliba | import someop from existinglibb import anotherop | from someaifw import glue a = someop(X) | b = glue(a) Y = anotherop(b) | amelius wrote: | It's more like gardening: 1. plant seed | 2. ...wait a very long time... 3. observe | completely unexpected but cool result | | The unexpected part of step 3 is what makes this very | different from any kind of engineering, even webdev. | | Of course, there _is_ a lot of engineering involved in good | ML, but that is more comparable to agricultural engineering | in the sense that it 's just a lot of dumb plumbing that | any engineer can do without knowledge of the actual | application. | tudorw wrote: | count me in :) | kerkeslager wrote: | I mean, for me, the unexpected part of 3 is what got me | into programming in general. The first time you type a | mysterious incantation into an editor and a few more | mysterious incantations into the console and the console | prints "Hello, world" like it was supposed to, it's | unexpected because it's hard to believe that any of this | mysterious incantation stuff actually works at all. | | As you get better at programming you have to take on | harder problems to create the surprise of something | working, because you gain confidence, and as you gain | confidence, you start expecting your code to work. It's | only when you've compiled the thing 6 times with small | corrections and gotten segfaults each time and the 7th | time you finally find the place you weren't updating the | pointer and you correct it, but this is the 7th error | you've corrected without the segfault going away, so you | don't really expect it to fix the problem, but then you | run it and it's fixed! | | And then you get a job and the reality is that most of | the jobs you're just writing CRUD apps and for a little | while you can get some surprise out of learning the | frameworks, but eventually you actually get really, | really knowledgeable about the Postrgres/Django/React | stack and nothing surprises you any more, but _because_ | nothing surprises you any more, you 're really effective | and you start being able to bill the big bucks _but only | for work on that stack_ because it takes time to struggle | enough to get surprised, and the time that takes means | your time is worth less to your clients. Money ruins | everything. And if you don 't do anything non-billable, | it's easy to forget what programming felt like when you | didn't know how your tools all worked inside and out. Not | everyone takes this path but it's certainly the easiest | path to take. | | I think for a lot of folks who have been doing this for a | long time, the reason ML is so exciting is it's getting | them back out of their comfort zone, and into a space | where they can experience surprise again. | | But that surprise has always been available if you | continue to find areas of programming that push you out | of your comfort zone. For me it's been writing | compilers/interpreters for programming languages. | _Crafting Interpreters_ was awesome: for the first time I | benchmarked a program written in _my_ language against a | Python program, and _my_ program was faster: I never | expected I 'd be able to do that! More recently, I wrote | a generational GC. It's... way too memory-intensive to be | used in my language which uses one-GC-per-thread for | potentially millions of threads, but it certainly was a | surprise when that worked. | | Personally, I'm keeping track of ML enough to know broad | strokes of things but I'm not getting my hands dirty with | code until there are some giants to stand on the | shoulders of. Those may already exist but it's not clear | who they are yet. And I've got very little interest in | plugging together opaque API components; I know how to | make an API call. I want to write the model code and | train it myself. | incidentnormal wrote: | I like how you've expressed this insight, and it is so | true. | | Becoming great at a particular technology stack means | modelling it in great detail in your head, so you can | move through it without external assistance. But that | leaves an arena without discovery, where you just | reinforce the same synapses, leading to rigidity and an | absence of awe. | teruakohatu wrote: | There is a little more to it than that. Abstractions in ML | are very leaky. | marban wrote: | And repeat that ~4 times to make it look like LangChain | guax wrote: | And just like webdev, each of those were done in a | different platform and require arcane incantations and 5h | of doc perusing to make it work on your system. | KaoruAoiShiho wrote: | Buddy this ain't 2022 anymore, ask chatgpt (with a plugin | that can read docs). | MattPalmer1086 wrote: | You can just ask GPT how to do it. Much like a lot of web | dev! | kerkeslager wrote: | And the code GPT gives you won't work, much like a lot of | web dev? ;P | pc86 wrote: | Maybe it's because of how I use it, but the code ChatGPT | gives me has always been super helpful and 99% correct. | But, we have a policy at work not to use it for work | product so I have to spend time changing enough of it | where it's different, and I'm never copy/pasting | anything. Enough changes to the structure and variables | to make it sufficiently different that it can't be | considered pasting company data into GPT, ask my | question(s), see what comes back out, refactor/type | manually into my IDE, test. I'd say one out of every 8-9 | times I get something objectively wrong - a method that | doesn't exist, something not compiling, etc. But it's | faster than using google/DDG, especially with some | prompting so that it just spits back code and not 5th- | grade level explanatory paragraphs before and after. And | well over half the time it does exactly what I need or | sufficiently close that my initial refactoring step gets | me the rest of the way. | MattPalmer1086 wrote: | Would you say that this satisfies the spirit of the | company policy? Or is it a bit of a hack to get around | it? | | I ask because we are about to produce a similar policy at | work. We can see the advantages of it, but likewise, we | can't have company data held in their systems. | pc86 wrote: | The policy is to not send any "sensitive company data" | into ChatGPT, which I 100% agree with. How we implement a | given Vue component or a particular API isn't sensitive | or particularly novel so if I strip the business logic | out I do honestly believe I'm complying with the spirit | of the policy. | SanderNL wrote: | If I use it I also make sure it's something completely | non-core business, like an arcane piece of sorting or | ugly rxjs construction. | | I get the IP angst, but some companies think their | GetGenericObjectFromDB() REST bs is secret sauce. | intelVISA wrote: | To the average VC a computer switching on is secret sauce | enough, the rest is really just an implementation detail. | Gravityloss wrote: | at some point someone makes a service where you can let | AI take over your computer directly. Easier that way! | Curling straight to shell taken to next level. | wrayjustin wrote: | So...AutoGPT? Now with command-line access! Have fun :) | | https://github.com/Significant-Gravitas/Auto-GPT/ | vorticalbox wrote: | Found my next hobby project | minimaxir wrote: | This is why the Hugging Face transformer ecosystem is so | good, as each of those blocks will roughly have the same | unified API. | MrGilbert wrote: | Just get rid of all the abbreviations in your mind - they | seem to be very intimidating. I really liked the explanation | that Stephen Wolfram did on ChatGPT: | | https://writings.stephenwolfram.com/2023/02/what-is- | chatgpt-... | | Maybe someone has resources to understand machine-learning on | an ELI5 level. | wwalexander wrote: | Wow, he waits until halfway through the article to mention | A New Kind of Science. Usually he works it into the first | couple of paragraphs! | alchemist1e9 wrote: | I known it's hard to believe but I sense LLMs have | slightly knocked his ego down and injected a small dose | of humility. | | https://youtu.be/z5WZhCBRDpU | | I pick that up in above video and also in the post above. | | Definitely healthy for him which just to be clear I'm a | huge Wolfram fan and the ego doesn't really bother me, | it's just part of who he is, however I do find it nice | that LLMs are having him self reflect more than typical. | HarHarVeryFunny wrote: | Not a big Wolfram fan myself. I gave him the benefit of | the doubt and bought "A New Kind of Science" (freakin' | expensive when it first came out), and read the whole | 1280 pages cover to cover ... Would have been better | presented as a short blog post. | | I find it funny how despite being completely uninvolved | in ChatGPT he felt the need to inject himself into the | conversation and write a book about it. I guess it's the | sort of important stuff that he felt an important person | like himself should be educating the plebes on. | | Predictably he had no insight into it and will have left | the plebes thinking it's something related to MNIST and | cat-detection. | MrGilbert wrote: | I just happen to read this article of him, which I found | easy to understand. I'm neither a huge proponent nor | opponent of the likes of his work. Or, bluntly speaking: | I don't know much else about his reputation in the | community. | pc86 wrote: | Big "a monad is just a monoid in the category of | endofunctors" vibes from this one. | joaogui1 wrote: | FWIW I work in LLMs and I consistently fail to do simple | webdev stuff | craigching wrote: | I was where you're at about ... oh wow, it's been almost ten | years since I jumped into machine learning. Mind you, I've | been learning on the side most of this time other than a | theoretical class at the University of Minnesota. But, that | aside, and depending on where you're at in your | understanding, this is a great resource for catching up if | you're really interested: https://karpathy.ai/zero-to- | hero.html it was posted on HN a couple of weeks ago and I | have to say it's a really good introduction and Andrej | Karpathy is a passionate and excellent teacher. You may want | to brush up on some intro Calculus, but it's very | understandable. | thewarrior wrote: | I've only been reading ML stuff for a few months and I kind | of understand what it's saying. This stuff isn't as complex | as its made out to be. | | It's just a bunch of black boxes AKA "pure functions". | | BLIP2's ViT-L+Q-former AKA //I give you a | picture of a plate of lobster it will say "A plate of | lobster". getTextFromImage(image) -> Text | | Vicuna-13B AKA //I give you a prompt and | you return completion ChatGPT style | getCompletionFromPrompt(text) -> Text | | We want to take the output of the first one and then feed in | a prompt to the LLM (Vicuna) that will help answer a question | about the image. However the datatypes don't match. Lets add | in a mapper. getAnswerToQuestion(image, | question) -> answer text = | getTextFromImage(image) prompt = | mapTextToPrompt(text) return | getCompletionForPrompt(prompt) | | Now where did this mapTextToPrompt come from ? | | This is the magic of ML. We can just "learn" this function | from data. And they plugged in a "simple" layer and learned | it from a few examples of (image , question) -> answer. This | is what frameworks like Keras, Pytorch allow you to do. You | can wire up these black boxes with some intermediate layers | and pass in a bunch of data and voila you have a new model. | This is called differentiable programming. | | The thing is you don't need to convert to text and then map | back into numbers to feed into the LLM. You skip that and use | the numbers it outputs and multiply directly with an | intermediate matrix. | getAnswerToQuestion(image, question) -> answer | text = getEmbeddingFromImage(image) embedding = | mapEmbeddingToInputEmbeddingForLLM(text) return | getCompletionForEmbedding(embedding) | | Congratulations you now understood that sentence. | artificial wrote: | Thank you for the insightful breakdown. Cheers! | eternauta3k wrote: | Interesting, so the LLM is "just" getting your question | plus a normal text description of the image (as vectors)? | thewarrior wrote: | At a high level yes. | | More precisely - It gets the question After irs passed | through a matrix that transforms the text description of | the image so the LLM can "understand" it. | | It maps from the space of one ML model to the other. | visarga wrote: | > take BLIP2's ViT-L+Q-former | | This thing takes an image and creates a representation | matrix. | | > connect it to Vicuna-13B with a linear layer | | Vicuna is an open LLM, pretty good quality, not as good as | GPT3.5 though. | | This is the beautiful part - a mere multiplication is enough | to convert the image tensor to text tensor. One freaking line | of code, and a simple one. | | > and train just the tiny layer on some datasets of image- | text pairs | | You then get a shitload of image-text pairs and train the | model to describe the images in text. But keep both the image | and text model frozen. Is that hard? No, just flip a flag. So | this "linear projection layer" (a matrix multiplication) is | the only learned part. That means it takes less time to | train, needs fewer examples and requires less memory. | | Training the image and text models was much more difficult. | But here we don't train these models, they use them as ready- | made parts. It's a hack on top of two unrelated models, so it | is cheap. | | In the end the finishing touches - they label 3500 high | quality image-text pairs, and fine-tune on them. Now the | model becomes truly amazing. It has broad visual | intelligence, and scooped OpenAI who didn't release Image | GPT-4 in the APIs yet. | | The important lesson to take is that unrelated models can be | composed together with a bit of extra training for the glue | model. And that open AI is just as powerful as "Open"AI | sometimes. It's breathing down their necks, just one step | behind. This model is also significant for applications - it | can power many automations in a flexible way. | rafaelero wrote: | > This is the beautiful part - a mere multiplication is | enough to convert the image tensor to text tensor. One | freaking line of code, and a simple one. | | I thought they were creating image tokens based on the | queries during finetuning and appending them to the | language model. They are not text tokens. | relativeadv wrote: | > like a dog trying to read Tolstoy | | this got a chuckle out loud from me. great visual. | egeozcan wrote: | This could be a great prompt to test the limits of txt2img | models. The astronaut riding a horse got boring already :) | mewpmewp2 wrote: | Only because of big complicated sounding terms, that also | exist in web dev. | macawfish wrote: | Arf! | seydor wrote: | Then it's an impressive demonstration of how modular neural | networks can be. Maybe we don't even need to train monoliths | artificial wrote: | Maybe a distributed trainer? AI@Home? | kramerger wrote: | Can any of this realistically run on CPU at some point? | | (Not training obviously) | tantony wrote: | Someone is probably going to port it to llama.cpp soon. | lhl wrote: | Yes, you can run inference at decent speeds on CPU with | llama.cpp. A token is about 0.75/words, so you can see lots | of people getting 4-8 words/s on their CPUs: | https://github.com/ggerganov/llama.cpp/issues/34 | | There a lot of optimizations that can be done. Here's one w/ | potentially a 15X AVX speedup for example: | https://github.com/ggerganov/llama.cpp/pull/996 | anonzzzies wrote: | It will do, probably quite soon. Many people are trying. | GaggiX wrote: | I've run LLaMa models on my CPU before, ViT-L and the | Q-former are two transformer models as well, so I can't see | why they wouldn't run on a CPU. | Ambix wrote: | I'm developing framework [1] in Golang with this goal in mind | :) It successfully runs relatively big LLM right now, and | diffusion models will be the next step | | [1] https://github.com/gotzmann/llama.go/ | muyuu wrote: | quantized Vicuna runs ok-ish in my 16GB i7 laptop (onboard | graphics) and the output is usable | | see this comparison: https://old.reddit.com/r/LocalLLaMA/comm | ents/12ezcly/compari... | | these models quantised to 4bit should run in CPU set ups with | 16GB of RAM + 16GB of swap (Linux) and perhaps other setups | run similarly | HopenHeyHi wrote: | Thanks for a useful comment. | | Do you reckon the 4bit quantized Vicuna just won't do here? | https://huggingface.co/anon8231489123/vicuna-13b-GPTQ-4bit-1... | | I think with this everything OpenAI demonstrated ~5 weeks ago | has been recreated by actually-open AI. Even if it runs much | much slower on prosumer hardware and with less good results at | least it is de-magicked. | p1esk wrote: | The magic is in the quality of GPT-4 output. That hasn't been | recreated yet. | kolinko wrote: | Open AI still hasn't exactly reached the level of gpt3.5. | GPT-4 is way ahead of anything. | 152334H wrote: | It'll work! I just haven't touched any of the 4bit stuff | myself, so I don't personally know how to add it. Great low- | hanging fruit for anyone else to take on. | chenzhekl wrote: | It sounds like a BLIP2 with an extra linear layer for | finetuning (or aligning the Q-former with a new LLM?). What | makes it more powerful than BLIP2? | 152334H wrote: | It's better because | | 1. it's using vicuna as a base. | | 2. It has a pretty high quality fine-tuning dataset. I | initially missed this, and it's a very important advantage. | | 3. (speculatively) it doesn't collapse to extremely short | responses (which BLIP2 and other models trained on image-text | caption pairs) because of how small/simple the adapter is. | | I was interested in training a BLIP2-LLaMA model before this, | and I might still do it just to test (3). | matthewfcarlson wrote: | >so hopefully by tomorrow it'll be runnable by 3090/4090 users. | | Taking a step back, this is just a wild statement. I know | there's some doom and gloom out there, but in certain aspects, | it's an awesome time to be alive. | cs702 wrote: | Indeed, really simple. And yes, the results are shockingly | good. But what I find most remarkable about this is that the | ViT-L+Q-former's hidden states _are related by only a linear | projection_ (plus bias) to the Vicuna-13B 's token embeddings: | emb_in_vicuna_space = emb_in_qformer_space @ W + B | | These two models are trained independently of each other, on | very different data (RGB images vs integer token ids | representing subwords), and yet somehow they learn to embed | different data in feature vectors that are so... _similar_. WHY | should that be the case? | | It suggests to me there may be something _universal_ about the | embedding layers and hidden states of _all_ trained deep | learning models. | sigmoid10 wrote: | >somehow they learn to embed different data in feature | vectors that are so... similar | | At it's core, BLIP2 already projects RGB inputs into text | token space and Vicuna (or rather LLaMA) uses such tokens as | inputs as well as outputs. The only reason why a linear layer | is needed at all is because they are not trained at the same | time, so you still have to move text embeddings from one | space to another. But it should not be surprising at all that | one hidden linear layer suffices to do just that (see the | universal approximation theorem [1]). This approach is just | an efficient way to combine different models for downstream | fine-tuning tasks while keeping their weights frozen, but it | is neither new nor particularly surprising. | | [1] https://en.wikipedia.org/wiki/Universal_approximation_the | ore... | cs702 wrote: | Thanks. Your comment about BLIP2 already projecting RGB | inputs into (a different) text token space makes sense to | me. See also fpgaminer's comment at | https://news.ycombinator.com/item?id=35603246 . However, I | don't see how the universal approximation theorem is | relevant here. The fact that deep models with sufficient | capacity can approximate any function does not imply that | two deep models trained independently of each other on | different tasks will learn to approximate functions that | relate to each other only by a linear transformation. | sigmoid10 wrote: | >I don't see how the universal approximation theorem is | relevant here. The fact that deep models | | The universal approximation is exactly _not_ about deep | models. Deep means many layers. But in the most simple | (and proven) case, a single hidden layer perceptron is | all it needs according to the UAT. Technically it also | needs a nonlinear activation function, but you get all | sorts of nonlinearities for free downstream anyways in | this particular model. | fpgaminer wrote: | BLIP2 is a contrastive Image-Language model. The embeddings | from the BLIP2 image model are already both aligned with | text, and linear. It should not be a surprise that only a | projection is required to translate it to LLaMA's embedding | space. | cs702 wrote: | This is the best answer. It makes sense to me. Thank you | :-) | AbrahamParangi wrote: | I think it's just that affine transforms in high dimensions | are surprisingly expressive. Since the functions are sparsely | defined they're much less constrained compared to the low | dimensional affine transformations we usually think of. | cs702 wrote: | Good point. Didn't think of that. It's a plausible | explanation here, because the dimensionality of the spaces | is so different, 5120 vs 768. Not surprisingly, the trained | weight matrix has rank 768: it's using every feature in the | lower-dimensional space. | | Still, it's kind of shocking that it works so well! | | I'd be curious to see if the learned weight matrix ends up | being full-rank (or close to full-rank) if both spaces have | the same dimensionality. | visarga wrote: | They would have full-rank because all the embedding space | is used. There are no unused large pockets. | cs702 wrote: | The weight matrix's rank would decrease for each feature | in the target space that cannot be expressed as as a | linear combination of features in the input space (plus a | bias). For example, if the target space has a feature | representing a non-visual quality like "smelliness," it | would not be expressible as a linear combination of | features representing visual attributes like "redness," | "blueness," and "greenness," etc. in the input space. | | If both spaces have the same dimensionality, the learned | weight matrix would be full-rank _only_ if every feature | in the target space is expressible as a linear | combination of features in the input space (plus a bias). | Which brings me back to my original question: WHY would | that be the case when the two models are trained | independently on data that is so different? | sdenton4 wrote: | A random nxn matrix is full rank... So it's kinda the | default: any amount of noise in the embedding is going to | result in full-rank transformations. | | So it's really less-than-full rank which would require an | explanation - ie, why does this image representation | project into this perfectly isolated subspace of the | language representation (or vice versa)? | | If that happened I would start looking for things like a | vocabulary of smell which is completely distinct and non- | overlapping with any visual context. But we use cross- | modal analogies in language /constantly/ (many smells are | associated with things we can see - 'smells like a rose') | so you wouldn't expect any clean separations for | different modalities... Maybe there's some branch of | analytic philosophy which has managed to completely | divorce itself from the physical world... | eurekin wrote: | > I created a PR on the repo to load it with int8, so hopefully | by tomorrow it'll be runnable by 3090/4090 users. | | How about 2x3090? Can it be run on multiple gpus? | collaborative wrote: | With fp8, would 4GB be enough or is 6GB more like it? | jxy wrote: | This opens a huge possibilities. It's likely we could simply | plug in stable diffusion using a linear layer. As well as | whisper and some TTS. Getting a back to back mixed | image/sound/text engine running on a laptop. | | I wonder if there's powerful enough ViT model that does OCR. | m00dy wrote: | Hi CHEN, | | Invoker here, I would like to have a chat or send me an email @ | community@invoker.network | endorphine wrote: | DotA? | m00dy wrote: | I'm way better than him ;) | xnzakg wrote: | The ramen example is kind of hilarious. Wonder if it would make | more sense with a bigger model. | m348e912 wrote: | Ok I'll bite. What makes the Raman example so funny? | burkaman wrote: | The recipe says to make a broth, cook the noodles in the | broth, then throw the broth away and add the noodles to a pan | with meat. You'll end up with burned crispy noodles and meat | in a dry bowl. | Thomashuet wrote: | The recipe it suggests is completely off. Just search for | "ramen recipe" and compare. For instance it tells you to add | soy sauce, sesame oil and green onions to the broth (so far | so good) but then tells you to throw all of it away ("once | the noodles are cooked, drain them"). The problem here is | that you should cook the noodles separately in plain water, | not a broth with soy sauce. | thatsadude wrote: | How is this a paper, they simply fine-tune the fully-connected | layer of BLIP-2 https://arxiv.org/pdf/2301.12597.pdf | nobrains wrote: | I think the real news is where this research is coming from. It | is from King Abdullah University of Science and Technology in | Saudi. The arab world has recently been quite behind in tech | research, but this is good news overall for the arab world, which | will inspire other universities and researchers in the region to | one-up their colleagues, which is overall good for the entire | world. | sterlind wrote: | I would be more thrilled if it were literally any other country | in the Arab world, but maybe that will come. | Imnimo wrote: | I'm pretty surprised that the frozen encoder has enough | information in its representations to do the "design the website | from a hand-written whiteboard mock-up" task, and that that | information can be effectively accessed by the language model | with just a small projection layer. I really would have guessed | that you needed to be training the image encoder simultaneously. | devinprater wrote: | Holy crap! If this has image to text, and can run on a CPU, even | with 16 GB RAM, one could hook up an NVDA addon, straight Python | so it shouldn't be too hard, and have AI read whatever the user | wants. Pictures, video frames, UI's, and if it can break the UI | down into something the screen reader can understand, boom! | m3kw9 wrote: | This is basically Image to Text, then from the text, start a GPT | type chat. No need to overthink anything ___________________________________________________________________ (page generated 2023-04-17 23:00 UTC)