[HN Gopher] Beating OpenAI CLIP with 100x less data and compute ___________________________________________________________________ Beating OpenAI CLIP with 100x less data and compute Author : vov_or Score : 234 points Date : 2023-02-28 15:04 UTC (7 hours ago) (HTM) web link (www.unum.cloud) (TXT) w3m dump (www.unum.cloud) | ipsum2 wrote: | How did you deal with data contamination? | vov_or wrote: | The datasets we used are pretty clean themselves if we compare | them with LAION. But we also filtered out images with captions | on them and by CLIP's scores. Btw, huge thanks for Laion and | Open_clip projects! It inspires us a lot. | bilater wrote: | For me - the biggest thing I am looking for is a serverless | vector data store. Competitors like Pinecone work just fine but | they go from 0-70 as soon as you upgrade to a pod. | | If you can figure out pricing primarily based on usage you can | capture a whole segment of this market. | ashvardanian wrote: | Great point! I would be happy to get more input and brain-storm | a good pricing model together, one that is fair both for | developers and for users. | | We have an source project UKV, that partly overlaps with | vector-search: https://github.com/unum-cloud/ukv | | Another one - UNSW, is a placeholder for now: | https://github.com/unum-cloud/unsw | | Both will be soon available on cloud marketplaces, but server- | less options are a bit harder to cook. Our Discord is the best | place to continue conversation: https://discord.gg/Bbh2bjNhvz | | Thank you for advice! | swyx wrote: | > The original CLIP was trained on 500x A100 Nvidia GPUs. The | latest Open_CLIP trained on 1024x GPUs. | | > We trained on the setup of 3x workstations, with 4x RTX 3090 | consumer-grade GPUs in each, connected over 200 GBit InfiniBand | HDR. | | ok so 85x improvement on the GPU count (i suspect even better | once you take into account the differences in consumer grade GPU) | but i must still be missing something - where does it say it uses | 100x less data? | brookst wrote: | Look at the "dataset" column: CLIP was trained on 400m images, | UForm on 4m. | vov_or wrote: | There are also dataset sizes for Albef and ViCHA. | nl wrote: | This looks interesting for image retrieval. | | I don't love the way their tables[1] report performance though. | My understanding is that the "Dataset" column in the table | represents the size of the training dataset, _not_ the size of | the dataset they are evaluating on. Note that this undersells | their performance though, so it isn 't like they are trying to | hide something here! | | Also I'd love to see someone do a similar benchmark for the | OpenAI CPT-3 embeddings. I'm pretty unclear how well they compare | to something like FLAN-T5, because they don't seem to be | evaluated anywhere in the retrieval setting (unless I've missed | it?) | | [1] See "Zero-Shot Image Retrieval, English-only" in | https://www.unum.cloud/blog/2023-02-20-efficient-multimodali... | alexandargyurov wrote: | Am I the only one who is very confused what this is? | jasonjmcghee wrote: | This is a good introduction to OpenAI CLIP, which should help | provide context. https://openai.com/research/clip | pizzaknife wrote: | thank you for this primer! | juxtaposicion wrote: | It is exciting that you could train a CLIP-style model from | scratch with only 4M datapoints. But if you've got that data, why | not fine tune a pretrained model with your 4M points? It seems | likely to outperform the from-scratch method. | vov_or wrote: | There is not only a difference in the data source but pre- | trained tasks as well. But you are right, a fine-tuned models | on human-annotated data are way better than zero-shot (just | pre-trained) on Image retrieval. And it is correct for CLIP, | ALBEF, VICHA, and UFORM. | ttt3ts wrote: | Any plans to document how to fine tune your models then? | vov_or wrote: | It will take some time, but yes, we have this in our plans. | riku_iki wrote: | perhaps this approach can lead to better training of | foundational models?.. | vov_or wrote: | More efficient - for sure! | varispeed wrote: | I read a lot about training models and so on, but very little | about inference. | | Let's say you came up with the custom model that gives good | results, how do you transfer that model so it can be used in an | API? | binarymax wrote: | I specialize in this area and build a product for self hosted | inference. | | The challenge to support a new model architecture is about | coding the preprocessing for inputs (like tokenization or image | resizing and color feature extraction) and post processing the | outputs (for example entity recognition needs to lookup the | entities and align the text). | | Once an architecture is coded for the pre/post processing, then | serving a new model for inference with that architecture is | easy! | alex_sf wrote: | There's no one answer to that since different models are.. | different. Beyond just modalities (text input and image output? | image input and video output?), there are different common | underlying tools used to build them. And then, of course, what | do you mean by API? How do you want to interact with it? | | As a general thing, you'd take a request that would require an | inference step, which would then invoke the model with some | parameters and input, and return the output. Beyond that, you'd | need more detail. | [deleted] | sashank_1509 wrote: | They seem to be only testing for the image retrieval task, but I | don't think CLIP is actually used for image retrieval. Most | cases, I see CLIP being used for semantic segmentation, detection | etc. Do these guys have similar results on these tasks? | vov_or wrote: | Hi! I am one of the contributors! We were focused on image | retrieval only. Almost all semantic search engines for images | are based on CLIP today. We are also building a semantic | multimodal search engine as a DBMS component. That is why Image | retrieval is so crucial for us as well as inference perf. Also, | for semantic segmentation and detection, you probably use only | the image encoder part of the CLIP. | ilaksh wrote: | This may be a dumb question, but would it be possible to apply | these techniques to something like text completion and/or visual | question answering? If you went ahead and used the optimizations | but still scaled the model up? | vov_or wrote: | Yes, it is possible. Approaches, on which our model is based, | are capable to solve VQA and other similar tasks showing SOTA | results. | ilaksh wrote: | Do you know anyone working on a large text completion model | based on it? | freediver wrote: | Do you have/plan to have a text embeddings model? | vov_or wrote: | Yes, we are training text embedding models right now. And | also have plans to open-source some of them! In addition, | we train encoders for different modalities with retrieval | purposes. For example, video data. | margorczynski wrote: | From what I understand the basis for their model are these two | described in these papers: https://arxiv.org/abs/2107.07651 | https://arxiv.org/abs/2208.13628 | | Lot of tricks put together for a great final result it seems | ashvardanian wrote: | Thank you! Founder here :) You are right, those are the base | papers, but we have extended the set of objectives quite | significantly, tapping into modalities that haven't been | publicly CLIP-ed :) | | It is probably worth writing a paper about, but we are just too | busy building tons of open-source stuff. Check out the GitHub | org here: https://github.com/unum-cloud | | It is not just about the tranformers, but also about databases, | networking, and improving the modern data stack for very large | scale retrieval-based AI. A lot of the pieces may be pre- | production, but I believe the amazing HN community may still | enjoy the ways we use io_uring, SIMD, and a few other less then | popular technologies. | eternalban wrote: | where is the udisk? Repo is just a readme on configuration. | cosmojg wrote: | Are the pretraining and training pipelines available anywhere | under a FOSS license? I'd love to take a swing at training a | mid-fusion model on data other than text and images (e.g., | sound, neuron spike trains, etc.) | debdut wrote: | man I just looked at ukv, it looks to good to be true, 30x | RocksDB, wtf! Hoping it's true | mahnerak wrote: | I could now find license in the huggingface repo, but it seems | like the codebase is Apache 2.0. Are the pretrained weights / | checkpoints also covered under this (or other permissive) | license? | | In other words, can we use it for _commercial purposes for free_? | grammers wrote: | Good question, was about to ask the same! | vov_or wrote: | Hi! Just added Apache2.0 to HF models card. Thanks! | cosmojg wrote: | Are the pretraining and training pipelines available anywhere | under a FOSS license? I'd love to take a swing at training a | mid-fusion model on data other than text and images (e.g., | sound, neuron spike trains, etc.) | sva_ wrote: | Not sure if I'm blind, but what is the number of parameters? | vov_or wrote: | 143M - English 206M - Multilingual | [deleted] | fabbari wrote: | The sample code has an error in it, it uses `model` before | initializing it. | vov_or wrote: | Thanks Seems like a typo. It will be fixed soon | kimihailv wrote: | Did the author report metrics of the unimodal model or of the | multimodal model with re-ranking? | vov_or wrote: | The results are reported with the multimodal model. ___________________________________________________________________ (page generated 2023-02-28 23:00 UTC)