[HN Gopher] Beating OpenAI CLIP with 100x less data and compute
       ___________________________________________________________________
        
       Beating OpenAI CLIP with 100x less data and compute
        
       Author : vov_or
       Score  : 234 points
       Date   : 2023-02-28 15:04 UTC (7 hours ago)
        
 (HTM) web link (www.unum.cloud)
 (TXT) w3m dump (www.unum.cloud)
        
       | ipsum2 wrote:
       | How did you deal with data contamination?
        
         | vov_or wrote:
         | The datasets we used are pretty clean themselves if we compare
         | them with LAION. But we also filtered out images with captions
         | on them and by CLIP's scores. Btw, huge thanks for Laion and
         | Open_clip projects! It inspires us a lot.
        
       | bilater wrote:
       | For me - the biggest thing I am looking for is a serverless
       | vector data store. Competitors like Pinecone work just fine but
       | they go from 0-70 as soon as you upgrade to a pod.
       | 
       | If you can figure out pricing primarily based on usage you can
       | capture a whole segment of this market.
        
         | ashvardanian wrote:
         | Great point! I would be happy to get more input and brain-storm
         | a good pricing model together, one that is fair both for
         | developers and for users.
         | 
         | We have an source project UKV, that partly overlaps with
         | vector-search: https://github.com/unum-cloud/ukv
         | 
         | Another one - UNSW, is a placeholder for now:
         | https://github.com/unum-cloud/unsw
         | 
         | Both will be soon available on cloud marketplaces, but server-
         | less options are a bit harder to cook. Our Discord is the best
         | place to continue conversation: https://discord.gg/Bbh2bjNhvz
         | 
         | Thank you for advice!
        
       | swyx wrote:
       | > The original CLIP was trained on 500x A100 Nvidia GPUs. The
       | latest Open_CLIP trained on 1024x GPUs.
       | 
       | > We trained on the setup of 3x workstations, with 4x RTX 3090
       | consumer-grade GPUs in each, connected over 200 GBit InfiniBand
       | HDR.
       | 
       | ok so 85x improvement on the GPU count (i suspect even better
       | once you take into account the differences in consumer grade GPU)
       | but i must still be missing something - where does it say it uses
       | 100x less data?
        
         | brookst wrote:
         | Look at the "dataset" column: CLIP was trained on 400m images,
         | UForm on 4m.
        
           | vov_or wrote:
           | There are also dataset sizes for Albef and ViCHA.
        
       | nl wrote:
       | This looks interesting for image retrieval.
       | 
       | I don't love the way their tables[1] report performance though.
       | My understanding is that the "Dataset" column in the table
       | represents the size of the training dataset, _not_ the size of
       | the dataset they are evaluating on. Note that this undersells
       | their performance though, so it isn 't like they are trying to
       | hide something here!
       | 
       | Also I'd love to see someone do a similar benchmark for the
       | OpenAI CPT-3 embeddings. I'm pretty unclear how well they compare
       | to something like FLAN-T5, because they don't seem to be
       | evaluated anywhere in the retrieval setting (unless I've missed
       | it?)
       | 
       | [1] See "Zero-Shot Image Retrieval, English-only" in
       | https://www.unum.cloud/blog/2023-02-20-efficient-multimodali...
        
       | alexandargyurov wrote:
       | Am I the only one who is very confused what this is?
        
         | jasonjmcghee wrote:
         | This is a good introduction to OpenAI CLIP, which should help
         | provide context. https://openai.com/research/clip
        
           | pizzaknife wrote:
           | thank you for this primer!
        
       | juxtaposicion wrote:
       | It is exciting that you could train a CLIP-style model from
       | scratch with only 4M datapoints. But if you've got that data, why
       | not fine tune a pretrained model with your 4M points? It seems
       | likely to outperform the from-scratch method.
        
         | vov_or wrote:
         | There is not only a difference in the data source but pre-
         | trained tasks as well. But you are right, a fine-tuned models
         | on human-annotated data are way better than zero-shot (just
         | pre-trained) on Image retrieval. And it is correct for CLIP,
         | ALBEF, VICHA, and UFORM.
        
           | ttt3ts wrote:
           | Any plans to document how to fine tune your models then?
        
             | vov_or wrote:
             | It will take some time, but yes, we have this in our plans.
        
         | riku_iki wrote:
         | perhaps this approach can lead to better training of
         | foundational models?..
        
           | vov_or wrote:
           | More efficient - for sure!
        
       | varispeed wrote:
       | I read a lot about training models and so on, but very little
       | about inference.
       | 
       | Let's say you came up with the custom model that gives good
       | results, how do you transfer that model so it can be used in an
       | API?
        
         | binarymax wrote:
         | I specialize in this area and build a product for self hosted
         | inference.
         | 
         | The challenge to support a new model architecture is about
         | coding the preprocessing for inputs (like tokenization or image
         | resizing and color feature extraction) and post processing the
         | outputs (for example entity recognition needs to lookup the
         | entities and align the text).
         | 
         | Once an architecture is coded for the pre/post processing, then
         | serving a new model for inference with that architecture is
         | easy!
        
         | alex_sf wrote:
         | There's no one answer to that since different models are..
         | different. Beyond just modalities (text input and image output?
         | image input and video output?), there are different common
         | underlying tools used to build them. And then, of course, what
         | do you mean by API? How do you want to interact with it?
         | 
         | As a general thing, you'd take a request that would require an
         | inference step, which would then invoke the model with some
         | parameters and input, and return the output. Beyond that, you'd
         | need more detail.
        
         | [deleted]
        
       | sashank_1509 wrote:
       | They seem to be only testing for the image retrieval task, but I
       | don't think CLIP is actually used for image retrieval. Most
       | cases, I see CLIP being used for semantic segmentation, detection
       | etc. Do these guys have similar results on these tasks?
        
         | vov_or wrote:
         | Hi! I am one of the contributors! We were focused on image
         | retrieval only. Almost all semantic search engines for images
         | are based on CLIP today. We are also building a semantic
         | multimodal search engine as a DBMS component. That is why Image
         | retrieval is so crucial for us as well as inference perf. Also,
         | for semantic segmentation and detection, you probably use only
         | the image encoder part of the CLIP.
        
       | ilaksh wrote:
       | This may be a dumb question, but would it be possible to apply
       | these techniques to something like text completion and/or visual
       | question answering? If you went ahead and used the optimizations
       | but still scaled the model up?
        
         | vov_or wrote:
         | Yes, it is possible. Approaches, on which our model is based,
         | are capable to solve VQA and other similar tasks showing SOTA
         | results.
        
           | ilaksh wrote:
           | Do you know anyone working on a large text completion model
           | based on it?
        
           | freediver wrote:
           | Do you have/plan to have a text embeddings model?
        
             | vov_or wrote:
             | Yes, we are training text embedding models right now. And
             | also have plans to open-source some of them! In addition,
             | we train encoders for different modalities with retrieval
             | purposes. For example, video data.
        
       | margorczynski wrote:
       | From what I understand the basis for their model are these two
       | described in these papers: https://arxiv.org/abs/2107.07651
       | https://arxiv.org/abs/2208.13628
       | 
       | Lot of tricks put together for a great final result it seems
        
         | ashvardanian wrote:
         | Thank you! Founder here :) You are right, those are the base
         | papers, but we have extended the set of objectives quite
         | significantly, tapping into modalities that haven't been
         | publicly CLIP-ed :)
         | 
         | It is probably worth writing a paper about, but we are just too
         | busy building tons of open-source stuff. Check out the GitHub
         | org here: https://github.com/unum-cloud
         | 
         | It is not just about the tranformers, but also about databases,
         | networking, and improving the modern data stack for very large
         | scale retrieval-based AI. A lot of the pieces may be pre-
         | production, but I believe the amazing HN community may still
         | enjoy the ways we use io_uring, SIMD, and a few other less then
         | popular technologies.
        
           | eternalban wrote:
           | where is the udisk? Repo is just a readme on configuration.
        
           | cosmojg wrote:
           | Are the pretraining and training pipelines available anywhere
           | under a FOSS license? I'd love to take a swing at training a
           | mid-fusion model on data other than text and images (e.g.,
           | sound, neuron spike trains, etc.)
        
           | debdut wrote:
           | man I just looked at ukv, it looks to good to be true, 30x
           | RocksDB, wtf! Hoping it's true
        
       | mahnerak wrote:
       | I could now find license in the huggingface repo, but it seems
       | like the codebase is Apache 2.0. Are the pretrained weights /
       | checkpoints also covered under this (or other permissive)
       | license?
       | 
       | In other words, can we use it for _commercial purposes for free_?
        
         | grammers wrote:
         | Good question, was about to ask the same!
        
         | vov_or wrote:
         | Hi! Just added Apache2.0 to HF models card. Thanks!
        
           | cosmojg wrote:
           | Are the pretraining and training pipelines available anywhere
           | under a FOSS license? I'd love to take a swing at training a
           | mid-fusion model on data other than text and images (e.g.,
           | sound, neuron spike trains, etc.)
        
       | sva_ wrote:
       | Not sure if I'm blind, but what is the number of parameters?
        
         | vov_or wrote:
         | 143M - English 206M - Multilingual
        
       | [deleted]
        
       | fabbari wrote:
       | The sample code has an error in it, it uses `model` before
       | initializing it.
        
         | vov_or wrote:
         | Thanks Seems like a typo. It will be fixed soon
        
       | kimihailv wrote:
       | Did the author report metrics of the unimodal model or of the
       | multimodal model with re-ranking?
        
         | vov_or wrote:
         | The results are reported with the multimodal model.
        
       ___________________________________________________________________
       (page generated 2023-02-28 23:00 UTC)