[HN Gopher] Video-LLaVA
       ___________________________________________________________________
        
       Video-LLaVA
        
       Author : tosh
       Score  : 146 points
       Date   : 2023-11-21 17:31 UTC (5 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | bobosha wrote:
       | This is a very cool project! Kudos to the authors for being on
       | top and keeping the features coming. Appears to be feature-
       | competitive with OpenAI's GPT-4V `vision` endpoint.
        
       | whimsicalism wrote:
       | Researchers seem very comfortable sticking "Apache 2.0" licenses
       | all over their foundation model finetunes.
       | 
       | This model is absolutely not Apache 2.0 in reality (it's a Vicuna
       | finetune nevermind the sourcing of the finetune dataset) and I
       | would use it for business at your peril.
        
         | Der_Einzige wrote:
         | Fine-tuning the weights scrambles the original representations
         | (sometimes more than others depending on training settings, but
         | if you train the text encoder it certainly will). All authors
         | have to do is not be honest about the original model it was
         | fine-tuned on in a world where lawyers start to come down on
         | this.
         | 
         | I see no issue for businesses using it.
        
           | whimsicalism wrote:
           | I don't know - it sounds like your default assumption is that
           | there is no issue because businesses can commit copyright
           | infringement/fraud and not be caught, I am not a lawyer so I
           | can't comment on the merits of the approach.
           | 
           | Generally I think it is difficult for businesses to break the
           | law given that any one of the members might defect on you.
           | 
           | Also I suspect that the logprobs for various sequences would
           | reveal which foundation model you used.
        
         | yeldarb wrote:
         | Looks like the Vicuna repo is Apache 2.0 also[1].
         | 
         | What's the interpretation of copyright law that would prevent
         | the code being Apache 2.0 based on the source of the fine-
         | tuning dataset?
         | 
         | [1] https://github.com/lm-sys/FastChat
        
           | whimsicalism wrote:
           | Not quite: fastchat is the inference code which is Apache 2.0
           | but distinct from the model artifact. If you look at the
           | model [0] it is licensed as non-commercial.
           | 
           | But why?
           | 
           | Well for one, Vicuna is a Llama finetune, which already
           | excludes it from being Apache 2.0. It's also finetuned on OAI
           | data which is... questionable in terms of license (don't
           | think you can really legally license a model trained on OAI
           | output as Apache 2.0 - although OAI doesn't really play by
           | its own rules so who knows)
           | 
           | [0]: https://huggingface.co/lmsys/vicuna-13b-v1.3
        
             | yeldarb wrote:
             | Which part of copyright law are model weights governed by?
             | (Or, if not by copyright law, what's the legal basis that
             | would let you choose a "license" for model weights?)
        
         | dartos wrote:
         | Tbf the llama license allows for small businesses usage.
         | 
         | But also these models aren't watermarked or anything (not that
         | watermarking really works) so it's kind of the wild west
        
       | kyriakos wrote:
       | I honestly have no idea what this project is about. It may be
       | because I'm completely out of the loop regarding LLMs but
       | still...
        
         | fkyoureadthedoc wrote:
         | I had no idea from the name, but the README does a good job of
         | explaining what it's about. Even has a nice video demo.
        
         | abrichr wrote:
         | Open source question answering over videos:
         | 
         | > With the binding of unified visual representations to the
         | language feature space, we enable an LLM to perform visual
         | reasoning capabilities on both images and videos
         | simultaneously.
        
           | kyriakos wrote:
           | Thanks
        
         | btbuildem wrote:
         | The related paper is here: https://arxiv.org/pdf/2311.10122.pdf
         | 
         | I think the TL;DR is "it can tell what's in the video and
         | 'reason' about it"
        
       | astrea wrote:
       | Side note: Why does every GitHub readme look like a children's
       | book these days? Emojis, big colorful graphics, gifs, cute
       | project logo, etc. Makes me feel awkward trying to read about a
       | serious topic with the ":o" emoji staring in my face. I'm just
       | waiting for the air horns to start blaring and a dancing cat to
       | slide across my screen.
        
         | chankstein38 wrote:
         | Because you're dealing with humans and sometimes humans don't
         | behave in the same way you apparently expect everyone to? These
         | aren't massive billion dollar corps they're some engineer or
         | group of engineers doing something interesting to them.
         | 
         | In this case it seems related to a university, so these are
         | students and researchers at a university. Some of them are very
         | likely qualifiable as kids to us old people.
         | 
         | Not sure why it's such a bother to you, does a topic need to be
         | cold and black and white for it to further our technological
         | research? (That's hypothetical because this repo, for instance,
         | absolutely furthers our tech abilities while also being in a
         | more friendly non-academic format.)
        
         | Implicated wrote:
         | The closer to discord a community is the more things look this
         | way, at least that's my interpretation.
        
         | devmor wrote:
         | Emojis are part of the common vernacular now, and software
         | development is a mainstream career instead of a siloed off
         | nerd-haven.
        
         | j45 wrote:
         | Because it's more inviting than to just people who like text
         | alone.
         | 
         | https://shuiblue.github.io/forcolab-uoft/paper/IST2022-emoji...
        
           | dartos wrote:
           | I love that this exists
        
             | j45 wrote:
             | Me too.
             | 
             | Not to say a study can't often be found for most
             | viewpoints.
        
         | geysersam wrote:
         | Couldn't agree more!
        
         | dymk wrote:
         | Do you use syntax highlighting?
        
         | dvngnt_ wrote:
         | you could also ask why does serious writing often avoid adding
         | big colorful graphics if it looks better.
        
       | rajamaka wrote:
       | Demo just errors out unfortunately
        
       ___________________________________________________________________
       (page generated 2023-11-21 23:00 UTC)