[HN Gopher] Llama 2 on ONNX runs locally
       ___________________________________________________________________
        
       Llama 2 on ONNX runs locally
        
       Author : tmoneyy
       Score  : 51 points
       Date   : 2023-08-10 21:37 UTC (1 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | hashtag-til wrote:
       | This is very cool! I really hope the ONNX project gets much more
       | adoption in the next months and years and help reduce the
       | fragmentation in the ML ecosystem.
        
         | brucethemoose2 wrote:
         | Eh... I have seen ONNX demos for years, and they tend to stay
         | barebones and slow, kinda like this.
         | 
         | NCNN, MLIR and TVM based ports have been far more impressive.
        
         | claytonjy wrote:
         | I'm not sure there's much chance of that happening. ONNX seems
         | to be the broadest in coverage, but for basically any model
         | ONNX supports, there's a faster alternative.
         | 
         | For the latest generative/transformer stuff (whisper, llama,
         | etc) it's often specialized C(++) stuff, but torch 2.0
         | compilation keeps geting better, BetterTransformers, TensorRT,
         | etc.
        
       | turnsout wrote:
       | Does anyone know the feasibility of converting the ONNX model to
       | CoreML for accelerated inference on Apple devices?
        
         | kiratp wrote:
         | If you're working with LLMs, just use this -
         | https://github.com/ggerganov/llama.cpp
         | 
         | It has Metal support.
        
         | brucethemoose2 wrote:
         | MLC's Apache TVM implementation can also compile to Metal.
         | 
         | Not sure if they made an autotuning profile for it yet.
        
         | mchiang wrote:
         | They used to have this: https://github.com/onnx/onnx-coreml
        
       | rrherr wrote:
       | How does this compare to using
       | https://github.com/ggerganov/llama.cpp with
       | https://huggingface.co/models?search=thebloke/llama-2-ggml ?
        
         | version_five wrote:
         | Ggml / llama.cpp has a lot of hardware optimizations built in
         | now, CPU, GPU and specific instruction sets like for apple
         | silicon (I'm not familiar with the names). I would want to know
         | how many of those are also present in onnx and available to
         | this model.
         | 
         | There are currently also more quantization options available as
         | mentioned. Though those incur a performance loss (they make the
         | model faster but worse) so it depends on what you're optimizing
         | for.
        
           | brucethemoose2 wrote:
           | ONNX is a format. There are different runtimes for different
           | devices... But I can't speak for any of them.
           | 
           | > specific instruction sets like for apple silicon
           | 
           | You are thinking of the Accelerate framework support, which
           | is basically Apple's ARM CPU SIMD library.
           | 
           | But Llama.cpp also has a Metal GPU backend, which is the
           | defacto backend for Apple devices now.
        
             | [deleted]
        
         | brucethemoose2 wrote:
         | Very unfavorably. Mostly because the ONNX models are FP32/FP16
         | (so ~3-4x the RAM use), but also because llama.cpp is well
         | optimized with many features (like prompt caching, grammar,
         | device splitting, context extending, cfg...)
         | 
         | MLC's Apache TVM implementation is also excellent. The
         | autotuning in particular is like black magic.
        
           | skeletoncrew wrote:
           | I tried quite a few of these and the ONNX one seems the most
           | elegantly put together of all. I'm impressed.
           | 
           | Speed can be improved. Quick and dirty/hype solutions, not
           | sure.
           | 
           | I really hope ONNX gets traction it deserves.
        
             | brucethemoose2 wrote:
             | > ONNX one seems the most elegantly put together of all.
             | 
             | What do you mean by this? The demo UI? Code quality?
        
             | version_five wrote:
             | > Quick and dirty/hype solutions, not sure.
             | 
             | Curious what you mean by this
        
         | moffkalast wrote:
         | These are still FP16/32 models, almost certainly a few times
         | slower and larger than the latest N bit quantized GGMLs.
        
       | glitchc wrote:
       | How was this allowed? I was under the impression that companies
       | the size of Microsoft needed to contact Meta to negotiate a
       | license.
       | 
       | Excerpt from the license:
       | 
       |  _Additional Commercial Terms. If, on the Llama 2 version release
       | date, the monthly active users of the products or services made
       | available by or for Licensee, or Licensee 's affiliates, is
       | greater than 700 million monthly active users in the preceding
       | calendar month, you must request a license from Meta, which Meta
       | may grant to you in its sole discretion, and you are not
       | authorized to exercise any of the rights under this Agreement
       | unless or until Meta otherwise expressly grants you such rights._
        
         | amelius wrote:
         | > To get access permissions to the Llama 2 model, please fill
         | out the Llama 2 access request form. If allowable, you will
         | receive GitHub access in the next 48 hours, but usually much
         | sooner.
         | 
         | I guess they send the form to Meta?
         | 
         | Anyway, I hope this is not what Open Source will be like from
         | now on.
        
         | thadk wrote:
         | > Meta and Microsoft have been longtime partners on AI,
         | starting with a collaboration to integrate ONNX Runtime with
         | PyTorch to create a great developer experience for PyTorch on
         | Azure, and Meta's choice of Azure as a strategic cloud
         | provider. (sic)
         | 
         | https://blogs.microsoft.com/blog/2023/07/18/microsoft-and-me...
        
         | stu2b50 wrote:
         | So they negotiated a license? Meta partnered with Azure for the
         | Llama 2 launch, there's no reason to think that they're
         | antagonistic towards each other.
        
       | flatfuzz wrote:
       | [dead]
        
       ___________________________________________________________________
       (page generated 2023-08-10 23:00 UTC)