[HN Gopher] Llama 2 on ONNX runs locally ___________________________________________________________________ Llama 2 on ONNX runs locally Author : tmoneyy Score : 51 points Date : 2023-08-10 21:37 UTC (1 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | hashtag-til wrote: | This is very cool! I really hope the ONNX project gets much more | adoption in the next months and years and help reduce the | fragmentation in the ML ecosystem. | brucethemoose2 wrote: | Eh... I have seen ONNX demos for years, and they tend to stay | barebones and slow, kinda like this. | | NCNN, MLIR and TVM based ports have been far more impressive. | claytonjy wrote: | I'm not sure there's much chance of that happening. ONNX seems | to be the broadest in coverage, but for basically any model | ONNX supports, there's a faster alternative. | | For the latest generative/transformer stuff (whisper, llama, | etc) it's often specialized C(++) stuff, but torch 2.0 | compilation keeps geting better, BetterTransformers, TensorRT, | etc. | turnsout wrote: | Does anyone know the feasibility of converting the ONNX model to | CoreML for accelerated inference on Apple devices? | kiratp wrote: | If you're working with LLMs, just use this - | https://github.com/ggerganov/llama.cpp | | It has Metal support. | brucethemoose2 wrote: | MLC's Apache TVM implementation can also compile to Metal. | | Not sure if they made an autotuning profile for it yet. | mchiang wrote: | They used to have this: https://github.com/onnx/onnx-coreml | rrherr wrote: | How does this compare to using | https://github.com/ggerganov/llama.cpp with | https://huggingface.co/models?search=thebloke/llama-2-ggml ? | version_five wrote: | Ggml / llama.cpp has a lot of hardware optimizations built in | now, CPU, GPU and specific instruction sets like for apple | silicon (I'm not familiar with the names). I would want to know | how many of those are also present in onnx and available to | this model. | | There are currently also more quantization options available as | mentioned. Though those incur a performance loss (they make the | model faster but worse) so it depends on what you're optimizing | for. | brucethemoose2 wrote: | ONNX is a format. There are different runtimes for different | devices... But I can't speak for any of them. | | > specific instruction sets like for apple silicon | | You are thinking of the Accelerate framework support, which | is basically Apple's ARM CPU SIMD library. | | But Llama.cpp also has a Metal GPU backend, which is the | defacto backend for Apple devices now. | [deleted] | brucethemoose2 wrote: | Very unfavorably. Mostly because the ONNX models are FP32/FP16 | (so ~3-4x the RAM use), but also because llama.cpp is well | optimized with many features (like prompt caching, grammar, | device splitting, context extending, cfg...) | | MLC's Apache TVM implementation is also excellent. The | autotuning in particular is like black magic. | skeletoncrew wrote: | I tried quite a few of these and the ONNX one seems the most | elegantly put together of all. I'm impressed. | | Speed can be improved. Quick and dirty/hype solutions, not | sure. | | I really hope ONNX gets traction it deserves. | brucethemoose2 wrote: | > ONNX one seems the most elegantly put together of all. | | What do you mean by this? The demo UI? Code quality? | version_five wrote: | > Quick and dirty/hype solutions, not sure. | | Curious what you mean by this | moffkalast wrote: | These are still FP16/32 models, almost certainly a few times | slower and larger than the latest N bit quantized GGMLs. | glitchc wrote: | How was this allowed? I was under the impression that companies | the size of Microsoft needed to contact Meta to negotiate a | license. | | Excerpt from the license: | | _Additional Commercial Terms. If, on the Llama 2 version release | date, the monthly active users of the products or services made | available by or for Licensee, or Licensee 's affiliates, is | greater than 700 million monthly active users in the preceding | calendar month, you must request a license from Meta, which Meta | may grant to you in its sole discretion, and you are not | authorized to exercise any of the rights under this Agreement | unless or until Meta otherwise expressly grants you such rights._ | amelius wrote: | > To get access permissions to the Llama 2 model, please fill | out the Llama 2 access request form. If allowable, you will | receive GitHub access in the next 48 hours, but usually much | sooner. | | I guess they send the form to Meta? | | Anyway, I hope this is not what Open Source will be like from | now on. | thadk wrote: | > Meta and Microsoft have been longtime partners on AI, | starting with a collaboration to integrate ONNX Runtime with | PyTorch to create a great developer experience for PyTorch on | Azure, and Meta's choice of Azure as a strategic cloud | provider. (sic) | | https://blogs.microsoft.com/blog/2023/07/18/microsoft-and-me... | stu2b50 wrote: | So they negotiated a license? Meta partnered with Azure for the | Llama 2 launch, there's no reason to think that they're | antagonistic towards each other. | flatfuzz wrote: | [dead] ___________________________________________________________________ (page generated 2023-08-10 23:00 UTC)