[HN Gopher] MK-1 ___________________________________________________________________ MK-1 Author : ejz Score : 70 points Date : 2023-08-05 21:20 UTC (1 hours ago) (HTM) web link (mkone.ai) (TXT) w3m dump (mkone.ai) | xianshou wrote: | Not a single mention of existing quantization techniques? Ten | bucks says this is just a wrapper around bitsandbytes or ggml. | Philpax wrote: | ...isn't this just quantization? | bhouston wrote: | Whatever it is, it will likely be copied into the open source | tooling like llama.cop soonish or something similar will arrive | in llama.cpp. It doesn't seem defensive advantage. It seems | like a feature and fighting against fast moving open source | alternatives. | atlas_hugged wrote: | Exactly what I was thinking. Everyone already does this. Unless | they're doing something else, they'll have to show why it's | better than just quickly quantizing to 8 bits or 4 bits or | whatever. | amelius wrote: | If you look at the demo video, the output is exactly the same | for both cases, so I doubt it uses quantization. | metadat wrote: | Too bad it's not an open source effort. | | I'm not a fan of proprietary dependencies in my stack, full stop. | lolinder wrote: | I seriously doubt this will go anywhere. The open source | community has already achieved basically the same performance | improvements via quantization. This feels like someone has | repackaged those libraries and is going to try to sell them to | unwary and uninformed AI startups. | drtournier wrote: | MKML == abstractions and wrappers for GGML? | pestatije wrote: | > Today, we're announcing our first product, MKML. MKML is a | software package that can reduce LLM inference costs on GPUs by | 2x with just a few lines of Python code. And it is plug and play | with popular ecosystems like Hugging Face and PyTorch | cududa wrote: | No judgement, but I'm genuinely curious why you saw the need to | comment with a random sentence in their post? | qup wrote: | It's not a random sentence, it's the main sentence everyone | wants to read. They posted it to be helpful. | Scene_Cast2 wrote: | I've worked on ML model quantization. The open source 4-bit or | 8-bit quantization isn't as good as one can get - there are much | fancier techniques to keep predictive performance while squeezing | size. | | Some techniques (like quantization-aware training) involve | changes to training. | lolinder wrote: | I'm sure there are better methods! But in this case, MKML's | numbers just don't look impressive when placed alongside the | prominent quantization techniques already in use. According to | this chart [0] it's most similar in size to a Q6_K | quantization, and if anything has slightly worse perplexity. | | If their technique _were_ better, I imagine that the company | would acknowledge the existence of the open source techniques | and show them in their comparisons, instead of pretending the | only other option is the raw fp16 model. | | [0] | https://old.reddit.com/r/LocalLLaMA/comments/142q5k5/updated... | KRAKRISMOTT wrote: | What about Unum's quantization methods? | | https://github.com/unum-cloud/usearch | ipsum2 wrote: | Isn't FasterTransformer (NVidia, OSS) and text-generation- | inference (Huggingface, not OSS) are faster than this? | lolinder wrote: | It's weird that not once do they mention or compare their results | to the already-available quantization methods. I normally try to | give benefit of the doubt, but there's really no way they're not | aware that there are already widely used techniques for | accomplishing this same thing, so the comparison benchmarks | _really_ should be there. | | To fill in the gap, here's llama.cpp's comparison chart[0] for | the different quantizations available for Llama 1. We can't | compare directly with their Llama 2 metrics, but just comparing | the percent change in speed and perplexity, MK-1 looks very | similar to Q5_1. There's a small but not insignificant hit to | perplexity, and a just over 2x speedup. | | If these numbers are accurate, you can download pre-quantized | Llama 2 models from Hugging Face that will perform essentially | the same as what MK-1 is offering, with the Q5 files here: | https://huggingface.co/TheBloke/Llama-2-13B-GGML/tree/main | | [0] https://github.com/ggerganov/llama.cpp#quantization | andy_xor_andrew wrote: | Also, using the word "codecs" kind of puts a bad taste in my | mouth. It's like they're trying to sound like they invented an | entirely new paradigm, with their own fancy name that reminds | people of video compression. | moffkalast wrote: | Q5_1 is already old news too, K quants are faster and more | space efficient for the same perplexity loss. | | https://old.reddit.com/r/LocalLLaMA/comments/142q5k5/updated... | lolinder wrote: | For sure, but I couldn't find numbers for the K quants that | included inference speeds, so I settled on the older one. If | MK-1 were trying to be honest they'd definitely want to | benchmark against the newest methods! ___________________________________________________________________ (page generated 2023-08-05 23:00 UTC)