[HN Gopher] MK-1
       ___________________________________________________________________
        
       MK-1
        
       Author : ejz
       Score  : 70 points
       Date   : 2023-08-05 21:20 UTC (1 hours ago)
        
 (HTM) web link (mkone.ai)
 (TXT) w3m dump (mkone.ai)
        
       | xianshou wrote:
       | Not a single mention of existing quantization techniques? Ten
       | bucks says this is just a wrapper around bitsandbytes or ggml.
        
       | Philpax wrote:
       | ...isn't this just quantization?
        
         | bhouston wrote:
         | Whatever it is, it will likely be copied into the open source
         | tooling like llama.cop soonish or something similar will arrive
         | in llama.cpp. It doesn't seem defensive advantage. It seems
         | like a feature and fighting against fast moving open source
         | alternatives.
        
         | atlas_hugged wrote:
         | Exactly what I was thinking. Everyone already does this. Unless
         | they're doing something else, they'll have to show why it's
         | better than just quickly quantizing to 8 bits or 4 bits or
         | whatever.
        
         | amelius wrote:
         | If you look at the demo video, the output is exactly the same
         | for both cases, so I doubt it uses quantization.
        
       | metadat wrote:
       | Too bad it's not an open source effort.
       | 
       | I'm not a fan of proprietary dependencies in my stack, full stop.
        
         | lolinder wrote:
         | I seriously doubt this will go anywhere. The open source
         | community has already achieved basically the same performance
         | improvements via quantization. This feels like someone has
         | repackaged those libraries and is going to try to sell them to
         | unwary and uninformed AI startups.
        
       | drtournier wrote:
       | MKML == abstractions and wrappers for GGML?
        
       | pestatije wrote:
       | > Today, we're announcing our first product, MKML. MKML is a
       | software package that can reduce LLM inference costs on GPUs by
       | 2x with just a few lines of Python code. And it is plug and play
       | with popular ecosystems like Hugging Face and PyTorch
        
         | cududa wrote:
         | No judgement, but I'm genuinely curious why you saw the need to
         | comment with a random sentence in their post?
        
           | qup wrote:
           | It's not a random sentence, it's the main sentence everyone
           | wants to read. They posted it to be helpful.
        
       | Scene_Cast2 wrote:
       | I've worked on ML model quantization. The open source 4-bit or
       | 8-bit quantization isn't as good as one can get - there are much
       | fancier techniques to keep predictive performance while squeezing
       | size.
       | 
       | Some techniques (like quantization-aware training) involve
       | changes to training.
        
         | lolinder wrote:
         | I'm sure there are better methods! But in this case, MKML's
         | numbers just don't look impressive when placed alongside the
         | prominent quantization techniques already in use. According to
         | this chart [0] it's most similar in size to a Q6_K
         | quantization, and if anything has slightly worse perplexity.
         | 
         | If their technique _were_ better, I imagine that the company
         | would acknowledge the existence of the open source techniques
         | and show them in their comparisons, instead of pretending the
         | only other option is the raw fp16 model.
         | 
         | [0]
         | https://old.reddit.com/r/LocalLLaMA/comments/142q5k5/updated...
        
         | KRAKRISMOTT wrote:
         | What about Unum's quantization methods?
         | 
         | https://github.com/unum-cloud/usearch
        
       | ipsum2 wrote:
       | Isn't FasterTransformer (NVidia, OSS) and text-generation-
       | inference (Huggingface, not OSS) are faster than this?
        
       | lolinder wrote:
       | It's weird that not once do they mention or compare their results
       | to the already-available quantization methods. I normally try to
       | give benefit of the doubt, but there's really no way they're not
       | aware that there are already widely used techniques for
       | accomplishing this same thing, so the comparison benchmarks
       | _really_ should be there.
       | 
       | To fill in the gap, here's llama.cpp's comparison chart[0] for
       | the different quantizations available for Llama 1. We can't
       | compare directly with their Llama 2 metrics, but just comparing
       | the percent change in speed and perplexity, MK-1 looks very
       | similar to Q5_1. There's a small but not insignificant hit to
       | perplexity, and a just over 2x speedup.
       | 
       | If these numbers are accurate, you can download pre-quantized
       | Llama 2 models from Hugging Face that will perform essentially
       | the same as what MK-1 is offering, with the Q5 files here:
       | https://huggingface.co/TheBloke/Llama-2-13B-GGML/tree/main
       | 
       | [0] https://github.com/ggerganov/llama.cpp#quantization
        
         | andy_xor_andrew wrote:
         | Also, using the word "codecs" kind of puts a bad taste in my
         | mouth. It's like they're trying to sound like they invented an
         | entirely new paradigm, with their own fancy name that reminds
         | people of video compression.
        
         | moffkalast wrote:
         | Q5_1 is already old news too, K quants are faster and more
         | space efficient for the same perplexity loss.
         | 
         | https://old.reddit.com/r/LocalLLaMA/comments/142q5k5/updated...
        
           | lolinder wrote:
           | For sure, but I couldn't find numbers for the K quants that
           | included inference speeds, so I settled on the older one. If
           | MK-1 were trying to be honest they'd definitely want to
           | benchmark against the newest methods!
        
       ___________________________________________________________________
       (page generated 2023-08-05 23:00 UTC)