[HN Gopher] Large Transformer Model Inference Optimization
       ___________________________________________________________________
        
       Large Transformer Model Inference Optimization
        
       Author : headalgorithm
       Score  : 56 points
       Date   : 2023-01-20 19:27 UTC (3 hours ago)
        
 (HTM) web link (lilianweng.github.io)
 (TXT) w3m dump (lilianweng.github.io)
        
       | ilaksh wrote:
       | Are any companies with the largest most capable models doing
       | these things? Maybe OpenAI has used some of them for GPT-4.
       | 
       | But also maybe there is another company using a very large
       | dataset and some optimizations. I would love to have an
       | alternative so I wasn't 100% reliant on OpenAI.
        
         | jayalammar wrote:
         | We train and serve large models at cohere.ai. We've shared some
         | optimization techniques here: https://txt.cohere.ai/running-
         | large-language-models-in-produ...
        
           | ilaksh wrote:
           | Awesome! Can your models write code?
        
       | binarymax wrote:
       | I help teams run transformers in their production systems on CPU,
       | using my product based on ONNX Runtime.
       | 
       | This is a great article, but if you're using something based on
       | BERT or RoBERTa, you don't need to do much. Distillation is
       | usually the only step you need to take if you're really picky, or
       | if your scale is millions of requests per day and you're not
       | making enough money to support the infrastructure.
       | 
       | I have had mixed results with quantization and sparsification,
       | but IMO it's just not worth it as they can be unstable.
        
       | haldujai wrote:
       | Or you could keep it simple and just not use a 500B parameter
       | model which is unnecessarily large for 99.9999999999999% of use
       | cases.
        
         | drdeca wrote:
         | I think that is likely too many nines (depending on how you are
         | counting and weighting "use cases").
        
           | haldujai wrote:
           | Sure, admittedly I was being overly hyperbolic and a bit
           | snarky.
           | 
           | However I am genuinely curious what sort of industrial "real
           | world task" there is that requires edge inference on GPT3.5
           | or PaLM-sized models where you would run into this problem
           | and not have the infrastructure therefore requiring these
           | potentially unstable tricks?
           | 
           | The point I was alluding to is that LLMs of this size are
           | overkill for most commercial use cases (e.g. NER, document
           | classification, semantic search, chat bot).
        
             | usmannk wrote:
             | This post isn't about "edge inference".
        
               | haldujai wrote:
               | Maybe I'm missing the point of the article then. What's
               | the low-resource scenario where inference speed is the
               | bottleneck for transformer adoption at scale?
        
               | bravura wrote:
               | New AI tasks are being unlocked by (large-scale)
               | foundation models (Liang, 2022).
               | 
               | Fine-tuning in low-resource (few-shot) scenarios is now
               | possible for many new applications.
               | 
               | However, these new AI applications relied upon a huge
               | pretrained model to get there. Because the old approach
               | of training from scratch on 100 labeled examples didn't
               | work well.
               | 
               | Thus, we want to distill the knowledge so that the model
               | can be deployed in low-resource scenarios.
               | 
               | [edit: I see your below comment about the concern about
               | transformer cost. Agreed. This is one of the many
               | concerns around foundation models that must be
               | understood. The happy path is that training the
               | foundation model is a one-time cost that pays dividends
               | in the many tasks it unlocks. However, you are correct
               | that the research to get there is quite spendy. I
               | encourage you to skim this paper. It's long but very
               | accessible: https://arxiv.org/pdf/2108.07258.pdf]
        
               | chipgap98 wrote:
               | I would imagine this would also bring down the cost of
               | running large models, which could increase their
               | adoption.
        
               | haldujai wrote:
               | Fair enough. I guess I'm biased by my working environment
               | and current belief that we're scaling transformer models
               | unnecessarily, but I guess that is also partly influenced
               | by their cost.
        
       | madlag wrote:
       | May I add another method: block fine-pruning of transformers
       | (pruning while fine-tuning) ?
       | 
       | https://arxiv.org/abs/2109.04838
       | 
       | Using blocks allows to keep good performence on GPUS, while
       | giving some flexibility in the pruning pattern. And when removing
       | entirely empty rows and columns the pruned matrices are actually
       | pretty dense, so competitive with structured pruning for speedup,
       | but less "aggressive" on the network during the pruning process.
       | Disclaimer: I am the main co-author.
        
       | bravura wrote:
       | Let us not forget that Lillian Weng was telling us to pay
       | attention to diffusion models before they became cool, and
       | definitely before your dad was using Stable Diffusion to generate
       | logos for his rotary club.
        
       ___________________________________________________________________
       (page generated 2023-01-20 23:00 UTC)