[HN Gopher] Large Transformer Model Inference Optimization ___________________________________________________________________ Large Transformer Model Inference Optimization Author : headalgorithm Score : 56 points Date : 2023-01-20 19:27 UTC (3 hours ago) (HTM) web link (lilianweng.github.io) (TXT) w3m dump (lilianweng.github.io) | ilaksh wrote: | Are any companies with the largest most capable models doing | these things? Maybe OpenAI has used some of them for GPT-4. | | But also maybe there is another company using a very large | dataset and some optimizations. I would love to have an | alternative so I wasn't 100% reliant on OpenAI. | jayalammar wrote: | We train and serve large models at cohere.ai. We've shared some | optimization techniques here: https://txt.cohere.ai/running- | large-language-models-in-produ... | ilaksh wrote: | Awesome! Can your models write code? | binarymax wrote: | I help teams run transformers in their production systems on CPU, | using my product based on ONNX Runtime. | | This is a great article, but if you're using something based on | BERT or RoBERTa, you don't need to do much. Distillation is | usually the only step you need to take if you're really picky, or | if your scale is millions of requests per day and you're not | making enough money to support the infrastructure. | | I have had mixed results with quantization and sparsification, | but IMO it's just not worth it as they can be unstable. | haldujai wrote: | Or you could keep it simple and just not use a 500B parameter | model which is unnecessarily large for 99.9999999999999% of use | cases. | drdeca wrote: | I think that is likely too many nines (depending on how you are | counting and weighting "use cases"). | haldujai wrote: | Sure, admittedly I was being overly hyperbolic and a bit | snarky. | | However I am genuinely curious what sort of industrial "real | world task" there is that requires edge inference on GPT3.5 | or PaLM-sized models where you would run into this problem | and not have the infrastructure therefore requiring these | potentially unstable tricks? | | The point I was alluding to is that LLMs of this size are | overkill for most commercial use cases (e.g. NER, document | classification, semantic search, chat bot). | usmannk wrote: | This post isn't about "edge inference". | haldujai wrote: | Maybe I'm missing the point of the article then. What's | the low-resource scenario where inference speed is the | bottleneck for transformer adoption at scale? | bravura wrote: | New AI tasks are being unlocked by (large-scale) | foundation models (Liang, 2022). | | Fine-tuning in low-resource (few-shot) scenarios is now | possible for many new applications. | | However, these new AI applications relied upon a huge | pretrained model to get there. Because the old approach | of training from scratch on 100 labeled examples didn't | work well. | | Thus, we want to distill the knowledge so that the model | can be deployed in low-resource scenarios. | | [edit: I see your below comment about the concern about | transformer cost. Agreed. This is one of the many | concerns around foundation models that must be | understood. The happy path is that training the | foundation model is a one-time cost that pays dividends | in the many tasks it unlocks. However, you are correct | that the research to get there is quite spendy. I | encourage you to skim this paper. It's long but very | accessible: https://arxiv.org/pdf/2108.07258.pdf] | chipgap98 wrote: | I would imagine this would also bring down the cost of | running large models, which could increase their | adoption. | haldujai wrote: | Fair enough. I guess I'm biased by my working environment | and current belief that we're scaling transformer models | unnecessarily, but I guess that is also partly influenced | by their cost. | madlag wrote: | May I add another method: block fine-pruning of transformers | (pruning while fine-tuning) ? | | https://arxiv.org/abs/2109.04838 | | Using blocks allows to keep good performence on GPUS, while | giving some flexibility in the pruning pattern. And when removing | entirely empty rows and columns the pruned matrices are actually | pretty dense, so competitive with structured pruning for speedup, | but less "aggressive" on the network during the pruning process. | Disclaimer: I am the main co-author. | bravura wrote: | Let us not forget that Lillian Weng was telling us to pay | attention to diffusion models before they became cool, and | definitely before your dad was using Stable Diffusion to generate | logos for his rotary club. ___________________________________________________________________ (page generated 2023-01-20 23:00 UTC)