Location>code7788 >text

Big Model Assessment Troubleshooting Guide | About Reasoning

Popularity:856 ℃/2025-04-25 15:24:16

This isBig Model Evaluation Troubleshooting GuideThe first article in the series, please pay attention to the series:

  • About reasoning
  • about\(\LaTeX\)Formula analysis
  • About reproducibility

What should I do if the model runs very slowly?

Adjust batch size

If you want the evaluation result to be fully reproducible (under specific input prompt and hardware conditions), you can set batch size to 1. However, if the batch size is increased (if the hardware conditions allow) it will speed up the reasoning speed.

Parallel data

You can load the model on multiple GPUs, then divide the dataset into multiple subsets and assign it to each GPU, and finally summarize all the calculation results.
This means that each data stream is processed simultaneously in parallel, reducing the total execution time to one-third of the GPU. Try to place the GPU on one node to avoid cross-node transmission bottlenecks.

Adjust the code

Due to various differences in code optimization, the inference speeds of different inference libraries are different. You may need to do some comparison experiments to select the fastest library. If you use pytorch to implement it at the model level, you can refer to this oneReasoning Optimization Checklist

Adjust accuracy

You can reduce the model size by adjusting the calculation accuracy, thereby speeding up the inference. Although the model of float32 precision (using 32-bit storage for each number) is very accurate to calculate, it consumes a lot of memory and computing resources. Reduced accuracy toblfoat16orfloat16(Half-precision) can double the speed and will not affect the calculation results. If you need to speed up further, you can try quantizing it to lower precision, such as 8 bits or 4 bits (can usegptqorbitsandbytesThe library completes quantization). Low-bit matrix calculations are faster (but some quantization libraries are a bit slower, so it is best to test them on your own model), and the model also consumes less memory.

What should I do if the memory usage is very large?

Estimate memory requirements

You can use the followingformulaEstimate the minimum theoretical memory required for model loading (specific hardware):

<Memory (GB)> = <Parameter quantity (G)> * <Precision factor>

The total memory required for the model is equal to the total amount of parameters multiplied by the number of bytes required for one parameter. 1 byte (Byte) is 8 bits, and the accuracy factor is valued according to the situation (float32For 4,float16orbfoat16For 2.8bitFor 1.4bitis 0.5).

This is the basic estimation method.

When using it, I suggest to calculate it like this:<Memory (GB)> = <Parameter quantity (G)> * (<Accuracy factor> * 110%). Because in the inference stage, in addition to loading the model, the total memory requirement will be more.

What should I do if I can’t even install a GPU?

Quantification

The first obvious method is to reduce<Accuracy Factor>: Reducing from float32 to 4bit can reduce memory usage by 8 times!
However, too low accuracy can lead to poor results. For some models (especially medium-sized models) float16 or 8bit is sufficient (low precision has a smaller impact on large models, probably due to information redundancy).

Model parallel

Model parallelism includes a series of techniques: splitting large models into multiple small submodels, allocating submodels to run on different GPUs, etc. This approach does not load the entire model at once, thus reducing memory requirements, but may be slower.

There are two ways to parallelize models:

  • Pipelines are parallel. That is, at the layer-level split model, different layers are assigned to different GPUs. Since the forward process is linear during inference, for example, the output of layer 1 is the input of layer 2, the GPU allocated by layer 2 needs to wait for the calculation of layer 1 to end before it can start (also called "bubble". At the same time, data and intermediate results also need to be transmitted between GPUs, which leads to slow execution speed. But it can be alleviated by splitting the input into smaller batches, the native library of PytorchPiPPyIt supports this function, it is alsoaccelerateA method to implement parallelism in the library background.
  • Tensors parallel. That is, at the matrix calculation level split model, the matrix is ​​split by row or column and assigned to different GPUs to calculate and merge the results. This parallel approach can be very efficient if multiple GPUs are on the same node (avoiding the bottlenecks across nodes), but the code implementation is somewhat difficult. FortunatelyvllmThe library has been implemented, and the acceleration effect isVery obvious

More parallel methods (including data parallelism, etc.) can be used as referenceThis document

Reduce burden with CPU

CPU offloading can transfer some models and calculations from the GPU to the CPU to reduce GPU memory footprint. However, compared with other methods, CPU uninstallation is necessaryMuch slower, the main reason is that data needs to be moved frequently between devices.

For example, Deepspeed'sZeRO-OffloadParameters can be assigned to both the CPU and the GPU (more detailed optimization instructions in the ZeRO-2 paper). Where the gradient, optimizer state, and fp32 model parameters during optimization are passed on the CPU, while the fp16 parameters during forward and reverse processes can be passed on the GPU, thereby leveraging CPU memory optimization and optimizing GPU computing while reducing communication between the two.

What should I do if the model is loaded into the GPU but still reports an OOMs error?

You may have a problem with context size.

We recommend:

  1. Load the model and dummy data at the same time to test whether the GPU will overflow. Note that the context length of the dummy data should be large enough (which can represent your task).
  2. Reduce batch size, or turn off the automatic search batch size feature (if enabled, it can sometimes cause OOM errors).
  3. More generally, make sure that the context size of the samples input to the model is in order from large to small, so that if the context size is too large, the model will directly report an error, avoiding the model running normally at the beginning and not having any problems until a certain time.

Original English:

Original author: clefourrier

Translator: SuSung-boy

Reviewed: adeenayakup