Vue d'ensemble
NVIDIA Model Optimizer fournit des techniques d'optimisation de modèles avant-gardistes — quantization (FP8, INT4, NVFP4), pruning, knowledge distillation, et speculative decoding — pour des modèles déployables sur TensorRT-LLM, SGLang, et vLLM.
Installation
uv pip install nvidia-modelopt
Quantization basique
import modelopt.torch.quantization as mtq
# FP8 post-training quantization
quant_cfg = mtq.FP8_DEFAULT_CFG
mtq.quantize(model, quant_cfg, forward_loop=calib_loop)
# Export pour TensorRT-LLM
from modelopt.torch.export import export_tensorrt_llm_checkpoint
export_tensorrt_llm_checkpoint(model, "model.pt", dtype="fp8")
Pruning + Distillation
import modelopt.torch.pruning as mtp
import modelopt.torch.distill as mtd
# Prune
pruned = mtp.prune(model, ratio=0.3, structure="2:4_sparse")
# Distill
teacher = load_teacher_model()
student = mtd.distill(student_model, teacher, kal="logit", alpha=0.5)