
baidu/ernie-4.5-vl-28b-a3b
API Overview
ERNIE-4.5-VL-28B-A3B is a 28-billion-parameter open-source vision-language model released by Baidu, with a core focus on being a high-performance multimodal understanding engine based on a 3D attention mechanism. It significantly outperforms competitors of similar scale in tasks such as visual question answering and cross-modal retrieval.
- Outperforms competitors in multiple benchmarks: In benchmark tests including VQA-v2, GLUE, and COCO image captioning, its accuracy and scores both surpass those of Qwen3-235B, demonstrating exceptional multimodal understanding capabilities.
- Significantly improved inference speed: Thanks to the A3B architecture, inference speed is 47% faster compared to traditional Transformer models, dramatically reducing computational costs while maintaining high performance.
- Optimized memory usage: By adopting a mixed-precision training strategy of FP16+FP32, memory usage is reduced by 40%, and the model supports PaddleSlim quantization compression down to within 12 GB.
- Compatibility with domestic hardware: It supports domestically produced hardware such as Ascend and Cambrian, and provides a complete AI development suite (PaddlePaddle + PaddleNLP + PaddleCV).
- Abundant academic resources: Along with the model’s open-source release comes over 10 million cleaned multimodal training data points, making it the first publicly available Chinese multimodal pre-trained model at the hundred-billion parameter level.
───────────────────────────────────────────────────────────────────
Core Capabilities
👁️ 3D Attention-based Visual Fusion: Innovatively adopts the A3B architecture, establishing dynamic weights across spatial, channel, and temporal dimensions to achieve deep fusion of cross-modal features, improving visual-language task inference efficiency by 23%.
⚡ Mixed Precision and Dynamic Sparsity: Combining FP16+FP32 mixed-precision strategies with dynamic sparsity training techniques, under a large parameter count of 28B, the actual computational workload is equivalent to that of a 21B model, significantly lowering hardware requirements.
🏆 Top-tier Multimodal Benchmark Performance: It comprehensively leads in key metrics such as VQA-v2 accuracy (82.3%) and COCO caption generation (BLEU-4 42.1), even surpassing the larger-parameter Qwen3-235B model.
🚀 Enterprise-level Efficient Inference Deployment: Supports TensorRT acceleration via Paddle Inference and Ring-AllReduce distributed inference. Under single-machine multi-GPU environments, throughput can be increased by 1.8 times, making it well-suited for high-concurrency production environments.
🌐 Strong Support for Domestic Ecosystem: Natively compatible with domestic AI chips (such as Ascend and Cambrian), providing a full-stack, domestically developed solution from training to inference, ensuring enterprise-level applications’ autonomy, controllability, and security.
Playground
Log in to explore more features! Click to Log In