baidu/ernie-4.5-vl-28b-a3b

baidu/ernie-4.5-vl-28b-a3b

Baidu’s flagship multimodal large model—a high-performance visual-language understanding engine that supports both thinking and non-thinking modes.
2025-12-10
LLM
Model capability: imageModel capability: function_call
Input:
$0.143/1M tokens
Output:
$0.572/1M tokens
Bulk order? Contact your manager for exclusive deals

API Overview

ERNIE-4.5-VL-28B-A3B is a 28-billion-parameter open-source vision-language model released by Baidu, with a core focus on being a high-performance multimodal understanding engine based on a 3D attention mechanism. It significantly outperforms competitors of similar scale in tasks such as visual question answering and cross-modal retrieval.

  • Outperforms competitors in multiple benchmarks: In benchmark tests including VQA-v2, GLUE, and COCO image captioning, its accuracy and scores both surpass those of Qwen3-235B, demonstrating exceptional multimodal understanding capabilities.
  • Significantly improved inference speed: Thanks to the A3B architecture, inference speed is 47% faster compared to traditional Transformer models, dramatically reducing computational costs while maintaining high performance.
  • Optimized memory usage: By adopting a mixed-precision training strategy of FP16+FP32, memory usage is reduced by 40%, and the model supports PaddleSlim quantization compression down to within 12 GB.
  • Compatibility with domestic hardware: It supports domestically produced hardware such as Ascend and Cambrian, and provides a complete AI development suite (PaddlePaddle + PaddleNLP + PaddleCV).
  • Abundant academic resources: Along with the model’s open-source release comes over 10 million cleaned multimodal training data points, making it the first publicly available Chinese multimodal pre-trained model at the hundred-billion parameter level.

───────────────────────────────────────────────────────────────────

Core Capabilities

👁️ 3D Attention-based Visual Fusion: Innovatively adopts the A3B architecture, establishing dynamic weights across spatial, channel, and temporal dimensions to achieve deep fusion of cross-modal features, improving visual-language task inference efficiency by 23%.

⚡ Mixed Precision and Dynamic Sparsity: Combining FP16+FP32 mixed-precision strategies with dynamic sparsity training techniques, under a large parameter count of 28B, the actual computational workload is equivalent to that of a 21B model, significantly lowering hardware requirements.

🏆 Top-tier Multimodal Benchmark Performance: It comprehensively leads in key metrics such as VQA-v2 accuracy (82.3%) and COCO caption generation (BLEU-4 42.1), even surpassing the larger-parameter Qwen3-235B model.

🚀 Enterprise-level Efficient Inference Deployment: Supports TensorRT acceleration via Paddle Inference and Ring-AllReduce distributed inference. Under single-machine multi-GPU environments, throughput can be increased by 1.8 times, making it well-suited for high-concurrency production environments.

🌐 Strong Support for Domestic Ecosystem: Natively compatible with domestic AI chips (such as Ascend and Cambrian), providing a full-stack, domestically developed solution from training to inference, ensuring enterprise-level applications’ autonomy, controllability, and security.

Playground

Log in to explore more features! Click to Log In

API Analytics

API Reference (1)

API DescriptionAPI EndpointRequest MethodStabilityParameter Description
Chat(PPIO)
POST
Stable
View Details

API Pricing

$
ModelDescriptionContextOfficial Price302.AI Price

baidu/ernie-4.5-vl-28b-a3b

-
30000

Input$0.143 / 1M tokens
Output$0.572 / 1M tokens

Input$0.143/ 1M tokens
Output$0.572/ 1M tokens
Original Price