SenseNova-V6-Pro

SenseNova-V6-Pro

SenseTime’s flagship version of the native multimodal general-purpose large model
2025-04-09
LLM
Model capability: image
Input:
$0.55/1M tokens
Output:
$1.43/1M tokens
Bulk order? Contact your manager for exclusive deals

API Overview

SenseNova-V6-Pro is the flagship version of SenseTime’s natively multimodal general-purpose large model. As the core component of the “Daily New SenseNova V6” large model ecosystem, it is positioned as a next-generation multimodal intelligence foundation featuring “native fusion of text, images, and video + long-chain reasoning driven by global memory + full-frame video understanding at the 10-minute level.”

  • Ultra-large-scale hybrid expert architecture: Adopting an MoE (Mixture of Experts) design with 620 billion parameters, this architecture significantly enhances modeling capabilities for complex multimodal tasks while maintaining efficient inference.
  • Native multimodal fusion: Without relying on intermediate alignment modules, it directly processes text, static images, and long-duration video inputs within a unified architecture, breaking through the traditional limitation that models can only handle short clips (<30 seconds). It supports full-frame-rate understanding and question-answering for videos lasting up to 10 minutes.
  • Multimodal Long Chain of Thought (Long CoT): By introducing a global memory mechanism and reinforcement learning optimization, it enables deep cross-temporal reasoning on event evolution, causal logic, and interpersonal relationships in videos—for example, “analyzing the causes and consequences of abnormal behavior in surveillance videos”.
  • World-leading evaluation performance: In Compass’s May 2025 multimodal large model ranking, it scored 80.4 total points, surpassing international mainstream models such as Gemini 2.5 Pro and GPT-4.5, and ranked first globally in multimodal capabilities. It also tied for first place domestically in Chinese general language proficiency.
  • Targeted at real-world interaction scenarios: Focusing on “everyday use cases” such as educational video analysis, medical image report generation, industrial quality inspection video diagnostics, and short-video content understanding, it promotes the practical application of multimodal AI across thousands of industries.

───────────────────────────────────────────────────────────────────

Core Capabilities

👁️‍🗨️ Video-level semantic understanding: Accurately captures dynamic information in videos, including action sequences, object interactions, and scene transitions, supporting tasks such as “summarizing key knowledge points from instructional videos” and “identifying critical operational steps from surgical recordings”.

🧠 Cross-modal causal reasoning: Combining image/video content with text instructions to perform inferences requiring integration of common sense and logic—for example, “Why did this chemical experiment produce precipitation?” or “Determining the party responsible for a traffic accident based on surveillance footage”.

📊 Professional-domain multimodal generation: Can generate structured reports from medical images or automatically create construction verification checklists based on engineering drawings and video instructions.

🎥 Long-video summarization and question-answering: Compresses content, extracts key events, and enables free-form question-answering for teaching, meeting, and surveillance videos lasting 5–10 minutes—without the need for pre-segmentation.

Playground

Log in to explore more features! Click to Log In

API Analytics

API Reference (1)

API DescriptionAPI EndpointRequest MethodStabilityParameter Description
Chat (SenseTime)
POST
Stable
View Details

API Pricing

$
ModelDescriptionContextOfficial Price302.AI Price

SenseNova-V6-Pro

-
32000

Input$0.5 / 1M tokens
Output$1.3 / 1M tokens

Input$0.55/ 1M tokens
Output$1.43/ 1M tokens
10%