
gemini-3-flash-preview
API Overview
Gemini 3 Flash is a lightweight, flagship AI model launched by Google, primarily positioned as a “high-speed response engine for complex tasks”. It’s designed specifically for high-throughput, low-latency scenarios, and while maintaining powerful multimodal understanding capabilities, it significantly optimizes inference costs and speeds, delivering Pro-level inference performance at one-quarter the cost, thus breaking the impossible triangle of speed, intelligence, and cost. It covers all use cases—from developers and enterprises to ordinary users—enabling ultra-fast responses from high-frequency development to everyday interactions, making it ideal for scenarios such as financial risk control, legal document analysis, and intelligent customer service agents.
- Core breakthrough: No longer needing to compromise among “speed, intelligence, and cost”—it’s 3 times faster than the previous-generation Gemini 2.5 Pro and outperforms it in terms of overall performance; its price is only 1/4 of Gemini 3 Pro, while also supporting cost-optimized solutions like context caching and batch processing.
- Superior coding capabilities: Achieving a 78% score on the SWE-bench Verified coding benchmark, surpassing both the 2.5 series and Gemini 3 Pro, making it the top choice for cost-effective agent coding.
- Enhanced multimodal capabilities: New advanced visual and spatial reasoning abilities have been added, along with code execution functionality (scalable, counting, editing visual inputs). Video analysis speed is 4 times faster than Gemini 2.5 Pro, and OCR accuracy has been significantly improved.
- Cost optimization solutions: Enabling context caching can save up to 90% of repeated token costs; using the Batch API for asynchronous processing further reduces costs by another 50%, while also increasing call rates.
- Large-scale deployment-friendly: Supports an input context of up to 1 million tokens, allowing it to handle large volumes of data such as long documents and video streams while maintaining low latency (in synchronous scenarios, it supports production-level rates).
- Creative and efficiency tools: Supports voice-generated app prototypes (a functional framework can be created in just minutes even with zero coding experience), video content summarization, multilingual Q&A, and generates visual outputs for various needs such as weather checks and travel planning (e.g., schedule tables with calendars).
───────────────────────────────────────────────────────────────────
Core Capabilities
⚡ Ultra-fast response: Delivers high-speed output at 218 tokens per second, ideal for real-time conversations, streaming processing, and high-concurrency API calls, providing “instant results” for tasks such as processing 512-pixel images and generating simple code.
🧠 Flexible inference control: The ‘thinking_level’ parameter supports both “low” (fast thinking) and “high” (deep thinking) modes, precisely balancing speed, cost, and inference depth.
👁️ Comprehensive multimodal processing: Natively supports parsing images, audio, and videos, enabling cross-modal tasks without requiring additional models. Easily handles advanced visual reasoning, video knowledge extraction, high-precision PDF parsing, and complex chart synthesis. 📊 Handling ultra-long contexts: With a 1M-token input window and 64K-token output capability, it can analyze entire financial reports, technical documents, or lengthy novels in one go.
🛠️ Deep integration with toolchains: Built-in Google Search and code execution features automatically verify facts and retrieve the latest data, ensuring that generated content is accurate and reliable.
Playground
Log in to explore more features! Click to Log In