
zai-org/autoglm-phone-9b-multilingual
API Overview
AutoGLM-Phone-9B-Multilingual is a 9-billion-parameter open-source multimodal model launched by FaceWall Intelligence (Zai-org). Its core purpose is to serve as a lightweight, on-device visual-language reasoning engine specifically designed for phone agents, aiming to enable multimodal perception and understanding of the screen to automatically execute actions.
- Specially tailored for phone agents: Developed based on the AutoGLM framework, this model is optimized for controlling mobile devices. It uses a visual-language model to parse screen interface elements in real time, enabling intent understanding and task execution.
- Multi-modal interaction capability: Supports both text and image inputs, capable of comprehending complex mobile screen content and automatically generating operation steps (such as taps and swipes) to complete end-to-end task loops.
- Cost-effective inference: The model costs 0.25 yuan/Mt for input and 1 yuan/Mt for output, significantly reducing inference costs for enterprise-level applications compared to similar large models.
- Secure and controllable mechanism: Equipped with prompts for confirming sensitive operations, it automatically switches to human intervention when encountering logins or verification codes. It also supports WiFi/network-based remote ADB debugging, ensuring the security of remote control.
- Extensive language support: As a Multilingual version, it supports instruction understanding and interaction in multilingual environments, making it suitable for global application scenarios.
───────────────────────────────────────────────────────────────────
Core Capabilities
📱 Deep screen perception and understanding
Utilizing a visual-language model to parse mobile screen UI elements in real time, accurately identifying icons, buttons, and text, and converting pixel information into actionable semantic instructions.
🤖 End-to-end task automation
Based on the AutoGLM framework, it can automatically plan operation paths according to natural language instructions (e.g., “Open Xiaohongshu and search for food”), and use ADB (Android Debug Bridge) to perform screen operations such as taps and swipes.
🌐 Multi-modal input and remote control
Supports mixed text-and-image inputs and integrates WiFi/network-based remote ADB debugging capabilities, making it easy to achieve remote device control and management across networks.
🛡️ Intelligent security and human handoff
Equipped with built-in security mechanisms, it automatically triggers confirmation prompts for operations involving privacy or critical decisions (such as logins and payments), and seamlessly switches to human intervention when encountering verification codes.
Playground
Log in to explore more features! Click to Log In