Qwen3.5-Omni-Flash
Qwen3.5-Omni-Flash는 Alibaba가 개발한 다중 모달 대형 언어 모델입니다. 256K의 광범위한 컨텍스트 윈도우를 지원하여 효율적인 처리를 가능하게 합니다.
파라미터
Undisclosed
컨텍스트
256K
라이선스
https://huggingface.co/Qwen/Qwen2.5-72B/blob/main/LICENSE
출시일
2026-03-30
API 가격
이 모델의 API 가격 정보는 현재 공개되지 않았습니다
강점
- ・고급 다중 모달 처리
- ・256K 긴 컨텍스트
- ・고속 응답 성능
약점
- ・클로즈드 라이선스 시스템
- ・상세한 평가 메트릭 부재
- ・새 모델로서의 제한된 실적
활용 사례
- ・대규모 문서 분석
- ・다중 모달 정보 처리
- ・실시간 응용 프로그램
심층 분석
Release Date
March 30, 2026
Architecture
Thinker-Talker, Hybrid-Attention MoE
Context Window
262,144 tokens
Max Audio Input
10+ hours continuous
Max Video Input
400+ seconds at 720p/1FPS
Speech Recognition
113 languages
Speech Generation
36 languages
Input Modalities
Text, Image, Audio, Video
Output Modalities
Text, Streaming Speech
API Price
~$0.065/1M text input, $0.260/1M output
Budget tier of Omni family
강점
- ・Budget-friendly omnimodal model: text, image, audio, and video input with speech output
- ・Natively end-to-end multimodal — no adapter or separate TTS pipeline needed
- ・113 languages for speech recognition, 36 for speech generation
- ・Low latency for real-time voice chat applications
- ・Apache 2.0 licensed, available for self-hosting via HuggingFace
약점
- ・Lower quality than the Plus variant on audio and vision benchmarks
- ・Benchmark scores trail Gemini 3.1 Pro on several audio understanding tasks
- ・Limited documentation on specific parameter count and architecture details
- ・Voice cloning quality may not match dedicated TTS solutions
- ・Real-world performance in noisy environments not extensively tested
경쟁사 비교
| Model | Arena | SWE | GPQA | Price |
|---|---|---|---|---|
| Qwen3.5-Omni-Plus | N/A | N/A | ~94.2 (MMLU) | TBD |
| Gemini 3.1 Pro | ~1480 | N/A | ~91 | Proprietary |
| GPT-Audio | ~1460 | N/A | ~89 | Proprietary |
| Qwen3.5-Omni-Flash | N/A | N/A | ~92 (MMLU) | $0.065/$0.260 |
| ElevenLabs | N/A | N/A | N/A | Proprietary TTS |
Qwen3.5-Omni-Flash is the budget tier of the Qwen3.5-Omni family, released March 30, 2026. It is a natively omnimodal model that accepts text, images, audio, and video as input and produces both text and streaming speech output in a single forward pass. The Flash variant trades some benchmark quality for lower latency and cost, making it suitable for real-time voice chat and high-volume applications.
출처
분석 생성일: 2026-05-24