이 모델의 강점은 무엇인가요?

광범위한 다중 모달 지원 256K 긴 컨텍스트 처리 능력 Alibaba의 최신 설계

이 모델의 약점은 무엇인가요?

클로즈드 소스 라이선스 형식 상세한 성능 메트릭 부재 상업적 사용에 대한 잠재적 제한

어떤 용도에 가장 적합한가요?

긴 문서 분석 다중 모달 데이터 처리 고급 컨텍스트 이해

모델 목록으로

アリババ독점

Qwen3.5-Omni-Light

Name: Qwen3.5-Omni-Light
Author: アリババ

Qwen3.5-Omni-Light는 Alibaba가 개발한 다중 모달 기초 모델입니다. 256K의 긴 컨텍스트 윈도우를 지원하여 고급 다중 모달 처리를 가능하게 합니다.

파라미터

Undisclosed

컨텍스트

256K

라이선스

https://huggingface.co/Qwen/Qwen2.5-72B/blob/main/LICENSE

출시일

2026-03-30

API 가격

이 모델의 API 가격 정보는 현재 공개되지 않았습니다

강점

・광범위한 다중 모달 지원
・256K 긴 컨텍스트 처리 능력
・Alibaba의 최신 설계

약점

・클로즈드 소스 라이선스 형식
・상세한 성능 메트릭 부재
・상업적 사용에 대한 잠재적 제한

활용 사례

・긴 문서 분석
・다중 모달 데이터 처리
・고급 컨텍스트 이해

심층 분석

Arena Elo

N/A

Not officially benchmarked on LMArena for this specific variant

MMAU (Audio Understanding)

82.2

SOTA; outperforms Gemini 3.1 Pro (81.1)

LibriSpeech Clean WER

1.11%

SOTA; ~3x lower error than Gemini 3.1 Pro (3.36%)

Input Price (Flash)

$0.10/1M

Budget tier; ~20x cheaper than GPT-5.2

Context Window

256K tokens

Supports 10+ hours of audio or 400s of 720p video

Speech Recognition Languages

113

Massive jump from 19 in previous generation

강점

・State-of-the-art audio understanding and generation, beating Gemini 3.1 Pro on key benchmarks.
・Exceptional multilingual support with 113 languages for speech recognition and 36 for synthesis.
・Cost-effective API pricing, significantly undercutting major Western competitors.
・Unique 'Audio-Visual Vibe Coding' enables code generation from spoken instructions and visual context.
・Advanced real-time interaction with semantic interruption and voice cloning capabilities.

약점

・Only the 'Light' variant has open weights; 'Plus' and 'Flash' are proprietary API-only models.
・The '215 SOTA results' claim includes many niche, per-language subtasks; broad independent verification is pending.
・High computational cost for processing long video/audio can lead to unpredictable API expenses.
・Audio generation quality is optimized for English and Mandarin; other languages can be less natural.
・Data processing occurs in Chinese data centers, raising potential latency and data sovereignty concerns for some users.

경쟁사 비교

Model	Arena	SWE	GPQA	Price
Gemini 3.1 Pro	N/A	N/A	N/A	$2.00-$4.00/$12.00-$18.00 per 1M tokens (estimated)
GPT-4o / GPT-Audio	N/A	N/A	N/A	$2.50/$10.00 per 1M tokens (GPT-4o text-only)
ElevenLabs (Multilingual v2)	N/A	N/A	N/A	Not directly comparable; specialized voice API

개요

Qwen3.5-Omni-Light is the lightweight variant within Alibaba's Qwen3.5-Omni family, released on March 30, 2026. It represents a significant leap in native omnimodal AI, designed to process text, images, audio, and video in a single model pass and generate both text and real-time speech. The architecture, based on a Thinker-Talker framework with Hybrid-Attention MoE, is optimized for efficiency, making the 'Light' version suitable for edge and resource-constrained deployments. While specific parameter counts for the Light variant are undisclosed, it shares the family's core capabilities, including a massive 256K token context window and support for 113 languages in speech recognition. Positioned as the most accessible entry point in the series, Qwen3.5-Omni-Light is available as open weights, allowing for local deployment and fine-tuning under a Qwen License (free commercial use). This contrasts with the flagship 'Plus' and balanced 'Flash' variants, which are proprietary and accessed via API. The model's primary innovation is its ability to handle long-form audio (10+ hours) and video (400+ seconds of 720p) natively, a feature that unlocks applications like full-podcast analysis, meeting transcription with visual context, and real-time multilingual voice agents. Benchmark claims from the Plus variant (215 SOTA results) position the family as a leader in audio and audio-visual tasks, though the Light variant's specific performance tier is less documented.

벤치마크 및 성능

Qwen3.5-Omni-Plus, the flagship variant, establishes new state-of-the-art (SOTA) results across 215 audio, audio-visual, and interaction subtasks. While specific benchmarks for the 'Light' variant are not detailed in the sources, the core architecture's capabilities are reflected in the Plus variant's results: ### Audio & Speech Performance (Plus Variant) | Benchmark | Qwen3.5-Omni-Plus | Gemini 3.1 Pro | Notes | | :--- | :--- | :--- | :--- | | **MMAU (Audio Understanding)** | **82.2** | 81.1 | SOTA | | **VoiceBench (Dialogue)** | **93.1** | 88.9 | End-to-end voice interaction | | **LibriSpeech Clean WER** | **1.11%** | 3.36% | ~3x lower word error rate | | **LibriSpeech Other WER** | **2.23%** | 4.41% | | | **CV15 (English) WER** | **4.83%** | 8.73% | | | **Seed-zh Voice Stability (lower is better)** | **1.07** | 2.42 (Gemini 2.5 Pro) | Superior to ElevenLabs (13.08) | ### Text & Vision Performance (Comparable to Qwen3.5-Plus-Instruct) | Benchmark | Qwen3.5-Omni-Plus | Qwen3.5-Plus-Instruct | Notes | | :--- | :--- | :--- | :--- | | **MMLU-Redux (Knowledge)** | 94.2 | **94.3** | On par with text-only counterpart | | **GPQA (STEM)** | 83.9 | **85.9** | | | **VideoMME (w/ audio)** | **81.9** | 81.0 | Stronger on dynamic visual perception | | **MMMU-Pro (Visual Reasoning)** | **73.9** | 73.8 | | ### Multimodal Understanding Performance | Benchmark | Qwen3.5-Omni-Plus | Gemini 3.1 Pro | Notes | | :--- | :--- | :--- | :--- | | **DailyOmni (Audio-Visual QA)** | **84.6** | 82.7 | SOTA | | **Qualcomm IVD (Audio-Visual Interactive)** | **68.5** | 66.2 | Real-world interactive scenarios | | **OmniGAIA (Tool Use)** | 57.2 | **68.9** | Agent capability with tools | The model's long-context capability (256K tokens) is a critical performance feature, enabling it to process over 10 hours of continuous audio or ~400 seconds of 720p video at 1 FPS in a single session.

상세 비교

### Head-to-Head with Gemini 3.1 Pro - **Strengths:** Qwen3.5-Omni (Plus) outperforms on core audio understanding (MMAU), ASR (LibriSpeech), real-time dialogue (VoiceBench), and voice stability/cloning. It also offers a 256K context window versus Gemini's 1M, but at a fraction of the cost. - **Weaknesses:** Gemini 3.1 Pro may still lead in some audio-visual tool use benchmarks (OmniGAIA) and likely offers a more polished, integrated ecosystem within Google's suite. Gemini's 1M context is superior for purely text-based, ultra-long document processing. - **Pricing:** Qwen3.5-Omni-Flash ($0.10/$0.80 per 1M tokens) is drastically cheaper than Gemini 3.1 Pro (~$2-4/$12-18), making it a compelling cost-optimized alternative for multimodal applications. ### Head-to-Head with GPT-4o / GPT-Audio - **Strengths:** Qwen3.5-Omni provides a truly unified end-to-end model with native audio-visual output, whereas GPT-4o's capabilities are more stitched together (e.g., using Whisper). Qwen's voice cloning stability and language coverage (113 vs. ~50 for GPT-4o) are superior. - **Weaknesses:** GPT-4o and its underlying models may still lead on certain complex reasoning and coding benchmarks (e.g., SWE-bench). The OpenAI ecosystem is more established for developer tools and documentation. - **Context:** Qwen's 256K context competes well with GPT-4o's 128K, though both trail Gemini's 1M. ### Head-to-Head with Open-Source Alternatives (e.g., Llama 4) - **Strengths:** No other major open-source model family (like Llama) offers native, integrated audio-visual understanding and generation at this scale. Qwen3.5-Omni-Light is uniquely positioned as an open-weight omnimodal model. - **Weaknesses:** For purely text-based tasks, specialized text models like Llama 4 or Qwen 3.5 dense models may be more efficient. The full 'Plus' performance is locked behind an API.

커뮤니티 평가

Community reaction has been a mix of excitement and cautious analysis: - **Developer Enthusiasm:** The release is seen as a significant step for open multimodal AI. Developers on platforms like Hugging Face and r/LocalLLaMA are particularly interested in the **Light variant's open weights** for self-hosting and fine-tuning, despite its lower capability tier compared to Plus/Flash. - **Benchmark Skepticism:** Many researchers and developers are taking the '215 SOTA results' claim with a grain of salt, noting that such aggregated numbers often include many niche benchmarks. There is a call for more independent, third-party evaluations on standardized, challenging tasks. - **Use Case Exploration:** The 'Audio-Visual Vibe Coding' feature has captured imagination, with developers prototyping tools that generate code from voice and sketch instructions. The long-audio processing capability is also seen as a game-changer for meeting summarization and podcast analysis workflows. - **Concerns:** A recurring theme is the **data sovereignty issue** tied to Chinese data centers. Enterprise users in regulated industries (healthcare, finance) are hesitant. The variable audio quality across languages is also a noted shortcoming for truly global applications. - **Adoption Pattern:** Early adoption is highest among AI researchers and developers building prototyping tools for multimodal interaction, where cost and innovation speed are critical. Production enterprise adoption is slower, awaiting more benchmarks and clarity on compliance.

활용 사례

1. **Real-Time Multilingual Voice Assistants & Customer Service:** * **Example:** A global e-commerce company deploys a customer service agent that can listen in 113 languages, understand the spoken issue, see a product image the user shares, and respond with a natural-sounding voice in 36 languages. It uses semantic interruption to handle user turn-taking naturally. * **Why Choose Qwen3.5-Omni-Light:** It provides an integrated, low-latency (Flash) or locally deployable (Light) solution that avoids the cost and latency of chaining separate ASR, LLM, and TTS services. Its voice cloning can maintain brand-consistent vocal identity. 2. **Long-Form Audio-Visual Content Analysis & Summarization:** * **Example:** A media company automatically generates chapter markers, detailed summaries, and searchable transcripts for 3-hour podcast episodes or lecture series, keyed to both the spoken content and the on-screen slides/demonstrations. * **Why Choose Qwen3.5-Omni-Light:** The 256K context window and native processing of 10+ hours of audio make it possible to analyze entire sessions in a single API call or model pass, preserving context that would be lost in chunked processing. The 'Light' variant is ideal for batch processing on cost-sensitive servers. 3. **Rapid Prototyping & Design Tool from Sketches and Voice:** * **Example:** A developer sketches a mobile app UI on paper, takes a photo, and verbally describes the interactive features. The model generates a functional front-end prototype (HTML/CSS/JS) from this combined input. * **Why Choose Qwen3.5-Omni-Light:** This is the flagship 'Audio-Visual Vibe Coding' use case. It leverages the model's native ability to ground code generation in both visual and auditory context, something purely text or image-based coding assistants cannot do. The Light or Flash variant is sufficient for iterative, low-latency prototyping. 4. **Accessibility and Content Creation Tools:** * **Example:** A video editor uses the model to automatically generate audio descriptions for visually impaired viewers, describing on-screen action in sync with the video timeline. It can also generate subtitles in multiple languages from the audio track. * **Why Choose Qwen3.5-Omni-Light:** It understands the temporal alignment between audio and video natively (using TMRoPE), allowing for precisely timed descriptions. The Light variant can run on a local creative professional's workstation, processing video files without sending sensitive content to the cloud.