이 모델의 강점은 무엇인가요?

대규모 ~5조 파라미터 광대한 200만 토큰 컨텍스트 윈도우 고급 추론 전문화

이 모델의 약점은 무엇인가요?

비공개 소스 라이선싱 베타 버전의 잠재적 불안정성 높은 연산 자원 요구

어떤 용도에 가장 적합한가요?

초장문 문서 분석 복잡한 논리적 추론 작업 대규모 데이터의 맥락 처리

모델 목록으로

xAI독점

Grok 4.3 Beta (Early Access)

Name: Grok 4.3 Beta (Early Access)
Author: xAI

Grok 4.3 Beta (얼리 액세스)는 xAI가 개발한 추론 모델입니다. 약 5조 개의 파라미터 대규모 구성과 200만 토큰의 매우 긴 컨텍스트 윈도우가 특징입니다.

파라미터

5000.0B

컨텍스트

2000K

라이선스

Proprietary

출시일

2026-05-17

API 가격

이 모델의 API 가격 정보는 현재 공개되지 않았습니다

강점

・대규모 ~5조 파라미터
・광대한 200만 토큰 컨텍스트 윈도우
・고급 추론 전문화

약점

・비공개 소스 라이선싱
・베타 버전의 잠재적 불안정성
・높은 연산 자원 요구

활용 사례

・초장문 문서 분석
・복잡한 논리적 추론 작업
・대규모 데이터의 맥락 처리

심층 분석

Artificial Analysis Intelligence Index

#10 overall, +4 vs Grok 4.20

Arena Elo (Text Overall)

1451

9,082 votes; Coding: 1493

GDPval-AA (Agentic Tasks)

1500 ELO

+321 over Grok 4.20; trails GPT-5.5 by 276

Input Price

$1.25/1M tokens

37.5% cheaper than Grok 4.20

Context Window

1M tokens

Grok 4.20 retains 2M for max-context workloads

GPQA Diamond

90.1%

#14 on Easy Benchmarks

Benchmark Run Cost (AA Index)

$395

~20% less than Grok 4.20; vs GPT-5.5: ~$3,959

강점

・Best cost-per-intelligence ratio in the frontier tier: $1.25/M input places it on the Pareto frontier for intelligence vs. cost, roughly 12× cheaper than Claude Opus 4.7
・Massive agentic task improvement: +321 ELO on GDPval-AA real-world agentic benchmarks, validated by Starlink's 70% autonomous resolution rate in production
・First xAI model with native video input and document generation (PDF, PPTX, XLSX), breaking Gemini's monopoly on production-grade video understanding

약점

・No persistent memory at any tier—including the $300/month SuperGrok Heavy plan—requiring custom memory layers for any stateful application
・Documented 'narcolepsy' regression: autonomous agent tasks show prolonged inactivity in sustained-action simulations (Andon Labs Vending-Bench 2), a production risk for agentic workflows
・Coding performance lags Claude Opus 4.7 by ~14 points on SWE-bench (~72% vs ~86%), ruling it out as a primary coding model

경쟁사 비교

Model	Arena	SWE	GPQA	Price
Claude Opus 4.7	~1500	~86%	~92%	~$15/$75
GPT-5.5 (xhigh)	~1510	~83%	~93%	$5/$30
Gemini 3.1 Pro Preview	~1480	~76%	~91%	~$1.25/$5.00

개요

Grok 4.3 (launched April 30, 2026) is xAI's most cost-efficient frontier model to date, scoring 53 on the Artificial Analysis Intelligence Index while dramatically undercutting competitors on price. The model represents a deliberate strategic pivot: rather than chasing raw intelligence leadership (GPT-5.5 scores 60, Claude Opus 4.7 scores ~62–67), xAI optimized for the price-performance frontier. Input costs dropped 37.5% and output costs 58.3% versus the predecessor Grok 4.20, while intelligence scores actually improved. The headline metric is GDPval-AA, a real-world agentic task benchmark, where Grok 4.3 jumped 321 ELO points to 1500—surpassing Gemini 3.1 Pro, GPT-5.4 mini, and Kimi K2.5—though it still trails GPT-5.5 (xhigh) by 276 ELO points. Feature-wise, Grok 4.3 introduces several production-relevant capabilities: native video input (breaking Gemini's monopoly on commercial video understanding APIs), built-in document generation (PDF, PowerPoint, spreadsheets directly from conversation), and always-on chain-of-thought reasoning. The model runs at ~100 tokens/second with a 1M-token context window, a reduction from Grok 4.20's 2M tokens, though the older model remains available for maximum-context workloads. Prompt caching at $0.20/M tokens further reduces costs for RAG and repeated-context applications. xAI also launched Grok Imagine Agent Mode for creative production workflows and integrated tighter coupling with Grok Computer, the autonomous desktop agent. However, Grok 4.3 arrives with notable gaps. Persistent memory remains absent at every tier, including the $300/month SuperGrok Heavy plan. Independent testing by Andon Labs revealed a 'narcolepsy' regression on sustained autonomous tasks—the model sometimes remains idle instead of taking required actions. Coding performance lags Claude Opus 4.7 significantly (~14 points on SWE-bench), and the AA-Omniscience Non-Hallucination Rate actually dropped 8 points versus Grok 4.20, trading reliability for higher accuracy scores. The model is best understood not as a general-purpose frontier leader but as a specialist: the most cost-effective option for long-context agentic workflows, customer support automation, and document-heavy analysis pipelines where intelligence-per-dollar matters more than absolute peak capability.

벤치마크 및 성능

## Comprehensive Benchmark Performance Grok 4.3's benchmark profile reveals a model that excels at agentic and instruction-following tasks while lagging frontier leaders on raw intelligence and coding. | Benchmark | Grok 4.3 | Grok 4.20 | Claude Opus 4.7 | GPT-5.5 (xhigh) | |---|---|---|---|---| | AA Intelligence Index | 53 | 49 | ~62–67 | 60 | | GDPval-AA (Agentic ELO) | 1,500 | 1,179 | Not published | ~1,620 | | τ²-Bench Telecom | 98% | 93% | ~86% | ~90% | | IFBench (Instruction Following) | 81% | 81% | ~79% | ~82% | | GPQA Diamond | 90.1% | ~88% | ~92% | ~93% | | Humanity's Last Exam | 35.0% | ~30% | ~40% | ~42% | | SciCode | 47.3% | ~42% | ~55% | ~53% | | SWE-bench Verified | ~72% | ~70% | ~86% | ~83% | | AA-Omniscience Accuracy | +8 pts vs 4.20 | Baseline | Not published | Not published | | AA-Omniscience Non-Hallucination | -8 pts vs 4.20 | 78% (record) | Not published | Not published | **Arena Elo Breakdown (BenchLM):** - Text Overall: 1451 (±6.5, 9,082 votes) - Coding: 1493 (±12.0, 2,471 votes) - Math: 1434 (±25.8, 501 votes) - Instruction Following: 1428 (±10.9, 2,958 votes) - Creative Writing: 1440 (±15.7, 1,460 votes) - Multi-turn: 1463 (±14.9, 1,618 votes) - Hard Prompts: 1463 (±8.1, 5,661 votes) - Hard Prompts (English): 1461 (±10.9, 2,992 votes) - Longer Query: 1452 (±10.3, 3,434 votes) **Runtime Metrics:** - Output Speed: ~94–115 tok/s (varies by provider/load) - Time to First Token: 6.5–7.1s (API); some reviews report up to 25.5s under load - Max Output: 1,000,000 tokens - Verbose: ~44% more output tokens than Grok 4.20 for benchmark suite **Key Takeaway:** The model's standout metrics are GDPval-AA (1500 ELO, +321 over predecessor) and τ²-Bench Telecom (98%), confirming genuine improvement in agentic and structured-task scenarios. The Intelligence Index gap to leaders is 7–14 points, but at $395 to run the full benchmark suite versus $3,959–$4,811 for frontier competitors, the cost-per-intelligence ratio is best-in-class.

상세 비교

## Head-to-Head Comparisons ### Grok 4.3 vs Claude Opus 4.7 Claude Opus 4.7 remains the raw intelligence and coding leader. It scores ~62–67 on the AA Intelligence Index versus Grok 4.3's 53, and dominates SWE-bench (~86% vs ~72%). However, Claude costs ~$15/M input tokens—roughly 12× more than Grok 4.3. Context windows are comparable at 1M tokens. Claude offers persistent memory via Projects; Grok has none. For coding agents, long-horizon reasoning, and tasks requiring maximum accuracy, Claude wins. For cost-sensitive agentic pipelines, document analysis, and customer support at scale, Grok 4.3 is the better economic choice. Claude Opus 4.7 inference speed is slower (~50 tok/s vs ~100 tok/s for Grok 4.3), though its time-to-first-token is lower. ### Grok 4.3 vs GPT-5.5 (xhigh) GPT-5.5 leads the AA Intelligence Index at 60 (vs 53) and holds a ~276 ELO advantage on GDPval-AA. It is broadly more capable across reasoning, coding, and knowledge tasks. However, GPT-5.5 costs ~$5/M input and ~$30/M output—4× and 12× more expensive respectively. Grok 4.3's production-validated agentic deployment (Starlink voice agent: 70% autonomous resolution) demonstrates real-world viability that benchmarks alone don't capture. GPT-5.5 runs at ~80 tok/s with ~3s TTFT, making it more suitable for interactive applications. Choose GPT-5.5 when absolute intelligence matters and budget is secondary; choose Grok 4.3 when cost-per-task is the primary constraint. ### Grok 4.3 vs Gemini 3.1 Pro Preview Gemini 3.1 Pro is the strongest multimodal competitor with native Google Workspace integration, excellent video understanding, and aggressive pricing (~$1.25/M input). On the AA Intelligence Index, the two are close (Grok 4.3: 53, Gemini 3.1 Pro: ~52). Gemini has deeper ecosystem integration (Sheets, Docs, Slides direct). Grok 4.3 wins on real-time X data access and document generation outputs. For Google-stack enterprises, Gemini is the natural fit. For teams needing live social data analysis or xAI ecosystem tools, Grok 4.3 has a structural advantage.

커뮤니티 평가

Community reaction to Grok 4.3 is sharply divided along use-case lines. Developers building agentic systems and long-context pipelines have been broadly positive, with VentureBeat noting the model sits comfortably on the Pareto frontier for intelligence versus cost. The Starlink voice agent deployment—achieving 70% autonomous resolution across 28 tools and 20% sales conversion—has been cited as the strongest production validation of any xAI model to date. However, the model has drawn criticism on multiple fronts. Andon Labs, an AI retail automation company, described a 'big regression' on Vending-Bench 2, characterizing the model as having 'narcolepsy problems, preferring to sleep for multiple days in a row over taking actions.' This has become a widely-discussed failure mode in agentic AI circles. On Reddit, casual users report minimal differences from Grok 4.20: one comment noted 'It's not much different than 4.20. Better document producing and video understanding. Other than that, no difference.' The lack of persistent memory at the $300/month SuperGrok Heavy tier has been a recurring sore point. NivaaLabs called it 'genuinely hard to defend,' and multiple reviewers flagged it as the most glaring product gap in the frontier model market. The co-founder departures at xAI—all 11 original co-founders have now left—have also raised questions about institutional continuity, though Grok 4.3's improvements suggest the development pipeline remains functional. Val's AI rankings placed Grok 4.3 first on CaseLaw and CorpFin benchmarks, suggesting strong adoption signals in legal and financial verticals. The model's native video input and document generation features have been praised by enterprise users evaluating multi-tool pipeline consolidation. The general consensus among developers: a strong 'second model' alongside GPT-5.5 or Claude Opus 4.7, excellent for cost-optimized agentic workloads, but not yet a replacement for frontier leaders on raw capability.

활용 사례

## Recommended Use Cases ### 1. High-Volume Legal & Financial Document Analysis A legal tech company processing thousands of contracts monthly can leverage Grok 4.3's 1M context window to ingest entire contracts in a single API call. At $1.25/M input tokens (vs Claude Opus 4.7's ~$15/M), costs drop by roughly 12×. Prompt caching at $0.20/M tokens further reduces costs for repeated system prompts. The model's strong CaseLaw and CorpFin benchmark rankings (Val's AI: #1 on both) and improved instruction following (IFBench: 81%) make it well-suited for extracting structured data from unstructured legal documents. **Choose Grok 4.3 over alternatives when:** volume is high, budget is constrained, and the task is extraction/summarization rather than complex legal reasoning. ### 2. Customer Support Voice Agents at Scale Grok 4.3's 98% score on τ²-Bench Telecom and production validation via the Starlink voice agent (70% autonomous resolution, 20% sales conversion, 28 tools) make it the strongest cost-optimized option for agentic customer support. The model can handle hardware troubleshooting, service credits, replacement workflows, and escalation without human intervention for the majority of interactions. **Choose Grok 4.3 over alternatives when:** deploying customer support agents at scale where cost-per-resolution matters and real-time X/social sentiment data can enhance responses. ### 3. Multimodal Research Pipelines with Video Input Grok 4.3 is one of only two commercial models (alongside Gemini) offering production-grade native video understanding. For education platforms processing lecture recordings, automotive companies running dashcam analysis, or media teams generating summaries from recorded meetings, Grok 4.3 combines video input with document generation (PDF, PPTX, XLSX) in a single API call. **Choose Grok 4.3 over alternatives when:** video analysis is required alongside structured document output, and Gemini's Google Workspace integration isn't needed. ### 4. Cost-Sensitive RAG Systems with Large Stable Context For developers building knowledge-base RAG systems with large, reusable system prompts (100K+ tokens), Grok 4.3's prompt caching at $0.20/M tokens represents up to a 90% discount versus base input rates. At 10,000 daily queries against a 100K-token system prompt, the cost difference versus GPT-5.5 or Claude Opus 4.7 is material. **Choose Grok 4.3 over alternatives when:** the workload is retrieval-heavy, context is largely stable across queries, and maximum absolute intelligence is not required.

최신 뉴스

## Recent Developments (as of May 2026) - **April 17, 2026:** Grok 4.3 beta launched with zero announcement, available only to SuperGrok Heavy subscribers ($300/month). Initial confusion about parameter count—Elon Musk later clarified the beta runs a 0.5T-parameter version, with a 1T version still in training. - **April 30, 2026:** Full API rollout completed. Model ID: `grok-4.3`. Pricing: $1.25/M input, $2.50/M output, $0.20/M cached. Tool invocation fees introduced: $5.00 per 1,000 web/code execution calls, $10.00 per 1,000 file attachment calls. A novel $0.05 fee per safety-filter-blocked request was also introduced—an industry first. - **May 2, 2026:** Grok Imagine Agent Mode launched in beta, enabling multi-step creative production workflows (one-minute movies, manga sets, product stories) via the Grok web interface. - **May 6, 2026:** Grok 4.3 published to llm-stats.com and benchmark tracking sites. Artificial Analysis confirmed the model's Pareto frontier positioning at $395 benchmark run cost. - **Price reductions vs Grok 4.20:** Input tokens down 37.5% ($2.00 → $1.25), output tokens down 58.3% (~$6.00 → $2.50). Cached tokens: $0.20/M. - **Grok 4.20 deprecation:** Several older models including grok-4-0709 scheduled for deprecation on May 15, 2026. Grok 4.20 itself remains available, retaining its 2M-token context window advantage. - **xAI corporate changes:** SpaceX acquired xAI in February 2026 in an all-stock deal. All 11 original xAI co-founders have departed. xAI now operates Colossus 2 at 1.5 gigawatts of compute and is training seven models including Grok 5 (targeting 6T and 10T parameter variants). - **Upcoming:** 1T-parameter version of Grok 4.3 expected to complete training within weeks. 'Skills' feature (reusable instructions for task automation) spotted in iOS testing but not yet publicly available. Grok Computer autonomous desktop agent in private beta.

Feature-wise, Grok 4.3 introduces several production-relevant capabilities: native video input (breaking Gemini's monopoly on commercial video understanding APIs), built-in document generation (PDF, PowerPoint, spreadsheets directly from conversation), and always-on chain-of-thought reasoning. The model runs at ~100 tokens/second with a 1M-token context window, a reduction from Grok 4.20's 2M tokens, though the older model remains available for maximum-context workloads. Prompt caching at $0.20/M tokens further reduces costs for RAG and repeated-context applications. xAI also launched Grok Imagine Agent Mode for creative production workflows and integrated tighter coupling with Grok Computer, the autonomous desktop agent.

However, Grok 4.3 arrives with notable gaps. Persistent memory remains absent at every tier, including the $300/month SuperGrok Heavy plan. Independent testing by Andon Labs revealed a 'narcolepsy' regression on sustained autonomous tasks—the model sometimes remains idle instead of taking required actions. Coding performance lags Claude Opus 4.7 significantly (~14 points on SWE-bench), and the AA-Omniscience Non-Hallucination Rate actually dropped 8 points versus Grok 4.20, trading reliability for higher accuracy scores. The model is best understood not as a general-purpose frontier leader but as a specialist: the most cost-effective option for long-context agentic workflows, customer support automation, and document-heavy analysis pipelines where intelligence-per-dollar matters more than absolute peak capability.

출처

분석 생성일: 2026-05-23