이 모델의 강점은 무엇인가요?

Industry-leading reasoning capabilities Autonomous task execution via Managed Agents Handles 200K token long-context Strong emphasis on safety

이 모델의 약점은 무엇인가요?

High API costs Not open-source Relatively slow inference speed

어떤 용도에 가장 적합한가요?

Complex reasoning tasks Autonomous agents Long-document analysis and summarization Advanced programming assistance

모델 목록으로

Anthropic독점

Claude Mythos Preview

Name: Claude Mythos Preview
Price: 15 USD
Author: Anthropic

Anthropic's latest reasoning-specialized model. It adopts the Mythos architecture and records 64.70 on the HLE benchmark, among other metrics, achieving top-level performance in complex reasoning tasks. With the Managed Agents function, it enables autonomous tool use and multi-step task execution. Its design emphasizes both safety and performance.

파라미터

Undisclosed

컨텍스트

200K

라이선스

Proprietary

출시일

2026-04-08

일본어 처리 능력

✅High-Quality JP

Multilingual model with strong Japanese language processing capabilities.

API 가격

입력 가격 (1M 토큰당)

$15

출력 가격 (1M 토큰당)

$75

과금 모드: standard

강점

・Industry-leading reasoning capabilities
・Autonomous task execution via Managed Agents
・Handles 200K token long-context
・Strong emphasis on safety

약점

・High API costs
・Not open-source
・Relatively slow inference speed

활용 사례

・Complex reasoning tasks
・Autonomous agents
・Long-document analysis and summarization
・Advanced programming assistance

심층 분석

SWE-bench Verified

93.9%

#1 overall; Opus 4.6: 80.8%, GPT-5.5: not reported

GPQA Diamond

94.6%

#1; Opus 4.6: 91.3%, Gemini 3.1 Pro: 94.3%

CyberGym

83.1%

#1; Opus 4.6: 66.6%, GPT-5.5: 81.8%

Humanity's Last Exam (w/ tools)

64.7%

#1; Opus 4.6: 53.1%, GPT-5.4: 52.1%

USAMO 2026

97.6%

Largest single benchmark jump: +55pp over Opus 4.6 (42.3%)

Input/Output Price

$25 / $125 per 1M tokens

5× Opus 4.6; invitation-only via Project Glasswing

강점

・Highest scores ever recorded on SWE-bench Verified (93.9%), CyberGym (83.1%), and USAMO 2026 (97.6%) across all frontier models
・Autonomous offensive cybersecurity capability unmatched by any public model—discovered thousands of zero-days including 27-year-old OpenBSD and 16-year-old FFmpeg bugs
・Generational leap in long-horizon agentic and terminal tasks (Terminal-Bench 2.0: 82.0%, reaching 92.1% with extended timeouts)

약점

・Not publicly available—restricted to ~52 vetted organizations under Project Glasswing with no planned general availability
・Extremely expensive at $25/$125 per million tokens, 5× the cost of Opus 4.7, limiting practical adoption even for approved partners
・Offensive cyber capabilities prompted Anthropic to withhold public release, creating a fundamental access barrier that no benchmark score can overcome

경쟁사 비교

Model	Arena	SWE	GPQA	Price
Claude Opus 4.7	N/A	87.6%	94.2%	$5/$25
GPT-5.5 (OpenAI)	N/A	Not publicly disclosed
Gemini 3.1 Pro (Google)	N/A	80.6%	94.3%	Not publicly disclosed

개요

Claude Mythos Preview, announced April 7, 2026, is Anthropic's most powerful model to date and the first to sit above the Opus tier in Anthropic's hierarchy (Haiku → Sonnet → Opus → Mythos). Internally codenamed 'Capybara,' it represents what Anthropic describes as a 4.3× jump over its previous performance trendline. The model achieves state-of-the-art results across coding, reasoning, cybersecurity, and agentic benchmarks—most notably 93.9% on SWE-bench Verified, 97.6% on USAMO 2026 (a 55-point leap over Opus 4.6), 83.1% on CyberGym, and a saturated 100% on Cybench. Its 1M-token context window and 128K-token output ceiling match the largest in the Claude family. What distinguishes Mythos from every other frontier model release is its deployment model. Anthropic has explicitly declined to make it generally available, citing offensive cybersecurity capabilities that exceed what they consider safe for unrestricted access. Instead, Mythos is deployed through Project Glasswing, a coalition of 12 major technology companies (AWS, Apple, Google, Microsoft, Cisco, CrowdStrike, NVIDIA, JPMorganChase, Broadcom, Palo Alto Networks, Linux Foundation) plus ~40 additional critical-infrastructure organizations. Anthropic committed $100M in usage credits and $4M in open-source security donations. The model has autonomously discovered thousands of zero-day vulnerabilities across every major operating system and browser, including bugs that evaded millions of automated test runs over 16–27 years. The strategic implications are significant. Mythos represents a new paradigm where the most capable frontier models may not be broadly accessible. Anthropic's 244-page system card includes a clinical psychiatrist assessment (a first for any Claude model) and white-box interpretability analysis. The company states that Mythos-class capabilities will eventually flow into a future Claude Opus release once safety safeguards mature. For the broader AI ecosystem, Mythos signals that the gap between 'capable enough to deploy' and 'capable enough to require restriction' is now a live industry question.

벤치마크 및 성능

Claude Mythos Preview dominates across virtually every reported benchmark category, representing generational jumps rather than incremental improvements. All scores are self-reported by Anthropic and should be interpreted with that caveat. ## Agentic Coding | Benchmark | Mythos Preview | Opus 4.6 | GPT-5.5 | Gemini 3.1 Pro | |---|---|---|---|---| | SWE-bench Verified | **93.9%** | 80.8% | Not reported | 80.6% | | SWE-bench Pro | **77.8%** | 53.4% | 58.6% | 54.2% | | SWE-bench Multilingual | **87.3%** | 77.8% | Not reported | Not reported | | Terminal-Bench 2.0 | **82.0%** (92.1% w/ extended timeout) | 65.4% | 82.7% | 68.5% | On SWE-bench Pro, Mythos leads GPT-5.5 by ~19 percentage points—the widest gap on any directly comparable coding benchmark. Terminal-Bench 2.0 is effectively tied with GPT-5.5 at default settings (82.0% vs 82.7%), but Mythos reaches 92.1% with the updated 2.1 harness and 4-hour extended timeouts. ## Reasoning & Mathematics | Benchmark | Mythos Preview | Opus 4.6 | GPT-5.4 | Gemini 3.1 Pro | |---|---|---|---|---| | GPQA Diamond | **94.6%** | 91.3% | 92.8% | 94.3% | | USAMO 2026 | **97.6%** | 42.3% | 95.2% | 74.4% | | HLE (with tools) | **64.7%** | 53.1% | 52.1% | 51.4% | | HLE (without tools) | **56.8%** | 40.0% | 39.8% | 44.4% | | GraphWalks BFS 256K–1M | **80.0%** | 38.7% | 21.4% | Not reported | The USAMO jump from 42.3% to 97.6% is the single most striking number in the release. GraphWalks BFS shows long-context reasoning more than doubled versus Opus 4.6, suggesting qualitative improvements in reasoning over very large context windows. On HLE, Anthropic notes Mythos 'still performs well at low effort, which could indicate some level of memorization.' ## Cybersecurity | Benchmark | Mythos Preview | Opus 4.6 | |---|---|---| | CyberGym | **83.1%** | 66.6% | | Cybench | **100% (saturated)** | Not reported | | Firefox 147 exploits | **181 working exploits** | 2 working exploits | | OSS-Fuzz (Tier 5 hijacks) | **10 full control-flow hijacks** | 0 (1 Tier-3 crash) | Independent evaluation by the UK AI Security Institute found Mythos succeeds 73% of the time on expert-level CTF tasks that no model could complete before April 2025. On AISI's 32-step 'The Last Ones' corporate-network simulation, Mythos was the first model to solve it end-to-end (3 of 10 runs), averaging 22 of 32 steps. Opus 4.6 averaged 16 steps. ## Multimodal & Computer Use | Benchmark | Mythos Preview | Opus 4.6 | |---|---|---| | OSWorld-Verified | **79.6%** | 72.7% | | BrowseComp | **86.9%** (4.9× fewer tokens) | 83.7% | | CharXiv Reasoning (w/ tools) | **93.2%** | 78.9% | | LAB-Bench FigQA (w/ tools) | **89.0%** | 75.1% | | MMMLU | **92.7%** | 91.1% | BenchLM.ai aggregates these into a provisional overall score of 99/100, ranking #1 of 117 tracked models. Mythos ranks #1 in Agentic, Coding, and Multilingual categories, and #3 in Multimodal. ## Key Caveats - All scores self-reported by Anthropic. - SWE-bench Multimodal (59.0%) uses an internal implementation not comparable to public leaderboards. - Anthropic acknowledges potential memorization concerns on HLE. - Parameter count not disclosed; '10 trillion' is unsubstantiated speculation from post-leak coverage.

상세 비교

## Claude Mythos Preview vs Claude Opus 4.7 | Dimension | Mythos Preview | Opus 4.7 | |---|---|---| | Availability | Invitation-only (Project Glasswing) | Generally available | | SWE-bench Verified | 93.9% | 87.6% | | SWE-bench Pro | 77.8% | 64.3% | | GPQA Diamond | 94.6% | 94.2% | | Terminal-Bench 2.0 | 82.0% | 69.4% | | CyberGym | 83.1% | 73.1% | | HLE (w/ tools) | 64.7% | 54.7% | | Context Window | 1M tokens | 1M tokens | | Max Output | 128K tokens | 128K tokens | | Input/Output Price | $25/$125 per 1M | $5/$25 per 1M | Mythos outperforms Opus 4.7 on every reported benchmark, most dramatically on SWE-bench Pro (+13.5pp), Terminal-Bench 2.0 (+12.6pp), and CyberGym (+10pp). However, Opus 4.7 is the most capable *generally available* Claude model and is 5× cheaper. Anthropic positions Opus 4.7 as the bridge model where new cyber safeguards are being tested ahead of eventual Mythos-class public releases. ## Claude Mythos Preview vs GPT-5.5 | Dimension | Mythos Preview | GPT-5.5 | |---|---|---| | Availability | ~52 organizations | Public (ChatGPT, API) | | SWE-bench Pro | 77.8% | 58.6% | | Terminal-Bench 2.0 | 82.0% | 82.7% | | OSWorld-Verified | 79.6% | 78.7% | | BrowseComp | 86.9% | 84.4% | | CyberGym | 83.1% | 81.8% | On the five benchmarks where both models report scores, Mythos leads all five, though three are within normal noise margins. The SWE-bench Pro gap (~19pp) is the only decisively large difference. GPT-5.5 has no public SWE-bench Verified score, no Cybench number, and no equivalent zero-day discovery program. The fundamental asymmetry is deployment: GPT-5.5 is rolling out to millions of ChatGPT users; Mythos is restricted to critical-infrastructure defenders. As Kingy AI's analysis concludes: 'on vendor-reported overlap, Mythos is ahead; on real-world availability, GPT-5.5 is the only one you can actually use.' ## Claude Mythos Preview vs Gemini 3.1 Pro | Dimension | Mythos Preview | Gemini 3.1 Pro | |---|---|---| | SWE-bench Verified | 93.9% | 80.6% | | GPQA Diamond | 94.6% | 94.3% | | Terminal-Bench 2.0 | 82.0% | 68.5% | | USAMO 2026 | 97.6% | 74.4% | | HLE (w/ tools) | 64.7% | 51.4% | Mythos leads Gemini 3.1 Pro significantly on coding and agentic benchmarks. GPQA Diamond is essentially tied (94.6% vs 94.3%). The USAMO gap (23pp) is large but less dramatic than the Opus 4.6 gap. Google has not reported cybersecurity benchmark scores for Gemini 3.1 Pro in this comparison set.

커뮤니티 평가

The AI research and developer community has reacted to Claude Mythos Preview with a mixture of awe, frustration, and strategic reassessment. **Benchmark dominance acknowledged broadly.** BenchLM.ai assigns Mythos a provisional #1 ranking (99/100) across 117 tracked models. Independent analysis from Vellum, R&D World, and SmartChunks all confirm that on every benchmark Anthropic reported, Mythos outperforms the prior flagship Opus 4.6 by margins that are clearly generational, not incremental. The USAMO 97.6% score has been described by multiple commentators as 'the most striking single number in any 2026 model launch.' **Frustration over access restrictions.** The most common developer reaction is that a model this capable, gated behind invitation-only access, creates a two-tier AI ecosystem. Forum discussions on Hacker News and AI Twitter highlight that while the cybersecurity rationale is understood, the lack of a public API means the broader developer community cannot verify claims, build on the model, or integrate it into production workflows. The Kingy AI comparison piece captures this tension: 'The one-sentence answer is: on vendor-reported overlap, Mythos is ahead; on real-world availability, GPT-5.5 is the only one you can actually use.' **Respect for Anthropic's safety stance.** The 244-page system card, the clinical psychiatrist assessment section, and the white-box interpretability analysis have been praised by alignment researchers as unusually thorough disclosure. The UK AI Security Institute's independent evaluation, confirming Mythos's cyber capabilities, has been cited as a model for third-party validation of frontier AI claims. **Open-source security community response.** The $2.5M to Alpha-Omega/OpenSSF and $1.5M to Apache Foundation, combined with the Claude for Open Source access program, have been positively received by open-source maintainers who historically lack access to expensive security tooling. **Strategic industry implications.** Security professionals and AI policy researchers note that Mythos represents the first frontier model where the deploying company has explicitly said 'this is too dangerous for general release.' The New York Times reported on the announcement under the headline 'Anthropic's New Mythos A.I. Model Sets Off Global Alarms,' framing it as a watershed moment for AI capability governance. Multiple analysts have noted that Anthropic's approach—deploying a powerful model defensively while withholding it from public access—creates a new template for responsible frontier AI deployment that other labs may be pressured to follow. **Adoption patterns.** With access limited to ~52 organizations, real-world adoption is concentrated among major cloud providers (AWS, Google Cloud, Microsoft Azure), cybersecurity firms (CrowdStrike, Palo Alto Networks), and critical infrastructure operators (JPMorganChase, Linux Foundation ecosystem). Anthropic's $100M credit commitment ensures substantial usage during the research preview period.

활용 사례

### 1. Defensive Cybersecurity & Vulnerability Research This is the primary use case Anthropic has designed Mythos for and the reason Project Glasswing exists. Mythos has autonomously discovered thousands of zero-day vulnerabilities across every major OS and browser, including bugs that evaded decades of automated testing. Specific applications include: local vulnerability detection in production binaries, black-box penetration testing of critical infrastructure, automated security auditing of open-source codebases, and exploit chain discovery. On the Firefox 147 benchmark, Mythos produced 181 working exploits versus 2 for Opus 4.6. **Choose Mythos over alternatives when** your organization has approved access and is working on defensive security of critical systems. No other publicly documented model matches this capability tier. ### 2. Autonomous Agentic Coding for Complex, Long-Horizon Tasks With 93.9% on SWE-bench Verified, 77.8% on SWE-bench Pro, and 82.0% on Terminal-Bench 2.0, Mythos is the strongest autonomous coding agent available. It can investigate codebases, implement fixes, run tests, and report results with minimal human steering. The model supports tool use and multi-step task execution through Anthropic's Managed Agents function. **Choose Mythos over alternatives when** you need an AI to autonomously resolve complex software engineering tasks that require sustained investigation across large codebases—particularly when the tasks involve security-sensitive code or require high confidence in correctness. For general coding tasks without security sensitivity, Opus 4.7 (93.9% → 87.6% on SWE-bench Verified, at 1/5 the cost) is more practical. ### 3. Advanced Mathematical & Scientific Reasoning The 97.6% on USAMO 2026 and 56.8% on HLE (without tools) represent the frontier of AI mathematical reasoning. Mythos shows particular strength in multi-step proofs, creative problem decomposition, and competition-level mathematics. The model's 94.6% on GPQA Diamond also indicates strong graduate-level science reasoning. **Choose Mythos over alternatives when** solving research-grade mathematical problems, validating complex proofs, or tackling scientific reasoning tasks where the 55-point USAMO gap over Opus 4.6 translates to qualitatively different problem-solving capability. For standard academic Q&A, the GPQA gap over Opus 4.7 (94.6% vs 94.2%) is negligible. ### 4. Long-Context Reasoning & Document Analysis The 1M-token context window combined with 80.0% on GraphWalks BFS 256K–1M (vs Opus 4.6's 38.7%) makes Mythos uniquely capable at reasoning over very large contexts. This is relevant for analyzing massive codebases, processing long legal or regulatory documents, conducting multi-document research synthesis, and navigating complex knowledge graphs. **Choose Mythos over alternatives when** your task requires genuine reasoning—not just retrieval—across documents that span hundreds of thousands to millions of tokens. The GraphWalks result suggests qualitative long-context improvements that smaller models cannot replicate regardless of prompting strategies.