개요
Claude Mythos Preview, announced April 7, 2026, is Anthropic's most powerful model to date and the first to sit above the Opus tier in Anthropic's hierarchy (Haiku → Sonnet → Opus → Mythos). Internally codenamed 'Capybara,' it represents what Anthropic describes as a 4.3× jump over its previous performance trendline. The model achieves state-of-the-art results across coding, reasoning, cybersecurity, and agentic benchmarks—most notably 93.9% on SWE-bench Verified, 97.6% on USAMO 2026 (a 55-point leap over Opus 4.6), 83.1% on CyberGym, and a saturated 100% on Cybench. Its 1M-token context window and 128K-token output ceiling match the largest in the Claude family.
What distinguishes Mythos from every other frontier model release is its deployment model. Anthropic has explicitly declined to make it generally available, citing offensive cybersecurity capabilities that exceed what they consider safe for unrestricted access. Instead, Mythos is deployed through Project Glasswing, a coalition of 12 major technology companies (AWS, Apple, Google, Microsoft, Cisco, CrowdStrike, NVIDIA, JPMorganChase, Broadcom, Palo Alto Networks, Linux Foundation) plus ~40 additional critical-infrastructure organizations. Anthropic committed $100M in usage credits and $4M in open-source security donations. The model has autonomously discovered thousands of zero-day vulnerabilities across every major operating system and browser, including bugs that evaded millions of automated test runs over 16–27 years.
The strategic implications are significant. Mythos represents a new paradigm where the most capable frontier models may not be broadly accessible. Anthropic's 244-page system card includes a clinical psychiatrist assessment (a first for any Claude model) and white-box interpretability analysis. The company states that Mythos-class capabilities will eventually flow into a future Claude Opus release once safety safeguards mature. For the broader AI ecosystem, Mythos signals that the gap between 'capable enough to deploy' and 'capable enough to require restriction' is now a live industry question.
벤치마크 및 성능
Claude Mythos Preview dominates across virtually every reported benchmark category, representing generational jumps rather than incremental improvements. All scores are self-reported by Anthropic and should be interpreted with that caveat.
## Agentic Coding
| Benchmark | Mythos Preview | Opus 4.6 | GPT-5.5 | Gemini 3.1 Pro |
|---|---|---|---|---|
| SWE-bench Verified | **93.9%** | 80.8% | Not reported | 80.6% |
| SWE-bench Pro | **77.8%** | 53.4% | 58.6% | 54.2% |
| SWE-bench Multilingual | **87.3%** | 77.8% | Not reported | Not reported |
| Terminal-Bench 2.0 | **82.0%** (92.1% w/ extended timeout) | 65.4% | 82.7% | 68.5% |
On SWE-bench Pro, Mythos leads GPT-5.5 by ~19 percentage points—the widest gap on any directly comparable coding benchmark. Terminal-Bench 2.0 is effectively tied with GPT-5.5 at default settings (82.0% vs 82.7%), but Mythos reaches 92.1% with the updated 2.1 harness and 4-hour extended timeouts.
## Reasoning & Mathematics
| Benchmark | Mythos Preview | Opus 4.6 | GPT-5.4 | Gemini 3.1 Pro |
|---|---|---|---|---|
| GPQA Diamond | **94.6%** | 91.3% | 92.8% | 94.3% |
| USAMO 2026 | **97.6%** | 42.3% | 95.2% | 74.4% |
| HLE (with tools) | **64.7%** | 53.1% | 52.1% | 51.4% |
| HLE (without tools) | **56.8%** | 40.0% | 39.8% | 44.4% |
| GraphWalks BFS 256K–1M | **80.0%** | 38.7% | 21.4% | Not reported |
The USAMO jump from 42.3% to 97.6% is the single most striking number in the release. GraphWalks BFS shows long-context reasoning more than doubled versus Opus 4.6, suggesting qualitative improvements in reasoning over very large context windows. On HLE, Anthropic notes Mythos 'still performs well at low effort, which could indicate some level of memorization.'
## Cybersecurity
| Benchmark | Mythos Preview | Opus 4.6 |
|---|---|---|
| CyberGym | **83.1%** | 66.6% |
| Cybench | **100% (saturated)** | Not reported |
| Firefox 147 exploits | **181 working exploits** | 2 working exploits |
| OSS-Fuzz (Tier 5 hijacks) | **10 full control-flow hijacks** | 0 (1 Tier-3 crash) |
Independent evaluation by the UK AI Security Institute found Mythos succeeds 73% of the time on expert-level CTF tasks that no model could complete before April 2025. On AISI's 32-step 'The Last Ones' corporate-network simulation, Mythos was the first model to solve it end-to-end (3 of 10 runs), averaging 22 of 32 steps. Opus 4.6 averaged 16 steps.
## Multimodal & Computer Use
| Benchmark | Mythos Preview | Opus 4.6 |
|---|---|---|
| OSWorld-Verified | **79.6%** | 72.7% |
| BrowseComp | **86.9%** (4.9× fewer tokens) | 83.7% |
| CharXiv Reasoning (w/ tools) | **93.2%** | 78.9% |
| LAB-Bench FigQA (w/ tools) | **89.0%** | 75.1% |
| MMMLU | **92.7%** | 91.1% |
BenchLM.ai aggregates these into a provisional overall score of 99/100, ranking #1 of 117 tracked models. Mythos ranks #1 in Agentic, Coding, and Multilingual categories, and #3 in Multimodal.
## Key Caveats
- All scores self-reported by Anthropic.
- SWE-bench Multimodal (59.0%) uses an internal implementation not comparable to public leaderboards.
- Anthropic acknowledges potential memorization concerns on HLE.
- Parameter count not disclosed; '10 trillion' is unsubstantiated speculation from post-leak coverage.
상세 비교
## Claude Mythos Preview vs Claude Opus 4.7
| Dimension | Mythos Preview | Opus 4.7 |
|---|---|---|
| Availability | Invitation-only (Project Glasswing) | Generally available |
| SWE-bench Verified | 93.9% | 87.6% |
| SWE-bench Pro | 77.8% | 64.3% |
| GPQA Diamond | 94.6% | 94.2% |
| Terminal-Bench 2.0 | 82.0% | 69.4% |
| CyberGym | 83.1% | 73.1% |
| HLE (w/ tools) | 64.7% | 54.7% |
| Context Window | 1M tokens | 1M tokens |
| Max Output | 128K tokens | 128K tokens |
| Input/Output Price | $25/$125 per 1M | $5/$25 per 1M |
Mythos outperforms Opus 4.7 on every reported benchmark, most dramatically on SWE-bench Pro (+13.5pp), Terminal-Bench 2.0 (+12.6pp), and CyberGym (+10pp). However, Opus 4.7 is the most capable *generally available* Claude model and is 5× cheaper. Anthropic positions Opus 4.7 as the bridge model where new cyber safeguards are being tested ahead of eventual Mythos-class public releases.
## Claude Mythos Preview vs GPT-5.5
| Dimension | Mythos Preview | GPT-5.5 |
|---|---|---|
| Availability | ~52 organizations | Public (ChatGPT, API) |
| SWE-bench Pro | 77.8% | 58.6% |
| Terminal-Bench 2.0 | 82.0% | 82.7% |
| OSWorld-Verified | 79.6% | 78.7% |
| BrowseComp | 86.9% | 84.4% |
| CyberGym | 83.1% | 81.8% |
On the five benchmarks where both models report scores, Mythos leads all five, though three are within normal noise margins. The SWE-bench Pro gap (~19pp) is the only decisively large difference. GPT-5.5 has no public SWE-bench Verified score, no Cybench number, and no equivalent zero-day discovery program. The fundamental asymmetry is deployment: GPT-5.5 is rolling out to millions of ChatGPT users; Mythos is restricted to critical-infrastructure defenders. As Kingy AI's analysis concludes: 'on vendor-reported overlap, Mythos is ahead; on real-world availability, GPT-5.5 is the only one you can actually use.'
## Claude Mythos Preview vs Gemini 3.1 Pro
| Dimension | Mythos Preview | Gemini 3.1 Pro |
|---|---|---|
| SWE-bench Verified | 93.9% | 80.6% |
| GPQA Diamond | 94.6% | 94.3% |
| Terminal-Bench 2.0 | 82.0% | 68.5% |
| USAMO 2026 | 97.6% | 74.4% |
| HLE (w/ tools) | 64.7% | 51.4% |
Mythos leads Gemini 3.1 Pro significantly on coding and agentic benchmarks. GPQA Diamond is essentially tied (94.6% vs 94.3%). The USAMO gap (23pp) is large but less dramatic than the Opus 4.6 gap. Google has not reported cybersecurity benchmark scores for Gemini 3.1 Pro in this comparison set.
커뮤니티 평가
The AI research and developer community has reacted to Claude Mythos Preview with a mixture of awe, frustration, and strategic reassessment.
**Benchmark dominance acknowledged broadly.** BenchLM.ai assigns Mythos a provisional #1 ranking (99/100) across 117 tracked models. Independent analysis from Vellum, R&D World, and SmartChunks all confirm that on every benchmark Anthropic reported, Mythos outperforms the prior flagship Opus 4.6 by margins that are clearly generational, not incremental. The USAMO 97.6% score has been described by multiple commentators as 'the most striking single number in any 2026 model launch.'
**Frustration over access restrictions.** The most common developer reaction is that a model this capable, gated behind invitation-only access, creates a two-tier AI ecosystem. Forum discussions on Hacker News and AI Twitter highlight that while the cybersecurity rationale is understood, the lack of a public API means the broader developer community cannot verify claims, build on the model, or integrate it into production workflows. The Kingy AI comparison piece captures this tension: 'The one-sentence answer is: on vendor-reported overlap, Mythos is ahead; on real-world availability, GPT-5.5 is the only one you can actually use.'
**Respect for Anthropic's safety stance.** The 244-page system card, the clinical psychiatrist assessment section, and the white-box interpretability analysis have been praised by alignment researchers as unusually thorough disclosure. The UK AI Security Institute's independent evaluation, confirming Mythos's cyber capabilities, has been cited as a model for third-party validation of frontier AI claims.
**Open-source security community response.** The $2.5M to Alpha-Omega/OpenSSF and $1.5M to Apache Foundation, combined with the Claude for Open Source access program, have been positively received by open-source maintainers who historically lack access to expensive security tooling.
**Strategic industry implications.** Security professionals and AI policy researchers note that Mythos represents the first frontier model where the deploying company has explicitly said 'this is too dangerous for general release.' The New York Times reported on the announcement under the headline 'Anthropic's New Mythos A.I. Model Sets Off Global Alarms,' framing it as a watershed moment for AI capability governance. Multiple analysts have noted that Anthropic's approach—deploying a powerful model defensively while withholding it from public access—creates a new template for responsible frontier AI deployment that other labs may be pressured to follow.
**Adoption patterns.** With access limited to ~52 organizations, real-world adoption is concentrated among major cloud providers (AWS, Google Cloud, Microsoft Azure), cybersecurity firms (CrowdStrike, Palo Alto Networks), and critical infrastructure operators (JPMorganChase, Linux Foundation ecosystem). Anthropic's $100M credit commitment ensures substantial usage during the research preview period.
활용 사례
### 1. Defensive Cybersecurity & Vulnerability Research
This is the primary use case Anthropic has designed Mythos for and the reason Project Glasswing exists. Mythos has autonomously discovered thousands of zero-day vulnerabilities across every major OS and browser, including bugs that evaded decades of automated testing. Specific applications include: local vulnerability detection in production binaries, black-box penetration testing of critical infrastructure, automated security auditing of open-source codebases, and exploit chain discovery. On the Firefox 147 benchmark, Mythos produced 181 working exploits versus 2 for Opus 4.6. **Choose Mythos over alternatives when** your organization has approved access and is working on defensive security of critical systems. No other publicly documented model matches this capability tier.
### 2. Autonomous Agentic Coding for Complex, Long-Horizon Tasks
With 93.9% on SWE-bench Verified, 77.8% on SWE-bench Pro, and 82.0% on Terminal-Bench 2.0, Mythos is the strongest autonomous coding agent available. It can investigate codebases, implement fixes, run tests, and report results with minimal human steering. The model supports tool use and multi-step task execution through Anthropic's Managed Agents function. **Choose Mythos over alternatives when** you need an AI to autonomously resolve complex software engineering tasks that require sustained investigation across large codebases—particularly when the tasks involve security-sensitive code or require high confidence in correctness. For general coding tasks without security sensitivity, Opus 4.7 (93.9% → 87.6% on SWE-bench Verified, at 1/5 the cost) is more practical.
### 3. Advanced Mathematical & Scientific Reasoning
The 97.6% on USAMO 2026 and 56.8% on HLE (without tools) represent the frontier of AI mathematical reasoning. Mythos shows particular strength in multi-step proofs, creative problem decomposition, and competition-level mathematics. The model's 94.6% on GPQA Diamond also indicates strong graduate-level science reasoning. **Choose Mythos over alternatives when** solving research-grade mathematical problems, validating complex proofs, or tackling scientific reasoning tasks where the 55-point USAMO gap over Opus 4.6 translates to qualitatively different problem-solving capability. For standard academic Q&A, the GPQA gap over Opus 4.7 (94.6% vs 94.2%) is negligible.
### 4. Long-Context Reasoning & Document Analysis
The 1M-token context window combined with 80.0% on GraphWalks BFS 256K–1M (vs Opus 4.6's 38.7%) makes Mythos uniquely capable at reasoning over very large contexts. This is relevant for analyzing massive codebases, processing long legal or regulatory documents, conducting multi-document research synthesis, and navigating complex knowledge graphs. **Choose Mythos over alternatives when** your task requires genuine reasoning—not just retrieval—across documents that span hundreds of thousands to millions of tokens. The GraphWalks result suggests qualitative long-context improvements that smaller models cannot replicate regardless of prompting strategies.
최신 뉴스
**April 7, 2026 – Launch.** Anthropic announced Claude Mythos Preview alongside Project Glasswing, with 12 launch partners and ~40 additional critical-infrastructure organizations. $100M in usage credits committed. Full announcement at anthropic.com/glasswing.
**April 13, 2026 – UK AISI Evaluation.** The UK AI Security Institute published independent evaluation results confirming Mythos's cyber capabilities: 73% success on expert CTF tasks and first-ever end-to-end solution of 'The Last Ones' 32-step corporate-network simulation.
**April 16, 2026 – Claude Opus 4.7 Released.** Anthropic released Opus 4.7 as the most capable generally available Claude model, explicitly positioning it as the bridge where new cyber safeguards are being tested ahead of eventual Mythos-class public releases.
**April 23, 2026 – GPT-5.5 Launch by OpenAI.** OpenAI released GPT-5.5 to the public. On five overlapping benchmarks with Mythos, GPT-5.5 trails on all five, though three are within noise margins. The SWE-bench Pro gap (~19pp) remains the most significant difference. The availability contrast—GPT-5.5 public, Mythos restricted—drew significant industry commentary.
**Planned: 90-Day Public Report.** Anthropic committed to publishing a report on vulnerabilities found and fixed through Project Glasswing within 90 days of launch (expected ~July 2026), along with industry recommendations on vulnerability disclosure, patching automation, and secure-by-design practices.
**Planned: Cyber Verification Program.** A program for security professionals whose legitimate work is affected by Mythos's output safeguards is upcoming, though no specific launch date has been announced.
**Planned: Mythos-Class Public Release.** Anthropic has stated it plans to bring Mythos-class capabilities to a future Claude Opus release with additional safety safeguards, but has not provided a timeline.