개요
Claude Opus 4.7, released April 16, 2026, is Anthropic's most capable generally available model and the first to ship with production cybersecurity safeguards developed under Project Glasswing. Built on the Mythos architecture, it ties with GPT-5.5 and Gemini 3.1 Pro atop the Artificial Analysis Intelligence Index (score 57) while leading GDPval-AA—a benchmark measuring economically valuable knowledge work across 44 occupations—by 79 Elo points. The model represents a targeted upgrade over Opus 4.6, with improvements concentrated in agentic coding (+10.9pp on SWE-bench Pro), multi-tool orchestration (77.3% MCP-Atlas, #1 among available models), and visual reasoning (+13pp on CharXiv). A new self-verification capability causes the model to check its own work before reporting, reducing confident-but-wrong outputs and enabling more autonomous long-running workflows.
However, the release comes with real trade-offs. Long-context retrieval performance regressed sharply—MRCR v2 8-needle at 1M tokens dropped from 78.3% to 32.2%—and BrowseComp web research fell 4.4 points, trailing both GPT-5.5 and Gemini 3.1 Pro. A new tokenizer inflates token counts by up to 35% on identical inputs, meaning effective per-task costs rise despite unchanged per-token pricing. Anthropic also deliberately reduced Opus 4.7's cybersecurity capabilities during training, making it the first commercially available model intentionally constrained in a specific domain for safety reasons. This positions it as a bridge to the more powerful but restricted Claude Mythos Preview.
The pricing remains at $5/$25 per 1M input/output tokens (with 90% cache discounts and 50% batch discounts available), and the model maintains the 1M-token context window and 128K max output of its predecessor. New features include an 'xhigh' effort level for finer reasoning control, task budgets in public beta for token-guided agentic loops, and vision resolution increased to 3.75 megapixels. Developer reception is strongly positive for coding and agent workflows, though the long-context regression and tokenizer cost increase have drawn sharp criticism in the community.
벤치마크 및 성능
## Comprehensive Benchmark Comparison
### Coding & Software Engineering
| Benchmark | Claude Opus 4.7 | Claude Opus 4.6 | GPT-5.5 | Gemini 3.1 Pro | Mythos Preview |
|---|---|---|---|---|---|
| SWE-bench Verified | **87.6%** | 80.8% | — | 80.6% | 93.9% |
| SWE-bench Pro | **64.3%** | 53.4% | 58.6% | 54.2% | 77.8% |
| Terminal-Bench 2.0 | 69.4% | 65.4% | **82.7%** | 68.5% | 82.0% |
| CursorBench | **70%** | 58% | — | — | — |
| Rakuten-SWE-Bench | 3× Opus 4.6 | baseline | — | — | — |
Opus 4.7 leads all generally available models on SWE-bench Verified (87.6%) and SWE-bench Pro (64.3%). The +10.9pp gain on SWE-bench Pro is the largest single-benchmark improvement in this release. However, it trails GPT-5.5 significantly on Terminal-Bench 2.0 (69.4% vs 82.7%), which tests autonomous shell-driven tasks.
### Agentic & Tool Use
| Benchmark | Claude Opus 4.7 | Claude Opus 4.6 | GPT-5.5 | Gemini 3.1 Pro |
|---|---|---|---|---|
| MCP-Atlas (Tool Use) | **77.3%** | 75.8% | 75.3% | 73.9% |
| OSWorld-Verified (Computer Use) | **78.0%** | 72.7% | 78.7% | — |
| Finance Agent v1.1 | **64.4%** | 60.1% | 60.0% | 59.7% |
| GDPval-AA (Elo) | **1,753** | 1,619 | 1,674 | — |
| BrowseComp | 79.3% | 83.7% | 84.4% | **85.9%** |
Opus 4.7 leads MCP-Atlas, Finance Agent v1.1, and GDPval-AA. It ties GPT-5.5 on OSWorld-Verified (78.0% vs 78.7%). BrowseComp is the clear regression (-4.4pp), where Opus 4.7 trails both GPT-5.5 and Gemini 3.1 Pro.
### Reasoning & Knowledge
| Benchmark | Claude Opus 4.7 | Claude Opus 4.6 | GPT-5.5/Pro | Gemini 3.1 Pro |
|---|---|---|---|---|
| GPQA Diamond | 94.2% | 91.3% | 93.6% | **94.3%** |
| HLE (no tools) | **46.9%** | 40.0% | 41.4% | 44.4% |
| HLE (with tools) | 54.7% | 53.3% | **58.7%** (Pro) | 51.4% |
| MMMLU (multilingual) | 91.5% | 91.1% | — | **92.6%** |
| Biology Reasoning | **74.0%** | 30.9% | — | — |
| AA-Omniscience | 26 | 14 | — | **33** |
Opus 4.7 leads HLE without tools (+5.5pp over GPT-5.5) and shows a dramatic 43pp jump in biology reasoning. GPQA Diamond is approaching saturation across all frontier models (93.6–94.4%). GPT-5.4/5.5 Pro leads HLE with tools (58.7%).
### Vision & Multimodal
| Benchmark | Claude Opus 4.7 | Claude Opus 4.6 |
|---|---|---|
| CharXiv (no tools) | **82.1%** | 69.1% |
| CharXiv (with tools) | **91.0%** | 84.7% |
| Max Image Resolution | **3.75 MP** (2,576px) | ~1.15 MP (1,568px) |
The 13pp CharXiv jump without tools is the largest relative improvement in the release. The 3.3× resolution increase enables pixel-accurate coordinate mapping for computer-use agents.
### Long-Context & Retrieval
| Benchmark | Claude Opus 4.7 | Claude Opus 4.6 |
|---|---|---|
| MRCR v2 8-needle (1M) | 32.2% | **78.3%** |
| Context Window | 1M tokens | 1M tokens |
The 46pp collapse on long-context multi-needle retrieval is the most significant regression. Anthropic acknowledges this and points to GraphWalks as a better signal for applied long-context reasoning, where Opus 4.7 shows improvement.
### Safety & Alignment
| Metric | Claude Opus 4.7 | Claude Opus 4.6 | Mythos Preview |
|---|---|---|---|
| Misaligned Behavior Score | 2.46 | 2.76 | **1.78** |
| Hallucination Rate | 36% | 61% | — |
Hallucination rate dropped 25 percentage points, driven largely by more frequent abstention on uncertain questions (attempt rate: 70% vs 82%).
상세 비교
## Head-to-Head Comparisons
### Claude Opus 4.7 vs GPT-5.5
| Dimension | Claude Opus 4.7 | GPT-5.5 |
|---|---|---|
| Release Date | April 16, 2026 | April 23, 2026 |
| Context Window | 1M / 128K output | 1M / 128K output |
| Input Price | $5/1M (flat to 200K, $10/1M above) | $5/1M (flat rate all sizes) |
| Output Price | $25/1M (flat to 200K, $37.50 above) | $30/1M (flat rate all sizes) |
| TTFT | ~0.5s | ~3s (GPT-5.4 baseline) |
| Throughput | ~42 tps | ~50 tps |
| SWE-bench Pro | **64.3%** | 58.6% |
| Terminal-Bench 2.0 | 69.4% | **82.7%** |
| BrowseComp | 79.3% | **84.4%** |
| MCP-Atlas | **77.3%** | 75.3% |
| GPQA Diamond | **94.2%** | 93.6% |
| HLE (no tools) | **46.9%** | 41.4% |
| Vision Resolution | **3.75 MP** | ~1.15 MP |
| Reasoning Controls | low/med/high/xhigh/max | xhigh effort tier |
**Summary:** Opus 4.7 wins on 6/10 shared benchmarks; GPT-5.5 wins on 4. Opus 4.7 dominates reasoning-heavy and review-grade tasks; GPT-5.5 excels at long-running tool-use and shell-driven workflows. Opus 4.7 has a significant TTFT advantage (~0.5s vs ~3s) making it better for interactive surfaces. GPT-5.5 uses fewer tokens per completed task on autonomous loops. Opus 4.7 has flat pricing above 200K tokens costing 2× more; GPT-5.5 keeps flat pricing.
### Claude Opus 4.7 vs Gemini 3.1 Pro
| Dimension | Claude Opus 4.7 | Gemini 3.1 Pro |
|---|---|---|
| Input Price | $5/1M | $2/1M |
| Output Price | $25/1M | $12/1M |
| SWE-bench Pro | **64.3%** | 54.2% |
| SWE-bench Verified | **87.6%** | 80.6% |
| MCP-Atlas | **77.3%** | 73.9% |
| GPQA Diamond | 94.2% | **94.3%** |
| BrowseComp | 79.3% | **85.9%** |
| MMMLU | 91.5% | **92.6%** |
| Intelligence Index | 57 | 57 (tied) |
| AA-Omniscience | 26 | **33** |
**Summary:** Opus 4.7 leads on coding and tool use; Gemini 3.1 Pro leads on web research, multilingual Q&A, and hallucination reduction (AA-Omniscience 33 vs 26). Gemini is 2.5× cheaper on input and 2× cheaper on output, making it the better cost-per-task option for text-heavy workloads. Both share the 1M context window.
### Claude Opus 4.7 vs Claude Opus 4.6
| Dimension | Claude Opus 4.7 | Claude Opus 4.6 |
|---|---|---|
| SWE-bench Pro | **64.3%** | 53.4% |
| MCP-Atlas | **77.3%** | 75.8% |
| CharXiv (no tools) | **82.1%** | 69.1% |
| BrowseComp | 79.3% | **83.7%** |
| MRCR v2 (1M context) | 32.2% | **78.3%** |
| Hallucination Rate | **36%** | 61% |
| Tokenizer | +1.0–1.35× tokens | baseline |
| Price | $5/$25 (identical) | $5/$25 |
**Summary:** Opus 4.7 is a clear upgrade for coding (+10.9pp SWE-bench Pro), tool use, and vision. However, long-context retrieval regressed severely (-46pp on MRCR v2) and BrowseComp fell 4.4pp. The new tokenizer inflates costs by up to 35% on identical inputs. Teams relying on precise needle-in-a-haystack retrieval from long documents should stay on Opus 4.6.
커뮤니티 평가
Developer and researcher reception is notably mixed, split along workload lines.
**Positive sentiment** dominates among coding-focused teams. Cursor reported Opus 4.7 scoring 70% on CursorBench (vs 58% for Opus 4.6) and called it "a meaningful jump in capabilities." Notion reported 14% higher task success with a third of the tool errors and described it as making their agent "feel like a true teammate." Replit called it "an easy upgrade decision" noting it achieves "the same quality at lower cost." Vercel described it as "phenomenal on one-shot coding tasks" and noted the model "does proofs on systems code before starting work, which is new behavior." XBOW reported a visual acuity jump from 54.5% to 98.5%—described as "a step change"—effectively eliminating their biggest pain point. Hex called it "the strongest model we've evaluated" for resisting dissonant-data traps.
**Critical sentiment** centers on the long-context regression and cost concerns. A Reddit post on r/ClaudeAI titled "Claude Opus 4.7 is a serious regression, not an upgrade" garnered over 2,300 upvotes in 24 hours, driven primarily by the MRCR v2 long-context retrieval collapse. Developers running RAG pipelines and document analysis workflows reported needing to fall back to Opus 4.6. The tokenizer cost increase has caused Pro and Max subscribers to hit rate limits significantly faster, with some reporting 5-hour caps consumed in a fraction of the previous time.
**Behavioral observations** from the community highlight that Opus 4.7 follows instructions much more literally than Opus 4.6, which Anthropic itself flagged in the migration guide. Prompts written for 4.6 that relied on loose interpretation produce different (sometimes less useful) results on 4.7. Some developers praise the "more opinionated perspective" and direct pushback, while others find the model interrupts with unnecessary follow-up questions. The removal of explicit temperature/top_p/top_k controls broke some production integrations that relied on deterministic output settings.
**Adoption pattern:** Major platforms (Cursor, GitHub Copilot, Replit, Vercel) switched their Opus tier to 4.7 at launch. Enterprise customers report strong results for agentic workflows but are running parallel evaluations on the tokenizer cost impact before full migration.
활용 사례
### 1. Agentic Software Engineering
Opus 4.7 excels when given autonomous coding tasks that span multiple files, require planning, and involve iterative debugging. The self-verification capability means it catches its own race conditions and off-by-one errors before reporting. Real-world examples: Cursor reported 70% on CursorBench; Rakuten saw 3× more production task resolution; Notion reported 14% higher success with a third of the tool errors. **Choose Opus 4.7 over alternatives** when the output will be reviewed by a human (e.g., pull requests, code review), the task requires multi-language reasoning (SWE-bench Pro), or the codebase is large enough that tool orchestration matters (MCP-Atlas 77.3%). Choose GPT-5.5 over Opus 4.7 for unattended terminal/DevOps automation (Terminal-Bench 82.7% vs 69.4%).
### 2. Computer-Use and Vision-Heavy Workflows
With 3.75MP vision support (3.3× prior Claude) and 78.0% on OSWorld-Verified, Opus 4.7 is the strongest available model for autonomous GUI interaction—clicking, typing, navigating applications. The vision resolution improvement enables 1:1 coordinate mapping without scale-factor math, critical for screen-based agents. XBOW's penetration testing benchmark jumped from 54.5% to 98.5% on visual acuity. **Choose Opus 4.7** for tasks requiring fine visual detail: reading dense dashboards, extracting data from scanned documents, analyzing technical diagrams, or operating desktop software autonomously. Choose GPT-5.5 for simpler vision tasks where the 3.75MP resolution advantage is unnecessary.
### 3. Multi-Tool Orchestration and Enterprise Agent Workflows
Opus 4.7's 77.3% MCP-Atlas score (best among available models) combined with task budgets makes it ideal for production agent systems that route across multiple tools. Ramp reported "stronger role fidelity, instruction-following, coordination, and complex reasoning." Factory Droids saw 10–15% task success lift with fewer tool errors. The new task budget feature lets developers set a token allowance for an entire agentic loop, preventing runaway costs. **Choose Opus 4.7** when the workflow involves 5+ distinct tool calls, requires self-verification between steps, or spans multiple sessions with file-based memory. Choose Gemini 3.1 Pro for cost-sensitive agent workflows where tool orchestration complexity is lower.
### 4. Financial Analysis and Professional Knowledge Work
Opus 4.7 leads Finance Agent v1.1 at 64.4% (vs GPT-5.5: 60.0%, Gemini: 59.7%) and GDPval-AA at 1,753 Elo. Harvey (legal AI) reported 90.9% on BigLaw Bench with "better reasoning calibration on review tables and noticeably smarter handling of ambiguous document editing tasks." Databricks saw 21% fewer errors on OfficeQA Pro. **Choose Opus 4.7** for structured professional outputs—financial models, legal analysis, enterprise document reasoning—where correctness and self-verification matter more than speed. Choose GPT-5.5 or Gemini 3.1 Pro for higher-volume, lower-stakes knowledge work where cost per task dominates.
최신 뉴스
**April 16, 2026 — Claude Opus 4.7 Launch**
- General availability across Claude API, Amazon Bedrock, Google Vertex AI, Microsoft Foundry
- Same pricing as Opus 4.6: $5/$25 per 1M input/output tokens
- New features: xhigh effort level, task budgets (public beta), 3.75MP vision, adaptive reasoning only
- Extended thinking with explicit budget_tokens fully removed; returns 400 error
- Temperature/top_p/top_k parameters at non-default values return 400 error
- New tokenizer: same input maps to 1.0–1.35× more tokens
**April 2026 — Cybersecurity Safeguards**
- First model with production cybersecurity safeguards from Project Glasswing
- Cyber capabilities deliberately reduced during training compared to Mythos Preview
- Cyber Verification Program launched for legitimate security professionals
- Over-eager malware flagging in Claude Code reported (static HTML/CSS flagged as potential malware)
**April 2026 — Platform Integrations**
- GitHub Copilot Pro+, Business, and Enterprise tiers switched to Opus 4.7
- Claude Code defaults to xhigh effort for all plans
- New /ultrareview slash command in Claude Code (3 free reviews for Pro/Max)
- Auto mode extended to Max plan subscribers
**April 14, 2026 — Enterprise Pricing Restructure**
- Claude Enterprise (150+ users) moved from flat per-seat to usage-based pricing ($20/user/mo base + compute)
- Individual Pro ($20/mo) and Max ($100–200/mo) plans unchanged
- Teams plan (under 150 users) unchanged
**Anthropic Acquired Stainless (SDK and MCP server tooling)**
**KPMG Global Alliance** — Claude integrated into KPMG's Digital Gateway for 276,000+ employees