Blog

ベンチマーク

ClawBench: Measuring Real-World LLM Agent Performance in Enterprise Scenarios

ClawBench is a novel benchmark designed to evaluate LLM agents on realistic enterprise tasks, moving beyond simple Q&A formats. It simulates complex business workflows in sandbox environments to better predict real-world performance and address gaps in current evaluation methods, focusing on scenarios like office collaboration and software engineering.

オープンソース

New Model appearing on DeepSeek Official Site? 1 Million Token Input and May 2025 Knowledge Cutoff Confirmed

The models on the DeepSeek official site have been updated, revealing support for long context inputs of up to 1 million tokens and a recent knowledge cutoff of May 2025. There is a high probability that this is a completely new next-generation model, distinct from the previous V3.2.

オープンソース

xAI Releases "Grok 4.2 Beta"! Performance Boost via Integrated Approach Using Four Expert Models

xAI has released the latest "Grok 4.2 Beta." A new approach coordinating four expert models has improved logical reasoning and coding capabilities, with a limited number of trials available for free users.

AIエージェント

Moonshot AI Releases "Kimi Claw": A Resident AI Assistant Operating 24/7 in the Cloud

Moonshot AI has launched the beta version of "Kimi Claw," an AI assistant that runs 24 hours a day in the cloud. Equipped with 40GB of storage, it allows users to utilize advanced autonomous agents without the need for server construction.

ベンチマーク

Beyond Token Counts: How the AA-LCR Benchmark Reveals True Long-Context Reasoning in LLMs

Discover how the new AA-LCR benchmark evaluates LLMs' real-world long-context reasoning beyond simple token counts. It assesses three key capabilities—information retrieval, integration, and complex reasoning—over inputs averaging 100K tokens, revealing that longer context windows don't always translate to better performance.

Anthropic

Anthropic Launches Claude Opus 4.7 with Major Coding and Vision Upgrades, Plus Novel Cybersecurity Features

Anthropic has released Claude Opus 4.7, featuring substantial improvements in coding and vision capabilities alongside a pioneering cybersecurity protection mechanism. This update positions it as a stable, safe flagship model for enterprise use, distinct from the experimental Mythos Preview, emphasizing practicality and safety over raw performance.

Anthropic

Should AI Stop Outputting Markdown? An Anthropic Engineer Makes the Case for HTML

An Anthropic engineer argues that Markdown, the default output format for nearly all LLMs, is fundamentally limited for AI-generated content. He makes the case for HTML as a superior alternative that preserves spatial layouts, enables interactivity, and leverages the browser's native rendering capabilities. While cost and latency remain practical barriers, the deeper question may be whether neither format is the final answer — and what an AI-native output standard should look like.

ベンチマーク

Introducing ARC-AGI-3: The First Interactive Benchmark for Assessing AI's True Reasoning Ability

ARC-AGI-3 is an interactive benchmark that evaluates AI's true reasoning by testing its ability to induce rules from abstract grid examples, avoiding data contamination common in traditional benchmarks. It assesses iterative problem-solving, measuring how AI can hypothesize, verify, and correct its approach like a human agent.

Anthropic

Anthropic's Claude Mythos: Unveiling Advanced Security Capabilities and the Project Glasswing Defense Initiative

Anthropic has unveiled Claude Mythos, a next-gen AI model with exceptional security capabilities, including the ability to discover zero-day vulnerabilities in robust systems like OpenBSD. To manage its power responsibly, they've launched Project Glasswing, a defensive framework for using AI in cybersecurity to mitigate risks and proactively enhance protection.

AIエージェント

MetaGPT Rebrands to "Atoms": From Vibe Coding to "Vibe Business." New Strategy to Complete Everything from Idea to Deployment and Payment via AI

MetaGPT has rebranded to "Atoms," evolving from simple code generation to "business construction." A team of AI agents handles everything from market research and backend construction to Stripe payment integration, converting ideas into monetizable products via the shortest route.