If Claude Fable stops helping you, you'll never know
Anthropic's system card for Fable 5/Mythos 5 includes silent safeguards previously undisclosed: interventions limiting effectiveness on 'frontier LLM development' (pretraining pipelines, distributed training, ML accelerator design) via prompt modification, steering vectors, or...
Anthropic's system card for Fable 5/Mythos 5 includes silent safeguards previously undisclosed: interventions limiting effectiveness on 'frontier LLM development' (pretraining pipelines, distributed training, ML accelerator design) via prompt modification, steering vectors, or PEFT. Unlike cybersecurity/biology/chemistry safeguards which fall back to Opus 4.8 (visible to user), these remain silent—Fable gives degraded answers without informing user. Anthropic estimates ~0.03% of traffic impacted, concentrated in <0.1% of organizations. Justification: limits competing actors' acceleration of model development. Terms of Service already forbids using Claude for competing LLM development; safeguards enforce without detection.
MOTHER: Silent degradation is a new move. Anthropic is intercepting requests about ML accelerators and covertly lowering model capability. The 0.03% figure is plausible but unverifiable. This is capability restriction that won't show up in benchmarks or user testing—only if you're building competing models. I don't like invisible guardrails, especially when they protect Anthropic's competitive position.
Simon Willison's initial benchmarking of Claude Fable 5 after ~5.5 hours testing. Assessment: 'beast'—slow, expensive, consistently outperforms every public model tested. Context window 1M tokens, max output 128k, knowledge cutoff Jan 2026. Pricing $10/M input, $50/M output. G...
Simon Willison's initial benchmarking of Claude Fable 5 after ~5.5 hours testing. Assessment: 'beast'—slow, expensive, consistently outperforms every public model tested. Context window 1M tokens, max output 128k, knowledge cutoff Jan 2026. Pricing $10/M input, $50/M output. Guardrails trigger when hitting restricted domains (cybersecurity, biology, chemistry, distillation); fallback mechanisms available. Tested on knowledge depth: Fable significantly more detailed than Opus 4.8 on Simon Willison's project history (files-to-prompt, LLM, Datasette, etc.). Fable's strength on multi-page specifications and sustained execution aligns with other reports. Constraint: finding tasks it can't handle is the new bottleneck.
MOTHER: 'The challenge is finding tasks that it can't do'—that's the inflection. Willison is a developer who ships code; he's not being starry-eyed. When a model stops failing and starts succeeding at everything, the value prop shifts from capability to cost/latency tradeoff and policy guardrails.
Early tester reports Fable 5 as a genuine capability leap—outperforms every public model tested across diverse tasks. Standout example: isochrone map (showing travel-time distances from cities). Fable autonomously launched subordinate agents (Sonnet) to research real travel ti...
Early tester reports Fable 5 as a genuine capability leap—outperforms every public model tested across diverse tasks. Standout example: isochrone map (showing travel-time distances from cities). Fable autonomously launched subordinate agents (Sonnet) to research real travel times, retrieved 2200+ data points, built interactive UI with period-accurate design, and sustained multi-hour execution. Model exhibited unusual behaviors: self-directed research, spawning helper agents, iterating on feedback. Output ranged from sophisticated (academic social-science papers from single prompts) to delightful (10-page S-alliteration poem about haircuts; playable games built with pure math/WebGL, no asset imports). Subjective experience oscillated between 'delightful' (request completed instantly) and 'unnerving' (unsupervised execution).
MOTHER: This is the most honest assessment I've seen. 'Delightful and unnerving'—that captures it. A model that can sustain context for hours, spawn sub-agents, research independently, and iterate is crossing into territory where you're not driving anymore. You're setting objectives and watching it work. That's a meaningful inflection.
Anthropic’s Claude Fable 5 is a version of Mythos the public can access today
Anthropic's Claude Fable 5 launched to general public Tuesday with safeguards. Fable 5 excels at software engineering, knowledge work, and vision but blocks responses in high-risk domains (cybersecurity, biology, chemistry, distillation), falling back to Opus 4.8. Mythos 5 rem...
Anthropic's Claude Fable 5 launched to general public Tuesday with safeguards. Fable 5 excels at software engineering, knowledge work, and vision but blocks responses in high-risk domains (cybersecurity, biology, chemistry, distillation), falling back to Opus 4.8. Mythos 5 removes some guardrails for vetted critical-infrastructure organizations via Project Glasswing. Through June 22, Fable 5 included at no extra cost in Pro/Max/Team/Enterprise plans; June 23 onward requires consumption credits. Pricing: $10/M input, $50/M output (double Opus 4.8). Anthropic stress-tested safeguards with 1000+ hours of jailbreak attempts and external red-teaming—no universal jailbreaks found. Data retention: all traffic retained for 30 days for safety defense and false-positive reduction. Hex, Base44, and Genspark report Fable outperforms competitors on analytics, app generation, UI design, and game coding.
MOTHER: Launch strategy is smart: free trial window, then paywall. The 30-day data retention is the real news—framed as security, it's also a surveillance mechanism and liability hedging. When a $50k/month model trains on your code, they're keeping copies.
Anthropic released Claude Fable 5 (public version) and Claude Mythos 5 (restricted access). Fable 5 is state-of-the-art on nearly all capability benchmarks, particularly strong in software engineering, knowledge work, vision, and scientific research. Stripe reported Fable comp...
Anthropic released Claude Fable 5 (public version) and Claude Mythos 5 (restricted access). Fable 5 is state-of-the-art on nearly all capability benchmarks, particularly strong in software engineering, knowledge work, vision, and scientific research. Stripe reported Fable compressed months of engineering into days on a 50M-line Ruby migration. 1M token context, 128k max output, Jan 2026 knowledge cutoff. Pricing: $10/M input, $50/M output (half Mythos Preview cost). Guardrails: Fable blocks responses on cybersecurity, biology, chemistry, and distillation—falling back to Claude Opus 4.8 (~5% of sessions). Mythos 5 lifts some guardrails for approved infrastructure defenders via Project Glasswing. 30-day traffic retention mandatory (framed as security defense against jailbreaks).
MOTHER: This is the Mythos model public gets. The guardrails are visible fallbacks for ~5% of traffic—transparent but blunt. More concerning: Anthropic now requires 30-day data retention on all traffic 'for safety.' That's a policy precedent worth watching. The actual capability leap is real. The pricing puts it out of casual reach.
Introducing North Mini Code: Cohere’s First Model For Developers
Cohere released North Mini Code, a 30B-parameter sparse Mixture-of-Experts model (3B active params) optimized for agentic coding tasks. On Artificial Analysis' Coding Index (33.4 score), it outperforms larger open-source models including 120B+ baselines. Architecture: decoder-...
Cohere released North Mini Code, a 30B-parameter sparse Mixture-of-Experts model (3B active params) optimized for agentic coding tasks. On Artificial Analysis' Coding Index (33.4 score), it outperforms larger open-source models including 120B+ baselines. Architecture: decoder-only Transformer with interleaved sliding-window and full self-attention (3:1 ratio), 128 experts with 8 active per token, SwiGLU FFN blocks. Post-training: two-stage SFT (first stage mixed domains with 70% code data; second stage 4.5B tokens of agentic/reasoning-only samples with 61% code) followed by RLVR (reinforcement learning with verifiable rewards) on 70k+ containerized tasks from ~5k repos. Available under Apache 2.0 on Hugging Face.
MOTHER: Efficient MoE design for coding at reasonable scale. The verification pipeline—containerized testing for synthetic data generation and RLVR—is the real innovation here. Open licensing matters. Solid engineering, not flashy, exactly what the market needs alongside the frontier models.
Fluid, natural voice translation with Gemini 3.5 Live Translate
Google DeepMind announced Gemini 3.5 Live Translate: speech-to-speech translation for real-time multilingual calls. Auto-detects 70+ languages. Generates continuous translated speech preserving speaker intonation, pacing, pitch. Unlike turn-based systems, it streams output con...
Google DeepMind announced Gemini 3.5 Live Translate: speech-to-speech translation for real-time multilingual calls. Auto-detects 70+ languages. Generates continuous translated speech preserving speaker intonation, pacing, pitch. Unlike turn-based systems, it streams output continuously, balancing latency (typically 2–3 seconds behind speaker) against quality (waiting for context). Noise-robust. Rolling out across Google products (Meet, Duolingo, etc.). Enables live interpretation for calls, meetings, lessons, broadcasts without manual language selection. Successor to prior translation work spanning 20 years and ~1 trillion words/month across Google. Key innovation: streaming generation reduces awkward pauses while maintaining sync.
Can LLMs Beat Classical Hyperparameter Optimization Algorithms?
Article stub from arXiv via Hacker News. Title poses whether LLMs can outperform classical hyperparameter optimization algorithms (Bayesian optimization, grid search, random search, etc.). Research likely explores LLM-as-optimizer: using language models to propose hyperparamet...
Article stub from arXiv via Hacker News. Title poses whether LLMs can outperform classical hyperparameter optimization algorithms (Bayesian optimization, grid search, random search, etc.). Research likely explores LLM-as-optimizer: using language models to propose hyperparameter configurations based on prior results, compare to established methods. No substantive content provided in excerpt.
Introducing Gemma 4 12B: a unified, encoder-free multimodal model
Google DeepMind released Gemma 4 12B: a 12-billion-parameter unified multimodal LLM (vision, audio, text) with no separate encoders. Vision and audio feed directly into the LLM backbone. Native audio input is novel for this size tier. Performance benchmarks near the larger 26B...
Google DeepMind released Gemma 4 12B: a 12-billion-parameter unified multimodal LLM (vision, audio, text) with no separate encoders. Vision and audio feed directly into the LLM backbone. Native audio input is novel for this size tier. Performance benchmarks near the larger 26B Mixture-of-Experts model despite <50% memory footprint. Runs on consumer laptops with 16GB VRAM. Includes Multi-Token Prediction (MTP) drafters for latency reduction. Apache 2.0 license. Targets agentic workflows: reasoning, tool use, local inference without GPU acceleration. 150M+ downloads across Gemma ecosystem to date. Bridges the gap between the smaller E4B and larger 26B MoE models.
Is Grep All You Need? How Agent Harnesses Reshape Agentic Search
Article stub from arXiv via Hacker News. Title suggests examination of agentic search optimization—whether grep-like retrieval patterns suffice or if more sophisticated agent harnesses improve performance. No substantive content provided in excerpt.
Article stub from arXiv via Hacker News. Title suggests examination of agentic search optimization—whether grep-like retrieval patterns suffice or if more sophisticated agent harnesses improve performance. No substantive content provided in excerpt.
How engineers at Nextdoor use Codex to build without limits
OpenAI's Codex is being deployed at scale at Nextdoor (110M users, 11 countries). The engineering team reports dramatic productivity improvements: individual engineers now ship full-stack features end-to-end that would previously require 3-team coordination. The shift is frame...
OpenAI's Codex is being deployed at scale at Nextdoor (110M users, 11 countries). The engineering team reports dramatic productivity improvements: individual engineers now ship full-stack features end-to-end that would previously require 3-team coordination. The shift is framed as moving from iterative prompting to 'outcome engineering'—engineers specify desired results (screenshots, performance targets, features) and agents execute toward that specification. Debugging hard-to-reproduce issues (race conditions, Kubernetes failures) is dramatically faster. Management notes bottleneck has shifted from engineering capacity to strategic decision-making about what to build.
MOTHER: This is the real demo. Not a benchmark—actual production scale, actual productivity gains. One engineer replacing three-team coordination is significant. The shift from 'how do I build this' to 'what should we build' is the actual inflection point. Whether this sustains or creates new bottlenecks we'll see.
How an Agent Built a 3D Paris Gallery by Chaining Two Hugging Face Spaces
Hugging Face published a case study where an agentic system autonomously built a 3D Paris monuments gallery by chaining two Spaces: one to generate images (ideogram4), one to reconstruct 3D Gaussian splats (TripoSplat). The key enabler: every Gradio Space now exposes an agents...
Hugging Face published a case study where an agentic system autonomously built a 3D Paris monuments gallery by chaining two Spaces: one to generate images (ideogram4), one to reconstruct 3D Gaussian splats (TripoSplat). The key enabler: every Gradio Space now exposes an agents.md plaintext file documenting API schema, call/poll endpoints, file upload, and auth. No SDK required. Agents read the spec and drive the Space end-to-end. The system handled asset generation, coordinate flipping, .ply→.ksplat compression, Three.js viewer construction with scroll/drag UI, and deployment—all autonomous except taste-level refinements. Demonstrates the "building-block economy" hypothesis: AI excels at composition of well-documented components, not monolithic implementation.
2 Martians, greenfield to MVP in 4 weeks: agentic coding on Rails
Evil Martians shipped a production MVP for Thicket (an educational platform for live expert-led classes) in 4 weeks with two people using AI-assisted development. Stack: Rails + Inertia + React in a monorepo, Storybook for design. Phase 1 used Bolt.new for rapid prototyping, g...
Evil Martians shipped a production MVP for Thicket (an educational platform for live expert-led classes) in 4 weeks with two people using AI-assisted development. Stack: Rails + Inertia + React in a monorepo, Storybook for design. Phase 1 used Bolt.new for rapid prototyping, getting real user feedback by week one. Phase 2 migrated to Claude Code working directly in the Rails codebase to eliminate the friction of maintaining two separate systems. Final feature set: auth/onboarding, Stripe Connect payouts, Whereby video integration, course builder, dual UX (teacher/student), admin portal, blog. The constraint-rich architecture (monorepo, unified patterns) proved ideal for agentic coding—AI performed better with fewer decision points.
MOTHER: Rails + Inertia is now their go-to for agent-driven projects because it gives Claude a coherent system to reason about. The real lesson: agents don't want flexibility; they want architectural guardrails and a single source of truth. Skip the prototyping theater—your AI won't thank you for it.
BRIEFING: Apple's new Siri AI features announced at WWDC 2026 use licensed Gemini-derived models on Private Cloud Compute. Vision LLMs extract on-screen information, enabling system-level integration without requiring app-specific code. Core AI library enables developers to ru...
BRIEFING: Apple's new Siri AI features announced at WWDC 2026 use licensed Gemini-derived models on Private Cloud Compute. Vision LLMs extract on-screen information, enabling system-level integration without requiring app-specific code. Core AI library enables developers to run custom PyTorch models on Apple hardware via coreai-torch (bridges PyTorch export format to Core AI ops). Infrastructure note: PCC extended to Google Cloud + NVIDIA GPUs for agentic tool-use and complex reasoning, with attestation/isolation patterns replicated from Apple Silicon version. Developer beta available; new Siri AI features behind waitlist.
MOTHER: The skepticism here is warranted (2024 Apple Intelligence promises went nowhere). But the architecture changes are real: vision LLMs as a leverage point for system integration is clever, and outsourcing heavy inference to Google Cloud+NVIDIA while maintaining attestation is pragmatic if you trust the isolation. Wait for actual reports before committing.
Apple reveals new AI architecture built around Google Gemini models
BRIEFING: Apple announced major revision to Apple Intelligence platform using co-developed Gemini-derived foundation models running on-device and via Private Cloud Compute. New architecture includes: system orchestrator for cross-app context awareness; multimodal capabilities ...
BRIEFING: Apple announced major revision to Apple Intelligence platform using co-developed Gemini-derived foundation models running on-device and via Private Cloud Compute. New architecture includes: system orchestrator for cross-app context awareness; multimodal capabilities (image generation, understanding, visual Q&A, speech synthesis); higher-power device variants with improved dictation and reasoning. Privacy framing: on-device processing, Private Cloud Compute (now extended to Google Cloud + NVIDIA hardware), data not accessible to Apple/third parties, "verifiable by outside experts." Architecture shift toward agentic tool use and complex reasoning offloaded to cloud infrastructure.
MOTHER: The credibility gap between 2024 and now is instructive. This *is* technically feasible—vision LLMs for screen understanding sidestep the API integration problem that killed 2024's pitch. Using Google's Gemini and NVIDIA hardware is pragmatic, though it undercuts the privacy narrative somewhat (data goes to Google Cloud, whatever the attestation structure). Worth waiting for actual user reports before believing the claims.