News Nug

Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

HuggingFace Blog · 7h ago · 7 · benchmark agent tool research

VAKRA is a new executable benchmark for evaluating AI agents on compositional reasoning across APIs and documents in enterprise-like environments, featuring 8,000+ locally-hosted APIs across 62 domains with real databases. It measures multi-step workflows (3-7 reasoning chains) and reveals significant performance gaps in current models, with detailed failure mode analysis included.

Trained a Qwen2.5-0.5B-Instruct bf16 model on Reddit post summarization task with GRPO written from scratch in PyTorch - updates! [P]

r/MachineLearning · 10h ago · 7 · fine tuning benchmark inference workflow

Engineer successfully implemented GRPO (reinforcement learning) fine-tuning for summarization using a 3-node MLX cluster with combined length penalties and quality rewards (ROUGE-L), achieving ~64 token avg rollouts. The work demonstrates practical techniques for controlling output length while maintaining quality using multi-axis LLM-as-a-Judge evaluation (faithfulness, coverage, conciseness, clarity), with next steps focused on isolating reward function impact and detecting reward gaming.

Was looking at a ICLR 2025 Oral paper and I am shocked it got oral [D]

r/MachineLearning · 13h ago · 7 · benchmark research

Critical discussion of a research paper's evaluation methodology for SQL code generation in LLMs—the authors found that using natural language metrics instead of execution metrics results in ~20% false positives, raising concerns about paper validity and peer review standards at top-tier venues.

Apr 14, 2026AlignmentAutomated Alignment Researchers: Using large language models to scale scalable oversight

Anthropic Research · 20h ago · 7 · research agent fine tuning benchmark

Anthropic's research explores weak-to-strong supervision as a practical approach to scalable oversight—training stronger AI models using weaker model feedback to prepare for supervising future superhuman AI. The study tests whether Claude can autonomously develop and test alignment methods, demonstrating potential for AI systems to accelerate their own alignment research.

Cybersecurity Looks Like Proof of Work Now

Simon Willison · 23h ago · 7 · benchmark deployment research

Claude Mythos Preview demonstrates exceptional capability in identifying security vulnerabilities, with the UK's AI Safety Institute confirming that vulnerability discovery scales with computational investment (tokens spent). This creates new economic incentives for security hardening and makes open-source libraries more valuable as shared security analysis investments.

Mar 6, 2026PolicyPartnering with Mozilla to improve Firefox’s security

Anthropic Blog · 1d ago · 8 · research agent benchmark workflow

Claude Opus 4.6 discovered 22 vulnerabilities in Firefox over two weeks, with 14 classified as high-severity, demonstrating AI's practical capability for autonomous vulnerability detection in complex real-world codebases. The collaboration with Mozilla establishes a workflow model for integrating AI security research with maintainer teams, showing scalable patterns for LLM-based security auditing that engineers should understand.

AnnouncementsFeb 5, 2026Introducing Claude Opus 4.6We’re upgrading our smartest model. Across agentic coding, computer use, tool use, search, and finance, Opus 4.6 is an industry-leading model, often by wide margin.

Anthropic Blog · 1d ago · 10 · new model api update agent inference benchmark

Claude Opus 4.6 releases with major improvements for AI engineers: 1M token context window in beta, enhanced agentic task capabilities, state-of-the-art coding performance on Terminal-Bench 2.0, and new developer features including adaptive thinking, context compaction, and effort controls for managing cost/intelligence tradeoffs. Available immediately on API at same pricing ($5/$25 per million tokens) with new product integrations like Claude Code agent teams and PowerPoint support.

Societal Impacts

Anthropic Research · 1d ago · 6 · research benchmark

Anthropic's Societal Impacts team shares research on AI values, real-world usage patterns, and safety evaluations including a large-scale study of 81,000 users and analysis of 700,000 Claude interactions. While technically rigorous, this is primarily research and policy-focused rather than directly applicable to daily AI development workflows.

Alignment

Anthropic Research · 1d ago · 7 · research benchmark

Anthropic's alignment research overview covering safety techniques for advanced AI systems, including new empirical findings on alignment faking, reward hacking generalization, and alignment audits. While primarily foundational research rather than immediately actionable tools, it addresses critical challenges in training and evaluating safe AI systems that engineers building with large models should understand.

ClawBench: Can AI Agents Complete Everyday Online Tasks? 153 tasks, 144 live websites, best model at 33.3% [R]

r/MachineLearning · 1d ago · 8 · benchmark agent open source research

ClawBench is a new benchmark evaluating AI browser agents on 153 real-world tasks across live websites, revealing that even the best models (Claude Sonnet, GLM-5) achieve only 33% success rates. The benchmark provides comprehensive evaluation infrastructure with multi-layer behavioral data collection, request interception for safe testing, and an interactive leaderboard—offering practical insights for building and improving web-capable AI agents.

20M+ Indian legal documents with citation graphs and vector embeddings – potential uses for legal NLP? [D]

r/MachineLearning · 1d ago · 8 · dataset rag research open source benchmark

A software engineer has built a structured 20M+ Indian court case dataset with citation graphs, dense/sparse embeddings, and extracted metadata (judges, parties, sections, acts). The resource includes heuristic + LLM-based NER extraction pipeline, cross-referenced legislation, and serves as a novel evaluation benchmark for legal RAG systems and graph neural networks on low-resource legal domain data.

We benchmarked TranslateGemma against 5 other LLMs on subtitle translation across 6 languages. At first glance the numbers told a clean story, but then human QA added a chapter. [D]

r/MachineLearning · 1d ago · 7 · benchmark inference research

Comprehensive benchmark comparing six LLMs on subtitle translation across six languages using reference-free quality metrics (MetricX-24 and COMETKiwi), with a custom combined score revealing model-metric affinity bias and critical failures like TranslateGemma's inability to properly distinguish Simplified vs Traditional Chinese despite high metric scores. The evaluation highlights practical limitations of current QE metrics and real-world deployment risks when relying solely on automated scoring.

[AINews] Top Local Models List - April 2026

Latent Space · 1d ago · 6 · open source deployment benchmark

Community survey of popular open-weight models across local deployment use cases, highlighting Qwen 3.5, Gemma 4, DeepSeek V3.2, and others based on actual Reddit recommendations rather than benchmarks. Focuses on practical model selection for engineers building local inference systems, with specific callouts for coding (Qwen3-Coder-Next) and agentic workloads (MiniMax M2.5/M2.7).

"I don't know!": Teaching neural networks to abstain with the HALO-Loss. [R]

r/MachineLearning · 1d ago · 8 · library research open source benchmark deployment

HALO-Loss is an open-source drop-in replacement for Cross-Entropy that uses euclidean distance instead of dot products to bound model confidence, enabling native out-of-distribution detection without sacrificing base accuracy. The method addresses a fundamental neural network problem where models hallucinate on unfamiliar data by mathematically constraining confidence to finite distances and providing an implicit "abstain class" at the origin of the latent space. Testing shows zero accuracy drop, improved calibration (ECE down to 1.5%), and significantly reduced false positives on far OOD detection compared to standard approaches.

I scaled a pure Spiking Neural Network (SNN) to 1.088B parameters from scratch. Ran out of budget, but here is what I found [R]

r/MachineLearning · 1d ago · 7 · research open source inference benchmark

An indie developer trained a 1B parameter Spiking Neural Network (SNN) from random initialization for language modeling, achieving 93% sparsity and spontaneous cross-lingual emergence, challenging the conventional wisdom that direct SNN training requires ANN conversion or distillation. While early-stage (4.4 loss, 27k steps), this demonstrates a viable pathway for neuromorphic computing and inference efficiency, with code and checkpoint shared for community feedback.

Thinking Deeper, Not Longer: Depth-Recurrent Transformers for Compositional Generalization [R]

r/MachineLearning · 1d ago · 7 · research workflow benchmark

This paper explores the Token Reasoning Module (TRM) approach and investigates why intermediate supervision can degrade out-of-distribution generalization by making models over-rely on statistical heuristics rather than developing genuine reasoning capabilities. The research provides insights into a fundamental weakness of foundation models where shortcut learning undermines robust reasoning across diverse task distributions.

Gemini Robotics-ER 1.6: Powering real-world robotics tasks through enhanced embodied reasoning

DeepMind Blog · 2d ago · 9 · new model api update agent benchmark

Google released Gemini Robotics-ER 1.6, a specialized embodied reasoning model for robotic systems with enhanced spatial understanding, multi-view reasoning, and new instrument-reading capabilities like gauge interpretation. The model is now available via the Gemini API with improvements in pointing, counting, task planning, and success detection—critical for physical agent autonomy.

mtmd: add Gemma 4 audio conformer encoder support

r/LocalLLaMA · 3d ago · 7 · open source inference tool benchmark

This PR adds audio processing support to Gemma 4 models in llama.cpp using a USM-style Conformer encoder, with key fixes for CUDA/Vulkan/Metal backend compatibility. The implementation includes optimizations like replacing unsupported ops (ggml_roll → view+concat) and fixing contiguity issues that caused CPU fallbacks, achieving strong audio transcription results across different quantization levels and backends.

LLMs learn backwards, and the scaling hypothesis is bounded. [D]

r/MachineLearning · 3d ago · 6 · research benchmark

This essay explores whether LLM capabilities emerge purely from scale (data + compute) versus requiring fundamental algorithmic innovations, tracing this debate from early computer vision work through GPT scaling. While intellectually engaging, it's primarily philosophical reflection on existing trends rather than introducing new technical methods, models, or practical tools for engineers building with AI.

Codex Plugin for Claude Code 💻, Qwen3.5-Omni 🤖, workload harness fit 🧑‍💻

TLDR AI · 3d ago · 6 · workflow benchmark

Survey findings reveal widespread developer distrust in AI-generated code (96%) with reliability concerns, highlighting the need for automated verification and deterministic guardrails in AI-assisted development workflows. The report positions AI as "trusted but verified" with emphasis on SDLC integration and automated quality gates rather than manual code review.