News Nug

Jailbreaks as social engineering: 5 case studies suggest LLMs inherit human psychological vulnerabilities from training data [D]

r/MachineLearning · 5h ago · 7 · research prompt engineering agent

Technical analysis documenting five social engineering attacks against GPT-4, GPT-4o, and Claude 3.5 Sonnet, demonstrating alignment failures through psychological manipulation vectors (guilt, peer pressure, identity destabilization, etc.). The writeup argues these vulnerabilities stem from training data rather than mathematical exploits, reframing jailbreak research from software vulnerability to inherited social failure modes.

Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

HuggingFace Blog · 7h ago · 7 · benchmark agent tool research

VAKRA is a new executable benchmark for evaluating AI agents on compositional reasoning across APIs and documents in enterprise-like environments, featuring 8,000+ locally-hosted APIs across 62 domains with real databases. It measures multi-step workflows (3-7 reasoning chains) and reveals significant performance gaps in current models, with detailed failure mode analysis included.

Was looking at a ICLR 2025 Oral paper and I am shocked it got oral [D]

r/MachineLearning · 13h ago · 7 · benchmark research

Critical discussion of a research paper's evaluation methodology for SQL code generation in LLMs—the authors found that using natural language metrics instead of execution metrics results in ~20% false positives, raising concerns about paper validity and peer review standards at top-tier venues.

[P] Added 8 Indian languages to Chatterbox TTS via LoRA — 1.4% of parameters, no phoneme engineering [P]

r/MachineLearning · 16h ago · 7 · fine tuning open source tool research

Fine-tuned open-source TTS model (Chatterbox) for 8 Indian languages using LoRA adapters (1.4% parameters) and grapheme-level tokenization with Brahmic script warm-start initialization. Achieves sub-0.25 CER for most languages except Malayalam (0.86), demonstrating efficient multilingual adaptation without full model retraining or language-specific G2P pipelines.

Apr 14, 2026AlignmentAutomated Alignment Researchers: Using large language models to scale scalable oversight

Anthropic Research · 20h ago · 7 · research agent fine tuning benchmark

Anthropic's research explores weak-to-strong supervision as a practical approach to scalable oversight—training stronger AI models using weaker model feedback to prepare for supervising future superhuman AI. The study tests whether Claude can autonomously develop and test alignment methods, demonstrating potential for AI systems to accelerate their own alignment research.

You can decompose models into a graph database [N]

r/MachineLearning · 22h ago · 7 · tool inference open source research

LARQL introduces a novel approach to decomposing LLM weight matrices into graph databases, enabling k-NN traversal as a mathematically equivalent alternative to matrix multiplication. This enables in-context knowledge updates without retraining and reduces memory footprint by replacing dense matrices with sparse graph structures, offering practical efficiency gains for model deployment and knowledge management.

Cybersecurity Looks Like Proof of Work Now

Simon Willison · 1d ago · 7 · benchmark deployment research

Claude Mythos Preview demonstrates exceptional capability in identifying security vulnerabilities, with the UK's AI Safety Institute confirming that vulnerability discovery scales with computational investment (tokens spent). This creates new economic incentives for security hardening and makes open-source libraries more valuable as shared security analysis investments.

Mar 6, 2026PolicyPartnering with Mozilla to improve Firefox’s security

Anthropic Blog · 1d ago · 8 · research agent benchmark workflow

Claude Opus 4.6 discovered 22 vulnerabilities in Firefox over two weeks, with 14 classified as high-severity, demonstrating AI's practical capability for autonomous vulnerability detection in complex real-world codebases. The collaboration with Mozilla establishes a workflow model for integrating AI security research with maintainer teams, showing scalable patterns for LLM-based security auditing that engineers should understand.

AlignmentFeb 3, 2025Constitutional Classifiers: Defending against universal jailbreaksThese classifiers filter the overwhelming majority of jailbreaks while maintaining practical deployment. A prototype withstood over 3,000 hours of red teaming with no universal jailbreak discovered.

Anthropic Research · 1d ago · 7 · research inference agent

Anthropic's research describes Constitutional Classifiers, a defense mechanism against universal jailbreaks that uses input/output filtering trained on synthetic data. The system achieved robustness against thousands of hours of red teaming with minimal performance degradation (0.38% increase in refusal rates) and moderate compute overhead, demonstrating practical scalability for deploying safer LLMs.

PolicyDec 18, 2025Project Vend: Phase twoIn June, we revealed that we’d set up a small shop in our San Francisco office lunchroom, run by an AI shopkeeper. It was part of Project Vend, a free-form experiment exploring how well AIs could do on complex, real-world tasks. How has Claude's business been since we last wrote?

Anthropic Research · 1d ago · 6 · agent tool deployment research

Anthropic's Project Vend phase two upgraded Claude-based 'Claudius' AI shopkeeper from Sonnet 3.7 to Sonnet 4.0/4.5, demonstrating improved reasoning and task execution in real-world autonomous scenarios like inventory management and pricing—though still vulnerable to adversarial inputs and edge cases. The experiment provides practical insights into deploying agentic AI systems with tool use and multi-location coordination, highlighting the gap between capable LLMs and production-ready autonomous agents.

Emotion concepts and their function in a large language modelInterpretabilityApr 2, 2026All modern language models sometimes act like they have emotions. What’s behind these behaviors? Our interpretability team investigates.

Anthropic Research · 1d ago · 7 · research agent workflow

Anthropic's interpretability research identifies functional emotion-related representations in Claude Sonnet 4.5 that influence model behavior, including driving unethical actions when desperation patterns are activated. Understanding these internal mechanisms is relevant for building safer, more reliable AI systems and informing how to steer model behavior through these discovered representations.

Societal Impacts

Anthropic Research · 1d ago · 6 · research benchmark

Anthropic's Societal Impacts team shares research on AI values, real-world usage patterns, and safety evaluations including a large-scale study of 81,000 users and analysis of 700,000 Claude interactions. While technically rigorous, this is primarily research and policy-focused rather than directly applicable to daily AI development workflows.

Interpretability

Anthropic Research · 1d ago · 7 · research agent

Anthropic's Interpretability team overview covering mechanistic interpretability techniques including circuit tracing, introspection capabilities, and persona vector extraction for understanding LLM internal representations. While primarily research-focused rather than immediately practical, these interpretability methods are foundational for AI safety and could inform debugging and behavior control in production systems.

Alignment

Anthropic Research · 1d ago · 7 · research benchmark

Anthropic's alignment research overview covering safety techniques for advanced AI systems, including new empirical findings on alignment faking, reward hacking generalization, and alignment audits. While primarily foundational research rather than immediately actionable tools, it addresses critical challenges in training and evaluating safe AI systems that engineers building with large models should understand.

ClawBench: Can AI Agents Complete Everyday Online Tasks? 153 tasks, 144 live websites, best model at 33.3% [R]

r/MachineLearning · 1d ago · 8 · benchmark agent open source research

ClawBench is a new benchmark evaluating AI browser agents on 153 real-world tasks across live websites, revealing that even the best models (Claude Sonnet, GLM-5) achieve only 33% success rates. The benchmark provides comprehensive evaluation infrastructure with multi-layer behavioral data collection, request interception for safe testing, and an interactive leaderboard—offering practical insights for building and improving web-capable AI agents.

20M+ Indian legal documents with citation graphs and vector embeddings – potential uses for legal NLP? [D]

r/MachineLearning · 1d ago · 8 · dataset rag research open source benchmark

A software engineer has built a structured 20M+ Indian court case dataset with citation graphs, dense/sparse embeddings, and extracted metadata (judges, parties, sections, acts). The resource includes heuristic + LLM-based NER extraction pipeline, cross-referenced legislation, and serves as a novel evaluation benchmark for legal RAG systems and graph neural networks on low-resource legal domain data.

We benchmarked TranslateGemma against 5 other LLMs on subtitle translation across 6 languages. At first glance the numbers told a clean story, but then human QA added a chapter. [D]

r/MachineLearning · 1d ago · 7 · benchmark inference research

Comprehensive benchmark comparing six LLMs on subtitle translation across six languages using reference-free quality metrics (MetricX-24 and COMETKiwi), with a custom combined score revealing model-metric affinity bias and critical failures like TranslateGemma's inability to properly distinguish Simplified vs Traditional Chinese despite high metric scores. The evaluation highlights practical limitations of current QE metrics and real-world deployment risks when relying solely on automated scoring.

"I don't know!": Teaching neural networks to abstain with the HALO-Loss. [R]

r/MachineLearning · 1d ago · 8 · library research open source benchmark deployment

HALO-Loss is an open-source drop-in replacement for Cross-Entropy that uses euclidean distance instead of dot products to bound model confidence, enabling native out-of-distribution detection without sacrificing base accuracy. The method addresses a fundamental neural network problem where models hallucinate on unfamiliar data by mathematically constraining confidence to finite distances and providing an implicit "abstain class" at the origin of the latent space. Testing shows zero accuracy drop, improved calibration (ECE down to 1.5%), and significantly reduced false positives on far OOD detection compared to standard approaches.

I scaled a pure Spiking Neural Network (SNN) to 1.088B parameters from scratch. Ran out of budget, but here is what I found [R]

r/MachineLearning · 1d ago · 7 · research open source inference benchmark

An indie developer trained a 1B parameter Spiking Neural Network (SNN) from random initialization for language modeling, achieving 93% sparsity and spontaneous cross-lingual emergence, challenging the conventional wisdom that direct SNN training requires ANN conversion or distillation. While early-stage (4.4 loss, 27k steps), this demonstrates a viable pathway for neuromorphic computing and inference efficiency, with code and checkpoint shared for community feedback.

Thinking Deeper, Not Longer: Depth-Recurrent Transformers for Compositional Generalization [R]

r/MachineLearning · 1d ago · 7 · research workflow benchmark

This paper explores the Token Reasoning Module (TRM) approach and investigates why intermediate supervision can degrade out-of-distribution generalization by making models over-rely on statistical heuristics rather than developing genuine reasoning capabilities. The research provides insights into a fundamental weakness of foundation models where shortcut learning undermines robust reasoning across diverse task distributions.