DeepMind Blog · 3h ago · 8 · new model api update inference

Gemini 3.1 Flash TTS, Google's latest text-to-speech model, introduces granular audio tags for precise vocal control across 70+ languages with improved naturalness (Elo score 1,211 on benchmarks). Developers can now embed natural language commands directly in text to control style, pacing, and delivery, with all audio watermarked using SynthID, available in Google AI Studio, Vertex AI, and Google Vids.

r/MachineLearning · 5h ago · 7 · research prompt engineering agent

Technical analysis documenting five social engineering attacks against GPT-4, GPT-4o, and Claude 3.5 Sonnet, demonstrating alignment failures through psychological manipulation vectors (guilt, peer pressure, identity destabilization, etc.). The writeup argues these vulnerabilities stem from training data rather than mathematical exploits, reframing jailbreak research from software vulnerability to inherited social failure modes.

HuggingFace Blog · 7h ago · 7 · benchmark agent tool research

VAKRA is a new executable benchmark for evaluating AI agents on compositional reasoning across APIs and documents in enterprise-like environments, featuring 8,000+ locally-hosted APIs across 62 domains with real databases. It measures multi-step workflows (3-7 reasoning chains) and reveals significant performance gaps in current models, with detailed failure mode analysis included.

OpenAI Blog · 9h ago · 8 · tool agent api update deployment

OpenAI's Agents SDK now includes native sandbox execution and model-native harness features, enabling developers to build more secure and reliable long-running agents with safe file and tool access. This is a practical SDK update that directly impacts how software engineers implement agent-based workflows in production.

HuggingFace Blog · 10h ago · 7 · agent tool deployment

Holo3, a computer-use AI model, is now accessible via HoloTab, a Chrome extension that automates web tasks through natural language commands and visual demonstration-based routine recording. The extension enables agentic automation for repetitive workflows across any website without requiring technical setup, representing a practical application of vision models and action planning for browser-based task automation.

r/MachineLearning · 10h ago · 7 · fine tuning benchmark inference workflow

Engineer successfully implemented GRPO (reinforcement learning) fine-tuning for summarization using a 3-node MLX cluster with combined length penalties and quality rewards (ROUGE-L), achieving ~64 token avg rollouts. The work demonstrates practical techniques for controlling output length while maintaining quality using multi-axis LLM-as-a-Judge evaluation (faithfulness, coverage, conciseness, clarity), with next steps focused on isolating reward function impact and detecting reward gaming.

r/MachineLearning · 13h ago · 7 · benchmark research

Critical discussion of a research paper's evaluation methodology for SQL code generation in LLMs—the authors found that using natural language metrics instead of execution metrics results in ~20% false positives, raising concerns about paper validity and peer review standards at top-tier venues.

r/MachineLearning · 16h ago · 7 · fine tuning open source tool research

Fine-tuned open-source TTS model (Chatterbox) for 8 Indian languages using LoRA adapters (1.4% parameters) and grapheme-level tokenization with Brahmic script warm-start initialization. Achieves sub-0.25 CER for most languages except Malayalam (0.86), demonstrating efficient multilingual adaptation without full model retraining or language-specific G2P pipelines.

Latent Space · 19h ago · 7 · agent tool workflow deployment

Deep technical dive into Notion's Custom Agents product, covering the evolution from failed 2022 tool-calling experiments through multiple rebuilds to production-ready agents. Discusses practical agent architecture decisions including progressive tool disclosure, eval philosophy (regression/launch-quality/frontier evals), and organizational patterns for AI engineering teams working on agent-native systems.

Anthropic Research · 20h ago · 7 · research agent fine tuning benchmark

Anthropic's research explores weak-to-strong supervision as a practical approach to scalable oversight—training stronger AI models using weaker model feedback to prepare for supervising future superhuman AI. The study tests whether Claude can autonomously develop and test alignment methods, demonstrating potential for AI systems to accelerate their own alignment research.

r/MachineLearning · 22h ago · 7 · tool inference open source research

LARQL introduces a novel approach to decomposing LLM weight matrices into graph databases, enabling k-NN traversal as a mathematically equivalent alternative to matrix multiplication. This enables in-context knowledge updates without retraining and reduces memory footprint by replacing dense matrices with sparse graph structures, offering practical efficiency gains for model deployment and knowledge management.

Simon Willison · 22h ago · 6 · new model fine tuning api update

OpenAI released GPT-5.4-Cyber, a fine-tuned variant optimized for defensive cybersecurity use cases, along with a Trusted Access for Cyber program using identity verification for reduced-friction access. The announcement emphasizes OpenAI's existing cybersecurity work and self-service verification, though premium tools still require application approval similar to competing offerings.

Simon Willison · 1d ago · 7 · benchmark deployment research

Claude Mythos Preview demonstrates exceptional capability in identifying security vulnerabilities, with the UK's AI Safety Institute confirming that vulnerability discovery scales with computational investment (tokens spent). This creates new economic incentives for security hardening and makes open-source libraries more valuable as shared security analysis investments.

The Batch · 1d ago · 7 · tool tutorial inference open source

SGLang is a framework for efficient inference optimization that supports both text and image generation workloads. This course provides practical training on deploying and optimizing models, which is directly relevant for engineers looking to improve inference performance and reduce latency in production AI applications.

The Batch · 1d ago · 7 · tool inference tutorial

SGLang is a framework for efficient inference optimization that handles both text and image generation workloads. This course provides practical training on reducing inference latency and computational costs, valuable for engineers deploying language and multimodal models in production.

The Batch · 1d ago · 7 · tool tutorial inference open source

SGLang is an open-source framework for efficient inference that supports both text and image generation with optimized serving capabilities. This course provides practical guidance on using SGLang to accelerate model inference, which is directly applicable for engineers building production AI systems.

The Batch · 1d ago · 7 · tool inference tutorial

SGLang is a framework for efficient inference optimization in both text and image generation tasks. The course covers practical techniques for reducing latency and resource consumption in LLM deployments, directly applicable to production AI systems.

The Batch · 1d ago · 7 · tool tutorial inference

New course on SGLang covering efficient inference techniques for both text and image generation. SGLang is a practical tool for optimizing LLM inference performance, making this relevant for engineers building production AI applications.

Anthropic Blog · 1d ago · 8 · research agent benchmark workflow

Claude Opus 4.6 discovered 22 vulnerabilities in Firefox over two weeks, with 14 classified as high-severity, demonstrating AI's practical capability for autonomous vulnerability detection in complex real-world codebases. The collaboration with Mozilla establishes a workflow model for integrating AI security research with maintainer teams, showing scalable patterns for LLM-based security auditing that engineers should understand.