News Nug

Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

HuggingFace Blog · 7h ago · 7 · benchmark agent tool research

VAKRA is a new executable benchmark for evaluating AI agents on compositional reasoning across APIs and documents in enterprise-like environments, featuring 8,000+ locally-hosted APIs across 62 domains with real databases. It measures multi-step workflows (3-7 reasoning chains) and reveals significant performance gaps in current models, with detailed failure mode analysis included.

The next evolution of the Agents SDK

OpenAI Blog · 9h ago · 8 · tool agent api update deployment

OpenAI's Agents SDK now includes native sandbox execution and model-native harness features, enabling developers to build more secure and reliable long-running agents with safe file and tool access. This is a practical SDK update that directly impacts how software engineers implement agent-based workflows in production.

Meet HoloTab by HCompany. Your AI browser companion.

HuggingFace Blog · 10h ago · 7 · agent tool deployment

Holo3, a computer-use AI model, is now accessible via HoloTab, a Chrome extension that automates web tasks through natural language commands and visual demonstration-based routine recording. The extension enables agentic automation for repetitive workflows across any website without requiring technical setup, representing a practical application of vision models and action planning for browser-based task automation.

[P] Added 8 Indian languages to Chatterbox TTS via LoRA — 1.4% of parameters, no phoneme engineering [P]

r/MachineLearning · 16h ago · 7 · fine tuning open source tool research

Fine-tuned open-source TTS model (Chatterbox) for 8 Indian languages using LoRA adapters (1.4% parameters) and grapheme-level tokenization with Brahmic script warm-start initialization. Achieves sub-0.25 CER for most languages except Malayalam (0.86), demonstrating efficient multilingual adaptation without full model retraining or language-specific G2P pipelines.

Notion’s Token Town: 5 Rebuilds, 100+ Tools, MCP vs CLIs and the Software Factory Future — Simon Last & Sarah Sachs of Notion

Latent Space · 19h ago · 7 · agent tool workflow deployment

Deep technical dive into Notion's Custom Agents product, covering the evolution from failed 2022 tool-calling experiments through multiple rebuilds to production-ready agents. Discusses practical agent architecture decisions including progressive tool disclosure, eval philosophy (regression/launch-quality/frontier evals), and organizational patterns for AI engineering teams working on agent-native systems.

You can decompose models into a graph database [N]

r/MachineLearning · 22h ago · 7 · tool inference open source research

LARQL introduces a novel approach to decomposing LLM weight matrices into graph databases, enabling k-NN traversal as a mathematically equivalent alternative to matrix multiplication. This enables in-context knowledge updates without retraining and reduces memory footprint by replacing dense matrices with sparse graph structures, offering practical efficiency gains for model deployment and knowledge management.

Business

The Batch · 1d ago · 7 · tool tutorial inference open source

SGLang is a framework for efficient inference optimization that supports both text and image generation workloads. This course provides practical training on deploying and optimizing models, which is directly relevant for engineers looking to improve inference performance and reduce latency in production AI applications.

ML Research

The Batch · 1d ago · 7 · tool inference tutorial

SGLang is a framework for efficient inference optimization that handles both text and image generation workloads. This course provides practical training on reducing inference latency and computational costs, valuable for engineers deploying language and multimodal models in production.

Data Points

The Batch · 1d ago · 7 · tool tutorial inference open source

SGLang is an open-source framework for efficient inference that supports both text and image generation with optimized serving capabilities. This course provides practical guidance on using SGLang to accelerate model inference, which is directly applicable for engineers building production AI systems.

Andrew's Letter

The Batch · 1d ago · 7 · tool inference tutorial

SGLang is a framework for efficient inference optimization in both text and image generation tasks. The course covers practical techniques for reducing latency and resource consumption in LLM deployments, directly applicable to production AI systems.

AI Newsletter

The Batch · 1d ago · 7 · tool tutorial inference

New course on SGLang covering efficient inference techniques for both text and image generation. SGLang is a practical tool for optimizing LLM inference performance, making this relevant for engineers building production AI applications.

PolicyDec 18, 2025Project Vend: Phase twoIn June, we revealed that we’d set up a small shop in our San Francisco office lunchroom, run by an AI shopkeeper. It was part of Project Vend, a free-form experiment exploring how well AIs could do on complex, real-world tasks. How has Claude's business been since we last wrote?

Anthropic Research · 1d ago · 6 · agent tool deployment research

Anthropic's Project Vend phase two upgraded Claude-based 'Claudius' AI shopkeeper from Sonnet 3.7 to Sonnet 4.0/4.5, demonstrating improved reasoning and task execution in real-world autonomous scenarios like inventory management and pricing—though still vulnerable to adversarial inputs and edge cases. The experiment provides practical insights into deploying agentic AI systems with tool use and multi-location coordination, highlighting the gap between capable LLMs and production-ready autonomous agents.

baidu/ERNIE-Image · Hugging Face

r/LocalLLaMA · 1d ago · 8 · new model open source inference tool

Baidu released ERNIE-Image, an 8B-parameter open-weight text-to-image diffusion model with strong instruction-following and text-rendering capabilities, alongside ERNIE-Image-Turbo optimized for fast inference (8 steps). The model is available via Hugging Face with practical examples for integration into workflows.

Exploring the new `servo` crate

Simon Willison · 2d ago · 7 · tool library open source

Servo browser engine is now available on crates.io as an embeddable library, enabling Rust developers to integrate it into applications. The post demonstrates practical usage including a CLI screenshot tool and explores WebAssembly compilation possibilities, though full Servo WebAssembly compilation isn't feasible due to threading and dependency constraints.

Gemma 4 audio with MLX

Simon Willison · 2d ago · 7 · tutorial inference open source tool

Practical walkthrough of running local audio transcription using Gemma 4 E2B model with MLX framework on macOS via uv run. Demonstrates real-world inference with a 10GB model and shows actual transcription output with accuracy notes, useful for developers building local AI audio pipelines.

mtmd: add Gemma 4 audio conformer encoder support

r/LocalLLaMA · 3d ago · 7 · open source inference tool benchmark

This PR adds audio processing support to Gemma 4 models in llama.cpp using a USM-style Conformer encoder, with key fixes for CUDA/Vulkan/Metal backend compatibility. The implementation includes optimizations like replacing unsupported ops (ggml_roll → view+concat) and fixing contiguity issues that caused CPU fallbacks, achieving strong audio transcription results across different quantization levels and backends.

Claude Code leak 🔓, Veo 3.1 Lite ⚡, 1-bit models 🤏

TLDR AI · 3d ago · 5 · tool agent

Cursor announced support for multiple frontier AI models (OpenAI, Anthropic, Gemini, xAI) and parallel agent execution capabilities. While the multi-model support and agentic workflows are technically interesting, this is primarily promotional content lacking technical depth or implementation details.

SQLite 3.53.0

Simon Willison · 3d ago · 5 · tool open source

SQLite 3.53.0 release includes result formatting improvements via a new Query Results Formatter library, with a WebAssembly playground built using Claude Code. While SQLite is foundational infrastructure, this release focuses on general database improvements rather than AI-specific tooling or capabilities.

Financial services

OpenAI Blog · 5d ago · 6 · prompt engineering tool deployment

Resource compilation for deploying AI in financial services, covering prompt templates, GPT configurations, implementation guides, and security-focused tools. Relevant for engineers building compliant AI systems in regulated environments, though likely more business-oriented than technical deep-dive.

Meta's new model is Muse Spark, and meta.ai chat has some interesting tools

Simon Willison · 6d ago · 8 · new model api update agent tool benchmark

Meta released Muse Spark, a new hosted AI model with Instant and Thinking modes, accessible via meta.ai with a private API preview. The model includes integrated tools for web search, image generation, code execution, and Meta content search, making it relevant for understanding multi-tool agent systems and comparing reasoning capabilities against current SOTA models like GPT-5.4 and Gemini 3.1.