News Nug

Gemini 3.1 Flash TTS: the next generation of expressive AI speech

DeepMind Blog · 3h ago · 8 · new model api update inference

Gemini 3.1 Flash TTS, Google's latest text-to-speech model, introduces granular audio tags for precise vocal control across 70+ languages with improved naturalness (Elo score 1,211 on benchmarks). Developers can now embed natural language commands directly in text to control style, pacing, and delivery, with all audio watermarked using SynthID, available in Google AI Studio, Vertex AI, and Google Vids.

Trained a Qwen2.5-0.5B-Instruct bf16 model on Reddit post summarization task with GRPO written from scratch in PyTorch - updates! [P]

r/MachineLearning · 10h ago · 7 · fine tuning benchmark inference workflow

Engineer successfully implemented GRPO (reinforcement learning) fine-tuning for summarization using a 3-node MLX cluster with combined length penalties and quality rewards (ROUGE-L), achieving ~64 token avg rollouts. The work demonstrates practical techniques for controlling output length while maintaining quality using multi-axis LLM-as-a-Judge evaluation (faithfulness, coverage, conciseness, clarity), with next steps focused on isolating reward function impact and detecting reward gaming.

You can decompose models into a graph database [N]

r/MachineLearning · 22h ago · 7 · tool inference open source research

LARQL introduces a novel approach to decomposing LLM weight matrices into graph databases, enabling k-NN traversal as a mathematically equivalent alternative to matrix multiplication. This enables in-context knowledge updates without retraining and reduces memory footprint by replacing dense matrices with sparse graph structures, offering practical efficiency gains for model deployment and knowledge management.

Business

The Batch · 1d ago · 7 · tool tutorial inference open source

SGLang is a framework for efficient inference optimization that supports both text and image generation workloads. This course provides practical training on deploying and optimizing models, which is directly relevant for engineers looking to improve inference performance and reduce latency in production AI applications.

ML Research

The Batch · 1d ago · 7 · tool inference tutorial

SGLang is a framework for efficient inference optimization that handles both text and image generation workloads. This course provides practical training on reducing inference latency and computational costs, valuable for engineers deploying language and multimodal models in production.

Data Points

The Batch · 1d ago · 7 · tool tutorial inference open source

SGLang is an open-source framework for efficient inference that supports both text and image generation with optimized serving capabilities. This course provides practical guidance on using SGLang to accelerate model inference, which is directly applicable for engineers building production AI systems.

Andrew's Letter

The Batch · 1d ago · 7 · tool inference tutorial

SGLang is a framework for efficient inference optimization in both text and image generation tasks. The course covers practical techniques for reducing latency and resource consumption in LLM deployments, directly applicable to production AI systems.

AI Newsletter

The Batch · 1d ago · 7 · tool tutorial inference

New course on SGLang covering efficient inference techniques for both text and image generation. SGLang is a practical tool for optimizing LLM inference performance, making this relevant for engineers building production AI applications.

AnnouncementsFeb 5, 2026Introducing Claude Opus 4.6We’re upgrading our smartest model. Across agentic coding, computer use, tool use, search, and finance, Opus 4.6 is an industry-leading model, often by wide margin.

Anthropic Blog · 1d ago · 10 · new model api update agent inference benchmark

Claude Opus 4.6 releases with major improvements for AI engineers: 1M token context window in beta, enhanced agentic task capabilities, state-of-the-art coding performance on Terminal-Bench 2.0, and new developer features including adaptive thinking, context compaction, and effort controls for managing cost/intelligence tradeoffs. Available immediately on API at same pricing ($5/$25 per million tokens) with new product integrations like Claude Code agent teams and PowerPoint support.

ProductFeb 17, 2026Introducing Claude Sonnet 4.6Sonnet 4.6 delivers frontier performance across coding, agents, and professional work at scale.

Anthropic Blog · 1d ago · 9 · new model api update inference

Claude Sonnet 4.6 is now available with significantly improved coding, reasoning, and computer-use capabilities (including 1M token context window in beta), matching or exceeding Opus 4.5 performance while maintaining Sonnet's pricing. The model shows major improvements in consistency, instruction following, and real-world task automation—particularly for computer vision/interaction tasks across legacy software without APIs.

AlignmentFeb 3, 2025Constitutional Classifiers: Defending against universal jailbreaksThese classifiers filter the overwhelming majority of jailbreaks while maintaining practical deployment. A prototype withstood over 3,000 hours of red teaming with no universal jailbreak discovered.

Anthropic Research · 1d ago · 7 · research inference agent

Anthropic's research describes Constitutional Classifiers, a defense mechanism against universal jailbreaks that uses input/output filtering trained on synthetic data. The system achieved robustness against thousands of hours of red teaming with minimal performance degradation (0.38% increase in refusal rates) and moderate compute overhead, demonstrating practical scalability for deploying safer LLMs.

baidu/ERNIE-Image · Hugging Face

r/LocalLLaMA · 1d ago · 8 · new model open source inference tool

Baidu released ERNIE-Image, an 8B-parameter open-weight text-to-image diffusion model with strong instruction-following and text-rendering capabilities, alongside ERNIE-Image-Turbo optimized for fast inference (8 steps). The model is available via Hugging Face with practical examples for integration into workflows.

We benchmarked TranslateGemma against 5 other LLMs on subtitle translation across 6 languages. At first glance the numbers told a clean story, but then human QA added a chapter. [D]

r/MachineLearning · 1d ago · 7 · benchmark inference research

Comprehensive benchmark comparing six LLMs on subtitle translation across six languages using reference-free quality metrics (MetricX-24 and COMETKiwi), with a custom combined score revealing model-metric affinity bias and critical failures like TranslateGemma's inability to properly distinguish Simplified vs Traditional Chinese despite high metric scores. The evaluation highlights practical limitations of current QE metrics and real-world deployment risks when relying solely on automated scoring.

I scaled a pure Spiking Neural Network (SNN) to 1.088B parameters from scratch. Ran out of budget, but here is what I found [R]

r/MachineLearning · 1d ago · 7 · research open source inference benchmark

An indie developer trained a 1B parameter Spiking Neural Network (SNN) from random initialization for language modeling, achieving 93% sparsity and spontaneous cross-lingual emergence, challenging the conventional wisdom that direct SNN training requires ANN conversion or distillation. While early-stage (4.4 loss, 27k steps), this demonstrates a viable pathway for neuromorphic computing and inference efficiency, with code and checkpoint shared for community feedback.

Gemma 4 audio with MLX

Simon Willison · 2d ago · 7 · tutorial inference open source tool

Practical walkthrough of running local audio transcription using Gemma 4 E2B model with MLX framework on macOS via uv run. Demonstrates real-world inference with a 10GB model and shows actual transcription output with accuracy notes, useful for developers building local AI audio pipelines.

mtmd: add Gemma 4 audio conformer encoder support

r/LocalLLaMA · 3d ago · 7 · open source inference tool benchmark

This PR adds audio processing support to Gemma 4 models in llama.cpp using a USM-style Conformer encoder, with key fixes for CUDA/Vulkan/Metal backend compatibility. The implementation includes optimizations like replacing unsupported ops (ggml_roll → view+concat) and fixing contiguity issues that caused CPU fallbacks, achieving strong audio transcription results across different quantization levels and backends.

[AINews] AI Engineer Europe 2026

Latent Space · 4d ago · 7 · new model agent workflow inference

GLM-5.1 reaches top-tier coding performance (#3 on Code Arena), while the 'cheap executor + expensive advisor' pattern emerges as a standard orchestration approach for reducing inference costs. Key implementations include Anthropic's API-level advisor tools, Berkeley's research, and new features in Qwen Code (v0.14.x) with agent engineering primitives like model routing and sub-agent selection.

Waypoint-1.5: Higher-Fidelity Interactive Worlds for Everyday GPUs

HuggingFace Blog · 6d ago · 8 · new model inference open source

Waypoint-1.5 is Overworld's improved real-time video world model now optimized for consumer hardware, running up to 720p/60fps on RTX 3090+ and 360p on broader gaming laptops/Apple Silicon. The model was trained on 100x more data than v1 with more efficient video modeling techniques, prioritizing interactive responsiveness and local deployment over pure visual fidelity.

Multimodal Embedding & Reranker Models with Sentence Transformers

HuggingFace Blog · 6d ago · 8 · tutorial rag library inference

Practical guide to multimodal embedding and reranker models that extend traditional RAG pipelines to handle text, images, and other modalities in a shared embedding space. Covers model loading, encoding mixed-modality inputs, and computing cross-modal similarities with concrete code examples and performance considerations.

[AINews] Gemma 4 crosses 2 million downloads

Latent Space · 8d ago · 7 · new model deployment inference open source tool

Gemma 4 is gaining traction as a practical edge-inference model with strong on-device performance (40 tok/s on iPhone 17 Pro via MLX), achieving 2M downloads in its first week and becoming the top trending model on Hugging Face. The release demonstrates mature ecosystem support across llama.cpp, Ollama, vLLM, and other deployment tools, positioning it as a reference point for local-first development and reducing reliance on paid cloud APIs.