r/MachineLearning · 1d ago · 7 · research open source inference benchmark

An indie developer trained a 1B parameter Spiking Neural Network (SNN) from random initialization for language modeling, achieving 93% sparsity and spontaneous cross-lingual emergence, challenging the conventional wisdom that direct SNN training requires ANN conversion or distillation. While early-stage (4.4 loss, 27k steps), this demonstrates a viable pathway for neuromorphic computing and inference efficiency, with code and checkpoint shared for community feedback.

r/MachineLearning · 2d ago · 7 · research workflow benchmark

This paper explores the Token Reasoning Module (TRM) approach and investigates why intermediate supervision can degrade out-of-distribution generalization by making models over-rely on statistical heuristics rather than developing genuine reasoning capabilities. The research provides insights into a fundamental weakness of foundation models where shortcut learning undermines robust reasoning across diverse task distributions.

DeepMind Blog · 2d ago · 9 · new model api update agent benchmark

Google released Gemini Robotics-ER 1.6, a specialized embodied reasoning model for robotic systems with enhanced spatial understanding, multi-view reasoning, and new instrument-reading capabilities like gauge interpretation. The model is now available via the Gemini API with improvements in pointing, counting, task planning, and success detection—critical for physical agent autonomy.

Simon Willison · 2d ago · 7 · tool library open source

Servo browser engine is now available on crates.io as an embeddable library, enabling Rust developers to integrate it into applications. The post demonstrates practical usage including a CLI screenshot tool and explores WebAssembly compilation possibilities, though full Servo WebAssembly compilation isn't feasible due to threading and dependency constraints.

Simon Willison · 2d ago · 6 · prompt engineering workflow

Bryan Cantrill argues that LLMs lack the optimization pressure that human laziness (finite time) creates, leading to bloated systems and poor abstractions if left unchecked. The piece emphasizes how human constraints force better engineering practices, a useful perspective for AI engineers building production systems to consider when relying on LLM-generated code or architectures.

Simon Willison · 2d ago · 7 · tutorial inference open source tool

Practical walkthrough of running local audio transcription using Gemma 4 E2B model with MLX framework on macOS via uv run. Demonstrates real-world inference with a 10GB model and shows actual transcription output with accuracy notes, useful for developers building local AI audio pipelines.

r/LocalLLaMA · 3d ago · 7 · open source inference tool benchmark

This PR adds audio processing support to Gemma 4 models in llama.cpp using a USM-style Conformer encoder, with key fixes for CUDA/Vulkan/Metal backend compatibility. The implementation includes optimizations like replacing unsupported ops (ggml_roll → view+concat) and fixing contiguity issues that caused CPU fallbacks, achieving strong audio transcription results across different quantization levels and backends.

r/MachineLearning · 3d ago · 6 · research benchmark

This essay explores whether LLM capabilities emerge purely from scale (data + compute) versus requiring fundamental algorithmic innovations, tracing this debate from early computer vision work through GPT scaling. While intellectually engaging, it's primarily philosophical reflection on existing trends rather than introducing new technical methods, models, or practical tools for engineers building with AI.

TLDR AI · 3d ago · 6 · workflow benchmark

Survey findings reveal widespread developer distrust in AI-generated code (96%) with reliability concerns, highlighting the need for automated verification and deterministic guardrails in AI-assisted development workflows. The report positions AI as "trusted but verified" with emphasis on SDLC integration and automated quality gates rather than manual code review.

TLDR AI · 3d ago · 5 · tool agent

Cursor announced support for multiple frontier AI models (OpenAI, Anthropic, Gemini, xAI) and parallel agent execution capabilities. While the multi-model support and agentic workflows are technically interesting, this is primarily promotional content lacking technical depth or implementation details.

TLDR AI · 3d ago · 6 · benchmark workflow

Benchmark study reveals significant accuracy gaps (25 percentage points) in AI approaches for data integration workflows, with cascading failures across multi-step processes. CData Connect AI demonstrates 98.5% accuracy, highlighting the importance of reliable schema interpretation and filter handling in production AI systems.

r/LocalLLaMA · 3d ago · 9 · new model open source agent deployment benchmark

MiniMax-M2.7 is a new open-source model with strong programming and agent capabilities, featuring self-evolving optimization during training and native multi-agent collaboration support. The model demonstrates exceptional performance on code tasks (SWE-Pro 56.22%, Terminal Bench 57.0%), system-level reasoning for SRE work, and achieves competitive benchmarks against GPT-5.3 and Claude variants while supporting deployment via SGLang, vLLM, and Transformers.

Simon Willison · 4d ago · 5 · tool open source

SQLite 3.53.0 release includes result formatting improvements via a new Query Results Formatter library, with a WebAssembly playground built using Claude Code. While SQLite is foundational infrastructure, this release focuses on general database improvements rather than AI-specific tooling or capabilities.

Latent Space · 4d ago · 7 · new model agent workflow inference

GLM-5.1 reaches top-tier coding performance (#3 on Code Arena), while the 'cheap executor + expensive advisor' pattern emerges as a standard orchestration approach for reducing inference costs. Key implementations include Anthropic's API-level advisor tools, Berkeley's research, and new features in Qwen Code (v0.14.x) with agent engineering primitives like model routing and sub-agent selection.