News Nug

20M+ Indian legal documents with citation graphs and vector embeddings – potential uses for legal NLP? [D]

r/MachineLearning · 1d ago · 8 · dataset rag research open source benchmark

A software engineer has built a structured 20M+ Indian court case dataset with citation graphs, dense/sparse embeddings, and extracted metadata (judges, parties, sections, acts). The resource includes heuristic + LLM-based NER extraction pipeline, cross-referenced legislation, and serves as a novel evaluation benchmark for legal RAG systems and graph neural networks on low-resource legal domain data.

Multimodal Embedding & Reranker Models with Sentence Transformers

HuggingFace Blog · 6d ago · 8 · tutorial rag library inference

Practical guide to multimodal embedding and reranker models that extend traditional RAG pipelines to handle text, images, and other modalities in a shared embedding space. Covers model loading, encoding mixed-modality inputs, and computing cross-modal similarities with concrete code examples and performance considerations.

awesome-opensource-ai — Curated list of the best truly open-source AI projects, models, tools, and infrastructure.

GitHub Trending AI · 22d ago · 7 · tool open source library agent rag deployment

A curated directory of production-ready open-source AI tools and libraries organized by category (core frameworks, models, inference, agents, RAG, training, deployment, benchmarks, safety). Highlights practical CLI tools like PR-Agent, Gemini CLI, LLM, and Repomix that directly integrate AI into developer workflows.