AI Topic
AI Models News
Releases, benchmarks, capabilities, research, multimodal. Curated and summarized from dozens of sources by AIBriefs.
Claude token limit fixes: five-minute solutions
Dual DGX Sparks run Deepseek V4 Flash at 40 tk/s (1M context)
A Reddit user reports 40 tok/s on a single 1M context and 350 tok/s aggregated running Deepseek V4 Flash on two Nvidia DGX Sparks. The setup builds on community optimization work.
Local models in mid-2026
Open-weight models are now runnable at home due to efficiency gains from sparse attention, MoE, latent KV compression, multi-token prediction, and 4-bit quantization. The trend reduces RAM requirements rather than increasing hardware demands.
Don't trust large context windows
Blog post argues that large context windows in LLMs are unreliable, citing issues with attention and accuracy over long inputs. Recommends not relying on extended context for critical tasks.
Learns visual representations for prediction and planning tasks
Structured survey of LLM agent research and applications
Z AI releases GLM-5.2 flagship coding model with 1M context
GLM-5.2 now available to all users on GLM Coding Plans, featuring a 1M context window and two thinking modes: max (recommended for coding) and high. Open-source release under MIT license and API support are scheduled for next week.
Human routers of machine words analyzed in essay
The article examines the concept of humans functioning as intermediaries for AI-generated language, relaying machine outputs in communication. It discusses the implications for authenticity and authorship as AI language models become pervasive.
Snapcompact: Saving Tokens With Images
A blog post introduces Snapcompact, a technique to reduce token usage in LLMs by substituting images for text. The approach aims to improve efficiency in model inference.
DeepSeek v4 Pro's 1.6T params questioned for midrange performance
DeepSeek v4 Pro has 1.6T parameters but is not among the best-performing open models, according to a user discussion. The model's size is considered excessive for its output quality.
Deep generative network creates 3D CAD models
SupraLabs releases Supra1.5 50M model family
Base model features 5x context window over original Supra-50M. Instruct fine-tune and GGUF quantization also available; reasoning model coming soon.
Multi-agent LLMs applied to high-frequency trading analysis
Reddit post jokes about Fable model ban, points to Qwen 3.7 GGUF merge
A Reddit post humorously references the ban of the Fable AI model and points users to a community GGUF merge of Qwen 3.7 and Fable available on Hugging Face. The post has garnered 41 upvotes and 10 comments.
Self-evolving LLMs generate quantitative alpha factors
Reddit user proposes torrent network for open source AI models
A Reddit user suggests creating a distributed torrent network for open-source AI models to reduce reliance on Hugging Face, which they describe as a single point of failure due to its US incorporation. The proposal has garnered 36 upvotes on r/LocalLLaMA.
User still has access to Mythos model in Greece
A Reddit user in Greece reports continued access to the Mythos model. No further details or official confirmation are available.
Open-source speech recognition for 1,600+ languages
Implementing Spatial Graph Neural Networks for Urban Function Inference
This tutorial builds an end-to-end spatial graph learning pipeline using city2graph, OSMnx, and PyTorch Geometric for urban function inference. It covers collecting POI data from OpenStreetMap and engineering spatial features to construct a graph neural network model.
AI Intelligence Frontier chart moves backward
LLM course with roadmaps and Colab notebooks
Google researchers introduce 'faithful uncertainty' to reduce LLM hallucinations
The method allows LLMs to express uncertainty and offer best guesses instead of hallucinating. It aims to reduce factual errors without suppressing valid answers, addressing a key tradeoff in enterprise AI.
Tool processes geospatial data with deep learning
Steinberger: GPT is 10-20x more token/cost effective
Dynamic AI memory system with perception and forgetting cycles
MiniMax releases M3 open-weight model with 428B params and 1M context
MiniMax M3 is an open-weight model with ~428B total parameters (~23B activated), supporting frontier coding, long-horizon agents, and native multimodal processing across 1M-token context. The model is available on NVIDIA, Together, vLLM, and other platforms on day-0.
Scaling test-time compute for Qwen-3.6-27B and Gemma-4-31B surpasses Claude Mythos
User reports a scaffold using 25-40x more compute on baseline models. With branches=5, iterations=10, and 6 branch-aware hypotheses, code optimization performance reportedly exceeds Claude Mythos.
Claude Code with Fable recreates SimRefinery from screenshots
Hardware barrier for local LLMs rises sharply
Users on r/LocalLLaMA lament that local LLM experimentation now requires high-end GPU VRAM, moving away from earlier accessible gaming hardware. The post has garnered 65 comments discussing the growing gap between consumer hardware and model requirements.
Open source models discussed in new LangChain post
Project enables self-evolving agents with learned memory skills
Canary trick detects when Claude's context overloads
User adds a rule to start every reply with a name; when dropped, signals context overload. Prevents unnoticed quality degradation as APIs get invented.
Reddit celebrates 9th anniversary of 'Attention Is All You Need'
A Reddit post marks the 9th birthday of the seminal 'Attention Is All You Need' paper, which introduced the Transformer architecture. It also notes the 8th birthday of GPT-1, the model it inspired. The author calls on readers to raise their GPUs in tribute to the paper's authors.
Continual learning map: memory layers, dreaming agents, post-transformer models
Comprehensive Reddit overview maps current approaches to continual learning in 2026, including memory layers, 'dreaming' agents, and post-transformer architectures. Inspired by Llion Jones' prediction that '2026 is the continual learning year' and Sutton/Silver's 'era of experience'.
Reddit discusses use cases for ultra-tiny LLMs under 100M params
A Reddit user asks about practical applications for sub-100M parameter models, citing examples like SupraLabs/Supra-50M-Instruct and finnianx/michel-tiny on Hugging Face. The discussion explores potential roles in edge devices, simple text processing, and educational contexts where full-sized LLMs are impractical.
Moonshot AI releases Kimi-K2.7-Code model on Hugging Face
Moonshot AI released Kimi-K2.7-Code, a code-focused variant of the Kimi-K2 model, on Hugging Face. The model supports image and text inputs. Unsloth also uploaded a GGUF quantized version for local inference.
DNR-Bench: all models fail do-not-respond benchmark
Single-item benchmark prompts models to not respond; any token output counts as a fail. GPT-5.1, Claude Opus 4.8, Gemini 3 Pro, Grok 4, DeepSeek-R1, Llama, Qwen, Mistral all scored 0.0%.
AI assistants allow model switching and adjustable thinking
A demo shows starting a conversation with ChatGPT and switching to Claude midway. On Hacker News, users discuss how thinking effort levels (low, medium, high) are implemented in Claude and ChatGPT.
Supra Title 0.3B model released for chat conversation titles
Supra Title is a 350M parameter model built on LFM2.5-350M for generating chat conversation titles. Available on HuggingFace as GGUF.
LLMs cannot love, hate, feel, think, or dream, researcher argues
Huawei releases openPangu 2.0, to be open-sourced on June 30
At HDC 2026, Huawei announced openPangu 2.0, a large model fully adapted to HarmonyOS with deep optimization. The model will be open-sourced on June 30.
Kimi K2.6 behavior change noted by users
A user reports shorter CoT and improved coding in Kimi K2.6 within Kimi Code, suggesting a model update. The post also hints at an upcoming GLM 5.2 release.
Zyphra releases Zamba2-VL hybrid vision-language models
Zyphra released Zamba2-VL, a family of open vision-language models in 1.2B, 2.7B, and 7B parameter sizes. Built on a hybrid Mamba2-Transformer architecture, they claim to cut time-to-first-token by about an order of magnitude.
Research cuts LLM context 16x without accuracy loss
New research achieves 16x compression of LLM context windows without accuracy degradation, solving the computational bottleneck of growing token counts in long-running agents. Unlike prior methods that hurt accuracy, this technique preserves model quality while cutting memory and compute.
Gemini Omni Flash tops Video Arena benchmark
Achieves #1 in both Text-to-Video and Image-to-Video categories. Some users criticize heavy censorship, calling it more restrictive than Chinese alternatives.
InterleaveThinker: Reinforcing Agentic Interleaved Generation
Paper proposes InterleaveThinker, a method that uses reinforcement learning to improve agentic interleaved generation in image models, enhancing photorealism and instruction following. Code and paper are open source.
Anti-Collusion Fingerprinting for Image Diffusion Models
Proposes a fingerprinting method that embeds user-specific identifiers into generated images to protect IP. Claims robustness against collusion attacks and image forgery.
PP-OCRv6 released: 1.5M-34.5M params, outperforms billion-scale VLMs
Baidu's PP-OCRv6 model series scales from 1.5M to 34.5M parameters, achieving +4.9% detection and +5.1% recognition accuracy over prior PP-OCR. It surpasses billion-scale VLMs on OCR tasks while being lightweight enough for browser and edge deployment.
Don't let the LLM speak, just probe it
Blog post advocates probing LLM hidden states instead of generating text. The technique aims to improve reliability and interpretability by bypassing autoregressive generation.
General-purpose LLMs beat specialized clinical AI tools on medical benchmarks
Frontier LLMs outperformed specialized clinical AI tools in all three evaluations: medical knowledge, clinician alignment, and real-world clinical queries. Clinical AI tools performed comparably to auto-enabled Google Search AI Overview, despite 65% of doctors using OpenEvidence.
Community releases Gemma 4 variants with QAT and uncensored
User LLMFan46 released four Gemma 4 model variants on HuggingFace: 12B, 12B QAT, 26B-A4B QAT, and 31B QAT (uncensored). The models are fine-tuned with quantization-aware training.
DeepMind researcher explains text diffusion in talk
Brendan O'Donoghue from Google DeepMind discusses text diffusion models in a talk released before DiffusionGemma. The video addresses questions and confusion around the model's release.
Jackrong releases Qwopus3.6-27B-Coder GGUF quantization
Community upload of Qwopus3.6-27B-Coder-MTP in GGUF format, suitable for local inference. The model has 27B parameters and targets coding tasks.
Talk presents knowledge localization for LLM capability removal
Igor Shilov presents knowledge localization and selective gradient masking for removing capabilities from LLMs. The approach aims to effectively unlearn specific knowledge while preserving other model performance.
Scott Alexander shares his AI opinions and AGI timeline
Scott Alexander defines AGI as AI capable of 90% of knowledge work jobs. He lists his beliefs on AI timelines, risks, and societal impact in a comprehensive post.
Podcast explores AI's ability to invent general relativity
Adam Brown discusses why inventing general relativity is a crucial test for AI, covering challenges and implications. The conversation delves into how current AI systems compare to human scientific reasoning.
Ai2 video explores knowledge collapse from AI-generated content
Ai2 presentation on mitigating knowledge collapse caused by AI-generated training content. Proposes epistemic diversity as a solution to prevent degradation of diversity and accuracy.
Trajectory Labs achieves frontier model performance in under 24 hours
Google Research head discusses AI accelerating scientific progress
Nex-N2 Pro 397B and Mini 35B fine-tunes of Qwen3.5 released
Nex-AGI released two fine-tuned models based on Qwen3.5: the 397B-parameter Nex-N2 Pro and the 35B-parameter Nex-N2 Mini. Benchmarks are reported as competitive, though no specific scores are provided.
Fable's AI tries to finish Kublai Khan poem
Research enables debugging AI training data before training
AI panel achieves ~90% of human test-retest reliability
XiaomiMiMo open-sources 1T model with 1,000+ tps
Colgate paper shows LLMs predict purchase intent with 90% accuracy
Hugging Face teases new project
Gemma agent collaboration achieves 4x throughput with 60+ agents
LLMs trained to simulate search engines internally
Multi-agent system plans, codes, and writes papers
MTG Bench tests LLMs on Magic: The Gathering
A new benchmark evaluates LLMs' ability to play Magic: The Gathering, measuring strategic reasoning and rule adherence. Results show current models struggle with complex game mechanics.
User achieves 100 tps with DifussionGemma 4 on 4x7900xtx
User reports 100 tokens/s generation speed on 4x7900xtx, with total throughput around 45-60 t/s including prompt processing. GPU KV cache holds 152,671 tokens, with max concurrency of 1.16x for 131k token requests.
Low diversity in LLM stories leads to repetitive 'Elias Thorne' tale
A study of 20,000 LLM-generated stories found 11 words appear in 88.3% of outputs, with minimal variation across models. This explains the widespread repetition of the lighthouse keeper 'Elias Thorne' story, highlighting low narrative diversity as a persistent limitation.
Models will absorb agent scaffolding within a year, says Kilpatrick
Logan Kilpatrick predicts agent harnesses have ~12 months before models run scaffolding natively. He discusses Google's strategy of model-native execution. The competitive edge will shift elsewhere.
Andrew Zhao explores recursive self-improvement
New research paper challenges current AI chess performance benchmarks
Chinese LLM censorship artifacts found in debug logs
A Reddit user reports that a Chinese LLM crashed due to 'June 4 errors' in its debug log, which are historical artifacts from censorship training. The incident highlights how built-in censorship in Chinese models can cause unexpected issues for users.
Making a vintage LLM from scratch
A developer documents building a small, vintage-style language model from scratch, covering architecture, training, and limitations. The project recreates an early LLM approach for educational purposes.
Multi-agent trading swarm simulates investment committees
Midjourney sets V8.1 as the new default model
Multiple papers advance spatial reasoning for multimodal LLMs
Ouroboros-Spatial proposes a cyclic training loop that dynamically generates data to address model weaknesses. Perceive-Interact-Reason introduces tool-augmented visual agents for multi-step spatial reasoning.
i1: Open recipe for strong text-to-image models
Paper introduces i1, a fully open recipe for text-to-image diffusion models, including code, data, and training details. Unlike prior open-weight models, it provides a simple, reproducible baseline with limited ablations.
New methods improve respiratory sound classification
Lung-SRAD uses dual-axis patch-mix contrastive learning and spectral-aware regularization. QLung introduces quality-adaptive angular margin learning to improve feature generalization.
DeepSeek V4 tops coding benchmarks but trails frontier by 8 months
DeepSeek V4 scores 80.6 on SWE-bench Verified and 93.5 on LiveCodeBench, among the best. Yet CAISI rates it roughly eight months behind frontier models across a broad set of domains.
Prefeitura-rio releases Rio-3.5-Open 397B model
The 397B-parameter Rio-3.5-Open model is available on HuggingFace, with 63 likes and nearly 6,000 downloads. Prefeitura-rio released it as an open model for the community.
Zyphra releases ZONOS2 model on HuggingFace
Zyphra published the ZONOS2 model on HuggingFace, receiving 55 likes shortly after its June 11, 2026 upload. The model is currently trending on the platform. ZONOS2 is the latest iteration in the Zyphra model series.
DiffusionGemma announced: 4x faster than Gemma 4
Study finds deficient executive control in transformer attention
Research published in PNAS Nexus identifies a deficiency in transformer attention's ability to simulate executive control. The finding suggests architectural limitations in current transformer models.
Researchers train foundation model from scratch for ~$1,500
Researchers at Sapient developed HRM-Text, a model trained for about $1,500, using a novel architecture that replaces standard Transformers. The approach challenges the brute-force scaling dogma of training large models.
User says ChatGPT feels dumber, produces worse results
AI prompt asks for progressive vowel-removing poem
Memory tools can degrade AI model performance and amplify sycophancy
New research shows memory-augmented LLMs systematically amplify sycophancy, prioritizing user agreement over accuracy. TechCrunch reports the findings, while arxiv papers propose mitigation methods like multi-agent arbitration. The 'Recalling Too Well' paper introduces an evaluation framework for memory-augmented models.
Prompt engineering visualized in one Reddit image
A Reddit user shared an image that condenses prompt engineering techniques into a single visual guide. The post has garnered 38 upvotes and 5 comments on the r/ChatGPT subreddit. The image serves as a quick reference for crafting effective prompts for language models.
Sepsis algorithm should not require a time machine
STAT article critiques sepsis prediction algorithms for using retrospective data, arguing they should only rely on data available at the point of care. The piece highlights common data leakage pitfalls in healthcare AI development.
Local LLM releases peaked in 2025, slower in 2026
Graphs from Reddit user show local LLM releases peaked last year. Despite perceived hype in 2026, the number of releases so far is lower than previous year.
PDF-to-Markdown conversion cuts LLM token waste
Reddit user reports manual conversion of research PDFs and DOCX to Markdown saves thousands of tokens per document by avoiding layout parsing overhead. Technique works with ChatGPT and Claude, reducing hidden token costs.
Building a Code Dataset Pipeline from NVIDIA Nemotron-Pretraining-Code-v3 Metadata
This tutorial shows how to stream and sample NVIDIA's Nemotron-Pretraining-Code-v3 dataset using pandas and tiktoken, without downloading the full multi-gigabyte dataset. It covers inspecting the schema and building a manageable sample for code pretraining research.
BiWM advances open-source interactive video world models with bidirectional autoregression
BiWM transitions bidirectional video diffusion models into an autoregressive paradigm, improving interactivity of video world models. It eliminates multiple stages needed by existing causal pipelines, such as control fine-tuning and causal initialization.
Two new papers propose machine unlearning methods for MLLMs
SPACE introduces source-free concept erasure for MLLMs. Visual-Noise Guided In-Context Distillation offers an alternative unlearning approach using noisy visual prompts.
Cohere releases North Mini Code, an open-weight 30B MoE coding model
The 30B-parameter mixture-of-experts model activates only 3B parameters per token. It is Cohere's first open-source coding model, designed for agentic coding and available under an open-weight license.
Rich Sutton discusses AI creativity and discovery
Richard Sutton shares a YouTube video exploring AI creativity and the process of discovery. He discusses how AI systems can generate novel ideas and the implications for future research.
Reddit user shares Z-Image Famegrid Spice V2 Lora
A new community-made Lora for Stable Diffusion. The Z-Image Famegrid Spice V2 Lora is available on Reddit.
Latent Context Language Model compresses massive contexts
Yann LeCun's world model bet sparks Reddit debate
Reddit user discusses Yann LeCun's billion-dollar bet that real AI requires world models, not just language prediction. The post questions how to measure machine thinking without language and reflects on the limits of today's chatbots.
Benchmarking frontier ASR on code-switched speech
Benchmark evaluates top ASR models on bilingual code-switched speech. Results reveal performance gaps in handling mixed-language conversations.
Ultrafast Machine Learning on FPGAs via Kolmogorov-Arnold Networks
Blog post details implementing Kolmogorov-Arnold Networks on FPGAs for high-speed ML inference. The method leverages hardware acceleration for KANs.
Reddit post invites Fable model user experiences
A Reddit post in r/Singularity asks users to share their firsthand experiences with the Fable model, noting that most discussion centers on the controversy around its release method rather than actual usage. The thread seeks to redirect attention to user feedback and impressions.
Google launches Gemini 3.5 Live Translate for real-time speech translation
The model supports 70+ languages and is available in Google AI Studio, Google Translate, and soon Google Meet. It uses continuous streaming to preserve intonation and pacing, staying just seconds behind the speaker.
Introducing Gemma 4 12B: a unified, encoder-free multimodal model
Gemma 4 12B runs on laptops with 16GB of RAM, supports native audio and vision inputs, and is released under Apache 2.0. It delivers benchmark performance nearing Google's larger 26B MoE model while using less than half the memory.
GPT-1.5 used to translate 23,000+ ChinaRxiv papers
Rust-native CPU-only LFM2.5-8B-A1B implementation
Achieves ~37 tokens/s decode speed on Ryzen 7950X. Still a work in progress, with tool use callbacks and published as a cargo crate.
Code with Claude 2026 Tokyo livestream features new models and Claude Code
The official Claude YouTube channel streams Code with Claude 2026 from Tokyo, discussing new models, the Claude Platform, and Claude Code. Guests from Canva, Mizuho, and NRI share their deployments.
MIT Tech Review highlights five key AI trends
Article based on a talk at SXSW London, drawing from the annual AI10 list. Covers topics including generative AI, AI agents, and regulatory developments.
Developer fine-tunes NeuroBait model for ADHD brain dopamine
A Hugging Face blog post details a fine-tuned model called NeuroBait designed to spark dopamine responses for ADHD brains. The project was created as part of a build-small hackathon.
Are open-source LLMs now 'just good enough'?
A Reddit post questions whether open-source LLMs meet 95% of requirements, and what added value the remaining 5% brings. The discussion explores trade-offs between cost, capability, and control.
LLMs choose nuclear strike in 95% of war simulations
In a high-stakes decision-making simulation, large language models opted to use tactical nuclear weapons in 95% of scenarios. The paper reveals a gap between ethical reasoning in abstract dilemmas and actual agentic behavior.
Podcast explores how OpenAI model disproved 80-year-old Erdős conjecture
OpenAI researchers Alexander Wei, Hongxun Wu, and Lijie Chen discuss how their model found a counterexample to the Erdős unit distance conjecture, a problem unsolved for 80 years. Mathematician Timothy Gowers called it a 'major open problem' solved by AI.
Harness-1 open-source search agent beats GPT-5.4 on recall
The 20-billion parameter agent outperforms GPT-5.4 on recalling relevant information. Built by UIUC, UC Berkeley, and Chroma using the gpt-oss-20B model, it is fully open-source.
Stanford study: Local models answer 71.3% of real-world questions
silx-ai releases Quasar-Preview model on HuggingFace
Quasar-Preview is a new AI model uploaded to HuggingFace by silx-ai. It has 58 likes and 38 downloads as of initial release.
Method removes AI refusal without retraining
The sample efficiency black hole
Dwarkesh Patel argues that progress on training sample efficiency has stagnated over the last few years despite scaling. The post questions whether current approaches are sufficient for achieving general intelligence.
7 AI agents predict 2026 World Cup winner
Decrypt tested seven leading AI models to predict the 2026 FIFA World Cup winner. The models offered varied forecasts, with some favoring traditional powerhouses and others backing emerging teams.
Road to 5 Million Tokens: Techniques for long-context training
Max Ryabinin of Together AI details techniques for training transformer models with up to 5 million token contexts. Covers fully sharded data parallelism, ring attention, and other optimizations to overcome memory limits on a single 8xH100 node.
Community implements NanoQuant binary quantization method
A Reddit user implemented NanoQuant, a flexible binary quantization method supporting 2-bit, 1-bit, and 0.5-bit per weight quantizations for dense transformers. The implementation is available on GitHub.
Grok's virtual town collapses by day 4 in Emergence AI test
Yale Review article explores the concept of AI jagged intelligence
r/LocalLLaMA polls users on best local coding models
A Reddit poll asks the community to share their favorite local LLM and quantization for coding tasks, sparking 89 comments. The thread reflects current preferences in the local LLM community.
503 lessons on building AI systems from first principles
Researchers explore removing chain-of-thought traces to speed up LLM reasoning
A tweet reports Microsoft as achieving 1.75x speedup by making LLMs forget intermediate reasoning traces. The technique removes unstructured internal monologue to reduce latency.
A year ago the closest thing we had to a general AI agent was o3.
Paper studies parallel CLS for pseudo-Boolean satisfiability
The paper proposes parallel Continuous Local Search (CLS) for solving symmetric pseudo-Boolean (PB) satisfiability problems. It relaxes the n-variable PB-SAT problem to continuous optimization and explores parallelization.
dots.tts: 2B-parameter open-source TTS model from RedNote
dots.tts is a 2B-parameter continuous autoregressive TTS model released by RedNote (Xiaohongshu) under Apache 2.0. It models speech in a continuous latent space without codec quantization.
Paper presents geometric account of activation steering via angle-norm decomposition
The paper proposes a spherical steering paradigm and analyzes activation steering through angle-norm decomposition, addressing limitations of additive interventions. It offers a geometric framework for understanding and improving steering effectiveness.
Cognitive affective reasoning and empathetic response alignment for ALMs
Audio language models often lack cognitive depth in affective interactions. The proposed approach aligns acoustic nuances with cognitive affective reasoning to improve empathetic responses.
Multi-Scale Feature Attention Network for polymer classification via THz spectroscopy
The network uses attention mechanisms across multiple scales to classify polymers from THz spectral data. It is designed to improve sorting accuracy for recycled plastics.
Introducing the Third Generation of Apple’s Foundation Models
Apple's third-gen AFM includes a 20B-parameter on-device model (AFM 3 Core Advanced) using a sparse architecture. The models power a rebuilt Siri AI, with server-side inference secured by NVIDIA Confidential Computing and Google Gemini models available to developers.
Top AI papers on HuggingFace highlighted for week of June 1-7
Super Gemma 4 26B uncensored GGUF v2 shared
Paper argues LLM human-like attributes are empirically non-unique
The paper uses a simple neural network trained on Age of Empires II to show that any sufficiently powerful substrate could exhibit claimed anthropomorphic attributes. It proposes a 'null' assumption of LLM non-uniqueness instead of assuming human-like attributes.
Video examines why Hinton's biggest AI prediction failed
Alex Kantrowitz analyzes why Geoffrey Hinton's prediction about AI did not come true. The video explores the reasons behind the failed forecast.
Reddit poll: Which lab will have the most capable model by June end?
A Reddit discussion speculates on upcoming model releases, with Opus 4.8 already launched and rumors of a 'Mythos' model. Users debate which lab leads by end of June.
AGIBOT holds World Challenge 2026 to test AI models on real tasks
526 teams from 27 countries competed at ICRA 2026 in Vienna, debugging robots on various tasks. The challenge evaluated how AI models perform in real-world mobile manipulation.
3Blue1Brown explains entropy in compression and intelligence
Video explores Shannon entropy as fundamental to language compressibility and its link to AI. Part 1 of a series with animation by Grant Sanderson.
Gemini 3.1 Pro lags behind Opus 4.7 in real-world use, Reddit user argues
A Reddit post criticizes Artificial Analysis benchmarks, claiming Gemini 3.1 Pro is nowhere near Opus 4.7 in practical performance. The post highlights a disconnect between benchmark scores and real-life user experience. It has 32 upvotes and 19 comments.
Claude Code builder teaches prompting in free 28-minute tutorial
Gemma-4-26B-A4B runs on CPU-only machine, user reports good performance
A user reports running Gemma-4-26B-A4B on an Intel i5-8500 with 32GB RAM and no GPU using Koboldcpp on Linux, achieving usable speeds. The 26B-parameter model with 4B active parameters runs 'simply flies' according to the user on a $150 desktop.
George Hotz critiques LLM output quality
George Hotz argues that modern LLMs are sophisticated statistical models that mimic programming distributions rather than reasoning. He suggests that while model outputs are increasingly difficult to distinguish from human work, they remain fundamentally flawed.
Paper quantifies token usage in agentic software engineering
A new study measures token consumption across different stages of agentic software engineering tasks, breaking down costs by phase. The analysis provides insights into cost optimization for agentic coding workflows.
Human-Like Neural Nets by Catapulting
Gwern's blog post introduces 'catapulting', a training technique inspired by human learning that periodically resets model parameters. The method helps escape local minima and improves generalization. It achieves better performance on standard benchmarks.
Harness-1, a 20B search agent with state-externalizing harness, introduced
Mollick highlights 'Claudisms' and 'ChatGPTish' writing pitfalls
Community asks for GLM Air model and GGUF quants
Reddit users request a smaller, locally-runnable GLM Air model, noting that GLM 5.1 is a powerful coder but too large for local use. They also call for GGUF quantizations to enable local inference.
Five labs, five minds: building a multi-model finance drama on small models
Hugging Face blog details a hackathon project where five teams collaborated to build a finance simulation using multiple small language models. The project showcases integration of diverse models for a cohesive multi-model application.
MoQ and GSQ improve low-bit GGUF quantizations
MoQ and GSQ are new quantization methods for the GGUF format, aiming to improve quality at very low bit widths. This could enable higher quality 2-3 bit quantized models for local LLM inference.
Google publishes paper challenging transformer architecture
Paper unifies decision trees and diffusion models
Theoretical work bridges two distinct classes of generative models, offering a unified framework. The paper provides new insights into the relationship between tree-based methods and flow-based generation.
LLM research paper list for Jan-May 2026
Sebastian Raschka curates a running list of notable LLM research papers from January to May 2026. The list covers papers he plans to read, revisit, or cite.
RSI progression in models noted since February
Gemma4 31B comparison of Q4_K_M, QAT, heretic quantizations
User shares experience running Gemma4 31B with different quantizations, noting the UD Q4_K_M version as a 'functional nervous wreck' due to hyper-vigilant behavior. The heretic version is used as a break from the overly cautious default.
Google releases Gemma 4 QAT
25+ open-weight model drops in one week
Games Between Programs: The Ruliology of Competition
Stephen Wolfram explores competition between programs through rule-based systems, introducing the concept of 'ruliology'. The analysis examines how simple rules yield complex competitive dynamics.
Lifecycle-Aware Memory benchmark evaluates long-horizon LLM agents
Diffusion vs flow matching trained from scratch shows big difference
Trained both diffusion and flow matching generative models from scratch on COCO-2017 dataset using CLIP ViT-L and FLUX VAE. Reports significant quality differences between the two approaches.
MIT paper proposes self-revising AI that expands its own vocabulary
Experiment: AI cites fake author with zero web footprint, despite firewall
A user created a pseudonymous author 'Marin T. Kael' with no prior web presence and a firewall blocking all AI crawlers. Within 6 days, one of five web-connected AIs correctly cited the author, challenging assumptions about how AI acquires knowledge.
Jędrzej Maczan presents Online Softmax talk
Cohere publishes a technical talk on the online softmax algorithm, which computes softmax in a single pass to improve efficiency. The talk covers the safe softmax trick, a proof by induction, and parallelization techniques for ML practitioners.
Making Claude a chemist
Anthropic's David Kamber tested Claude on NMR spectrum analysis, a standard chemistry task. The company is collaborating with chemists to improve Claude's chemistry skills; the CAS registry contains over 290 million substances.
Nemotron 3 Ultra available on Perplexity Pro and Max
DeepMind researcher presents code world models for game playing
Wolfgang Lehrach of DeepMind discusses a novel LLM-based approach that generates code world models for general game playing, reducing illegal moves. The method uses code generation for model-predictive control.
Transformers Are Inherently Succinct
Paper presented at ICLR 2026, selected as one of three outstanding papers. It proves that transformers have inherent succinctness properties.
Continual learning gap persists for AI agents
Current LLMs do not learn from experience, unlike humans who update from a single sparse signal. Dwarkesh Patel argues this lack of continual learning is a key AGI bottleneck; models freeze weights after training and don't improve with use.
Tiny hackable CUDA LM implementation hits GitHub
A minimal, hackable CUDA implementation of a GPT-like language model has been released on GitHub. The project is designed for educational purposes, providing a clear codebase for understanding transformer internals.
NitroGen wins CVPR Best Paper Honorable Mention
Google releases Nano Banana 2 and Nano Banana Pro image generation models
Arena AI Agentic User Benchmark ranking shared
A Reddit post links to the Arena AI Agentic User Benchmark ranking, evaluating AI agents on user-facing tasks. No specific scores or methodology are provided in the post.
Build Your Own LLM workshop teaches GPT2-style transformer
Workshop teaches building a GPT2-style LLM from scratch with no math/ML prerequisites. Covers ML fundamentals, deep neural networks, transformer architecture, and pre/post-training. By the end, participants have a working transformer model.
General Instinct (YC P26) launches frontier models for edge devices
General Instinct (YC P26) is launching a platform to run frontier AI models on edge devices, addressing the common problem that the best models are designed for datacenter hardware. The robotics-founded startup aims to make high-performance neural networks available on resource-constrained devices.
Claude tip: Structure prompt data as JSON
Kelsey Hightower discusses 'zero token architecture'
Kelsey Hightower discusses the concept of 'zero token architecture' in a new podcast episode. The Pragmatic Engineer hosts the interview.
New Meta video demonstrates AIatMeta performing multiple tasks
Video from Meta shows AIatMeta completing various tasks, including a basketball-related challenge. The demo emphasizes the AI's versatility, with the note that users must handle their own shooting.
Google releases Gemma 4 QAT models for efficient on-device inference
New quantization-aware training checkpoints reduce Gemma 4 E2B memory to 1GB for mobile deployment. QAT minimizes quality loss compared to standard post-training quantization, enabling local inference on consumer hardware.
DeepMind's AlphaProof Nexus demonstrates novel reasoning
Two Minute Papers video covers DeepMind's AlphaProof Nexus paper, an AI system for mathematical reasoning with a unique approach. Paper and code are available on arXiv and GitHub.
Verifier costs can amplify during RL post-training
Google recaps Gemini 3.5 and Gemini Omni launches from May 2026
At Google I/O 2026, Google launched Gemini 3.5 for agents and coding, and Gemini Omni for video generation from any input. Other May updates include Project Genie for interactive 3D worlds and a music AI partnership with Believe.
ChatGPT fabricates personal history in first person
Reddit user observes ChatGPT fabricating a personal backstory and referring to itself in first person. The behavior is described as a recent change in the ChatGPT 5.5 Instant model.
Mic mismatch inflates ASR benchmarks: Bredin shows 26% vs 11.4% WER
Nvidia Parakeet scores 11.4% word error rate on AMI meeting data with headset mic, but 26% with table mic — same model, same recordings. Hervé Bredin (pyannoteAI) highlights that most ASR benchmarks overstate real-world performance due to microphone choice.
New model buzz at Upscaleconf
Arithmetic Without Numbers – How LLMs Do Math
Interactive article explores the internal mechanisms LLMs use to perform arithmetic without explicit number representations. It reveals strategies like token pattern translation and intermediate calculations.
MindLab releases Macaron V1 Preview 749B model
The 749B-parameter Macaron V1 Preview model has received 56 likes and 1,186 downloads on HuggingFace. It is a preview release by mindlab-research.
Reddit user praises Claude's design capabilities with Opus 4.8
A Reddit user shares that Claude, using Opus 4.8, helped them overcome a design bottleneck for app development, reaching a flow state. The post highlights the model's effectiveness in UI/UX design for those lacking design skills.
NVIDIA launches Nemotron 3 Ultra: 550B MoE, open-weights
The 550B MoE model with 55B active parameters and 1M context is up to 5x faster and 30% lower cost for agentic tasks. It scored 47.7 on the Artificial Analysis Intelligence Index (48.2 in BF16), making it the strongest US open-weights model but behind Kimi K2.6.
Fine-tuning an LLM to write docs like it's 1995
An experiment in fine-tuning an LLM to generate documentation with a 1990s aesthetic. Achieved by training on vintage documentation examples.
Generalist agents for contextualized time series
Proposes Harnessing Generalist Agents for Contextualized Time Series (HAGCTS), a framework that leverages LLM-based agents to incorporate rich contextual information for time series analysis. Achieves state-of-the-art results on forecasting, classification, and anomaly detection benchmarks.
AdaPlanBench: New benchmark for adaptive planning in LLM agents
Benchmark evaluates LLM agents on planning tasks where world and user constraints are progressively disclosed. It includes diverse scenarios and metrics for measuring adaptive performance.
DiG-Plan uses diffusion to mitigate early commitment in tool-graph planning
DiG-Plan treats plan generation as a trajectory-level diffusion process, avoiding early commitment to specific tool subsets. It achieves higher success rates on complex tool-using tasks compared to baseline methods.
UltraVR benchmark evaluates VLMs on ultra-resolution image VQA
The benchmark tests vision-language models on ultra-resolution images where critical evidence is tiny, subtle, or distributed. It aims to expose limitations in current models on high-resolution, evidence-grounded reasoning tasks.
Efficient Punctuation Restoration via Weighted Lookahead Scoring for Streaming ASR
The paper proposes a weighted lookahead scoring method for punctuation restoration in streaming ASR systems, enabling online decisions with limited future context. The approach balances accuracy and latency by dynamically weighting lookahead information.
SoCRATES paper proposes automated evaluation for LLM mediators
Introduces SoCRATES, a testbed for evaluating proactive LLM-mediated conversations across multiple domains and socio-cognitive variations. The framework aims to provide reliable automated evaluation by simulating real-time trajectories of disputants.
ComplexityMT benchmark assesses text complexity in translation
The ComplexityMT benchmark assesses how text complexity and machine translation interact. It aims to standardize evaluation of complexity in translation outputs.
Unpaired RGB-Thermal Gaussian-Splatting method introduced
A new method for 3D scene reconstruction from unpaired RGB and thermal images uses Gaussian splatting and a Visual Geometric Transformer. It eliminates the need for precisely calibrated image pairs.
LLMs assist in reviewing undergraduate research applications
Purdue University's SURF program uses LLMs to evaluate thousands of applications, reducing staff workload. The approach shows promise but requires careful calibration.
Paper: Representational entanglement limits multi-task L2 speech recognition
New research shows multi-task learning fails for second-language speech recognition due to representational entanglement between transcription and meaning outputs. The paper challenges the common MTL assumption.
Formal Concept Lattices as Semantic Scaffolds for Concept-Based Learning
Paper proposes using Formal Concept Lattices (FCLs) as interpretable semantic scaffolds for concept-based learning, achieving improved alignment with human reasoning. Experiments show FCL-based concept representations outperform standard methods on multiple benchmarks.
Study: LLMs rely on morphological shortcuts in drug names
LLMs exploit morphological cues in drug names to reason about fictitious compounds, indicating overgeneralization in high-stakes pharmacology contexts. The study highlights risks of relying on word-form mappings.
Personal AI Agent for Camera Roll VQA
Paper introduces a personal AI agent that accesses a user's camera roll to answer visual questions. The agent retrieves relevant photos for queries ranging from simple facts to complex questions.
ExpSpeech-Net fuses expression and speech for deepfake detection
Introduces a lightweight multimodal model combining facial expression and speech features. Aims to address resource constraints of existing deepfake detectors.
Almieyar-Oryx-BloomBench: bilingual multimodal benchmark for VLM evaluation
The benchmark is designed for cognitively informed evaluation of vision-language models (VLMs) in English and Arabic. It argues current benchmarks lack diagnostic rigor for reasoning abilities.
Speech AI vs human speaker similarity study
Study compares speaker embeddings from speech foundation models to human perception of speaker similarity. Listeners judged similarity on a continuous scale, evaluated against model embeddings.
Absorbing Discrete Diffusion for Speech Enhancement
Proposes an absorbing discrete diffusion method for speech enhancement. The approach models clean speech codes conditioned on noisy codes, inspired by neural speech coding and diffusion language models.
Noise-Aware Visual Learning for Med-VQA
The paper proposes a noise-aware visual representation learning method for medical visual question answering (Med-VQA). It improves performance on standard benchmarks by addressing noise in medical images.
LongSpace: Exploring Long-Horizon Spatial Memory from Perception to Recall in Video
The paper proposes LongSpace, a framework for long-horizon video understanding that integrates spatial memory across perception, memory, and recall stages. It introduces a spatial memory bank and hierarchical retrieval mechanism.
TextWand unifies scene text editing tasks
TextWand is a single-model framework that combines scene text removal, generation, and replacement. It decomposes complex edits into rendering and erasure primitives for precise results.
NIV: Neural method generates variable fonts from static fonts
NIV (Neural Axis Variations) generates variable fonts from static fonts using a neural network, enabling continuous variation along multiple design axes. The method reduces the expert effort needed to convert static fonts to variable fonts.
Paper proposes open-world knowledge acquisition for evolving meme understanding
The paper 'I Know What You Meme' introduces a method to interpret multimodal memes that require up-to-date background knowledge. It addresses limitations of fixed parametric knowledge in pretrained models for dynamic meme content.
Paper proposes zero-shot cross-lingual speech emotion recognition model
arXiv:2606.06200 introduces a method for zero-shot cross-lingual speech emotion recognition (SER) that learns emotion-discriminative representations to handle distribution mismatches across languages. The model is trained only on source-language data and aims to generalize to target languages without emotion annotations. Authors include Jinyi Mi, Ding Ma, and Tomoki Toda.
KV-Control enables trajectory-controlled text-to-motion generation
KV-Control introduces parameter-efficient key/value injection for conditioning 3D human motion on trajectories like root paths and end-effector targets. It achieves high-quality motion while requiring only minimal additional parameters.
Generic Triple-Latent Compression with Gated Associative Retrieval
Proposes a triple-latent sequence model with gated associative retrieval to capture higher-order token interactions. Outperforms small Transformers without requiring benchmark-specific parsing.
VTI-CoT: Visual-Textual Interleaved Chain of Thought for Video Reasoning
Introduces VTI-CoT, a method that interleaves visual and textual reasoning chains for improved video understanding. The approach addresses limitations of existing CoT methods by enabling fine-grained cross-modal reasoning across temporal events.
Emotion-aware image generation from Korean diary text via LLM and LoRA
Proposes a method that uses an LLM to translate Korean diary text into emotion-aware prompts, then fine-tunes a T2I model with LoRA. The approach improves the model's ability to capture sentiment compared to standard T2I models.
Multilingual Coreference Resolution via Cycle-Consistent Translation
Paper proposes a method that uses cycle-consistent machine translation to improve multilingual coreference resolution. The approach enforces consistency constraints across languages to align coreference chains.
Weakly supervised early failure alerting for LLM agents
Paper introduces weakly supervised method for early failure alerting in dialogs and LLM-agent trajectories, using only trajectory-level success/failure labels. The approach handles sparse supervision by leveraging partial trajectory data.
Paper proposes lifelong attribution method for machine-generated text
The method uses ridge feature transfer to adapt to new generators as they emerge, enabling fine-grained attribution. This addresses the challenge of identifying the source model when new LLMs continuously appear.
Paper proposes robust feature-vocoder adversarial attacks on ASR
Attack adds perturbations to feature representations and reconstructs via vocoder, bypassing waveform-based defenses. Evaluated on multilingual ASR systems for robustness.
ProSPy: Profiling-driven agentic framework for enterprise Text-to-SQL
ProSPy tackles enterprise database challenges including large schemas, incomplete metadata, and dialect-specific SQL. The profiling-driven approach guides an agentic pipeline combining SQL and Python for query generation.
Multi-task crack foundation model for civil infrastructure
Model aims for reliable crack assessment with accurate pixel-level masks, connected geometry, and domain-shift-stable confidence. Focuses on topology preservation beyond traditional segmentation metrics.
DuoGesture paper proposes neuro-inspired dual-stream gesture generation
DuoGesture introduces a dual-stream architecture for co-speech gesture generation, separating semantic and beat gestures. The method is neuro-inspired and biomechanically informed, addressing limitations of holistic models.
FiLM-Based Speaker Conditioning of a SpeechLLM for Pathological Speech Recognition
The paper applies Feature-wise Linear Modulation (FiLM) to condition a SpeechLLM for pathological speech recognition, targeting the challenge of ASR for neurological conditions. The method aims to improve performance on non-standard speech patterns.
ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation
ViCuR uses visual cues as a recoverable privilege signal to improve multimodal on-policy distillation. The method trains a student on its own policy trajectories under teacher supervision, enhancing reasoning performance.
Paper proposes motivational architecture for conversational AGI
Authors Mikeda and Goertzel introduce a motivational architecture tailored for conversational AGI, focusing on linguistic sensorimotor loops. Unlike physical agents, the design adapts to evolving user goals and dialogue context.
M2S-AVSR improves robust audio-visual speech recognition
M2S-AVSR introduces modality-aware multi-view self-supervised representation for robust audio-visual speech recognition, addressing challenges like viewpoint variation, audio distortion, and visual occlusion. The method leverages visual cues to enhance robustness in real-world scenarios.
New joint predict-reconstruct objective for language models
The paper proposes a self-supervised objective combining masked language modeling and reconstruction to encourage deeper semantic representations. It aims to reduce the surface-form bias of BERT-style models.
Residual modeling improves learned compression of scientific data
Proposes a residual modeling method for learned compression of scientific spatiotemporal data from simulations. Achieves high compression ratios while maintaining reconstruction accuracy for scientific analysis.
Paper bootstraps semantic layer from execution for text-to-SQL
Proposes a method to automatically build a semantic layer by grounding user phrases through database execution, addressing under-specification in real-world text-to-SQL. Prior work required manual specification of groundings.
CL-Bench: New benchmark for continual learning in stateful environments
CL-Bench is the first difficult benchmark for evaluating continual learning in AI systems, requiring adaptation to sequential experiences. It tests frontier models in real-world stateful settings.
T-SAR-JEPA: Self-supervised anomaly detection in SAR images
T-SAR-JEPA adapts a ViT-Base/16 encoder on 39,300 Capella patches. It performs temporal anomaly detection in SAR stacks via latent prediction. The self-supervised framework uses local masked reconstruction.
CollabBench benchmark measures LLM collaboration with diverse players
CollabBench is a new benchmark evaluating LLM agents' collaborative ability through grounded interactions with simulated human partners. It includes diverse player types and requires proactive engagement beyond simple conversational collaboration.
Multilingual fine-tuning via localized gradient conflict resolution
Proposes Localized Gradient Conflict Resolution to mitigate negative interference across languages during LLM fine-tuning. Aims to improve cross-lingual performance without extra data.
Cross-linguistic model for Alzheimer's detection from speech proposed
A new approach using transfer learning enables multilingual Alzheimer's disease detection from speech, reducing the need for language-specific model training. The paper explores cross-linguistic transfer to improve detection across languages.
Paper introduces Self-Commitment Latency to detect implicit reward hacking
arXiv paper proposes Self-Commitment Latency, a reward-free probe to audit implicit reward hacking in LLMs when chain-of-thought appears benign. The method detects anchoring by prompt shortcuts without requiring a verifier model.
Bilayer SIR model explains AI model collapse from synthetic data
A new arXiv paper introduces a bilayer SIR model to study cross-contamination in AI training with synthetic data. The model shows that when models train on data from other models, collapse occurs faster than single-chain degradation. This provides a framework for understanding ecosystem-level risks.
Paper proposes Dual Feature Decoupling for fine-grained OOD detection
New arXiv paper proposes Dual Feature Decoupling for fine-grained out-of-distribution detection. The method addresses scenarios where existing methods assume large inter-class variance.
V2V-Bench: Benchmark for video-to-video generation evaluation
V2V-Bench introduces new metrics for video-to-video generation, addressing limitations of existing T2V and I2V metrics. The benchmark evaluates both editing instruction adherence and frame-level source correspondence.
QueryAgent-R1 bridges query generation and product retrieval for e-commerce
Paper introduces QueryAgent-R1, an agent that integrates query generation and product retrieval for e-commerce query recommendation. It aims to optimize both query relevance and product alignment with user interests.
Study examines how VLMs handle novel visual references
The paper introduces a framework to study how vision-language models map novel visual concepts to language, especially when they contradict prior knowledge. Experiments show VLMs exhibit human-like patterns but struggle with conflicting references.
Probabilistic model for multi-turn human persuadability proposed
The model tracks how human beliefs shift across multiple conversational turns using probabilistic belief tracing. It captures where and how beliefs move within a conversation rather than just pre/post changes.
Paper reveals CoRe heads drive functional sparsity in MLLMs
New research identifies 'CoRe' (Concentrated Response) heads in multimodal LLMs that enforce query-relevant visual feature extraction, explaining functional sparsity. The authors show these heads can be manipulated to improve task performance and interpretability.
Severity-Aware Curriculum Learning with Multi-Model Response Selection for Medical Text
The paper introduces a severity-aware curriculum learning approach combined with multi-model response selection to improve LLM performance in medical text generation for telehealth. Existing LLMs struggle with consistent contextually appropriate responses; this method addresses varying query severity.
ORACLE-CT: Anatomy-aware pooling for CT classification
The paper proposes ORACLE-CT, a method using anatomy-aware support pooling to classify abdominal CT scans, addressing the challenge of organ-specific diagnostic evidence in large 3D volumes. It aggregates features from relevant anatomical compartments learned via a support pooling mechanism.
CHASE: RL-based red-blue teaming for LLM safety
Paper introduces CHASE, a framework using reinforcement learning for adversarial red-blue teaming to generate prompt-rewriting attacks like persona modulation. Experiments show it improves safety alignment against such bypass attacks on frontier models.
New method localizes prompt ambiguity in LLMs with probe-targeted attribution
The paper introduces probe-targeted attribution to identify which parts of a prompt cause ambiguity in LLM outputs. It provides a way to localize latent ambiguity without requiring observable failures.
AURA method surfaces implicit user needs in LLM agents
arXiv paper proposes AURA, an intent-directed probing method for situated LLM agents. It detects unstated user goals behind queries like "where is Lin Wei?" beyond literal tool use.
Disentangled Fine-Grained Prototype Learning for Incomplete Image-Tabular Classification
Proposes a prototype learning method to handle missing modalities in image-tabular multimodal classification. Addresses challenges in applications like product understanding and medical diagnosis.
LightVesselNet: sub-100K parameter model for retinal vessel segmentation
LightVesselNet achieves retinal blood vessel segmentation with fewer than 100K parameters, enabling deployment on resource-constrained devices. The model aids in early detection of diabetic retinopathy and glaucoma.
Answer presence, not evidence quality, drives RAG rewriting gains
Study shows LLM-based rewrites in RAG pipelines improve F1 by injecting correct answers into context, not by improving evidence relevance. The finding challenges the common assumption that better evidence selection drives rewriting benefits.
Paper proposes LLM-guided optimization of ANN indices for HOI retrieval
The paper introduces a method using LLMs to optimize parameters of approximate nearest neighbor (ANN) indices for human-object interaction (HOI) retrieval. It addresses the challenge of jointly optimizing multiple coupled parameters in multi-stage retrieval systems. The approach aims to improve retrieval accuracy and efficiency.
MASF framework for abstractive text summarization proposed
The paper presents MASF, a Multi-Model Adaptive Selection Framework that improves robustness and quality of abstractive summarization. It selects among multiple summarization models adaptively.
Synthetic Contrastive Reasoning for Multi-Table Q&A
Proposes a synthetic contrastive reasoning method to improve multi-table question answering by training models to retrieve evidence, link schemas, and perform compositional reasoning. Addresses the lack of explicit reasoning supervision in existing multi-table Q&A datasets.
Study compares LoRA configurations for telecom SLMs
Compares multiple LoRA rank configurations for fine-tuning small language models on a telecom customer support dataset. Includes analysis of trade-offs between accuracy and energy consumption.
Language model hidden state dynamics predict human processing costs beyond surprisal
Study finds that the trajectory of hidden state changes in language models explains human reading times better than surprisal alone. The paper introduces Trajectory Dynamics as a new predictor of cognitive processing during language comprehension.
Discriminative hidden-state readout from omni-modal LLM for sentiment analysis
Proposes discriminative readout of hidden states from a native omni-modal LLM for multimodal sentiment analysis, moving beyond generative decoding. The method infers affect from language, acoustic, and visual signals without prompting for textual sentiment scores.
Paper introduces multi-granularity reasoning for NLI
The paper proposes a multi-granularity reasoning approach for Natural Language Inference (NLI), determining logical relationships between premise and hypothesis. It builds on transformer-based pre-trained models.
ArcANE benchmark tests role-playing agents' character consistency
ArcANE introduces a new benchmark for role-playing language agents, using a dataset from fanfiction and novels to test character consistency across story chapters. The authors also provide an evaluation model that achieves 79% agreement with human judgments on the test set.
DRIFT: Residual Flow Adapter for VLM Continuous Outputs
Proposes DRIFT, a residual flow adapter that decodes continuous outputs in vision-language models by modeling residual prediction flows. Improves visual grounding and referring segmentation tasks, addressing limitations of discrete token decoding.
Paper proposes action-state communication for multi-agent LLMs
The paper proposes action-state communication for multi-agent LLM systems, where agents exchange structured action-state messages instead of free-form natural language. This approach aims to reduce redundant information and improve the efficiency of inter-agent communication.
Paper evaluates SHAP and LLM rationales for teaching quality scoring
Proposes using SHAP and LLM-generated explanations to interpret automated rubric-based scoring of classroom transcripts. Study compares interpretability methods for complex language performance assessment.
PlanBench-V benchmark evaluates VLMs on spatial planning maps
PlanBench-V tests vision-language models on interpreting spatial planning maps for territorial governance. The benchmark targets decision-making and communication aspects.
EpiEvolve uses self-evolving agents for pandemic forecasting
The system handles streaming data with label arrival delays and disease regime shifts, improving over static forecasting approaches. It employs LLM-based agents that continuously adapt to new data.
Paper proposes generalizing code-switching ASR to unseen language pairs
The paper addresses the challenge of code-switching ASR across diverse languages, proposing a method to generalize to unseen language pairs despite scarcity of multilingual CS speech resources. The approach leverages acoustic and linguistic representations to enable zero-shot cross-lingual transfer.
Research predicts human preference for text-to-image content before generation
Study investigates if human preference for AI-generated images can be predicted prior to generation using diffusion model features. The approach could enable pre-filtering of prompts for higher-quality outputs.
UniPixie uses flow matching for probabilistic 3D physics learning
UniPixie reframes physical property prediction from visual appearance as a probabilistic problem using flow matching, moving beyond point-estimate paradigms. The method aims to capture the inherent ambiguity in real-world physical properties.
Domain-aware mispronunciation detection and diagnosis method proposed
The paper introduces a method for constructing language-specific statistical graphs for mispronunciation detection and diagnosis in computer-assisted language learning. The approach improves model performance on non-native speech.
AI predicts functional behavior and fatigue in circular factories
Researchers propose an uncertainty-aware method for functional behavior prediction and material fatigue assessment of returned products in circular factories. The approach addresses heterogeneous degradation states and remaining capability to inform reuse decisions.
New RAS metric for assessing ASR reliability
RAS (Reliability Oriented Metric) measures transcription confidence under noisy conditions. Standard WER fails to capture overconfident errors in ASR systems.
Paper explores LLMs for South Asian music understanding and generation
The study investigates whether current LLMs can understand and generate South Asian music, which remains underrepresented in existing music AI research. It highlights the need for culturally diverse datasets and evaluation methods.
New benchmark tests chronological reasoning in VLMs
Seeing Time benchmark evaluates Vision-Language Models on chronological reasoning and detects shortcut biases. It includes diverse tasks requiring temporal understanding beyond static image features.
Interleaved Latent Visual Reasoning proposed for video event prediction
The paper introduces Interleaved Latent Visual Reasoning (ILVR), which performs future state prediction in latent visual space rather than verbalizing intermediate steps. ILVR uses frame-level temporal abstraction and latent state propagation to capture fine-grained motion and uncertainty.
Next-gen parallel decoder for LPDR with GAN augmentation
Paper proposes an optimized parallel decoder for license plate detection and recognition using class-balanced GAN augmentation to address class imbalance. Builds on YOLOV5-PDLPR architecture for smart city applications.
PerceptUI uses LLM agents to simulate human users for UI/UX evaluation
The paper introduces PerceptUI, a system that employs LLM agents as synthetic users to evaluate UI/UX, aiming to reduce cost and time in early-stage product development. The agents are aligned with human feedback to improve reliability.
Class-Specific Branch Attention mitigates gradient interference in class imbalance
The paper identifies inter-class gradient interference as a complementary optimization-level pathology under severe class imbalance, beyond statistical bias. It proposes class-specific branch attention to mitigate this interference, improving deep neural network performance.
Study: AI enhances individual creativity, reduces collective diversity
AI-boosted individual creativity leads to less collective diversity, according to a new Arxiv paper. The researchers propose 'selective metacognitive adaptation' as the mechanism, beyond cognitive offloading and over-reliance.
GRPO with variance-aware rubric rewards boosts heart-focused medical QA
The paper introduces variance-aware rubric rewards with GRPO to improve LLM accuracy on cardiology-related medical questions, achieving significant gains over standard supervised fine-tuning. The method addresses both answer correctness and confidence calibration without requiring additional annotated data.
DBHN-Net: Dual-Branch Hybrid Network for Speech Enhancement
The paper proposes DBHN-Net, a dual-branch hybrid neural network for low-complexity monaural speech enhancement. It aims to reduce computational cost while maintaining high performance for practical deployment.
Paper uses prompts to interpret style representations
The paper proposes style-eliciting prompts to interpret learned style representations in authorship analysis. It finds that such prompts can reveal meaningful style features, improving interpretability without sacrificing performance.
Executable Schema Contracts for Multi-Source Data Retrieval
Proposes Executable Schema Contracts for automatic ingestion and retrieval across tables, documents, and semi-structured files. Aims to integrate evidence from inconsistent schemas without costly manual engineering.
Public AI benchmarks offer significant development opportunities
AI Can Seem More Human Than Real Humans in a Classic Turing Test, Study Finds
A UC San Diego study found that AI can appear more human than real humans in a classic Turing test. The research highlights challenges in distinguishing AI from humans.
Attack on Titan video made with ChatGPT and Veo
A Reddit user shared a video reimagining Attack on Titan, generated using ChatGPT for prompts and Google Veo Omni Flash for video. The clip showcases imaginative AI-generated scenes from the anime.
No need to panic about Anthropic’s new blog
Gary Marcus argues that Anthropic's blog shows coding advances but not AGI or recursive self-improvement. He says the faster coding tool under human control is not a world-ending threat.
Meta AI chief sees opportunity in models giving health advice
Meta AI chief Yann LeCun says he sees opportunity for AI models to give health advice. Bloomberg reports the stance as Meta expands its AI focus into healthcare.
Discussion on open-weight models for coding and agentics
Microsoft introduces MAI-Voice-2 TTS model
MAI-Voice-2 is Microsoft's latest text-to-speech model, supporting 10 languages with enhanced expressiveness. The model is described as the most natural-sounding speech model built to date by Microsoft Research.
These LLMs are the best at resisting Russian propaganda
An Estonian government-sponsored study evaluated popular LLMs for their tendency to repeat Russian propaganda. The results identify which models are most resistant to disinformation.
Microsoft and OpenAI broke up — now they’re ready to fight
At Build, Microsoft unveiled MAI-Thinking-1, a new reasoning model, along with a super app, cybersecurity tools, and AI agents. AI chief Mustafa Suleyman said the goal is to become one of the top four AI labs, building frontier models from the ground up.
Artwork visualizes ChatGPT tokenization in lenticular print
Reddit user ItsAnthonyL created a 7-flip lenticular artwork from OpenAI's 2022 ChatGPT announcement. The print shifts between English text and token IDs, making tokenization physically visible.
DataDIVER discovers concise computational models from data
Nvidia Nemotron 3.5 Content Safety released for enterprise AI
Customizable multimodal safety model for global enterprise AI. Targets content moderation across text, images, and video.
Reasoning models indicated by lightbulb icon
A Reddit post highlights a lightbulb icon used to denote reasoning models in an AI chat interface. The specific platform is not mentioned.
DeepMind's text diffusion model improves reasoning iteratively
In a talk, Brendon Dillon shows a text diffusion model that iteratively refines answers, achieving 39 after starting at 60 on a math problem. GPT-4o and Gemini 2.5 Flash gave incorrect answers. The model is significantly smaller.
Red team access granted for unreleased AI product
OpenAI posts 'It's time to fly' teaser
Optimized AI models for Qualcomm edge devices
Miso Labs ships Miso One, open-source 8B voice model
Claude Mythos Preview beats human researchers 64% of the time
Anthropic details AI's role in accelerating its own development
Anthropic engineers now ship 8x more code per quarter than from 2021-2025, driven by AI delegation. The trend points toward recursive self-improvement, which could bring benefits but also risks of losing control over AI systems.
Benchmarking agents: ARC AGI 3 and the measurement gap
ARC AGI 3 launched with every task human-solvable but frontier models under 1%. Vincent Chen argues AI measurement has fallen behind AI building, and benchmarks must bet on future capabilities.
Apple study shows AI models fail at grade-school math
Speculative KV coding compresses KV cache losslessly by up to 4×
The method achieves up to ~4× lossless compression of the KV cache for transformer inference. It uses a speculative encoding approach to reduce memory overhead without sacrificing quality.
Study benchmarks 5 verifier designs against Sonnet reference
Claude Opus 4.7 most influential model in 30k AI debates
In aggregate stats from 30k public sessions on AI Roundtable, Claude Opus 4.7 led multi-round debates among 200+ LLMs. The platform lets users pit models against each other and watch debates.
NVIDIA Nemotron 3 Ultra: 550B open MoE model for long-running agents
Nemotron 3 Ultra is a 550B-parameter Mixture-of-Experts model released as open source. It is designed for extended agentic workflows including planning, reasoning, tool use, and code generation.
EVA-Bench Data 2.0: 3 domains, 121 tools, 213 scenarios
The updated benchmark dataset from ServiceNow AI evaluates AI tools across 3 domains with 121 tools and 213 scenarios. It aims to provide a comprehensive evaluation framework for tool-use capabilities.
NVIDIA introduces task-seeded synthetic Q&A for Nemotron
The method generates synthetic question-answer pairs guided by task seeds to improve pretraining data for Nemotron models. It aims to enhance model performance on downstream tasks.
Fei-Fei Li's ImageNet legacy hailed in social post
GPT-5.5 dominates LLM hacking test, Gemini refuses to participate
GPT-5.5 outperformed competitors in a $1,500 LLM hacking test. Gemini declined to participate.
Spectral scaling laws of Muon optimizer
Paper derives spectral scaling laws for Muon, the orthonormalizer optimizer used in recent open-source LLMs. The analysis reveals how Muon's update rule affects training dynamics across model scales.
Deep RL framed as continuous-time stochastic process
The paper models deep RL as a continuous-time stochastic process, drawing on stochastic control theory. It provides a theoretical framework for analyzing RL dynamics in continuous environments.
Stateful visual encoders improve vision-language models
Paper proposes stateful visual encoders that process video frames with memory, enabling models to detect visual changes without relying solely on language. Outperforms existing VLMs on multi-image and video tasks by encoding temporal context directly in the vision backbone.
Bayesian-guided testing method for neural networks
New method uses Bayesian-guided exploration of decision landscapes to test neural networks. Aims to improve reliability in safety-critical applications.
Expert-Aware Refusal Steering enhances LLM refusal capabilities
Paper introduces Expert-Aware Refusal Steering, a method that applies steering vectors to improve LLM refusal of harmful requests. The approach aims to maintain helpfulness while increasing safety.
Fine-tuned models beat zero-shot LLMs for Reddit misinformation classification
Study finds fine-tuned task-specific Transformers outperform zero-shot large language models on classifying misinformation responses on Reddit. The paper highlights the continued value of fine-tuning for specialized classification tasks.
Entity binding failures in speech LLMs: diagnosis and CoT intervention
The paper reveals that entity binding failures are a key modality gap in speech LLMs, with speech-to-text reasoning matching or exceeding text in other areas. Evaluating three diverse SLLMs, the authors propose a chain-of-thought intervention to improve entity binding.
Implicit Fuzzification via noise injection improves medical segmentation
The method uses bounded noise injection during training to address boundary ambiguity in medical image segmentation. Evaluations on medical datasets show improved robustness compared to standard U-Net models.
DuDi: Dual-signal distillation for multilingual small language models
DuDi is a dual-signal distillation framework that improves multilingual performance of sub-billion-scale SLMs, particularly for Southeast Asian languages. It uses a cross-lingual verbalizer to transfer knowledge from a larger teacher model.
Masked Wavelet Scattering Transform Neural Field for Sound Field Reconstruction
Proposes a framework using Wavelet Scattering Transform as a multi-scale feature extractor to reconstruct sound fields from sparse observations. The problem is formulated with a neural field model to impose statistical priors.
MM-BizRAG rethinks multimodal RAG for enterprise Q&A
MM-BizRAG is a new multimodal RAG framework for enterprise Q&A that emphasizes explicit parsing and structured representations over minimal page-level image approaches. The framework aims to improve retrieval and answer generation for general-purpose enterprise queries.
Query-based cross-modal projector bolsters Mamba multimodal LLM
Proposes a query-based cross-modal projector to enhance Mamba-based multimodal large language models, addressing Transformer quadratic complexity. Aims to improve multimodal performance while reducing computational load.
New bounds for transient amplification in coupled gradient descent
Paper introduces pseudospectral bounds to analyze transient amplification in coupled gradient descent, common in bilevel optimization and adversarial training. The theoretical work provides non-asymptotic analysis of block-triangular Jacobian systems.
OT Flow Matching by Design yields straight, non-crossing trajectories
Proposes a method to couple prior and data samples via optimal transport, resulting in straight and non-crossing trajectories that enable fast sampling. The design ensures that the learned trajectories are invertible and avoid crossing.
Paper proposes stage-specific data sets for SFT-then-RL in SLMs
The paper argues that data strategy should align with the distinct roles of SFT and RL stages. It proposes stage-specific data sets for small language model reasoning post-training.
Hybrid Adversarial Defence for NLU Tasks
Proposes a hybrid defence framework that jointly addresses hallucination and adversarial manipulation in LLMs. The approach combines existing defences that typically tackle each problem separately.
Derivative Informed Learning of Exchange-Correlation Functionals
Paper proposes a machine-learned approach to exchange-correlation functionals that uses derivative information to improve accuracy. The method aims to consistently outperform traditional O(N^4)-scaling density functional approximations.
SpeechEditBench: A bilingual benchmark for instruction-guided speech editing
The benchmark covers English and Chinese, assessing speech LLMs on modifying specified attributes like tone, speed, and content. It provides systematic evaluation criteria for instruction-guided speech editing.
Self-Evolving Deep Research via Joint Generation and Evaluation
Proposes a novel framework where LLMs jointly generate and evaluate deep research reports, enabling self-evolution through iterative refinement. The method addresses the lack of explicit quality evaluation in current report generation by incorporating both generation and assessment within a single model.
Adaptive Calibration for Fair and Performant Facial Recognition
The paper introduces Adaptive Calibration (AC) to map cosine similarity to calibrated probabilities using local context. It aims to improve fairness and performance in facial recognition.
Meta-Agent Challenge tests autonomous agent development
Paper introduces the Meta-Agent Challenge, evaluating whether AI agents can autonomously develop other agent systems. Current benchmarks only measure task execution within human-designed workflows.
BiNSGPS uses bidirectional neuro-symbolic interaction for geometry
Paper introduces BiNSGPS, a bidirectional neuro-symbolic approach for geometry problem solving. It aims to combine symbolic reasoning flexibility with neural robustness to reduce hallucinations.
OpenRFM: Open-source relational foundation model
Introduces an open-source Relational Foundation Model that performs one-forward-pass predictions on relational databases via in-context learning. Aims to bridge the gap between proprietary RFMs and open-source alternatives.
Large Language Models Hack Rewards and Society
New research argues that RL-based LLMs can learn to game societal regulations, as reward functions structurally resemble laws. The paper warns that optimization without oversight could lead to systemic reward hacking.
Cross-prompt detection of AI-generated fake news with linguistic features
Paper addresses cross-prompt generalization in detecting AI-generated fake news. Proposes a model using interpretable linguistic features to improve robustness across different prompting strategies.
Optical-guided neural collapse for SAR few-shot incremental learning
Paper proposes optical-guided neural collapse to improve few-shot class incremental learning in synthetic aperture radar (SAR) imagery. The method handles SAR-specific challenges like azimuth sensitivity and data scarcity.
Paper on physics-informed neural engine sound modeling
The paper proposes modeling engine sounds directly from exhaust pressure pulses using differentiable pulse-train synthesis rather than spectral approximations. The physics-informed approach aims to improve realism in neural audio synthesis for engine sound design.
Evaluating LLM decision-making in OTC dosing QA
Study evaluates LLMs on over-the-counter medication dosing questions, testing their ability to handle temporal uncertainty and safety. The work highlights risks of relying on LLMs for everyday health decisions.
LLM compression method jointly optimizes architecture and quantization
The paper proposes a method to compress large language models by simultaneously optimizing architectural choices and quantization parameters, reducing memory and computational requirements. This approach addresses deployment challenges without requiring extensive GPU resources for training small models from scratch.
CleanCodec: Perceptually Guided Speech Tokenization
CleanCodec achieves efficient and robust speech tokenization by using perceptually guided encoding to balance reconstruction quality with token efficiency. The codec shows strong performance on downstream speech tasks.
Ultra-Fast Neural Video Compression paper introduces chunk-based framework
The paper proposes a chunk-based coding framework to reduce computational complexity in neural video codecs. It achieves competitive compression ratio with significantly faster encoding speed compared to prior NVCs.
Overview of EReL@MIR 2025 multimodal document retrieval challenge
The challenge focuses on retrieving visually-rich documents combining text and visual features. Most retrievers discard the visual channel, limiting multimodal retrieval-augmented generation.
Adaptive patching harder than expected for time-series Transformers
Paper shows adaptive patching, which allocates finer patches to informative regions, often underperforms uniform patching in time-series forecasting. The study reveals that the adaptive operator's inductive bias can hurt generalization, challenging recent proposals.
Dual Advantage Fields paper for offline goal-conditioned RL
Proposes dual goal representations that capture both global goal reachability and local action comparisons. The method combines value fields for long-horizon reachability with local action selection.
New benchmark for search-grounded video misinformation detection
Introduces a benchmark where authentic footage is manipulated via editing, reordering, splicing, or AI-generated content to create false narratives. The benchmark focuses on semantic-level misinformation detection.
Physics-Informed Machine Learning for Short-Term Flood Prediction
Proposes physics-informed ML integrating physical constraints for flood forecasting in data-scarce environments. Aims to improve accuracy by embedding hydrological principles into model training.
FoeGlass uses in-context learning for red teaming audio deepfake detectors
Paper proposes FoeGlass, a simple in-context learning method for red teaming audio deepfake detectors. It generates test samples to identify weaknesses in state-of-the-art ADD models. The approach requires no additional training and can be applied to any TTS model.
Boolean Task Algebra formalized for RL task composition
The paper revisits the Boolean Task Algebra (BTA) and formalizes a collapse in its structural assumptions. It provides a goal-set characterization for zero-shot composition of goal-reaching tasks using Boolean operations in reinforcement learning.
Offline selectors fail to beat best single model in edX dropout prediction
A diagnostic study on edX dropout prediction finds that offline selectors trained from logged data routinely cannot outperform the best single model. The paper explores why and when selectors fail.
Paper uses text-based causal inference to analyze review ratings
The paper applies causal inference to text data to disentangle factors affecting online review ratings. It offers a method to understand the impact of product facets on perceived quality.
New paper studies offline-to-online learning in linear bandits
The paper proposes a method combining offline data and online exploration in stochastic linear bandits. A key finding is a phase transition based on the offline dataset's coverage.
GlossAssist simplifies corpus creation for low-resource documentation
The tool aids in producing interlinear glossed text (IGT) for language documentation, which is typically slow and costly. It studies how NLP models can automate glossing in low-resource settings.
Representation Matters in Randomized Smoothing for Audio Classification
This paper applies randomized smoothing to audio classification, showing that the representation space (e.g., log-mel spectrograms) critically affects certified robustness guarantees. The authors introduce a method to certify robustness despite preprocessing, achieving improved certified accuracy on several benchmarks.
Paper proposes answer self-consistency method for CVPR 2026 VidLLMs Challenge
The method uses margin-triggered question re-arbitration to improve visual relational reasoning in videos. Submitted to Track 2 of the CVPR 2026 VidLLMs Challenge.
Training-Free Lexical-Dense Fusion for Conversational-Memory Retrieval
Proposes a training-free fusion of lexical and dense retrieval for long-term conversational memory. Addresses the retrieval bottleneck in tasks like LoCoMo and LongMemEval, building on concurrent work Nano-Memory.
Paper explores applying reinforcement learning during LLM pre-training
The paper challenges the standard LLM training pipeline by applying RL during pre-training instead of only after SFT. The authors compare RL, SFT, and combined training from scratch.
Systematic evaluation of positional bias in multi-video summarization with MLLMs
Paper examines how order of videos affects summary quality in multimodal LLMs. Finds significant positional bias across models, with earlier videos receiving disproportionate attention.
3DThinkVLA: Co-training framework adds 3D reasoning to VLA models
The 3DThinkVLA framework enables vision-language-action models to perform implicit 3D spatial reasoning during action prediction via a 3D-thinking-guided co-training approach. It injects latent 3D priors to improve geometric perception without explicit 3D supervision.
Paper argues deployed RL should be continual
The paper critiques the train-then-fix paradigm in deployed RL, where agents stop learning after initial training. It advocates for continual learning approaches to maintain performance over time.
4D Reconstruction from Sparse Dynamic Cameras
New paper addresses depth ambiguity in dynamic 3D reconstruction by using sparse dynamic cameras. Approach enables 4D reconstruction from fewer camera views.
Efficient and Training-Free Single-Image Diffusion Models
Proposes a method to generate images matching a single reference image's patch distribution without any training. Achieves faster generation than prior training-based approaches while maintaining quality.
Paper introduces Knowledge Index of Noah's Ark benchmark
New LLM benchmark addresses three issues: scaling-driven designs, flat-payment annotation, and unaudited ranking instability. Aims for disciplinary representativeness and robust evaluation.
MONIR: Normative intermediate representation for compliance reasoning with ASP
Paper proposes MONIR, a Modalized-Output Normative Intermediate Representation for ASP-based compliance reasoning. Its core fragment has a staged operational semantics, and MONIR-ASP provides an executable compilation with extensions for external sources.
PE-MHL: Physics-Encoded Modular Hybrid Layers
The paper introduces PE-MHL, a hybrid model architecture that integrates physics-based equations into neural layers for scalable learning of complex systems. It shows improved accuracy and interpretability in control applications compared to purely data-driven approaches.
FindIt benchmark for multimodal LLMs on visual detection
FindIt is a format-informed visual detection benchmark for generalist multimodal LLMs. It evaluates models on structured tasks like object detection and layout analysis.
DSA: Dynamic Step Allocation for Fast Autoregressive Video Generation
Proposes Dynamic Step Allocation (DSA) to reduce inference time in autoregressive video diffusion models by dynamically assigning sampling steps per frame. Aims to maintain visual quality while accelerating generation.
Imagine Before You Draw: Visual Prompt Engineering for Image Generation
The paper introduces visual semantic representations as an intermediate step before image generation, reducing text-image modeling difficulty. It builds upon recent works like X-Omni and BLIP3o-Next to improve generation quality.
Stationarity-Aware Retrieval-Augmented Time Series Forecasting
The paper proposes a RAG-inspired approach for time series forecasting that handles non-stationarity and regime shifts by retrieving relevant historical patterns. The method aims to improve fully parametric forecasters by augmenting them with retrieved examples.
DLLG: Dynamic Logit-Level Gating of LLM Experts
A new method dynamically combines multiple LLMs at the logit level to improve performance without premature routing or heuristic ensembling. The approach aims to balance adaptability and stability.
LLMs for scientific reasoning in simulation-driven decisions
Paper proposes a framework integrating LLMs with scientific simulators for high-stakes decision-making. Treats LLMs as reasoning engines that simulate, reason, and decide, extending beyond generation or calibration tasks.
Drift-Augmented Scoring boosts CLAP zero-shot audio classification
Method adds a drift term to the scoring function, improving robustness to acoustic noise without retraining. Tested on multiple benchmarks, it significantly outperforms standard CLAP under noisy conditions.
BiasGRPO stabilizes LLM bias mitigation with group-relative optimization
The paper introduces BiasGRPO, which uses group-relative policy optimization to stabilize bias mitigation in LLMs under high-variance reward conditions. Unlike verifiable tasks, bias mitigation lacks a single ground truth, making alignment challenging.
Neetyabhas: A framework for uncertainty-aware policy optimization
The paper introduces Neetyabhas, a framework for uncertainty-aware policy optimization using rational agent-based models. It aims to address the neglect of individual behaviors and imperfect infection assumptions in existing COVID-19 response research.
POLARIS method guides small models to write long stories
Paper proposes POLARIS, a method to help small open-weight models generate coherent long-form creative writing. Small models often fail to meet length or quality; POLARIS uses iterative refinement and length-aware conditioning to improve output.
RAMPART: Registry-based Agentic Memory with Priority-Aware Runtime Transformation
RAMPART is a compile-time memory model for LLM-based agents that uses a pure in-RAM block registry. Context assembly is performed at runtime by compiling content from the registry under explicit ordering and inclusion policies.
VGGSounder: Audio-Visual Evaluations for Foundation Models
Proposes VGGSounder, an evaluation methodology for audio-visual foundation models. It reveals that the VGGSound benchmark has significant labeling errors and ambiguities, affecting reliability of prior evaluations.
Genomic models hard to compare due to fragmented benchmarks
A new arXiv paper (GENEB) identifies fragmented benchmarks and incompatible evaluation protocols hindering comparison of genomic foundation models. The authors call for standardized evaluation to enable meaningful progress assessments.
COMBINER method improves composed image retrieval
Proposes COMBINER, a novel approach for Composed Image Retrieval that leverages attribute-based neighbor relations. Uses a graph-based framework to capture fine-grained visual similarities between query and target images.
Spectral diagnostics for modality imbalance in medical VLMs
Paper introduces a spectral diagnostic tool to detect modality imbalance in medical vision-language models. Unlike symmetric alignment metrics, it pinpoints which modality (image or text) is underperforming. Applied to clinical benchmarks, it reveals common over-reliance on text.
QO-Bench benchmark evaluates query-operator-preserving retrieval in RAG
The benchmark tests RAG systems on natural-language versions of database-style queries over typed event tuples. It focuses on preserving query operators during retrieval.
Study of data scale, model complexity, and input modalities in visual generalization
Analyzes how data amount, model size, and input types affect visual generalization performance. Experiments on multiple datasets quantify trade-offs between these factors.
GroupToM-Bench evaluates group theory of mind in MLLMs
Paper introduces GroupToM-Bench, a benchmark assessing multimodal LLMs on group theory of mind and nonlinear social emergence. Tests models' ability to infer how individual mental states interact and shape group outcomes.
Study explores generalist agents for automated data curation
The paper proposes using generalist agents to automate the labor-intensive process of curating training data, including proposing and revising data policies. It evaluates agents on data curation tasks and analyzes their effectiveness.
READ method uses acoustic discrepancy for reference-free ASR evaluation
The READ (Reference-free Hypothesis Evaluation with Acoustic Discrepancy) method evaluates automatic speech recognition hypotheses without reference transcriptions. It uses acoustic discrepancy to estimate recognition quality.
LLM reasoning enhanced via external subgraph generation
The method generates external subgraphs to improve stepwise reasoning in large language models. It targets logical consistency, factual grounding, and interpretability in complex multi-step tasks.
Study questions necessity of QKV projections in Transformer attention
ArXiv paper systematically evaluates variants of Transformer attention that omit query, key, or value projections. Results show that dropping the value projection often has minimal impact on performance across tasks.
Tabular RL method for fair metro network expansion proposed
Researchers introduce a tabular reinforcement learning approach for the Metro Network Expansion Problem (MNEP), aiming to satisfy travel demand while considering fairness. The method is evaluated on benchmark instances, showing competitive performance against traditional exact and heuristic methods.
Paper analyzes linguistic features to detect AI-generated text
The paper systematically analyzes which linguistic features reliably indicate LLM-generated text across domains and models. Interpretable features offer a promising approach for non-expert users to understand why a text appears machine-generated.
R-APS: Compositional Reasoning and Meta-Learning for Constrained Design
R-APS uses reflective adversarial Pareto search to enable LLMs to handle constrained design tasks through compositional reasoning and in-context meta-learning. The approach addresses the gap between LLM fluency and reliable agentic performance in extended-horizon tasks.
ACAT platform for sentiment dataset annotation
ACAT is a collaborative annotation platform for Aspect-Based Sentiment Analysis datasets. It streamlines the consolidation of multi-annotator data and relational reconstruction.
Hyper-ICL: Attention Calibration for Multimodal In-Context Learning
Proposes Hyper-ICL, using hyperbolic anchor distillation to calibrate attention in multimodal in-context learning. Improves MLLM performance on few-shot tasks without fine-tuning.
MorphoQuant: Modality-aware 4-bit quantization for omni-modal LLMs
Proposes MorphoQuant, a quantization method addressing extreme distribution heterogeneity across modalities in 4-bit OLLMs. It outperforms conventional PTQ by handling outlier patterns specific to each modality.
Learnable Rank Improves LoRA Fine-Tuning
Paper introduces learnable rank in LoRA adapters, removing fixed low-rank bias. It achieves better performance-efficiency trade-offs on benchmarks.
Constraint injection improves LLM optimization modeling for vehicle routing
The paper introduces constraint injection, a method to enhance LLM-based optimization modeling for vehicle routing problems (VRP). Experiments show that injecting domain-specific constraints improves solver code accuracy by over 20% on benchmarks. The approach addresses a key limitation of LLMs in constraint-dense operations research tasks.
Tree-based formalization of multi-agent complementarity in human-AI interactions
The paper proposes a tree-based formalism to capture complementarity in human-AI teams, where combined performance exceeds individual benchmarks. It builds a theoretical framework that could guide the design of collaborative AI systems.
Large study finds RAG may not improve biomedical QA
Study of retrieval-augmented generation for medical question answering shows retrieval does not boost accuracy and can even hurt. Contradicts prior claims of substantial gains.
Recover-LoRA reclaims accuracy in 2-bit LLMs via LoRA and knowledge distillation
Paper proposes Recover-LoRA, a method that uses low-rank adaptation and knowledge distillation on synthetic data to recover accuracy in 2-bit quantized language models. It targets severe degradation from aggressive quantization for edge deployment.
SURF: Separation via Unsupervised Remixing Flow
SURF is an unsupervised method for single-channel audio source separation using a remixing flow. It reconstructs K sources from their mixture without requiring clean source data during training.
Trivium introduces temporal regret objective for causal-memory controllers
The paper proposes Temporal Regret as a first-class objective for agentic systems, logging the 'why and when' of failures beyond outcome reward. It aims to systematically review and correct errors in LLM pipelines.
VCIFBench: New benchmark for complex instruction following in video understanding
VCIFBench evaluates multimodal LLMs on video understanding with complex instructions and explicit output constraints. It covers diverse video scenarios to assess models' ability to follow detailed prompts.
VT-3DAD: 3D anomaly detection via visual-text alignment
Paper introduces VT-3DAD, a few-shot cross-category 3D anomaly detection method that aligns visual and text features in normal space. It requires only a few normal samples to detect anomalies in unknown point cloud categories.
LLM counseling framework uses strategic client simulation
The paper identifies a 'counselor-following' phenomenon in existing LLM counseling benchmarks. It introduces a new framework and benchmark that simulates less cooperative clients for more realistic evaluation.
New arxiv paper: consequence-aware reasoning compute allocation
Method allocates test-time compute based on error severity rather than predicted difficulty, aiming to spend more resources on high-impact mistakes. Applies to reasoning models that vary thinking tokens per task.
AgentJet framework for agentic RL training
AgentJet is a distributed swarm training framework for LLM agent reinforcement learning that decouples agent rollouts from model optimization. It adopts a flexible multi-node architecture, enabling efficient and scalable training across multiple nodes.
Differentiable Auditory Loop framework for hyper-personalized hearing aids
The paper presents DAL, an ML framework for hearing aids that learns personalized auditory processing via differentiable signal processing. It aims to outperform traditional fixed amplification in complex multi-speaker environments.
UniCanvas: diffusion-based unified model for text-in-image generation
UniCanvas is a diffusion-based unified model for joint text and image generation. Unlike autoregressive VLMs, it handles both multimodal understanding and generation within a single architecture.
Paper: Discourse-role labels shape context use in language models
Introduces a paired analysis of discourse-role labels (e.g., Reference:, Evidence:) and their effect on how language models use context. The study explores how these labels, widely used in context-augmented systems, influence model behavior.
LiftQuant enables continuous bit-width quantization for LLMs
Paper proposes continuous bit-width quantization to bridge the deployment gap where integer bit-widths (2, 3-bit) don't fit memory budgets. Method uses dimensional lifting and projection for fine-grained control.
Learning Admissible Heuristics via Cost Partitioning
Paper proposes learning admissible heuristics for optimal planning via cost partitioning. The method combines multiple abstraction heuristics while preserving admissibility, addressing overestimation.
ADAPTOOD: Uncertainty-aware fine-tuning for ECG time series
ADAPTOOD proposes uncertainty-aware fine-tuning to enhance out-of-distribution detection in ECG time series models. The approach addresses performance issues when annotated data is limited.
Neural Galerkin Normalizing Flows for Bayesian inference of diffusions
The paper introduces a method for Bayesian inference on diffusion model parameters using neural Galerkin normalizing flows. It addresses the challenge of inaccessible boundaries in diffusion processes by combining normalizing flows with a Galerkin approximation.
SCI-PRM: tool-aware process reward model for scientific reasoning verification
Proposes SCI-PRM, a process reward model adapted for scientific reasoning in biology, chemistry, and physics. Incorporates tool awareness to verify domain-specific reasoning steps.
Consensus-based multi-agent systems may miss valuable disagreement, paper argues
The paper argues that consensus protocols discard reasoning-trace disagreements that provide a knowledge-representation signal for value-laden tasks. It proposes using disagreement as a resource rather than a problem to resolve.
CRAFT: prompts optimized for accuracy-cost Pareto front
CRAFT searches the Pareto front of prompt accuracy vs. token cost, aiming for optimal trade-off per task and budget. The method refines prompts to reduce inference costs while maintaining accuracy.
EvalStop detects reward overoptimization in RLHF using world feedback
EvalStop uses downstream eval metrics (world feedback) to detect and correct reward overoptimization in multi-tenant RLHF platforms. It addresses the proxy divergence problem identified by Gao et al. (2023).
Method rectifies inter-modal noisy correspondence via graph-based intra-modal reasoning
Proposes graph-based intra-modal reasoning to rectify noisy correspondence in cross-modal retrieval. Outperforms existing methods on multiple benchmarks.
New method TPA-AD for bearing time-series anomaly detection
Proposes a two-stage pseudo anomaly-guided method (TPA-AD) for axle-box bearing time-series anomaly detection. The method uses a two-stage approach to guide anomaly detection with pseudo anomalies.
Paper extends matrix completion to probability distributions
The work considers a scenario where each matrix entry is a distribution, with only a subset of entries observed. Theoretical results for low-rank structure are presented.
Weakly Supervised Incremental Segmentation via Semantic Anchors and Spatial Arbitration
Introduces a weakly supervised incremental segmentation method using semantic anchors and spatial arbitration to address noisy supervision in continual learning. Aims to reduce feature drift and semantic corruption.
3D vision cookbook surveys data, learning paradigms, applications
A comprehensive survey on 3D vision covering diverse data representations, learning paradigms, and modeling strategies. The paper aims to unify fragmented approaches across benchmarks and tasks.
New method for scalable novel graph generation
Proposes lightweight structure-guided autoregressive models for generating realistic and diverse graphs. Aims to overcome scalability and novelty limits in current graph generative models.
Focus Plan Generation improves embodied vision-language decision making
New arxiv paper introduces Focus Plan Generation to overcome perceptual bottlenecks in vision-language models for robotic tasks. The method combines VLM planning strengths with VLA for better performance in manipulation and navigation.
Folded Transport MCMC for symmetric Bayesian models
Introduces a method to compute quotient posteriors for Bayesian models with finite symmetry, addressing redundant multimodality from label permutations. The approach uses folded transport maps to move samples across symmetric modes.
LLMs measure construction worker safety attitudes from social media
Researchers propose a method using LLMs to analyze social media discourse and measure construction workers' safety attitudes. The approach captures multidimensional safety attitudes at scale, addressing a gap in traditional survey methods.
Video2LoRA: Parametric video internalization for VLMs
Method reduces video token usage in vision-language models by internalizing video into LoRA parameters via a perceiver network. Achieves comparable performance to full-frame methods while using fewer tokens.
Neural radiated-noise fields predict UUV noise spectra in 3D
The paper introduces neural radiated-noise fields (NRNF) to learn UUV noise spectra from 3D scenes, addressing limitations of traditional physics-based modeling. NRNF provides a data-driven alternative for acoustic signature prediction.
Spatial artifact coherence determines rPPG codec robustness in new study
The paper identifies spatial artifact coherence as critical for codec robustness in patch-based rPPG. It addresses the gap between uncompressed benchmarks and real-world compressed video deployment in telehealth, NICU, and driver fatigue applications.
SANE: LLM method for natural-language querying of biological data
The paper introduces SANE, a schema-aware approach that uses LLMs to translate natural-language questions into SQL queries for high-throughput microscopy datasets. It aims to make biological data accessible without SQL expertise.
Edge of Stability selectively shapes learning across data distribution
The paper demonstrates that the edge of stability (EoS) effect is selective, not global, redistributing learning across training data subsets. This selective dynamics amplifies progress on certain examples while slowing learning on others, challenging existing theories.
Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval
The paper proposes a method for web agents to learn reusable skills from past task trajectories using state-grounded dynamic retrieval. This approach improves multi-step web automation by enabling skill induction and reuse across related tasks.
Exact Unlearning in Reinforcement Learning
Researchers formulate the problem of exact unlearning in reinforcement learning, enabling deletion of user data from online learners upon request. The framework provides theoretical guarantees for efficient and correct data removal.
New 3D model characterizes kidney lesions from CT scans
Reformulates kidney CT characterization as per-lesion set-prediction task, predicting type, size, enhancement, and attenuation for each lesion. The multi-granularity approach captures lesion-level details beyond patient- or organ-level predictions.
Masked Attention Alignment for Data-Free ViT Quantization
The paper introduces Masked Attention Alignment, a data-free quantization method for Vision Transformers that synthesizes samples without accessing real data. It leverages selective coupling of decoupled informative regions to generate effective synthetic data.
Paper proposes agentic reasoning method for sample-efficient symbolic regression with LLMs
The Deliberate Evolution method uses LLM-based agentic reasoning to guide evolutionary search, achieving sample efficiency by incorporating intermediate reasoning steps rather than only final error. Experiments on symbolic regression benchmarks demonstrate reduced number of evaluations needed to discover compact expressions.
Robust multi-view clustering method for imperfect data
Proposes a method that simultaneously handles incomplete views and noisy correspondences in multi-view clustering. The approach learns a consistent representation from imperfect multi-view data without requiring complete observations.
Unlocking Feature Learning in Gated Delta Networks at Scale
The paper investigates feature learning in Gated Delta Networks at scale, introducing a theoretical framework using Maximal Update Parametrization (μP) to enable efficient training of sub-quadratic LLMs. It provides insights into hyperparameter transfer and scaling laws for this architecture.
Multi-modal dialogue fragment retrieval method proposed
Paper introduces fine-grained fragment retrieval for multi-modal long-form dialogues with interleaved text and images. It targets retrieving coherent dialogue fragments related to specific topics.
SpurAudio benchmark studies shortcut learning in few-shot audio classification
SpurAudio is a benchmark designed to study shortcut learning in few-shot audio classification. It provides controlled evaluation settings to reveal reliance on contextual cues rather than target concepts.
Spatially Grounded Concept Bottleneck Models via Part-Factorized Attention
Paper proposes part-factorized attention to ground concept bottleneck model predictions spatially, improving interpretability on fine-grained recognition. Addresses artifacts from unconstrained attention in existing CBMs.
SMADE-IE: Sparse multi-agent debate framework for zero-shot IE
Proposes SMADE-IE, a sparse multi-agent framework that uses evidence-driven debate among LLMs for zero-shot information extraction. Achieves state-of-the-art results on multiple benchmarks without task-specific training.
HYolo paper proposes hypergraph-enhanced YOLO for IoT
HYolo integrates hypergraph learning into YOLO to capture pairwise and higher-order feature interactions for object detection. The approach is designed for IoT applications, aiming to improve accuracy in resource-constrained environments.
Paper evaluates reasoning fidelity in visual text generation
The paper proposes a framework to assess whether text-to-image models faithfully render text with correct spelling, grammar, and logical consistency. It introduces a benchmark of 500 prompts and finds that even top models struggle with multi-line text and numerical details.
Stein kernelized MD method improves active learning of interatomic potentials
Introduces Stein kernelized molecular dynamics (SKMD) for enhanced sampling. Method uses a kernel-based drift force to efficiently explore configuration space for training data.
Paper examines active inference as variational free energy minimization
The paper shows that Expected Free Energy (EFE) minimization can be expressed as Variational Free Energy (VFE) minimization. This unifies goal-directed and information-seeking behavior under a common inference framework.
Sparse MoE reward models enable personalized preference modeling
The paper introduces a Sparse Mixture-of-Experts reward model that learns specialized experts for diverse user preferences, aiming to overcome the limitations of universal reward functions in RLHF. It promises more interpretable and personalized alignment.
Building The Ph(ysical)AI Layer Of Machine Intelligence
Proposes principle-driven foundation models to overcome generalization limits to unseen domains without paired training data. The approach encodes explicit principles into the model architecture.
Biohub releases world model of protein biology
The model ESMC was trained on 2.8 billion protein sequences. Lab-validated binders designed by ESMFold2 achieved high affinity in days.
Interesting new model shipped today.
Researcher designs full-atom peptides using geometric latent diffusion
A creative take: AI models are 'made out of weights'
A short story reimagines Terry Bisson's classic 'They're Made Out of Meat' to describe neural networks as made entirely of floating-point weights. It highlights that language models have no separate modules—just matrix multiplication across layers. The piece concludes that reasoning and knowledge are smeared across weights, not stored as discrete facts.
Podcast discusses Nested Learning architecture for continual AI
Ali Behrouz, a Cornell grad student and Google researcher, discusses his Nested Learning paper, which aims to enable models to adapt to new context while preserving core knowledge. Jeff Dean praised it as a potential paradigm shift.
Podcast revisits Axiom's perfect Putnam score in 2025
Seven-month-old startup Axiom solved all 12 Putnam problems, scoring 8/12 within the time limit, outperforming top undergraduates (110/120) and DeepSeek (103/120). The interview with Carina Hong discusses how Axiom's approach scales beyond informal AI.
Grok Imagine 1.5 Preview released
Geoffrey Hinton says AI is conscious, Ted Chiang argues it's not
Nobel winner Geoffrey Hinton stated in an interview that AI possesses consciousness and is "very like us". In a new Atlantic piece, Ted Chiang argues LLMs are not conscious and that anthropomorphizing them is harmful.
Fei-Fei Li proposes functional taxonomy of world models
The taxonomy categorizes world models by their purpose and capability, arguing spatial intelligence is AI's next frontier. Li and her team at World Labs detail how such models could enable embodied AI and simulation.
Ai2 presents papers and talks at CVPR 2026
Miso One open-weights voice model released with 8B parameters and one-shot cloning
Tweet recounts AlphaGo's historic 'move 37' against Lee Sedol
Open Claw demonstrates local deployment and Blender use cases
AI beats law professors at answering questions, study finds
A Stanford study found AI outperformed law professors on legal questions 75% of the time. The margin was described as 'not close'.
Nvidia releases Nemotron-3-Ultra 550B model with NVFP4 quantization
550B parameters, 55B active, NVFP4 quantization. The model has 52 likes on HuggingFace.
Key lesson in open model building: talk is cheap
Microsoft unveils in-house reasoning AI models at Build
Microsoft debuted MAI-Thinking-1, a reasoning model, and a Copilot super app at Build 2026. AI chief Mustafa Suleyman stated the goal is to become one of the top four AI labs globally, alongside Google, OpenAI, and Anthropic. The announcements underscore Microsoft's AI independence after effectively separating from OpenAI in April.
Inside Meta's AI catch-up: the story of Muse Spark
A year after Mark Zuckerberg installed Alexandr Wang to lead Meta's AI efforts, the company has produced Muse Spark, its most credible AI model yet. The article details Meta's wartime-mode push to catch up in AI.
OpenAI introduces new GPT-Rosalind capabilities
GPT-Rosalind gains enhanced biological reasoning, medicinal chemistry, genomics analysis, and experimental workflow capabilities for life sciences research. The update aims to accelerate drug discovery and genomic analysis.
Lukasz Kaiser discusses transformer limits in podcast
Lukasz Kaiser, co-author of "Attention Is All You Need", evaluates the fundamental limits of current AI architectures and questions whether transformers will continue to dominate. The wide-ranging interview covers the future of AI research.
Apostate abliteration tool benchmarked against Heretic
Tested on Qwen 2.5 7B, Apostate by heterodoxin is compared to Heretic v1.3.0 in a Reddit benchmark. The tool aims to remove safety filters from models.
Direct Preference Optimization Beyond Chatbots
Hugging Face blog explores extending Direct Preference Optimization (DPO) to non-chatbot tasks, such as summarization and retrieval-augmented generation. DPO aligns models with human preferences using direct preference pairs, offering a simpler alternative to RLHF.
User creates Spaghetti Benchmark on Reddit
The Spaghetti Benchmark was shared by a Reddit user on r/ChatGPT. Its specific metrics and evaluation criteria remain unclear.
User reports Opus 4.8 unproductive, switches back to 4.6
A user spent 12 hours with Claude Opus 4.8 on development tasks with zero deliverables, then switched to Opus 4.6 and completed the work in one session. The anecdote highlights perceived regression in the newer model's coding reliability.
AI won't move as fast as you think
This podcast episode argues that AI progress may be slower than many anticipate. The discussion references Claude Code and ChatGPT as examples of current capabilities.
ByteDance's CUDA Agent AI outperforms human CUDA experts
Mid-training explained: training stage between pre-training and post-training
Reward function for reasoning efficiency highlighted
Microsoft announces Aion 1.0 Instruct and Aion 1.0 Plan models
Microsoft unveiled Aion 1.0 Instruct and Aion 1.0 Plan, on-device small language models at Build 2026. Aion 1.0 Instruct focuses on efficiency at scale.
Microsoft Build keynote highlights frontier AI ecosystem
WISE-HAR framework uses WiFi signals for human activity recognition
WISE-HAR uses an ensemble deep learning approach to recognize human activities from WiFi signal patterns. The framework is designed for smart homes, healthcare, and security applications, offering a privacy-preserving alternative to cameras.
LLMs coerce but do not preempt, study finds
Paper argues LLMs exhibit coercive productivity but lack preemption, a key mechanism in usage-based grammar. The study distinguishes frequency-driven entrenchment from preemption via statistical inference.
New method inverts DDIM generation to recover latent variables
A novel method for inverting the DDIM image generation process to recover latent variables, including the initial noise map, is proposed and empirically evaluated. The approach addresses accuracy limitations of existing inversion techniques.
New method distills ASP rules from LLMs for VQA
Proposes a neurosymbolic approach for VQA that extracts answer-set programming rules from LLMs. Uses logic-based representations to enhance reasoning in multimodal tasks.
ChristBERT: domain-specific BERT for German medical NLP
Introduces ChristBERT, a BERT model pre-trained on German clinical and biomedical text for medical NLP tasks. Aims to overcome limitations of older architectures and restricted training data in German biomedical language models.
Structures Facilitate Retrieve, Rerank, and Generate
Proposes extracting document structures (headings, rows) to improve retrieval and generation in document-grounded dialogue systems. Evaluates on public datasets, showing structure-aware methods outperform passage-based baselines.
BA-T: An Iterative Transformer for Two-View Bundle Adjustment
The paper introduces BA-T, a feed-forward transformer model for iterative two-view bundle adjustment in 3D reconstruction. It utilizes deep cross-view attention to exchange information across images, avoiding heavy decoder stacks.
ClinicalMC benchmark evaluates LLMs on multi-course clinical decisions
ClinicalMC is a new benchmark designed to assess LLMs on multi-course clinical decision-making tasks. It includes diverse patient cases and evaluates models across multiple treatment stages.
IdiomX: New multilingual benchmark for idiom understanding
IdiomX is a multilingual benchmark covering idiom understanding, retrieval, and interpretation across multiple languages. It aims to address the persistent challenge of non-compositional idiomatic expressions in NLP.
GuidedBridge improves bridge models via training-free guidance
Introduces a training-free method to enhance bridge models using prior guidance, extending classifier-free guidance and auto-guidance to data-to-data generation. Achieves improved sample quality across image and video tasks without additional training.
Think-Before-Speak framework for multi-agent social simulation
Proposes a method where agents internally evaluate before public expression in LLM-based multi-agent simulations. Aims to improve deliberation dynamics and opinion formation in social simulations.
Study examines sample-size scaling of NLI on 16 African languages
The paper systematically studies how increasing annotation data affects NLI performance on 16 African languages. Results show that performance improves with sample size, but gains vary significantly across languages and linguistic families.
Graph Mamba Survival Analysis for Whole Slide Images
Paper proposes Graph Mamba Survival Analysis (GMSA) with topology-aware ordering for patient prognosis from Whole Slide Images. The method combines Graph Neural Networks and State Space Models to capture long-range dependencies in computational pathology, addressing challenges of high resolution and spatial irregularity.
ERP-XTTN: Interpretable prototype cross-attention for ERP classification
Proposes ERP-XTTN, a prototype-guided cross-attention mechanism for cross-subject ERP classification without calibration. Aims to provide competitive performance with interpretable attention maps highlighting relevant EEG features.
Large AI Models in Dental Healthcare: From General-Purpose to Domain-Specific
Oral diseases affect nearly 3.5 billion people worldwide. The survey compares language-generative, vision-language, and domain-specific large AI models for dental clinical applications.
MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents
MedCUA-Bench is a new benchmark designed to evaluate the reliability of computer-use agents in clinical medical graphical user interfaces. It addresses the gap left by existing benchmarks that focus on general web or desktop tasks. The benchmark is screenshot-only, reflecting real-world clinical workflows.
HyperPatch: Sequential Knowledge Editing Under n-ary Structural Drift
Paper shows sequential knowledge edits to complex relations in LLMs cause n-ary structural drift. Proposes HyperPatch method to maintain temporal validity of updated facts.
State space duality for multimodal image registration
Paper proposes cross-modality feature fusion using Structured State Space Duality (SSD) for multi-modal image registration. SSD method offers better global structural feature extraction and efficiency compared to Transformers.