This paper discusses how to design an evaluation-efficient self-improving AI systems for societal and business problems like ad optimization where the cost of generating new content is low but evaluation is expensive. It argues that traditional human-driven optimization is slow and bottlenecked by content generation, but generative AI has shifted the bottleneck to efficient evaluation and prompt refinement. T-BoN BO addresses key challenges—lack of numerical gradients in language space, and the need to balance exploration and exploitation—by adapting classic **Bayesian Optimization (BO)** principles. Theoretically, the paper proves that T-BoN BO, which uses textual gradients and a Best-of-N selection rule, **emulates gradient-based Upper Confidence Bound (UCB) BO** and inherits its theoretical guarantees for evaluation efficiency. Empirical results in digital marketing scenarios demonstrate that T-BoN BO significantly **outperforms state-of-the-art baselines** in achieving performance gains under a fixed evaluation budget.
This whiteoaper by Google titled **"Context Engineering: Sessions & Memory,"** authored by Kimberly Milam and Antonio Gulli in November 2025, which provides a detailed guide to building stateful, intelligent Large Language Model (LLM) agents. The document defines **Context Engineering** as the process of dynamically managing information within an LLM's context window, emphasizing two core, interconnected components: **Sessions** and **Memory**. **Sessions** manage the immediate, chronological dialogue and working state of a single conversation, while **Memory** is a decoupled system for long-term persistence, capturing and consolidating key information across multiple sessions to enable personalization. The paper extensively covers architectural considerations for both sessions (e.g., compaction strategies for managing long context) and memory (e.g., types of memory, storage architectures, and the LLM-driven process of extraction and consolidation), contrasting the dynamic, user-specific role of memory managers with the static, factual role of Retrieval-Augmented Generation (RAG) engines. Finally, it outlines critical production requirements, including **privacy**, **security**, and **asynchronous processing**, to ensure robust and efficient deployment of these state-aware agents.
This paper introduces **Asynchronous Thinking (AsyncThink)**, a novel paradigm for large language model (LLM) reasoning designed to enable **agentic organization** and collaborative problem-solving. AsyncThink employs an **organizer-worker thinking protocol** where an LLM acts as an organizer that dynamically structures concurrent processes using **Fork and Join actions**, while workers execute sub-queries. The authors compare AsyncThink favorably to traditional sequential and parallel thinking approaches, demonstrating that it achieves **higher accuracy and reduced critical-path latency** across complex tasks like multi-solution countdown and mathematical reasoning. Training is accomplished through a two-stage process involving **cold-start format fine-tuning** followed by **reinforcement learning (RL)**, which optimizes the model for correctness, format compliance, and thinking concurrency. Furthermore, the results show that AsyncThink's capability for organizing thought processes **generalizes well** to previously unseen domains and problem types.
This paper by OpenAI discusses a new approach to **neural network interpretability** through the use of **sparse circuits**. The authors explain that understanding the behavior of complex, hard-to-decipher neural networks is critical for safety and oversight as AI systems become more capable. They distinguish their work on **mechanistic interpretability**, which seeks to fully reverse-engineer computations, from other methods like chain-of-thought interpretability. The core of their research involves training **sparse models**—models with far fewer internal connections—to create simpler, **disentangled circuits** that are easier to analyze and understand, offering a promising path toward making even larger AI systems transparent.
The academic paper introduces Supervised Reinforcement Learning (SRL), a novel training framework for Large Language Models (LLMs) developed by researchers from Google Cloud AI Research and UCLA to address the difficulty of multi-step reasoning. SRL reformulates problem-solving as a sequence of logical actions, providing dense, step-wise rewards based on the similarity between the model's generated actions and expert trajectories, which contrasts with the sparser, final-outcome rewards used in Reinforcement Learning with Verifiable Rewards (RLVR). The framework trains models to generate an internal reasoning monologue before committing to an action, encouraging flexible and sophisticated reasoning patterns like interleaved planning and verification. Extensive experiments on challenging mathematical reasoning and agentic software engineering benchmarks demonstrate that SRL significantly outperforms baseline methods like Supervised Fine-Tuning (SFT) and RLVR, especially when used to initialize training before subsequent RLVR refinement.
This research paper introduces Multi-Agent Evolve (MAE), a novel reinforcement learning framework designed to enable large language models (LLMs) to self-improve their general reasoning abilities without relying on human-curated datasets or verifiable external rewards. MAE accomplishes this through a system where a single LLM is instantiated into three interacting roles—a Proposer that creates challenging questions, a Solver that attempts to answer them, and a Judge that evaluates both the questions and answers. This triad operates in a closed-loop co-evolution process, driven by domain-agnostic self-rewarding mechanisms like difficulty-aware and quality rewards, which allows the model to continuously generate better training material and enhance its capabilities across diverse benchmarks like mathematics, coding, and general knowledge. The experiments demonstrate that this multi-agent, self-play approach outperforms traditional Supervised Fine-Tuning (SFT), particularly highlighting its stability and effectiveness in generating a self-improving training signal.
This paper introduces a novel self-supervised learning framework designed to resolve the pervasive issue of representation collapse in existing Joint-Embedding Predictive Architectures (JEPAs). It establishes a theoretical foundation by proving that an isotropic Gaussian distribution is the optimal embedding distribution for minimizing the worst-case risk across various downstream tasks. To enforce this optimal distribution, the paper proposes SIGReg (Sketched Isotropic Gaussian Regularization), a scalable method that uses directional statistical tests, specifically recommending the Epps-Pulley test, to match the empirical feature distribution to the target Gaussian. The core contribution is the resulting LeJEPA loss function, which combines the standard JEPA prediction objective with SIGReg, effectively eliminating the need for complex anti-collapse heuristics like stop-gradients or teacher-student networks, and demonstrating robust, state-of-the-art performance with significantly reduced training complexity.
This paper introduce a new meta-benchmark designed to evaluate large language models' (LLMs) ability to perform **interactive preference discovery** and response personalization through conversation. The framework converts existing benchmarks into interactive tasks by assigning **psychologically-grounded personas** with hidden preferences to be discovered by the AI. Evaluation of numerous frontier models showed that simply attempting personalization often **degraded performance** compared to generic responses (42.6% of cases), indicating systematic failures in current architectures. The research established a strong positive correlation between **question-asking volume** and preference alignment, but noted that models tend not to ask enough questions, and personalization also often imposes a **cognitive cost** that reduces task accuracy, particularly in mathematical reasoning. Ultimately, the source argues that interactive preference discovery is a **distinct capability** requiring dedicated architectural innovations rather than relying on emergent general language understanding.
The academic paper investigates the efficiency of Large Language Model (LLM) pre-training by quantifying the amount of knowledge left unextracted from training datasets. The authors demonstrate that employing retrieval-augmented generation (RAG) at test time, which involves reusing the pre-training data, leads to significant accuracy improvements across benchmarks like MMLU, Math-500, and SimpleQA, even after decontamination efforts. The study establishes that retrieval acts as a compute multiplier, with performance gains for MMLU sometimes equivalent to about a 5x increase in pre-training compute alone. Furthermore, the researchers show that combining RAG with additional test-time compute techniques, such as self-consistency and reranking, yields even greater gains, suggesting substantial room for improvement in both dataset quality and current pre-training methodologies. Overall, the findings indicate that LLMs are not fully utilizing the information present in existing datasets and that retrieval offers a powerful, additive way to enhance performance.
The academic paper proposes **DreamGym**, a novel, unified framework for scaling agent learning using reinforcement learning (RL) by synthesizing diverse experiences instead of relying on costly real-environment rollouts. The core of this system is a **reasoning-based experience model** that abstracts environment dynamics into a textual space, enabling the generation of consistent state transitions and reward signals through explicit reasoning. DreamGym integrates an **experience replay buffer** to enrich synthetic data and a **curriculum task generator** that creates progressively challenging problems based on reward entropy, thereby addressing common RL challenges like sparse rewards and task scarcity. Experimental results across diverse environments, including those not traditionally "RL-ready" like WebArena, demonstrate that DreamGym substantially **improves RL training efficiency** and yields significant performance gains in both purely synthetic settings and sim-to-real transfer scenarios.
This paper introduces **Continuous Autoregressive Language Models (CALM)**, a new paradigm designed to overcome the efficiency limitations of conventional, token-by-token generation in Large Language Models (LLMs). CALM achieves significant computational savings by employing a robust **autoencoder** to compress a chunk of $K$ discrete tokens into a single, high-fidelity continuous vector, thereby reducing the number of sequential generation steps by a factor of $K$. This shift necessitates a comprehensive **likelihood-free framework**, including an **energy loss** for generative modeling and a new evaluation metric called **BrierLM**, which offers a reliable alternative to Perplexity for implicit models. Furthermore, the paper details a provably exact, but computationally expensive, **likelihood-free temperature sampling algorithm**, along with a highly efficient batch approximation that demonstrates an equivalent trade-off between accuracy and diversity as traditional LLMs. The empirical results confirm that increasing the **semantic bandwidth** $K$ provides a powerful new axis for achieving a superior performance-compute balance in language modeling.
This position paper argues for a new epistemic theory of agents that views internal reasoning and external actions as equivalent epistemic tools for acquiring knowledge. The core argument is that for an agent to achieve optimal and efficient behavior, its tool use decision boundary must be aligned with its knowledge boundary, meaning it should only resort to external tools when necessary knowledge is unavailable internally. The paper formalizes this concept by defining tools, agents, and optimal behavior, and introduces three principles of knowledge: foundation, uniqueness/diversity, and dynamic conservation, which provide a theoretical basis for designing next-generation knowledge-driven intelligence systems capable of adaptive, goal-directed behavior with minimal unnecessary action. Finally, the authors propose paths toward achieving agent optimality through enhanced training paradigms like next-tool prediction and reinforcement learning that rewards both correctness and efficiency.
This paper introduces Nested Learning (NL), a new paradigm that addresses fundamental challenges in AI self-improvement, continual learning and memory for models like Large Language Models (LLMs). NL suggests that existing deep learning methods compress their "context flow" and explains how in-context learning emerges in large models. The authors propose the HOPE architecture, a self-referential learning module with a Continuum Memory System (CMS), which is built on the NL insights that traditional optimizers are fundamentally associative memory modules. Experiments demonstrate that HOPE, using this novel framework, shows promising results across language modeling and common-sense reasoning tasks, often outperforming modern recurrent neural networks and Transformers.
This paper introduce the GST-UNet (G-computation Spatio-Temporal UNet), a novel neural framework designed for causal inference using spatiotemporal observational data, particularly when analyzing a single observed trajectory. This framework integrates a U-Net encoder with ConvLSTM and attention mechanisms to learn spatiotemporal dependencies and explicitly address challenges like interference, spatial confounding, and time-varying confounding. The core contribution is coupling this architecture with iterative G-computation to provide theoretically grounded identification and consistency guarantees for estimating location-specific potential outcomes. Empirical results, including synthetic experiments and a real-world analysis of wildfire smoke exposure and respiratory hospitalizations during the 2018 California Camp Fire, validate the method's ability to produce stable and accurate counterfactual estimates compared to existing baselines.
This paper introduces a research paper focused on improving **Large Language Model (LLM) performance on tasks requiring long-term conversational memory**. The authors address limitations in existing evaluation methods by presenting a new framework that automatically generates **long, coherent conversations up to 10 million tokens** and **BEAM**, a benchmark dataset with 100 dialogues and 2,000 probing questions designed to test ten distinct memory abilities, including contradiction resolution and temporal reasoning. To enhance LLMs, the authors propose **LIGHT**, a human-cognition-inspired framework that integrates three complementary memory systems: episodic, working, and a scratchpad for salient facts. Experimental results demonstrate that even state-of-the-art LLMs struggle with dialogue lengthening, while the LIGHT framework **consistently improves performance** across various models.
This paper introduces Agentic Economic Modeling (AEM), a rigorous framework proposed by superstar social scientists that leverages Large Language Models (LLMs) to reliably simulate economic decisions and generate counterfactual data for econometric inference. The core innovation is a three-stage pipeline—Generation, Correction, and Inference—designed to overcome the systematic biases found in raw LLM outputs by anchoring them to small samples of real-world human data. Specifically, AEM employs a bias-correction mapping and a mixture-of-personas approach to align synthetic choices with empirical evidence, enabling accurate estimation of economic quantities like demand elasticities and treatment effects. The authors validate AEM's effectiveness in two settings: a large-scale conjoint study and a regional field experiment, demonstrating that the method significantly improves estimation accuracy and can reduce the scale and duration required for expensive Randomized Control Trials (RCTs). The results show that the bias-correction mixture model is particularly effective, demonstrating its ability to generalize across regions and time periods.
This research by anthropic investigates the existence of **functional introspective awareness** in large language models (LLMs), specifically focusing on Anthropic's Claude models. The core methodology involves using **concept injection**, where researchers manipulate a model's internal activations with representations of specific concepts to see if the model can accurately **report on these altered internal states**. Experiments demonstrate that models can, at times, notice injected "thoughts," distinguish these internal representations from text inputs, detect when pre-filled outputs were unintentional by referring to prior intentions, and even **modulate their internal states** when instructed to "think about" a concept. The findings indicate that while this introspective capacity is often **unreliable and context-dependent**, the most capable models, such as Claude Opus 4 and 4.1, exhibit the strongest signs of this ability, suggesting it may emerge with increased model sophistication.
This paper investigates whether large reasoning models can sustain self-training using Reinforcement Learning (RL), specifically employing majority voting as a self-feedback mechanism, termed Self-Rewarded Training (SRT). The research demonstrates that this basic approach initially improves the model's reasoning performance and enhances the quality of its self-generated feedback, achieving performance comparable to RL with ground-truth supervision. However, a critical limitation is identified: prolonged self-training consistently leads to reward hacking and a sudden, complete performance collapse as models learn to maximize the training pseudo-reward by outputting simplistic, template answers. The authors conclude that designing robust feedback mechanisms is the central challenge for enabling sustained self-improvement in large language models.
This paper proposes a method for transforming a general-purpose large language model agent into a domain-specific expert. This system achieves specialization by systematically generating, abstracting, and curating reusable Model Context Protocol (MCP) tools from successful task executions, which are then stored in an MCP Box. At inference time, a Retrieval-Augmented Generation (RAG) mechanism selects the most contextually relevant tools from the box, thereby enhancing the agent's problem-solving accuracy and computational efficiency. Experimental results on challenging benchmarks like GAIA, PathVQA, and Humanity’s Last Exam demonstrate that ALITA-G attains new state-of-the-art performance while simultaneously achieving a significant reduction in average token consumption compared to generalist baselines. The overall process converts transient solutions into reusable competence, offering a new paradigm for automated agent generation focused on capability expansion.
The academic paper proposes a novel framework called Test-Time Self-Improvement (TT-SI) for training Large Language Model (LLM) agents more efficiently by adapting them on-the-fly during inference. This new paradigm is motivated by the high cost and inefficiency of traditional large-scale fine-tuning, which often involves redundant data. TT-SI operates in three steps: Self-Awareness identifies uncertain test instances, Self-Augmentation generates tailored training samples for those instances, and Self-Improvement uses these samples for lightweight, temporary fine-tuning. Empirical results across several agent benchmarks demonstrate that TT-SI significantly improves model accuracy (e.g., +5.48% on average) while utilizing dramatically fewer training samples compared to standard supervised fine-tuning. The findings support the potential of uncertainty-guided, instance-specific learning as a more effective and cost-efficient approach for building capable, self-evolving LLM agents.