
Large language models often struggle with long-context tasks because the attention mechanism suffers from **score dilution**, where relevant information is overwhelmed by surrounding "distractor" tokens. Researchers found that common **inference-time scaling strategies**, such as generating additional "thinking tokens," fail to solve this problem as context length increases. To address this, the authors propose **query-only test-time training (qTTT)**, a computationally efficient method that updates only the model's **query projection matrices** for a specific input. By performing a single prefill to cache **keys and values** and then applying targeted gradient updates, the model learns to better distinguish the "needle" of relevant information from the "haystack" of noise. Experiments across **LongBench-v2** and **ZeroScrolls** benchmarks show that qTTT consistently outperforms traditional methods and thinking tokens. This approach suggests that **adapting model parameters** during inference is a more effective use of compute than simply increasing the length of the generated output.