
The four papers we review dated from 1967 up to two papers in 2025 collectively discuss the mathematical properties and deep learning applications of **doubly stochastic matrices**, which are nonnegative matrices whose rows and columns sum to one. One paper, "Concerning Nonnegative Matrices and Doubly Stochastic Matrices," provides the **foundational mathematical theory** regarding the convergence of iterative row and column scaling (known as the Sinkhorn algorithm) to a unique doubly stochastic matrix, contingent on the original matrix having "total support." The other papers focus on **Transformer architecture enhancements**, proposing "Sinkformers" and "Sparse Sinkhorn Attention" as variants that replace the standard row-wise SoftMax attention with the Sinkhorn algorithm to enforce **doubly stochastic attention matrices** for improved performance and theoretical properties, such as a connection to the Wasserstein metric. Furthermore, the "Gradient Multi-Normalization" paper introduces a **stateless optimizer** that uses a multi-normalization procedure, including a "Square-Root Sinkhorn" variant, demonstrating its efficacy and efficiency in training large language models.
Sources:
1967:
CONCERNING NONNEGATIVE MATRICES AND DOUBLY STOCHASTIC MATRICES
https://projecteuclid.org/journalArticle/Download?urlId=pjm%2F1102992505
June 24, 2022:
Sinkformers: Transformers with Doubly Stochastic Attention
https://arxiv.org/pdf/2110.11773
February 10, 2025:
Gradient Multi-Normalization for Stateless and Scalable LLM Training
https://arxiv.org/pdf/2502.06742
July 12, 2025:
ESPFormer: Doubly-Stochastic Attention with Expected Sliced Transport Plans
https://arxiv.org/pdf/2502.07962