
This research investigates the theoretical and practical differences between reconstruction-based and joint-embedding paradigms in self-supervised learning (SSL). By deriving the first closed-form solutions for these methods, the authors demonstrate that joint-embedding approaches are more robust when datasets contain high-magnitude irrelevant noise, such as complex backgrounds in images. Conversely, reconstruction is more effective for data with low-magnitude noise, explaining its success in natural language processing where tokens are semantically dense. A critical finding is that, unlike supervised learning, SSL requires a precise alignment between data augmentations and noise to eliminate uninformative features. Ultimately, the work justifies the empirical dominance of latent space prediction on challenging real-world datasets where identifying and ignoring noise is essential for performance.