Self-Supervised Learning for Strong Gravitational Lensing

7 min readJul 28, 2023

Image by Yashwardhan Deshmukh using DALL-E and Canva

Greetings! I am excited to share my progress of Google Summer of Code 23'. I’ve spent the last few months working with ML4SCI on an interesting project:

DeepLens: Self-Supervised Learning for Strong Gravitational Lensing

My code and all results can be found on Github

Special thanks to (mentors) Sergei Gleyzer, Anna Parul, Yurii Halychanskyi.

What is DeepLens?

DeepLense is a deep learning pipeline for particle dark matter searches with strong gravitational lensing.

What is strong gravitational lensing?

Strong gravitational lensing can be compared to a cosmic magnifying glass. It is a phenomenon that occurs when a massive object with strong gravity, like a big galaxy, bends light from a distant object. This makes the distant object appear distorted, magnified, or even seen multiple times.

Joe: How can strong gravitational lensing help find this dark matter?

Good Question Joe! Dark matter is mysterious stuff. We can’t see it directly because it doesn’t emit or absorb light! However, we know that gravity affects everything, even light, hence through gravitational lensing, if we see more lensing than what we would expect based on visible matter, we can infer that there must be more matter there that we cannot see!

The project

Deep learning is the electricity of the 21st century. With the rise of self-attention mechanisms, we are able to produce sophisticated artificial neural network models capable of learning key morphological features as a part of their latent representation.

Using neural networks to learn the representation of strong gravitational lensing images, could potentially lead to building models capable of learning an embedding space with possibilities far beyond existing theories, which could speed up astronomical research and contribute significantly to our understanding of the universe!

Side Note: Supervised, unsupervised and domain adaptation methods have previously been tested by the DeepLens team and have shown good results.

1. Contrastive Learning

In short, contrastive learning is a type of self-supervised learning method, that tries to learn similar and dissimilar representations of data by contrasting positive and negative examples.

As an example, if we take two images of class ‘cold dark matter’, the neural network learns that different images of the same class are very similar, and a separate image of class ‘axion’ is different from the two. These are called positive and negative samples.

Contrastive Learning: Diagram by Yashwardhan Deshmukh using Canva

A sample is augmented using either adding a random rotation or a random Gaussian noise. This is called a pretext task.

(a) Each augmentation further splits into a positive and negative pair. (b) Visualization of the augmented dataset

"""
Loss function used to train the contrastive network:
"""
def contrastive_loss(self, projections_1, projections_2):
    # temperature = 0.1
    # InfoNCE loss (information noise-contrastive estimation)
    # NT-Xent loss (normalized temperature-scaled cross entropy)
    # Cosine similarity: the dot product of the l2-normalized feature vectors
    projections_1 = tf.math.l2_normalize(projections_1, axis=1)
    projections_2 = tf.math.l2_normalize(projections_2, axis=1)
    similarities = (tf.matmul(projections_1, projections_2, transpose_b=True) / self.temperature)
    
    # The similarity between the representations of two augmented views of the
    # same image should be higher than their similarity with other views
    batch_size = tf.shape(projections_1)[0]
    contrastive_labels = tf.range(batch_size)
    self.contrastive_accuracy.update_state(contrastive_labels, similarities)
    self.contrastive_accuracy.update_state(contrastive_labels, tf.transpose(similarities))
    
    # The temperature-scaled similarities are used as logits for cross-entropy
    # a symmetrized version of the loss is used here
    loss_1_2 = keras.losses.sparse_categorical_crossentropy(
        contrastive_labels, similarities, from_logits=True)
    loss_2_1 = keras.losses.sparse_categorical_crossentropy(
        contrastive_labels, tf.transpose(similarities), from_logits=True)
    return (loss_1_2 + loss_2_1) / 2

Once the model seems to have learned the representations through pretraining, we can use the same encoder with its learned weights, and add a custom head for fine-tuning it for classification or regression.

Shown below are the results for fine-tuning the contrastive learning model with both, rotation pretext and Gaussian noise pretext for pretraining. We notice how much better it is compared to its supervised baseline!

2. Bootstrap Your Own Latent (BYOL) Learning

A phrase that succinctly and metaphorically describes BYOL: “It is like two artists sketching the same landscape from slightly different perspectives, but blindfolded. One artist (online network) continuously updates their sketch (parameters) based on feedback (backpropagation) they receive, while the other artist (target network) periodically peeks at the first artist’s sketch and subtly adjusts their own. Over time, both artists develop their unique yet closely aligned interpretations of the landscape (data representations).”

BYOL Learning: Diagram by Yashwardhan Deshmukh using Canva

Essentially, BYOL trains two networks, the target network and the online network, both in parallel. There are no positive or negative pairs here like there are in contrastive learning. Two different augmented views of the ‘same’ image are brought, and representations are learned using the online network, while the target network is a moving average of the online network, giving it a slower parameter update. This gives some amount of stability during training.

"""
Loss function used to train the BYOL network:
"""
def byol_loss(p, z):
    # It calculates the mean similarity between the prediction p and the target z, 
    # which are both normalized to unit length with tf.math.l2_normalize. It then returns 2 - 2 * mean_similarity, 
    # which is equivalent to 2(1 - mean_similarity), so that the loss is minimized when the cosine similarity between p and z is maximized.
    p = tf.math.l2_normalize(p, axis=1)  
    z = tf.math.l2_normalize(z, axis=1)  

    similarities = tf.reduce_sum(tf.multiply(p, z), axis=1)
    return 2 - 2 * tf.reduce_mean(similarities)

Through iterative training, the two networks develop shared representations that capture the underlying structure of the data. This learned latent space can then be used for downstream tasks or fine-tuned with labeled data to achieve better performance.

Vision Transformers (ViT) as an encoder

A Vision transformer is a type of neural network that is used to capture representations from images for tasks pertaining to computer vision. Recently, transformer (not the ones we learned about in school regarding transferring electrical energy) architectures have led us to solve many language processing problems due to their non-sequential and self-attention mechanisms. More about them can be read in the paper here.

Vision Transformers divide an image into a series of patches, converting each patch into a vector. These vectors are input into a Transformer encoder, which consists of multiple self-attention layers. The self-attention mechanism enables the model to understand the long-range interactions between the patches, which is crucial for image classification. The model learns how various segments of the image contribute to the final classification label. The Transformer encoder outputs a sequence of vectors that serve as the image’s features. These features are subsequently utilized for image classification.

Two main reasons to implement a vision transformer as an encoder:

Scalability: ViTs are highly scalable, meaning they can be effectively trained on larger datasets and high-resolution images. Their performance often improves with more data and larger model sizes.
Long-Range Dependencies: The self-attention mechanism enables the model to capture long-range dependencies between pixels or patches, which is often difficult for convolutional neural networks (CNNs) to achieve without deeper architectures.

Regression!

The goal here is to explore the properties of dark matter. We want to approximate the mass density of the vortex substructure of dark matter condensates on simulated strong lensing images. For this, I used the above two methods, contrastive learning and BYOL to capture representations present in the axion image, which could be beneficial for our regression task.

Initially, a log10 transformation is required for two major reasons:

Skewed Distribution: A log transformation can make the distribution more normal-like.
High-Value Outliers: As the labels contain a few high and low values that are exerting undue influence on the regression model, a log transformation can mitigate their influence.

Here is the result of the predicted mass vs. the actual mass on a test set. This depicts that self-supervised learning can be expanded to regression tasks as well!