Introduction

In our previous research I focused on creating a image captioning model for high resolution histopathology image slides. Histology images can reach a typical size of 150,000x150,000 pixels and modern deep learning architectures need to be adapted to handle such large resolutions. To this effect, I created a novel encoder-decoder architecture that uses pre-trained Vision Transformers as encoders that can take in smaller patches of a large image at once and use a BERT based decoder to generate descriptions for that whole image.

These automatic report generating systems can be helpful for image retrieval tasks and understand images better. Please refer to this Paper and the associated GitHub code, for a more detailed explanation of my work.

A previous version that utilized LSTM based decoder was accepted as an Extended Abstract at the Machine Learning for Health Conference 2023, at New Orleans (ML4H, co-located with NeuRIPS). You can find that publication here: paper.

One of the drawbacks of these methods were that the models are not inherently interpretable. This is because when we attention pool the patch representations before feeding them into the language decoder, we are losing the patch based location information.

Here we detail a new method that helps us build a more interpretable model by preserving the localization information that further helps us associate generated tokens with image regions. This is previewed here:

Dataset

25,120 WSI-text pairs from GTEx Portal. 40 different tissue types 23517 training, 603 validation and 1000 testing samples stratified by tissue. Same as Zhang et al.

Here is what the tissue distribution looks like:

Caption Pre-processing

Image Pre-processing

Method Overview

This is inspired by neural machine translation as illustrated here:

Following the same architecture we constructed the following:

A: We pre-extract embeddings from the HIPT which encodes the image in heirarchical manner. We are grateful to Chen et. al for the pre-trained weights.

B: We add a learnable [CLS] token in front of all our patch tokens and feed it to the patch encoder. We separately feed the thumbnail image of the WSI to the thumbnail encoder. We concatenate these two representations such that each patch token representation has the added image context vector from the thumbnail. We feed this concatenated WSI representation to our transformer decoder for our caption generation.

Evaluation

We evaluate our model using the following metrics:

We see that the model we described here performs well:

Comparative Evaluation

The models chosen:

Here are the results:

Here is a qualitative comparison between different models:

Summary