To the best of our knowledge this model has the highest reported image captioning capabilities that were published to this date on the reported dataset. The focus of this report however lies on our analysis of active learning in the context of remote sensing image captioning.
Our final submission for the TU Berlin course "Computer Vision for Remote Sensing" (Winter 22/23). The goal of our project was to compare different sampling critera for active learning applied to a remote sensing image captioning system. We used an encoder-decoder structure consisting of a Vision Transformer and the GPT-2 language model as our pipeline (https://huggingface.co/nlpconnect/vit-gpt2-image-captioning) and trained it on the NWPU-Captions dataset, the largest remote sensing image captioning dataset consisting of 31500 RGB images with five human produced captions each (https://github.com/HaiyanHuang98/NWPU-Captions).
As an example, this is an image of the NWPU-Captions dataset together with its five reference sentences.
The stadium is shaped like a ring.
The circular stadium is next to the lawn and parking lot.
The stadium is next to the grass and parking lot.
Many buildings are next to the stadium.
There are some green lawns near the stadium.
We set up an active learning framework which is based on the idea of training in cycles. At each of nine cycles, the model is trained for ten epochs, it is validated on the validation set, and we label a number of samples which corresponds to 5% of the full training set size (that is, of the labeled and unlabeled training data combined). We experimentally tested each of the sampling queries described above.
Simulating an active learning process by randomly sampling from the unlabeled dataset serves as a baseline for the sampling critera.
Training on the full dataset without sampling any data gives an upper bound of the performance which can be achieved.
Uncertainty criteria aim to find datapoints closely located to the decision boundary as those might contain features the model has not learned yet and are thus highly informative for the training process.
We infer on the unlabeled data and assign each datapoint an uncertainty value which corresponds to the probability with which the prediction was emitted. As in a generated sentence, each word is emitted with a certain probability, we only consider the word which shows lowest confidence. In recurrent sentence generation, later words are emitted with high probability and low variance because they strongly depend on prior words.
This is why we only consider the word of lowest confidence which is usually located at the beginning of a sentence.
This sampling criterion works similarly to least confidence sampling, but we measure the confidence of a word as the margin between the probabilities of the most likely and second most likely predictions.
Diversity criteria aim to sample a subset which is as diverse as possible. In practice, this is achieved by extracting features and clustering them using k-means to find structures within the dataset. Then, one sample per cluster is selected to cover a wide range of features.
We enforce diversity in terms of image features by clustering the latent vectors produced by the image encoder.
We enforce diversity in terms of text features by using the pretrained BERT model (https://huggingface.co/docs/transformers/model_doc/bert) to extract features of the predicted sentences. Those features then processed as described above.
Finding the most uncertain samples and clustering them has the advantage of significantly reducing the cost of the computationally intensive clustering. We concatenated the most successful uncertainty and diversity criteria by first finding the most uncertain samples, clustering them and selecting one random sample per cluster.
Similarly, we concatenated the criteria in the reverse order which does not offer the advantage of reduced complexity but can still offer insights into the active learning process. Here, we clustered the image features, randomly selected one sample per cluster, and labeled the most uncertain samples.
None of our selected criteria was able to beat the random baseline in terms of our three measured metrics, the BLEU-4, Rouge-L and METEOR scores. While the single criteria achieved results similar to the baseline results, the fusion of criteria worked significantly worse. We have identified a problem in our pipeline which leads to this unexpected behavior but have yet to fix it. In our paper, we propose adjustments to the single criteria and new methods which could be implemented in the future and have the potential of improving the active learning process. Our pipeline proved to be well-suited for the remote sensing image captioning as we were able to achieve test results better than the state of the art after training on the full dataset without active learning.
Our codebase was developed to run stabily under python==3.9.15. With other releases, unexpected behaviour might occur.
# clone project
git clone https://git.tu-berlin.de/wallburg/activelearning_ic.git
# install requirements
pip install -r requirements.txt- install git lfs following this then
git lfs clone https://github.com/lowlorenz/NWPU-Captions.git
tar -xzf NWPU-Captions/NWPU_images.tar.gz -C NWPU-Captions# module folder
cd activelearning_ic
# run training
python src/main.py --sample_method [SAMPLE_METHOD] --cluster_mode [CLUSTER_MODE] --conf_mode [CONF_MODE] --argsThe slurm scripts for all experiments are provided. Inside the script,
#SBATCH -o /path/to/output/log
#SBATCH -D /home/dir/activelearning_ic/need to be set. Start training:
# module folder
cd activelearning_ic
# run module via slurm
sbatch [SCRIPT_NAME].sh
# monitor output
tail -f [OUT].log| Name | Default | Type | Description |
|---|---|---|---|
| epochs | 10 | int | Number of epochs to train per cycle. |
| max_cycles | 5 | int | Number of active learning cycles to train. |
| init_set_size | 0.05 | float | Initial train set size in percent. |
| new_data_size | 0.05 | float | Percentage of added labels per cycle. |
| learning_rate | 0.0001 | float | Learning rate of the optimizer. |
| batch_size | 4 | int | Batch size. |
| sample_method | "cluster" | str | Sampling method to retrieve more labels. |
| device_type | "cuda" | str | Device to train on. |
| run_name | "test" | str | Name of the run. |
| data_path | "NWPU-Captions/" | str | Path to the NWPU-Captions dataset. |
| debug | False | bool | Debug mode. |
| val_check_interval | 1.0 | float | Validation check interval. |
| num_devices | 1 | int | Number of devices to train on. |
| num_nodes | 1 | int | Number of nodes to train on. |
| ckpt_path | "checkpoints" | str | Path to store or load checkpoint. |
| mode | "train" | str | Choose between train and test mode. |
| seed | 42 | int | Global random seed. |
| conf_mode | "least" | str | Whether to sample based on "least" confidence or "margin" of confidence. |
| conf_average | "sentence" | str | Whether to sample based on average "sentence" confidence or minimum "word" confidence. |
| cluster_mode | "image" | str | Whether to use the image or text embeddings for clustering. |
| mutliple_sentence_loss | False | bool | Whether to use the image or text embeddings for clustering. |
This automatically selects the validation set and predicts on it using a pretrained model
# module folder
cd activelearning_ic
# predict
python validation_generator.py --ckpt [[NAME]_[SEED]-[TIME]-[CYCLE].ckpt]Please find the models weights in the tubcloud
- project
- src
- main.py # main Python script for running the project
- utils.py # dataset logic for masking and loading the data
- evaluation.py # logic for generating meteor, bleu and rouge-L scores
- model.py # logic on how to infer / train / predict / validate on the different data loaders
- prediction_writer.py # defines how the data is logged during training
- rouge.py # logic for calculating the rouge-L score
- strategies.py # logic for the different selection strategies
- validation_generator.py # used to generate predictions on the validation set (easily adjustable for the test set)
- _evaluation_... # scripts for generating the plots in this project
- metrics.iypnb # the script that is used for general analysis of our runs
- confidence_per_word_length.ipynb # plots the confidence per word position
- full_model_on_testset.iypnb # extensive analysis of the full model
- generation_on_testset.ipynb # a notebook in which the a model can be run on the test set
- latent_space_visualization.iypnb # visualization of the bert latentspace using Umap
- sample_selection.ipnb # used for analyizing the selected sample over the course of the strategy


