Linear Alignment of Vision-language Models for Image Captioning

Research output: Working paper and reportsPreprint

Abstract

Recently, vision-language models like CLIP have advanced the state of the art in a variety of multi-modal tasks including image captioning and caption evaluation. Many approaches adapt CLIP-style models to a downstream task by training a mapping network between CLIP and a language model. This is costly as it usually involves calculating gradients for large models. We propose a more efficient training protocol that fits a linear mapping between image and text embeddings of CLIP via a closed-form solution. This bypasses the need for gradient computation and results in a lightweight captioning method called ReCap, which can be trained up to 1000 times faster than existing lightweight methods. Moreover, we propose two new learning-based image-captioning metrics that build on CLIP score along with our linear mapping. Furthermore, we combine ReCap with our new metrics to design an iterative datastore-augmentation loop (DAL) based on synthetic captions. We evaluate ReCap on MS-COCO, Flickr30k, VizWiz, and MSRVTT. ReCap achieves performance comparable to state-of-the-art lightweight methods on established metrics while outperforming them on our new metrics, which are better aligned with human ratings on Flickr8k-Expert and Flickr8k-Crowdflower. Finally, we demonstrate that ReCap transfers well to other domains and that our DAL leads to a performance boost.
Original languageEnglish
Number of pages26
Publication statusPublished - 2024

Publication series

NamearXiv.org

Fields of science

  • 305907 Medical statistics
  • 202017 Embedded systems
  • 202036 Sensor systems
  • 101004 Biomathematics
  • 101014 Numerical mathematics
  • 101015 Operations research
  • 101016 Optimisation
  • 101017 Game theory
  • 101018 Statistics
  • 101019 Stochastics
  • 101024 Probability theory
  • 101026 Time series analysis
  • 101027 Dynamical systems
  • 101028 Mathematical modelling
  • 101029 Mathematical statistics
  • 101031 Approximation theory
  • 102 Computer Sciences
  • 102001 Artificial intelligence
  • 102003 Image processing
  • 102004 Bioinformatics
  • 102013 Human-computer interaction
  • 102018 Artificial neural networks
  • 102019 Machine learning
  • 102032 Computational intelligence
  • 102033 Data mining
  • 305901 Computer-aided diagnosis and therapy
  • 305905 Medical informatics
  • 202035 Robotics
  • 202037 Signal processing
  • 103029 Statistical physics
  • 106005 Bioinformatics
  • 106007 Biostatistics

JKU Focus areas

  • Digital Transformation

Cite this