๐ง๐ต๐ฒ ๐ง๐ฟ๐ฎ๐ถ๐ป๐ถ๐ป๐ด ๐ฆ๐๐ฟ๐ฎ๐๐ฒ๐ด๐ถ๐ฒ๐ ๐๐ฒ๐ต๐ถ๐ป๐ฑ ๐ฉ๐ถ๐๐ถ๐ผ๐ป ๐๐ฎ๐ป๐ด๐๐ฎ๐ด๐ฒ ๐ ๐ผ๐ฑ๐ฒ๐น๐ (๐ฉ๐๐ ๐)
Ever wondered how Vision Language Models actually learn to understand both images and text?
Below are the training approaches, each with distinct strengths:
1๏ธโฃ Contrastive Learning โ These models learn by matching similar image-text pairs while pushing apart dissimilar ones. Think CLIP and its variants. Theyโre relatively inexpensive to train and excel at tasks like image search and zero-shot classification.
2๏ธโฃ Masking-Based Learning โ By filling in the blanks, models reconstruct missing image patches from text descriptions or predict masked words from images. This bidirectional approach develops a deeper understanding of the relationship between visual and textual information.
3๏ธโฃ Generative Models โ Can generate complete images from text or produce detailed captions from images. Most expensive to train but offer the most flexibility for creative tasks. DALL-E and modern image captioning systems fall into this category.
4๏ธโฃ Pretrained Backbone Approaches โ The practical integrators. Leverage existing pretrained models like open-source LLMs (Llama, for example) and learn a mapping network to connect vision encoders with language models. Reduces training costs by building on proven foundations.
Understanding these training paradigms helps you choose the right model for your application. Need fast zero-shot classification? Go contrastive. Building a captioning system? Consider generative or masking approaches. Working with limited compute? Pretrained backbones might be your answer.
Paper link: https://arxiv.org/html/2405.17247v1#S1


