Multi-Modal Pre-Training for Multimedia Understanding

ACM International Conference on Multimedia Retrieval (ICMR) Workshop

12 - 15 July 2021, Taipei, Taiwan, China


Representation learning has always been a key challenge in many tasks of multi-modality domain, such as image-text retrieval, visual question and answering, video localization, speech recognition, ect. Along the history of multi-modality learning, we find that model initialization is one of the most important factors. For example, research of weight initialization set a foundation of neural network based methods and image features pre-trained with Visual Genome Dataset build a new standard setting for many vision-language models. Recently, multi-modal pre-training is a new paradigm of model initialization that establishes the state-of-the-art performances for many multimedia tasks. Pretraining models outperforms traditional methods by providing stronger representation of different modalities learned in an unsupervised training way. Multi-modal pre-training is an interesting topic and has attracted rapidly growing interests in many fields and the intersection of these them, including computing vision, natural language processing, speech recognition, etc. With the continuous effort of many works, we also find that the cost time can be even decreased to 10 hours on 2 Titan RTXs for a vision-language pre-training model in very recent works. Although the emerging trend of multi-modal pre-training models, it remains unexplored in many aspects. For example, studies of standard settings for fair comparison of different multi-modal pre-training models will benefit the research community. More discussion about the efficiency of sub-modules and pre-training tasks will also help us to have more thorough knowledge about pre-training mechanism. Exploration of improving training efficiency is also worth tackling.

Call For Paper

The goals of this workshop are to (1) investigate research opportunities of multi-modal model initialization, especially on multi-modal pre-training, (2) solicit novel methodologies of multi-modal pre-training, (3) explore and discuss the advantage and possibilities of pre-training for more multimedia tasks. We expect contributions concerning multimodality model initialization and multi-modal pre-training, involving image, language, video, speech, etc.

The topics of interest include (but not limited to):

  • Multi-modal self-supervised learning
  • Multi-modal pre-training task
  • Multi-modal pre-training optimization
  • Multi-modal representation learning
  • Multi-modal model optimization
  • Multi-modal model initialization
  • Cross-modality retrieval
  • Lightweight multi-modal pre-training
  • Multi-modality alignment and parsing
  • Advanced multi-modal applications
  • Benchmark datasets and novel evaluation methods

Important Dates

  • April 20, 2021

    Deadline for Workshop Paper Submission.

  • May 20, 2021

    Acceptance Notification of Workshop Papers.


To be announced.


Bei Liu

Microsoft Research Asia

Jianlong Fu

Microsoft Research Asia

Shizhe Chen


Qin Jin

Renmin University of China

Alexander Hauptmann

Carnegie Mellon University

Yong Rui

Lenovo Group