Multi-Modal Pre-Training for Multimedia Understanding

ACM International Conference on Multimedia Retrieval (ICMR) Workshop

16 - 19 November 2021, Taipei, Taiwan, China


Representation learning has always been a key challenge in many tasks of multi-modality domain, such as image-text retrieval, visual question and answering, video localization, speech recognition, ect. Along the history of multi-modality learning, we find that model initialization is one of the most important factors. For example, research of weight initialization set a foundation of neural network based methods and image features pre-trained with Visual Genome Dataset build a new standard setting for many vision-language models. Recently, multi-modal pre-training is a new paradigm of model initialization that establishes the state-of-the-art performances for many multimedia tasks. Pretraining models outperforms traditional methods by providing stronger representation of different modalities learned in an unsupervised training way. Multi-modal pre-training is an interesting topic and has attracted rapidly growing interests in many fields and the intersection of these them, including computing vision, natural language processing, speech recognition, etc. With the continuous effort of many works, we also find that the cost time can be even decreased to 10 hours on 2 Titan RTXs for a vision-language pre-training model in very recent works. Although the emerging trend of multi-modal pre-training models, it remains unexplored in many aspects. For example, studies of standard settings for fair comparison of different multi-modal pre-training models will benefit the research community. More discussion about the efficiency of sub-modules and pre-training tasks will also help us to have more thorough knowledge about pre-training mechanism. Exploration of improving training efficiency is also worth tackling.

Call For Paper

The goals of this workshop are to (1) investigate research opportunities of multi-modal model initialization, especially on multi-modal pre-training, (2) solicit novel methodologies of multi-modal pre-training, (3) explore and discuss the advantage and possibilities of pre-training for more multimedia tasks. We expect contributions concerning multimodality model initialization and multi-modal pre-training, involving image, language, video, speech, etc.

The topics of interest include (but not limited to):

  • Multi-modal self-supervised learning
  • Multi-modal pre-training task
  • Multi-modal pre-training optimization
  • Multi-modal representation learning
  • Multi-modal model optimization
  • Multi-modal model initialization
  • Cross-modality retrieval
  • Lightweight multi-modal pre-training
  • Multi-modality alignment and parsing
  • Advanced multi-modal applications
  • Benchmark datasets and novel evaluation methods

Important Dates

  • April 20, 2021 April 25, 2021

    Deadline for Workshop Paper Submission.

  • May 20, 2021 May 15, 2021

    Acceptance Notification of Workshop Papers.

  • May 30, 2021 June 20, 2021

    Camera-ready date for Workshop Papers.

  • November 16, 2021

    Workshop Day.

Paper Submission

Paper Format

All papers must be formatted according to the ACM proceedings style. Click on the link to access Latex and Word templates for this format. Please use "sample-sigconf.tex" as a Latex template or "ACM_SigConf.doc" as a Word template.

Lenght Of The Paper

We invite the following two types of papers:

Full Paper: limited to 8 pages, including all text, figures, and references: Full Papers should describe original contents with evaluations. They will be reviewed by more than two experts based on:
  1. Originality of the content
  2. Quality of the content based on evaluation
  3. Relevance to the theme
  4. Clarity of the written presentation

Short Paper: limited to 4 pages, including all text, figures, and references. Short papers should describe work in-progress as position papers. They will be reviewed by two experts based on:
  1. Originality of the content
  2. Relevance to the theme
  3. Clarity of the written presentation

Submission Website

Submissions should be made through here.

Camera-ready Instruction

The preparation instructions are readied for authors' final submissions. Please review: (for authors & speakers' submission types and specific deadlines).

The MMPT'21 recorded presentation video instructions:


  • 9:00-9:05


  • 9:05-9:40

        Ruihua Song

    Dr. Ruihua SONG is a tenured Associate Professor of Gaoling School of AI at Renmin University. She has been Lead Researcher of Microsoft Research Asia and Chief Scientist of Microsoft XiaoIce. On May, 2017, the first AI-authored poem collection was published. The book title is “The Sunshine Lost Windows”. She contributes the generation algorithm behind. She published more than 90 papers and hold 25 patents. She serves SIGIR, ACL, KDD, WWW, EMNLP, CIKM, etc. as Short Paper Chairs, Area Chair, Senior PC, or PC. Her recent research interests include multi-modal understanding of natural language, multi-modal dialogue systems and AI based creation.

    Title: WenLan: Efficient Large-Scale Multi-Modal Pre-Training on Real World Data

    Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities. Since this strong assumption is often invalid in real-world scenarios, we choose to implicitly model the cross-modal correlation for large-scale multi-modal pre-training, which is the focus of the Chinese project `WenLan' led by our team. Specifically, with the weak correlation assumption over image-text pairs, we propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework. We construct a large Chinese multi-source dataset of 650 million image-text pairs for pre-training our model. Extensive experiments demonstrate that WenLan on various downstream tasks and easy to build efficient applications based on searching between images and texts.

  • 9:45-10:30

    Session 1 (Session Chair: Bei Liu)

       9:45-10:00:    Be Specific, Be Clear: Bridging Machine and Human Captions by Scene-Guided Transformer

       10:00-10:15:   Language-Conditioned Region Proposal and Retrieval Network for Referring Expression Comprehension

       10:15-10:30:   Residual Recurrent CRNN for End-to-End Optical Music Recognition on Monophonic Scores

  • 10:35-11:10

        Limin Wang

    Dr. Limin Wang received the B.Sc. degree from Nanjing University, Nanjing, China, in 2011, and the Ph.D. degree from the Chinese University of Hong Kong, Hong Kong, in 2015. From 2015 to 2018, he was a Post-Doctoral Researcher with the Computer Vision Laboratory, ETH Zurich. He is currently a Professor with the Department of Computer Science and Technology, Nanjing University. His research interests include computer vision and deep learning. He was the first runner-up at the ImageNet Large Scale Visual Recognition Challenge 2015 in scene recognition, and the winner at the ActivityNet Large Scale Activity Recognition Challenge 2016 in video classification. He has served as a Senior PC or Area Chair for AAAI 2021, IJCAI 2021. He is a member of the IEEE and ACM.

    Title: Cross-modal Pretraining and Matching for Video Understanding

    Videos are generally accompanied with multi-modal information such as audio, text, and motion. The multi-modal information is becoming an important cue for understanding video content. How to model the correlation between multi-modalities in videos is still an unsolved problem in video understanding tasks such as video action recognition, video temporal grounding, and video description. In this talk, we focus on two specific video understanding tasks (i.e., cross-modal self-supervised pretraining and temporal grounding) by exploiting the video-text cross modal information. In particular, we notice that videos are naturally accompanied by abundant text information such as YouTube titles, Instagram captions, and Movie scripts. This textual information could serve as a general information to guide us train a multi-modal network, which could be used as a general video representation to be fine-tuned on the downstream tasks, or as cross-modal matching similarity to be used for video segment retrieval.

  • 11:10-11:55

    Session 2 (Session Chair: Jianlong Fu)

       11:10-11:25:   Style-Guided Image-to-Image Translation for Multiple Domains

       11:25-11:40:   A Fair and Comprehensive Comparison of Multimodal Tweet Sentiment Analysis Methods

       11:40-11:55:   Unsupervised Training Data Generation of Handwritten Formulas using Generative Adversarial Networks with Self-Attention

  • 11:55-12:00



Bei Liu

Microsoft Research Asia

Jianlong Fu

Microsoft Research Asia

Shizhe Chen


Qin Jin

Renmin University of China

Alexander Hauptmann

Carnegie Mellon University

Yong Rui

Lenovo Group