Workshop on Multimodal Learning

In conjunction with CVPR 2020.

Seattle, WA - June 19th 2020 (Full Day)

Workshop on Multimodal Learning

The exploitation of the power of big data in the last few years led to a big step forward in many applications of Computer Vision. However, most of the tasks tackled so far are involving mainly visual modality due to the unbalanced number of labelled samples available among modalities (e.g., there are many huge labelled datasets for images while not as many for audio or IMU based classification), resulting in a huge gap in performance when algorithms are trained separately.

This workshop aims to bring together communities of machine learning and multimodal data fusion. We expect contributions involving video, audio, depth, IR, IMU, laser, text, drawings, synthetic, etc. Position papers with feasibility studies and cross-modality issues with highly applicative flair are also encouraged therefore we expect a positive response from academic and industrial communities.

This is an open call for papers, soliciting original contributions considering recent findings in theory, methodologies, and applications in the field of multimodal machine learning. Potential topics include, but are not limited to:

  • Multimodal learning
  • Cross-modal learning
  • Self-supervised learning for multimodal data
  • Multimodal data generation and sensors
  • Unsupervised learning on multimodal data
  • Cross-modal adaptation
  • Multimodal data fusion
  • Multimodal transfer learning
  • Multimodal applications (e.g. drone vision, autonomous driving, industrial inspection, etc.)
  • Machine Learning studies of unusual modalities


Papers will be limited to 8 pages according to the CVPR format (c.f. main conference authors guidelines). All papers will be reviewed by at least two reviewers with double blind policy. Papers will be selected based on relevance, significance and novelty of results, technical merit, and clarity of presentation. Papers will be published in CVPR 2020 proceedings if desired (Track 1). Authors will be asked to flag this option in the submission site. Please note that deadlines for Track 1 (with IEEE/CVF proceedings) and Track 2 (no proceedings) are different.

All the papers should be submitted using CMT website for Track 1.
For Track 2, please send your submission at Include author names and affiliations in the email.

Important Dates

    Track 1 (with IEEE/CVF proceedings)

    • Deadline for submission: March 10th, 2020 - 23:59 Pacific Standard Time
    • Notification of acceptance: April 10th, 2020
    • Camera Ready submission deadline: April 17th, 2020
    • ---EXTENDED---
    • Camera Ready submission deadline: April 19th, 2020

    Track 2 (no proceedings)

    Please send your submission at

    • Deadline for submission: April 20th, 2020 - 23:59 Pacific Standard Time
    • ---EXTENDED---
    • Deadline for submission: April 25th, 2020 - 23:59 Pacific Standard Time
    • Notification of acceptance: May 15th, 2020
    • Camera Ready submission deadline: May 31st, 2020

    Workshop date: June 19th (Full Day)


Conference virtual platform at this link


[NEW] Invited talks have been recorder with the permission of the speakers. Youtube links are provided below.

[NEW] Proceedings are available at this link


Friday, June 19th (2nd-time events will eventually overrun to June 20th - PDT time)

Session 1 - Session chairs: Vittorio Murino, Pietro Morerio

[08:30 AM - 08:45 AM] - Welcome by organizers
[08:30 PM - 08:45 PM] (2nd-time)

[08:45 AM - 09:35 AM] - Keynote 1 – Label Efficient Visual Abstractions for Autonomous Driving - Prof. Andreas Geiger
[08:45 PM - 09:35 PM] (2nd-time)
Abstract It is well known that semantic segmentation can be used as an effective intermediate representation for learning driving policies. However, the task of street scene semantic segmentation requires expensive annotations. Furthermore, segmentation algorithms are often trained irrespective of the actual driving task, using auxiliary image-space loss functions which are not guaranteed to maximize driving metrics such as safety or distance traveled per intervention. In this talk, I will quantify the impact of reducing annotation costs on learned behavior cloning agents. I will analyze several different segmentation-based modalities for the task of self-driving in the CARLA simulator, and discuss the trade-off between annotation efficiency and driving performance. I will present several practical insights into how segmentation-based visual abstractions can be exploited in a more label efficient manner, resulting in reduced variance of the learned policies.

[09:35 AM - 09:55 AM] - Oral 1 – W70: CPARR: Category-based Proposal Analysis for Referring Relationships - Chuanzi He (USC); Haidong Zhu (University of Southern California)*; Jiyang Gao (Waymo); Kan Chen (University of Southern California); Ram Nevatia (U of Southern California)
[09:35 PM - 09:55 PM] (2nd-time)

[09:55 AM - 10:15 AM] - Oral 2 – W70: Improved Active Speaker Detection based on Optical Flow - Chong Huang (UC Santa Barbara)*; Kazuhito Koishida (Microsoft)
[09:55 PM - 10:15 PM] (2nd-time)

Session 2 - Session chairs: Bodo Rosenhahn, Paolo Rota

[10:30 AM - 11:20 AM] - Keynote 2 – W70 TBA - Dr. Andrew Fitzgibbon
[10:30 PM - 11:20 PM] (2nd-time)

[11:20 AM - 11:40 AM] - Oral 3 – W70: Interactive Video Retrieval with Dialog - Sho Maeoki (The University of Tokyo)*; Kohei Uehara (The University of Tokyo); Tatsuya Harada (The University of Tokyo / RIKEN)
[11:20 PM - 11:40 PM] (2nd-time)

[11:40 AM - 12:00 AM] - Oral 4 – W70: Self-Supervised Object Detection and Retrieval Using Unlabeled Videos - Elad Amrani (IBM / Technion)*; Rami Ben-Ari (IBM-Research); Inbar Shapira (IBM); Tal Hakim (University of Haifa); Alex Bronstein (Technion)
[11:40 PM - 00:00 AM] (2nd-time)

[12:00 PM - 12:20 PM] - Oral5 – W70: Quality and Relevance Metrics for Selection of Multimodal Pretraining Data - Roshan M Rao (UC Berkeley)*; Sudha Rao (Microsoft Research); Elnaz Nouri (Microsoft); Debadeepta Dey (Microsoft); Asli Celikyilmaz (Microsoft Research); Bill Dolan (Microsoft)
[00:00 AM - 00.20 AM] (June 20th) (2nd-time)

[12:30 AM - 01:30 PM] - Keynote 3 – W70 TBA - Prof. Jitendra Malik
[00.30 AM - 01:30 AM] (June 20th) (2nd-time)

Session 3 - Session chairs: Yan Huang, Michael Yang

[02:50 PM - 03:10 PM] - Oral6 – W70: Multi-modal Dense Video Captioning - Vladimir Iashin (Tampere University)*; Esa Rahtu (Tampere University)
[02:50 AM - 03:10 AM] (June 20th) (2nd-time)

[03:10 PM - 03:30 PM] - Oral7 – W70: Cross-modal variational alignment of latent spaces - Thomas Theodoridis (Centre for Research and Technology-Hellas); Theocharis Chatzis (Centre for Research and Technology-Hellas); Vassilis Solachidis (Centre for Research and Technology-Hellas)*; Kosmas Dimitropoulos ( Centre for Research and Technology-Hellas); Petros Daras (ITI-CERTH, Greece)
[03:10 AM - 03:30 AM] (June 20th) (2nd-time)

Session 4 - Session chairs: Amir Zadeh, Michael Yang

[04:00 PM - 04:50 PM] - Keynote 4 – W70 TBA - Prof. Dhruv Batra
[04:00 AM - 04:50 AM] (June 20th) (2nd-time)

[04:50 PM - 05:10 PM] - Oral8 – W70: Exploring Phrase Grounding without Training: Contextualisation and Extension to Text-Based Image Retrieval - Letitia E Parcalabescu (Heidelberg University)*; Anette Frank (Heidelberg University)
[04:50 AM - 05:10 AM] (June 20th) (2nd-time)

[05:10 PM - 05:30 PM] - Oral9 – W70: Classification-aware Semi-supervised Domain Adaptation - Gewen He (Florida State University)*; Xiaofeng Liu (CMU); Fangfang Fan (Harvard); Jane You (HK)
[05:10 AM - 05:30 AM] (June 20th) (2nd-time)

[05:30 PM - 05:50 PM] - Oral10 – W70: A Dataset and Benchmarks for Multimedia Social Analysis - Bofan Xue (University of California, Berkeley)*; David Chan (University of California, Berkeley); John Canny (University of California, Berkeley)
[05:30 AM - 05:50 AM] (June 20th) (2nd-time)

[05:50 PM - 06:00 PM] - Final Remarks -
[05:50 AM - 06:00 AM] (June 20th) (2nd-time)

Invited Speakers

Jitendra Malik was born in Mathura, India in 1960. He received the B.Tech degree in Electrical Engineering from Indian Institute of Technology, Kanpur in 1980 and the PhD degree in Computer Science from Stanford University in 1985. In January 1986, he joined the university of California at Berkeley, where he is currently the Arthur J. Chick Professor in the Department of Electrical Engineering and Computer Sciences. He is also on the faculty of the department of Bioengineering, and the Cognitive Science and Vision Science groups. During 2002-2004 he served as the Chair of the Computer Science Division, and as the Department Chair of EECS during 2004-2006 as well as 2016-2017. Since January 2018, he is also Research Director and Site Lead of Facebook AI Research in Menlo Park.

Dhruv Batra is an Associate Professor in the School of Interactive Computing at Georgia Tech and a Research Scientist at Facebook AI Research (FAIR). His research interests lie at the intersection of machine learning, computer vision, natural language processing, and AI. The long-term goal of his research is to develop agents that 'see' (or more generally perceive their environment through vision, audition, or other senses), 'talk' (i.e. hold a natural language dialog grounded in their environment), 'act' (e.g. navigate their environment and interact with it to accomplish goals), and 'reason' (i.e., consider the long-term consequences of their actions).

Andrew Fitzgibbon leads the “All Data AI” (ADA) research group at Microsoft in Cambridge, UK. He is best known for his work on 3D vision, having been a core contributor to the Emmy-award-winning 3D camera tracker “boujou“, to body tracking for Kinect for Xbox 360, and for the articulated hand-tracking interface to Microsoft’s HoloLens. His research interests are broad, spanning computer vision, machine learning, programming languages, computer graphics and occasionally a little neuroscience. He is a fellow of the Royal Academy of Engineering, the British Computer Society, and the International Association for Pattern Recognition, and is a Distinguished Fellow of the British Machine Vision Association. Before joining Microsoft in 2005, he was a Royal Society University Research Fellow at Oxford University, having previously studied at Edinburgh University, Heriot-Watt University, and University College, Cork.

Andreas Geiger is Full Professor at the University of Tübingen and a Group Leader at MPI-IS Tübingen. His research focuses on computer vision and machine learning with a focus on 3D scene understanding, parsing, reconstruction, material and motion estimation for autonomous intelligent systems such as self-driving cars or household robots. In particular, me and my group investigate how complex prior knowledge can be incorporated into computer vision algorithms for making them robust to variations in our complex 3D world.


Yan Huang


Li Liu

NUDT & University of Oulu
Louis-Philippe Morency
Carnegie Mellon University

Pietro Morerio

Istituto Italiano di Tecnologia

Vittorio Murino

Università degli Studi di Verona & Huawei

Matti Pietikäinen

University of Oulu

Bodo Rosenhahn

Leibniz-Universität Hannover

Paolo Rota

Università di Trento

Liang Wang


Qi Wu

University of Adelaide

Michael Ying Yang

University of Twente

Amir Zadeh

Carnegie Mellon University


For additional info please contact us here