Universal Representations for Computer Vision
Thursday, 24 November 2022
In recent years, deep learning has achieved impressive results for many computer vision tasks. However, the best performance in each task is achieved by designing and learning an independent network per task for each domain/modality, e.g. image classification, depth estimation, audio classification, optical flow estimation. By contrast, humans in the early years of their development develop powerful internal visual representations that are subject to small refinements in response to later visual experience. Once these visual representations are formed, they are universal and later employed in many diverse vision tasks from reading text and recognizing faces to interpreting visual art forms and anticipating the movement of the car in front of us.
The presence of universal representations in computer vision has important implications. First, it means that vision has limited complexity. A growing number of visual domains and tasks can be modelled with a bounded number of representations. As a result, one can use a compact set of representations for learning multiple modalities, domains and tasks, and efficiently share features and computations across them, which is crucial in platforms with limited computational resources such as mobile devices and autonomous cars. Second, as we obtain more complete universal representations, learning new domains and tasks is made easier and performed more efficiently from only a few samples by transfer learning. Third, universal representations enable computer vision models with increased capabilities for scene understanding including semantics, geometry, motion, audio.
Learning universal representations requires addressing several challenges. These include improved architecture design for modelling diverse visual data and interface for allowing effective interaction between them, as well as tackling interference/dominance during optimization. Although existing universal representation learning strategies for architecture design and training algorithms have been explored, key difficulties (such as task interference) associated with learning compact and well-generalised universal representations over modalities, domains and tasks remain. For instance, currently we do not have established models like ResNet or Vision Transformers that can solve multiple problems from various modalities and domains. Our aim is to increase the awareness of the computer vision community in these new and effective solutions are likely to be needed for a breakthrough in learning universal representations.
Multimodal Video Search by Examples
Thursday, 24 November 2022
Videos are being generated in large numbers and volumes – by the public, video conferencing tools (e.g. Teams, Zoom, Webex), and TV broadcasters such as the BBC. The videos may be stored in public archives such as Youtube or proprietary archives such as the BBC Rewind Archive. Videos in these archives are typically indexed by pre-defined metadata such as titles, tags, and viewer notes, and are searchable by keywords. Commercial video search engines such as Youtube, Vidrovr, Panopto, and IronYun are traditionally keyword-based. Using these engines, we can search by any word spoken or on-screen, or by traditional metadata. However, it is difficult to use keywords to search for specific moments in a video where a particular speaker talks about a specific topic at a particular location. Furthermore, most videos have little or no metadata, and automatic metadata extraction is not yet sufficiently reliable. Therefore, video search by keywords has limitations.
Video search by examples is a desirable alternative as it allows search for any content by examples, but it is notoriously hard. To improve search performance, multiple modalities could be considered e.g. face, voice, context (or setting) and topic, as each modality provides a separate search cue and multiple cues together should more accurately identify relevant content than any individual cue alone.
The Multimodal Video Search by Examples Workshop, which is part of BMVC 2022 Workshop Programme, is aimed to provide a forum on content-based video search as well as to encourage in-depth discussion of technical and application issues.
Workshop topics include, but are not limited to:
- Content-based image and video search
- Multimodal image and video search
- Image and video segmentation
- Image and video embedding
- Re-ranking for image and video search
- Rank aggregation for image and video search
- Human centred AI for image and video search