Polarity Loss for Zero-shot Object Detection

Shafin Rahman
Salman Khan
Nick Barnes

Australian National University, Data61-CSIRO, Inception Institute of AI
[GitHub]
[Paper + Supplementary Material]


(Top Left) Traditional ZSD approaches align visual features (solid dots) to their corresponding semantics (boat/airplane/bicycle) without considering the related semantic concepts (black text). It results in a fragile description of an unseen class (train) and causes confusion with background and seen classes (bottom left). (Top Right) Our approach automatically attends to related semantics from an external vocabulary and reshapes the semantic embedding such that visual features are wellaligned with seen word vectors and related semantics. Moreover, it maximizes the inter-class separation that avoids confusion between unseen and background (bottom right).


Zero-shot object detection is an emerging research topic that aims to recognize and localize previously ‘unseen’ objects. This setting gives rise to several unique challenges, e.g., highly imbalanced positive vs. negative instance ratio, ambiguity between background and unseen classes and the proper alignment between visual and semantic concepts. Here, we propose an end-to-end deep learning framework underpinned by a novel loss function that puts more emphasis on difficult examples to avoid class imbalance. We call our objective the ‘Polarity loss’ because it explicitly maximizes the gap between positive and negative predictions. Such a margin maximizing formulation is important as it improves the visual-semantic alignment while resolving the ambiguity between background and unseen. Our approach is inspired by the embodiment theories in cognitive science, that claim human semantic understanding to be grounded in past experiences (seen objects), related linguistic concepts (word dictionary) and the perception of the physical world (visual imagery). To this end, we learn to attend to a dictionary of related semantic concepts that eventually refines the noisy semantic embeddings and helps establish a better synergy between visual and semantic domains. Our extensive results on MS-COCO and Pascal VOC datasets show as high as 14x mAP improvement over state of the art.


Code


 [GitHub]


Paper

S. Rahman, S. H. Khan, N. Barnes.
Polarity Loss for Zero-Shot Object Detection.
arXiv preprint, 2020. [Paper]

S. Rahman, S. H. Khan, N. Barnes.
Improved Visual-Semantic Alignment for Zero-Shot Object Detection.
AAAI, 2020. [Paper]


Results

Our framework allows us to detect both seen and unseen object cateogries in natural images druing inference. We achieve significant improvement over the state of the art.



.

.
To test our model in the wild, we apply it on some example videos in the Youtube-8M dataset from Google AI. The results are demostrated in the videos below.

.

.
Link to Video 1
.
Link to Video 2
.
The above results are for Generalized Zero-shot detection setting. The seen/unseen objects are enclosed in yellow/red bounding boxes.


Citation

S. Rahman, S. H. Khan and N. Barnes, “Polarity Loss for Zero-shot Object Detection,” arXiv preprint arXiv:1811.08982, 2020.
S. Rahman, S. H. Khan and N. Barnes, "Improved Visual-Semantic Alignment for Zero-Shot Object Detection," 34th AAAI Conference on Artificial Intelligence, (AAAI), New York, US, 2020.

@article{rahman2020polarity,
title={Polarity Loss for Zero-shot Object Detection},
author={Rahman, Shafin and Khan, Salman and Barnes, Nick},
journal={arXiv preprint arXiv:1811.08982},
year={2020}}

@article{rahman2020improved,
title={Improved Visual-Semantic Alignment for Zero-Shot Object Detection},
author={Rahman, Shafin and Khan, Salman and Barnes, Nick},
journal={34th AAAI Conference on Artificial Intelligence},
publisher = {AAAI},
year={2020}}



Previous Work

This works is a follow up work on our previous contribution: Zero-shot Detection: Learning to Simultaneously Recognize and Localize Novel Concepts.



Ancknowledgement

This project page template was modified from this webpage.