Salman Khan bio photo

Salman Khan

I'm a computer vision researcher affiliated with MBZUAI and ANU.

Twitter   G. Scholar LinkedIn E-Mail

Research Projects

Life-long Learning

Life-long Learning 

The ability to incrementally learn novel tasks and acquire new knowledge is necessary for life-long machine learning. Deep neural networks suffer from ‘catastrophic forgetting’, a phenomenon that occurs when a network is sequentially trained on a series of tasks and the learning acquired on new tasks interferes with the previously learned concepts. As an example, in a typical transfer learning scenario, when a model pre-trained on a source task is adapted to another task by fine-tuning its weights, its performance significantly degrades on the source task whose weights are overridden by the newly learned parameters. It is, therefore, necessary to develop continual learning models capable of incrementally adding newly available classes without the need to retrain models from scratch using all previous class-sets (a cumulative setting).

Multi-modal and Multi-task Foundational Models

Multi-modal Learning 

Pre-trained vision-language (V-L) foundational models such as CLIP have shown excellent generalization ability to downstream tasks. However, their quick adaptation for downstream tasks and novel modalities pose challenges. As an example, they are sensitive to the choice of input text prompts and require careful selection of prompt templates to perform well. Inspired by the Natural Language Processing (NLP) literature, recent CLIP adaptation approaches learn prompts as the textual inputs to fine-tune CLIP for downstream tasks. We note that using prompting to adapt representations in a single branch of CLIP (language or vision) is sub-optimal since it does not allow the flexibility to dynamically adjust both representation spaces on a downstream task. We have been working on multi-modal prompt learning as a quick adaptation strategy for foundational models. In parallel, developing and adaptation of foundational models for diverse modalities e.g., videos and remote sensing is also a main focus of this research stream.

Zero/Few-shot Object Detection

Zero-Few-shot Object Detection 

Since its inception, zero/few-shot learning research has been dominated by the object classification problem. Although it still remains as a challenging task, the recognition setting has a number of limitations that render it less suitable for real-life scenarios. First, the recognition setting is destined to work for simpler cases where only a single dominant object is present. Second, the attributes and semantic descriptions are relevant to individual objects instead of the entire scene composition. Third, zero/few-shot recognition provides an answer to unseen or rare categories in elementary tasks, e.g., classification and retrieval, yet it is unable to scale to advanced tasks such as scene interpretation and contextual modeling, which require a fundamental reasoning about all salient objects. This research theme seeks to address these challenges in an object detection framework.

Adversarial Attacks and Defenses

Adversarial Attack and Defense 

Adversarial examples contain small, human imperceptible perturbations specifically designed by an adversary to fool a learned model. These examples pose a serious threat for security critical applications, e.g. autonomous cars, bio-metric identification and surveillance systems. Furthermore, if a slight perturbation added to a benign input drastically changes the deep network’s output with a high-confidence, it reflects that our current models are not distinctively learning the fundamental visual concepts. Therefore, the study of various adversarial attacks and the design of robust deep networks goes a long way towards developing reliable and trustworthy artificial intelligence systems.

Spectral Representations

Spectral Representations 

Spectral signatures of natural scenes were earlier found to be distinctive for different scene types with varying spatial envelope properties such as openness, naturalness, ruggedness, and symmetry. Recently, such handcrafted features have been outclassed by deep learning based repre-sentations. This project focuses on the development of novel spectral descriptions of convolution features, implemented efficiently as a unitary transformation within deep network architectures. We show that the spectral transformation decorrelates convolutional activations, which reduces co-adaptation between feature detections, thus acts as an effective regularizer.

Change Detection

Change Detection 

This project relates to the automatic detection of changes in visual imagery. First, we address the problem of weakly supervised change detection in a pair of color images. Our structured deep learning based approach requires only image-level labels to simultaneously detect and localize changes. Second, we address the forest change detection problem using satellite imagery. Our proposed approach is capable of performing change analysis at a much finer temporal resolution by recovering the missing information and automatically learns strong features from the raw surface reflectance data.

Scene Categorization

Scene Classification 

This project addresses the challenging problem of scene categorization where diverse intra-class variations and confusing inter-class similarities exist. We develop two novel solutions to solve this problem. First, a new approach is developed which exploits rich mid-level convolutional features in a transformed domain to categorize indoor scenes. The feature represenatation not only incorporates the discriminative aspects of the target dataset, but it also encodes the features in terms of the general object categories that are present in indoor scenes. Second, we propose a new learnable feature descriptor to handle large scale deformations caused by spatial layout and scale variations in indoor scenes.

Geometry Estimation


Objects' spatial layout estimation and clutter identification are two important tasks to understand indoor scenes. We propose to solve both of these problems in a joint framework using RGBD images of indoor scenes. In contrast to recent approaches which focus on eitherone of these two problems, we perform `fine grainedstructure categorization' by predicting all the major objects and simultaneously labeling the cluttered regions. A conditional random field model is proposed to incorporate a rich set of local appearance, geometric featuresand interactions between the scene elements.

Semantic Labelling

Semantic Labelling 

Inexpensive structured light sensors can capture rich information from indoor scenes, and scene labeling problems provide a compelling opportunity to make use of this information. In this work, we present a novel Conditional Random Field (CRF) model to effectively utilize depth information for semantic labeling of indoor scenes. At the core of the model, we propose a novel and efficient plane detection algorithm which is robust to erroneous depth maps. Our CRF formulation defines local, pairwise and higher order interactions between image pixels.

Shadow Detection and Removal

Shadow Removal 

We present a framework to automatically detect and remove shadows in real world scenes from a single image. Previous works on shadow detection put a lot of effort in designing shadow variant and invariant hand-crafted features. In contrast, our framework automatically learns the most relevant features in a supervised manner using multiple convolutional deep neural networks (ConvNets). Using the detected shadow masks, we propose a Bayesian formulation to accurately extract shadow matte and subsequently remove shadows. The Bayesian formulationis based on a novel model which accurately models the shadow generation process in the umbra and penumbra regions.

Signature Verification

Signature Verification 

Handwritten signatures are one of the most socially acceptable and traditionally used person identification and authentication metric. Although a number of authentication systems based on handwritten signatures have been proposed, a little attention is paid towards employing signatures for person identification. In this work, we address both the identification and verification problems related to analysis of dynamic handwritten signatures. In this way, the need to present user name before biometric verification can be eliminated in current signature based biometric authentication systems. A compressed sensing approach is used for user identification and to reject a query signature that does not belong to any user in the database.

Template Security

Template Security 

Exposure of unprotected authentication data is a high risk threat for organizations with online presence. The challenge is: how to ensure security of multi-factor authentication data without deteriorating the performance of an identity verification system? To solve this problem, we present a novel framework that applies random projections to biometric data (inherence factor), using secure keys derived frompasswords (knowledge factor), to generate inherently secure, ecient and revocable/renewable biometric templates for users’ verification. We evaluate the security strength of the framework against possible attacks by adversaries.