Publications

♠ indicates equal technical contribution

A novel Reversible Vision Transformer that allows training arbitrarily deep transformers with per-image memory of a single layer ViT model while matching acccuracy, FLOPs & #parameters.
Computer Vision and Pattern Recognition 2018 (Oral)

An online video transformer architecture with memory caching for efficient long-term video recognition that achieves state-of-the-art on action detection (AVA) & aniticipation.
Under Review (Code and paper coming soon)

An improved state-of-the-art (2022) MViTv2 architecture for recognition and detection task for both images (ImageNet and COCO) as well as videos (Kinetics and AVA).
Under Review

We present Ynet -- a scene-aware trajectory prediction model that factorizes uncertainty into its epistemic (goal related) and aleatoric (path related) factors that achieves SOTA on both short-term & long-term time horizons on Stanford Drone, ETH/UCY and InD datasets.
International Conference on Computer Vision 2021

A new Multiscale Vision Transformer architecture that achieves SOTA performance-complexity tradeoff (2021) across image recognition (ImageNet), action recognition (Kinetics) & detection tasks.
Internatonal Conference on Computer Vision 2021

A Multi Stream convultional-deconvolutional framework for predicting future positions of pedestrians in egocentric videos using pose, location and egomotion features.
Internatonal Conference on Computer Vision 2021

We propose a relationship between catastrophic forgetting in discriminator and mode collapse in generator and propose a adaptive multi adversarial training (AMAT) solution to tackle this in GANs.
British Machine Vision Conference 2021

We present ORViT -- an object-centric video transformer model that explicitly models appearence & dynamics of objects that improves on action recognition (SSV2, EK100) & detection (AVA) tasks.
Under Review

We present PECNet -- A Predicted Endpoint Conditioned trajectory prediction network for forecasting multimodal human trajectories respecting social norms in multi-agent scenarios
European Conference on Computer Vision 2020 (Oral)

A Multi Stream convultional-deconvolutional framework for predicting future positions of pedestrians in egocentric videos using pose, location and egomotion features.
European Conference on Computer Vision 2020 (Oral)

A Robust Encoder-recurrent-Decoder framework for egocentric human locomotion forecasting based on disentangling concurrent human motion
Winter Conference on Applications of Computer Vision 2020 (Oral)

A Multi Stream convultional-deconvolutional framework for predicting future positions of pedestrians in egocentric videos using pose, location and egomotion features.
Computer Vision and Pattern Recognition 2018 (Spotlight)

A Multitask framework that utilizes the spontaneity information present in speech to improve the performance at emotion recognition tasks.
Interspeech 2018 (Oral)

We show that learning in Deep Networks is a two stage process. First, it rapidly learns 'shallow learnable' ('easier') examples and then slowly learns to generalize to other 'harder' examples.
Workshop on Identifying and Understanding Deep Learning Phenomena (ICML 2020) (Oral)

An algorithm to extend the use of Cellular Automata from the confined space of binary images to Grayscale images with minimal time and space overheads.
28th Irish Signals and Systems Conference 2017

Contact

Please feel free to reach to me for any query