I am a researcher/engineer with a strong background in computer vision and machine learning.
I am interested in developing algorithms for real-world applications.
Currently, I am working on object detection and tracking at Xovis AG,
Switzerland since October 2023.
Earlier, I was research engineer for two years in imaging and computer vision group at Siemens research India, directed by Amit Kale.
In 2013, I graduated from masters in informatics (MOSIG) program at
Institut National Polytechnique de Grenoble-INPG (School ENSIMAG)
with specialization in Graphics Vision and Robotics (GVR).
I completed my master's thesis under the supervision of Dr. Georgios Evangelidis
and Dr. Radu HORAUD at INRIA, Grenoble.
I received Bachelor of Technology degree in Electronics and Instrumentation Engineering from VIT University, Vellore,
during which I had chance do an internship at university of Edinburgh under the supervision of Dr. Bob Fisher.
ChaLearn Looking at People Challenge, 2014 , Gesture detection, Rank: 7/17.
ChaLearn Looking at People Challenge, 2013 , Gesture detection, Rank: 17/54.
Selected Publications
Spatio-Temporal Action Detection Under Large Motion
we aim to study the performance of cuboid-aware feature aggregation in action detection under large action. Further, we propose to enhance actor feature representation under large motion by tracking actors and performing temporal feature aggregation along the respective tracks. We define the actor motion with intersection-over-union (IoU) between the boxes of action tubes/tracks at various fixed time
scales. The action having a large motion would result in lower IoU over time, and slower actions would maintain higher IoU. We find that track-aware feature aggregation consistently achieves a large improvement in action detection performance, especially for actions under large motion compared to cuboid-aware baseline. As a result, we also report state-of-the-art on the large-scale MultiSports dataset.
ROAD: The ROad event Awareness Dataset for Autonomous Driving
we introduce the ROad event Awareness Dataset (ROAD) for Autonomous Driving, to our knowledge the first of its kind. ROAD is designed to test an autonomous vehicles ability to detect road events, defined as triplets composed by an active agent, the action(s) it performs and the corresponding scene locations. ROAD comprises videos originally from the Oxford RobotCar Dataset annotated with bounding boxes showing the location in the image plane of each road event. We benchmark various detection tasks, proposing as a baseline a new incremental algorithm for online road event awareness termed 3D-RetinaNet. We also report the performance on the ROAD tasks of Slowfast and YOLOv5 detectors, as well as that of the winners of the ICCV2021 ROAD challenge.
Gurkirt Singh, Stephen Akrigg, ......., & Fabio Cuzzolin
we propose a novel Recurrent Convolutional Network (RCN), which relies on recurrence to capture the temporal context across frames at each network level.
Our network decomposes 3D convolutions into (1) a 2D spatial convolution component, and (2) an additional hidden state
1 × 1 convolution, applied across time.
The hidden state at any time t is assumed to depend on the hidden state at t − 1
and on the current output of the spatial convolution component.
As a result, the proposed network: (i) produces causal outputs, (ii) provides flexible temporal reasoning, (iii) preserves temporal resolution.
we propose to optimise both encoder and decoder simultaneously in an end-to-end fashion.
In a twostage training setting, we first initialise our architecture using pre-trained encoders and decoders – then,
the entire network is trained end-to-end in a fine-tuning stage to learn the most relevant features for video caption generation.
In our experiments, we use GoogLeNet and Inception-ResNetv2 as encoders and an original Soft-Attention (SA-) LSTM as a decoder.
Analogously to gains observed in other computer vision problems, we show that end-to-end training significantly improves over the traditional, disjoint training
process
we present a method to predict an entire ‘action tube’ in a trimmed video just by observing a smaller subset of video. We propose a Tube Prediction network (TPnet) which jointly predicts the past, present and future bounding boxes along with
their action classification scores. At test time TPnet is used in a (temporal) slid-
ing window setting, and its predictions are put into a tube estimation framework
to construct/predict the video long action tubes not only for the observed part of
the video but also for the unobserved part
TraMNet - Transition Matrix Network for Efficient Action Tube Proposals
Current state-of-the-art methods solve spatio-temporal action localisation by extending 2D anchors to 3D-cuboid proposals on stacks of frames, to generate sets of temporally connected bounding boxes called action micro-tubes. To avoid this problem we introduce a Transition-Matrix-based Network (TraMNet) which relies on computing transition probabilities between anchor proposals while maximising their overlap with ground truth bounding boxes across frames, and enforcing sparsity via a transition threshold. As the resulting transition matrix is sparse and stochastic, this reduces the proposal hypothesis search space from O(n^f) to the cardinality of the thresholded matrix.
Online Real-time Multiple Spatiotemporal Action Localisation and Prediction
We present a method for multiple spatiotemporal action localisation,
classification, and early prediction based on a single deep learning framework,
which able to work in an online and real time contraints.
Gurkirt Singh, Suman Saha, Michael Sapienza, Philip Torr, Fabio Cuzzolin.
AMTnet: Action-Micro-Tube regression by end-to-end trainable deep architecture.
Dominant approaches provides sub-optimal solutions to the action dection problem, as they rely on seeking frame-level detections and construting tubes from them. In this paper we radically depart from current practice, and take a first step towards the design and implementation of a deep network architecture able to classify and regress video-level micro-tubes.
Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos
In this work we propose a new approach to the spatio-temporal
localisation (detection) and classification of multiple concurrent actions within
temporally untrimmed videos. We demonstrate the performance of our algorithm on the challenging
UCF101, J-HMDB-21 and LIRIS-HARL datasets, achieving new state-of-the-art results
across the board and significantly lower detection latency at test time.
Suman Saha, Gurkirt Singh, Michael Sapienza, Philip Torr, Fabio Cuzzolin.
Untrimmed Video Classification for Activity Detection: submission to ActivityNet Challenge
In this work we propose a simple, yet effective, method for the temporal detection
of activities in temporally untrimmed videos with the help of untrimmed classification.
This method secured the 2nd place at ActivityNet
Challenge 2016 in activity detection task [Results]
Gurkirt Singh and Fabio Cuzzolin.
CVPR 2016 ActivityNet workshop, 2nd place in detection task.
Continuous gesture recognition from articulated poses
This paper addresses the problem of continuous gesture recognition from articulated poses.
Unlike the common isolated recognition scenario, the gesture boundaries are here unknown,
and one has to solve two problems: segmentation and recognition.
This is cast into a labeling framework, namely every site (frame) must be assigned a label (gesture ID).
The inherent constraint for a piece-wise constant labeling is satisfied by solving a
global optimization problem with a smoothness term.
This mehtod secured 7th place in gesture
detection task in ChaLearn LaP Challenge using only skeleton data.
Georgios Evangelidis, Gurkirt Singh, Radu Patrice Horaud.
Skeletal Quads:Human action recognition using joint quadruples
In this context, we propose a local skeleton descriptor that encodes
the relative position of joint quadruples. Such a coding implies a
similarity normalisation transform that leads to a compact (6D or 5D)
view-invariant skeletal feature, referred to as skeletal quad.
In the references below, we use this descriptor in conjunction
with FIsher kernel in order to encode gesture or action (sub)sequences.
The short length of the descriptor compensates for the large inherent
dimensionality associated to Fisher vectors.
Georgios Evangelidis, Gurkirt Singh, Radu Patrice Horaud.
Frame-wise representations of depth videos for action recognition
We present three types of depth data representation from depth frames, which are referred as single-reference representation, multiple-reference representation and Quad representation.
Gurkirt Singh
Master thesis, INRIA and Grenoble Institute of Technology, France, 2013
Supervisors: Dr. Radu Horaud and Dr. Georgios Evangelidis