Carview!

CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Wed, 17 Dec 2025 11:15:07 GMT access-control-allow-origin: * etag: W/"694290bb-85dd" expires: Sun, 28 Dec 2025 13:35:42 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: 8323:3157C7:79A777:8877E4:69512FD6 accept-ranges: bytes age: 0 date: Sun, 28 Dec 2025 13:25:42 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210071-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1766928343.615775,VS0,VE216 vary: Accept-Encoding x-fastly-request-id: 912a0eb6f8b01116bbd93f99d1d217b8cd15da46 content-length: 9917 Gurkirt Singh

Gurkirt Singh

Computer Vision Laboratory

Sternwartstrasse 7, ETH Zentrum

CH - 8092 Zürich, Switzerland

Office: ETF C114

CV/Resume

About Me

I am a researcher/engineer with a strong background in computer vision and machine learning. I am interested in developing algorithms for real-world applications. Currently, I am working on object detection and tracking at Xovis AG, Switzerland since October 2023.

I was a postdoctoral researcher with Prof. Luc Van Gool in the Computer Vision Lab at ETH Zurich for 2020 to 2023. I received a Doctor of Philosophy (PhD) in the Artificial Intelligence and Vision Group at Oxford Brookes University in 2019. I was advised by Dr. Fabio Cuzzolin. My PhD research was focused on spatio-temporal action detection and prediction in realistic videos.

Earlier, I was research engineer for two years in imaging and computer vision group at Siemens research India, directed by Amit Kale. In 2013, I graduated from masters in informatics (MOSIG) program at Institut National Polytechnique de Grenoble-INPG (School ENSIMAG) with specialization in Graphics Vision and Robotics (GVR). I completed my master's thesis under the supervision of Dr. Georgios Evangelidis and Dr. Radu HORAUD at INRIA, Grenoble. I received Bachelor of Technology degree in Electronics and Instrumentation Engineering from VIT University, Vellore, during which I had chance do an internship at university of Edinburgh under the supervision of Dr. Bob Fisher.

CV , Google Scholar LinkedIn

Contests

I love to participate in challenges and contests. Here are some of my recent participations.

MultiSports Challenge, 2022: Spatio-temporal action detection, Rank: 1/15.

Charades Challenge, 2017: Acton Recognition, Rank: 2/10, Temporal Action Segmentation, Rank: 3/6.

ActivityNet Challenge, 2017: Untrimmed Video Classification, Rank: 3/29.

ActivityNet Challenge, 2016: Untrimmed Video Classification, Rank: 10/24, Actvity detection, Rank: 2/6.

ChaLearn Looking at People Challenge, 2014 , Gesture detection, Rank: 7/17.

ChaLearn Looking at People Challenge, 2013 , Gesture detection, Rank: 17/54.

Selected Publications

Spatio-Temporal Action Detection Under Large Motion

we aim to study the performance of cuboid-aware feature aggregation in action detection under large action. Further, we propose to enhance actor feature representation under large motion by tracking actors and performing temporal feature aggregation along the respective tracks. We define the actor motion with intersection-over-union (IoU) between the boxes of action tubes/tracks at various fixed time scales. The action having a large motion would result in lower IoU over time, and slower actions would maintain higher IoU. We find that track-aware feature aggregation consistently achieves a large improvement in action detection performance, especially for actions under large motion compared to cuboid-aware baseline. As a result, we also report state-of-the-art on the large-scale MultiSports dataset.

Gurkirt Singh, Vasileios Choutas, Suman Saha, Fisher Yu, Luc Van Gool

WACV 2023 & MultiSports Challenge Winner DeeperAction Workshop ECCV 2022

ROAD: The ROad event Awareness Dataset for Autonomous Driving

we introduce the ROad event Awareness Dataset (ROAD) for Autonomous Driving, to our knowledge the first of its kind. ROAD is designed to test an autonomous vehicles ability to detect road events, defined as triplets composed by an active agent, the action(s) it performs and the corresponding scene locations. ROAD comprises videos originally from the Oxford RobotCar Dataset annotated with bounding boxes showing the location in the image plane of each road event. We benchmark various detection tasks, proposing as a baseline a new incremental algorithm for online road event awareness termed 3D-RetinaNet. We also report the performance on the ROAD tasks of Slowfast and YOLOv5 detectors, as well as that of the winners of the ICCV2021 ROAD challenge.

Gurkirt Singh, Stephen Akrigg, ......., & Fabio Cuzzolin

TPAMI 2022

Recurrent Convolutions for Causal 3D CNNs

we propose a novel Recurrent Convolutional Network (RCN), which relies on recurrence to capture the temporal context across frames at each network level. Our network decomposes 3D convolutions into (1) a 2D spatial convolution component, and (2) an additional hidden state 1 × 1 convolution, applied across time. The hidden state at any time t is assumed to depend on the hidden state at t − 1 and on the current output of the spatial convolution component. As a result, the proposed network: (i) produces causal outputs, (ii) provides flexible temporal reasoning, (iii) preserves temporal resolution.

Gurkirt Singh, Fabio Cuzzolin.

HVU - ICCVW 2019

End-to-End Video Captioning

we propose to optimise both encoder and decoder simultaneously in an end-to-end fashion. In a twostage training setting, we first initialise our architecture using pre-trained encoders and decoders – then, the entire network is trained end-to-end in a fine-tuning stage to learn the most relevant features for video caption generation. In our experiments, we use GoogLeNet and Inception-ResNetv2 as encoders and an original Soft-Attention (SA-) LSTM as a decoder. Analogously to gains observed in other computer vision problems, we show that end-to-end training significantly improves over the traditional, disjoint training process

Silvio Olivastri, Gurkirt Singh, Fabio Cuzzolin.

HVU - ICCVW 2019

Predicting Action Tubes

we present a method to predict an entire ‘action tube’ in a trimmed video just by observing a smaller subset of video. We propose a Tube Prediction network (TPnet) which jointly predicts the past, present and future bounding boxes along with their action classification scores. At test time TPnet is used in a (temporal) slid- ing window setting, and its predictions are put into a tube estimation framework to construct/predict the video long action tubes not only for the observed part of the video but also for the unobserved part

Gurkirt Singh, Suman Saha, Fabio Cuzzolin.

AHB - ECCVW 2018

TraMNet - Transition Matrix Network for Efficient Action Tube Proposals

Current state-of-the-art methods solve spatio-temporal action localisation by extending 2D anchors to 3D-cuboid proposals on stacks of frames, to generate sets of temporally connected bounding boxes called action micro-tubes. To avoid this problem we introduce a Transition-Matrix-based Network (TraMNet) which relies on computing transition probabilities between anchor proposals while maximising their overlap with ground truth bounding boxes across frames, and enforcing sparsity via a transition threshold. As the resulting transition matrix is sparse and stochastic, this reduces the proposal hypothesis search space from O(n^f) to the cardinality of the thresholded matrix.

Gurkirt Singh, Suman Saha, Fabio Cuzzolin.

ACCV 2018

Online Real-time Multiple Spatiotemporal Action Localisation and Prediction

We present a method for multiple spatiotemporal action localisation, classification, and early prediction based on a single deep learning framework, which able to work in an online and real time contraints.

Gurkirt Singh, Suman Saha, Michael Sapienza, Philip Torr, Fabio Cuzzolin.

ICCV 2017

Video

AMTnet: Action-Micro-Tube regression by end-to-end trainable deep architecture.

Dominant approaches provides sub-optimal solutions to the action dection problem, as they rely on seeking frame-level detections and construting tubes from them. In this paper we radically depart from current practice, and take a first step towards the design and implementation of a deep network architecture able to classify and regress video-level micro-tubes.

Suman Saha, Gurkirt Singh, Fabio Cuzzolin.

ICCV 2017

Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos

In this work we propose a new approach to the spatio-temporal localisation (detection) and classification of multiple concurrent actions within temporally untrimmed videos. We demonstrate the performance of our algorithm on the challenging UCF101, J-HMDB-21 and LIRIS-HARL datasets, achieving new state-of-the-art results across the board and significantly lower detection latency at test time.

Suman Saha, Gurkirt Singh, Michael Sapienza, Philip Torr, Fabio Cuzzolin.

BMVC 2016

Untrimmed Video Classification for Activity Detection: submission to ActivityNet Challenge

In this work we propose a simple, yet effective, method for the temporal detection of activities in temporally untrimmed videos with the help of untrimmed classification. This method secured the 2nd place at ActivityNet Challenge 2016 in activity detection task [Results]

Gurkirt Singh and Fabio Cuzzolin.

CVPR 2016 ActivityNet workshop, 2nd place in detection task.

PDF
Code

Continuous gesture recognition from articulated poses

This paper addresses the problem of continuous gesture recognition from articulated poses. Unlike the common isolated recognition scenario, the gesture boundaries are here unknown, and one has to solve two problems: segmentation and recognition. This is cast into a labeling framework, namely every site (frame) must be assigned a label (gesture ID). The inherent constraint for a piece-wise constant labeling is satisfied by solving a global optimization problem with a smoothness term. This mehtod secured 7th place in gesture detection task in ChaLearn LaP Challenge using only skeleton data.

Georgios Evangelidis, Gurkirt Singh, Radu Patrice Horaud.

ECCV 2014 workshop

Skeletal Quads:Human action recognition using joint quadruples

In this context, we propose a local skeleton descriptor that encodes the relative position of joint quadruples. Such a coding implies a similarity normalisation transform that leads to a compact (6D or 5D) view-invariant skeletal feature, referred to as skeletal quad. In the references below, we use this descriptor in conjunction with FIsher kernel in order to encode gesture or action (sub)sequences. The short length of the descriptor compensates for the large inherent dimensionality associated to Fisher vectors.

Georgios Evangelidis, Gurkirt Singh, Radu Patrice Horaud.

ICPR 2014

Frame-wise representations of depth videos for action recognition

We present three types of depth data representation from depth frames, which are referred as single-reference representation, multiple-reference representation and Quad representation.

Gurkirt Singh

Master thesis, INRIA and Grenoble Institute of Technology, France, 2013

Supervisors: Dr. Radu Horaud and Dr. Georgios Evangelidis