We use cookies to ensure that we give you the best experience on our website. By continuing to browse this repository, you give consent for essential cookies to be used. You can read more about our Privacy and Cookie Policy.

Durham Research Online
You are in:

Re-ID-AR: Improved Person Re-identification in Video via Joint Weakly Supervised Action Recognition

Alsehaim, A. and Breckon, T.P. (2021) 'Re-ID-AR: Improved Person Re-identification in Video via Joint Weakly Supervised Action Recognition.', BMVC 2021 Online, 22-25 Nov 2021.


We uniquely consider the task of joint person re-identification (Re-ID) and action recognition in video as a multi-task problem. In addition to the broader potential of joint Re-ID and action recognition within the context of automated multi-camera surveillance, we show that the consideration of action recognition in addition to Re-ID results in a model that learns discriminative feature representations that both improve Re-ID performance and are capable of providing viable per-view (clip-wise) action recognition. Our approach uses a single 2D Convolutional Neural Network (CNN) architecture comprising a common ResNet50-IBN backbone CNN architecture, to extract frame-level features with subsequent temporal attention for clip level feature extraction, followed by two sub-branches:- the IDentification (sub-)Network (IDN) for person Re-ID and the Action Recognition (sub-)Network for per-view action recognition. The IDN comprises a single fully connected layer while the ARN comprises multiple attention blocks on a one-to-one ratio with the number of actions to be recognised. This is subsequently trained as a joint Re-ID and action recognition task using a combination of two task-specific, multi-loss terms via weakly labelled actions obtained over two leading benchmark Re-ID datasets (MARS, LPW). Our consideration of Re-ID and action recognition as a multi-task problem results in a multi-branch 2D CNN architecture that outperforms prior work in the field (rank-1 (mAP) – MARS: 93.21%(87.23%), LPW: 79.60%) without any reliance 3D convolutions or multi-stream networks architectures as found in other contemporary work. Our work represents the first benchmark performance for such a joint Re-ID and action recognition video understanding task, hitherto unapproached in the literature, and is accompanied by a new public dataset of supplementary action labels for the seminal MARS and LPW Re-ID datasets.

Item Type:Conference item (Paper)
Full text:(AM) Accepted Manuscript
Download PDF
Publisher Web site:
Date accepted:No date available
Date deposited:26 October 2021
Date of first online publication:November 2021
Date first made open access:27 October 2021

Save or Share this output

Look up in GoogleScholar