We use cookies to ensure that we give you the best experience on our website. By continuing to browse this repository, you give consent for essential cookies to be used. You can read more about our Privacy and Cookie Policy.

Durham Research Online
You are in:

Using compressed audio-visual words for multi-modal scene classification.

Kurcius, J.J. and Breckon, T.P. (2014) 'Using compressed audio-visual words for multi-modal scene classification.', in Computational Intelligence for Multimedia Understanding (IWCIM), 2014 International Workshop on, 1-2 November 2014, Paris, France ; proceedings. Paris: IEEE, pp. 99-103.


We present a novel approach to scene classification using combined audio signal and video image features and compare this methodology to scene classification results using each modality in isolation. Each modality is represented using summary features, namely Mel-frequency Cepstral Coefficients (audio) and Scale Invariant Feature Transform (SIFT) (video) within a multi-resolution bag-of-features model. Uniquely, we extend the classical bag-of-words approach over both audio and video feature spaces, whereby we introduce the concept of compressive sensing as a novel methodology for multi-modal fusion via audio-visual feature dimensionality reduction. We perform evaluation over a range of environments showing performance that is both comparable to the state of the art (86%, over ten scene classes) and invariant to a ten-fold dimensionality reduction within the audio-visual feature space using our compressive representation approach.

Item Type:Book chapter
Keywords:Multi-resolution, Bag of words, MFCC, Compressed sensing, Audio-visual, Multi-modal.
Full text:(AM) Accepted Manuscript
Download PDF
Publisher Web site:
Publisher statement:© 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Date accepted:No date available
Date deposited:04 February 2015
Date of first online publication:November 2014
Date first made open access:No date available

Save or Share this output

Look up in GoogleScholar