We use cookies to ensure that we give you the best experience on our website. By continuing to browse this repository, you give consent for essential cookies to be used. You can read more about our Privacy and Cookie Policy.

Durham Research Online
You are in:

Extracting coarse body movements from video in music performance : a comparison of automated computer vision techniques with motion capture data.

Jakubowski, Kelly and Eerola, Tuomas and Alborno, Paolo and Volpe, Gualtiero and Camurri, Antonio and Clayton, Martin (2017) 'Extracting coarse body movements from video in music performance : a comparison of automated computer vision techniques with motion capture data.', Frontiers in digital humanities., 4 . p. 9.


The measurement and tracking of body movement within musical performances can provide valuable sources of data for studying interpersonal interaction and coordination between musicians. The continued development of tools to extract such data from video recordings will offer new opportunities to research musical movement across a diverse range of settings, including field research and other ecological contexts in which the implementation of complex motion capture systems is not feasible or affordable. Such work might also make use of the multitude of video recordings of musical performances that are already available to researchers. The present study made use of such existing data, specifically, three video datasets of ensemble performances from different genres, settings, and instrumentation (a pop piano duo, three jazz duos, and a string quartet). Three different computer vision techniques were applied to these video datasets—frame differencing, optical flow, and kernelized correlation filters (KCF)—with the aim of quantifying and tracking movements of the individual performers. All three computer vision techniques exhibited high correlations with motion capture data collected from the same musical performances, with median correlation (Pearson’s r) values of .75 to .94. The techniques that track movement in two dimensions (optical flow and KCF) provided more accurate measures of movement than a technique that provides a single estimate of overall movement change by frame for each performer (frame differencing). Measurements of performer’s movements were also more accurate when the computer vision techniques were applied to more narrowly-defined regions of interest (head) than when the same techniques were applied to larger regions (entire upper body, above the chest or waist). Some differences in movement tracking accuracy emerged between the three video datasets, which may have been due to instrument-specific motions that resulted in occlusions of the body part of interest (e.g. a violinist’s right hand occluding the head whilst tracking head movement). These results indicate that computer vision techniques can be effective in quantifying body movement from videos of musical performances, while also highlighting constraints that must be dealt with when applying such techniques in ensemble coordination research.

Item Type:Article
Additional Information:Published in the Digital Musicology Speciality Section.
Full text:(AM) Accepted Manuscript
Available under License - Creative Commons Attribution.
Download PDF
Full text:(VoR) Version of Record
Available under License - Creative Commons Attribution.
Download PDF
Publisher Web site:
Publisher statement:Copyright: © 2017 Jakubowski, Eerola, Alborno, Volpe, Camurri and Clayton. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Date accepted:21 March 2017
Date deposited:30 March 2017
Date of first online publication:06 April 2017
Date first made open access:No date available

Save or Share this output

Look up in GoogleScholar