ObjectTracking

From Psych 221 Image Systems Engineering
Revision as of 22:38, 10 December 2015 by imported>Projects221 (Background - Optical Motion Tracking)
Jump to navigation Jump to search

Applying a computer vision object tracking algorithm to track musicians’ ancillary gestures

Introduction

The motivating question: Musicians gestures at a subphrase level

When watching a musical performance, it is possible to notice that musicians’ body movements seem closely related to their expressive intentions and to their conceptualization of musical structure, beyond the movements made necessarily to play the instrument. In recent years, many studies have noted the correspondence of body motion at the level of the phrase (Krumhansl & Schenck, 1997; MacRitchie, Buck, & Bailey, 2013; Vines, Krumhansl, Wanderley, & Levitin, 2006; Wanderley, Vines, Middleton, McKay, & Hatch, 2005). Looking at clarinetists, participants in Vines et al., (2006) tracked the phrasing of clarinetists under audio-only, video-only, and audiovisual conditions; the participants with only video cues could follow phrasing very well, indicating how communicative the visual channel can be for audience members. Indeed, performers used body motion to modulate audience perception of phrase length, in that, for the visual-only condition, they extended the perception of the length of the last phrase by continuing to move their bodies. A more recent study, MacRitchie et al. (2013) used PCA to analyze motion capture performance of 9 pianists playing two excerpts of Chopin preludes. Their findings suggested that overall motion in a performance arises from the performers’ representation of the musical structure, and that repeating body motion patterns at the level of the phrase can be observed.

These inquiries have largely focused on how musicians physically express musical structure in ancillary gestures on the timescale of the phrase. However, it is possible that ancillary gestures at a sub-phrase level reflect musicians’ conceptions of musical structure. As noted in Godoy et al., (2010), motion of the head of the pianist was cyclical at shorter timescales or melodic figures. Also, Vines et al., (2006) mentions changes in performers’ facial expression changes at important structural moments in the score, though what these facial expressions signified – if they were reflective of some sort of emotional change, or were more related to depiction of emphasis – is not clear. These kinds of gestures are the interest of the current study – those that do not reflect task demands, or capabilities and physical constraints of subjects, but rather planned conceptual grouping and the possibility of physical expression beyond the means necessary for playing, - and if they corresponding to musical structures on a level smaller than the phrase.

A perhaps complementary line of research exists in the field of linguistics, in which it was found that body motions not directly related to speech-producing actions aid in listener understanding of speech. For instance, rhythmic head motion conveys linguistic information, with head movement correlating strongly with pitch and amplitude of the talker’s voice (Munhall, Jones, Callan, Kuratate, & Vatikiotis-Bateson, 2004). Further, when animations of these “talking heads” were presented in a perception task without sound, participants correctly identified more syllables when natural head motion was presented in the animation than when it was eliminated or distorted. This result suggests that nonverbal gestures such as head movements play a more direct role in the perception of speech than previously known. While one might extrapolate that, from a listeners perspective, musical structure might be more ‘understandable’ when a visual corollary plausibly matches it, the takeaway from this study is that natural head motions arise in conjunction with sub-sentence speech, and these tendencies seem natural to both the speaker and the listener.

Motivation/Goals

With a goal of exploring how musicians express structural information on a sub-phrase level, an ideal avenue of consideration is to investigate gestural cues when performers are conceiving the music as mostly containing small melodic groupings rather than long phrases. While there are multiple motion capture systems on the Stanford campus, there is a lack of sufficient performer population at the level of students wanting to become professional musicians, and thus, I felt it necessary to travel off campus to get a sufficient number of participants, perhaps at a slight expense of the quality of the data. Evaluating the quality though, of the ad hoc 'optical motion capture' system I used to obtain performers motions, was the purpose of this project. In the following sections, I will describe the input data (collected directly prior to the start of term), as well as the setup I used to collect the data. Then, I will describe my evaluation of the object tracking algorithm I used to track colored markers attached to the performers body while they played.

Background - Optical Motion Tracking

Low-cost 'motion capture' systems have been implemented in the past using computer vision. The main umbrella title for these systems are 'optical motion capture' (OMC) systems. Specifically, OMC systems uses a number of special cameras that view a scene from a variety of angles. For the majority of professional applications, like movie production in which the goal is to track actors motion for realistic avatar generation, these systems aim to capture 3D motion. Thus, they use a selection of cameras recording the same subject to later reconstruct a 3D representation of the markers, similarly to stereoscopic vision in humans. In "Marker-based" OMC, reflective markers are placed on the performers's body. These markers can easily be recognized software typically built into the camera, usually due to their reflectance. Recording the positions of these markers throughout the performers actions allows the experimenter to determine the position of the performers's body at any given time. Alternatively, "Marker-less" OMC tries to accomplish the same task without the use of special tracking devices. Instead, the actor's silhouette is viewed from different angles, and is used to reconstruct the entire 3D body of the performer. For the purposes of tracking musicians, the primary use of motion capture has been marker-based, though some examples exist of research being successfully run using silhouettes, though these examples use only 2D silhouettes (cite Dahl).

Cost

While the costs of professional and lab-quality motion capture systems are not readily available, it is reasonable to presume that they range from the tens of thousands (for a limited number of cameras) to hundreds of thousands of dollars (for more complicated setups). Thus, part of the motivation for this project was to evaluate a low-cost alternative to such costly systems that are certainly out of the range of a student budget, and out of the range of most music departments that may be interested in research into music and gesture.

Motivation

This project is an exploration of the data quality that can be obtained using a low-cost 2D motion capture system. I focus mainly on the evaluation of the object-tracking algorithm chosen, and its suitability for analyzing the collected data. I leave out the results of the research question, as I hope to publish it separately. Thus, roughly I split the entire Methods into three stages:

   Pre-Capture Setup - Preparing the stage and the subject for use
   Capture & Data Acquisition - Tracking subject movement and building a 3D representation
   Gesture-Tracking Algorithm and Post-Processing editing - Evaluation of the runtime and accuracy of the chosen algorithm

I have implemented a small one-dimensional sample of the marker detection and tracking algorithm that I will use as an example later on. The sample was written in Matlab and is available for download.

Methods

Results

Conclusion

References