ObjectTracking: Difference between revisions

Revision as of 23:08, 10 December 2015

Applying a computer vision object tracking algorithm to track musicians’ ancillary gestures

Introduction

The motivating question: Musicians gestures at a subphrase level

When watching a musical performance, it is possible to notice that musicians’ body movements seem closely related to their expressive intentions and to their conceptualization of musical structure, beyond the movements made necessarily to play the instrument. In recent years, many studies have noted the correspondence of body motion at the level of the phrase (Krumhansl & Schenck, 1997; MacRitchie, Buck, & Bailey, 2013; Vines, Krumhansl, Wanderley, & Levitin, 2006; Wanderley, Vines, Middleton, McKay, & Hatch, 2005). Looking at clarinetists, participants in Vines et al., (2006) tracked the phrasing of clarinetists under audio-only, video-only, and audiovisual conditions; the participants with only video cues could follow phrasing very well, indicating how communicative the visual channel can be for audience members. Indeed, performers used body motion to modulate audience perception of phrase length, in that, for the visual-only condition, they extended the perception of the length of the last phrase by continuing to move their bodies. A more recent study, MacRitchie et al. (2013) used PCA to analyze motion capture performance of 9 pianists playing two excerpts of Chopin preludes. Their findings suggested that overall motion in a performance arises from the performers’ representation of the musical structure, and that repeating body motion patterns at the level of the phrase can be observed.

These inquiries have largely focused on how musicians physically express musical structure in ancillary gestures on the timescale of the phrase. However, it is possible that ancillary gestures at a sub-phrase level reflect musicians’ conceptions of musical structure. As noted in Godoy et al., (2010), motion of the head of the pianist was cyclical at shorter timescales or melodic figures. Also, Vines et al., (2006) mentions changes in performers’ facial expression changes at important structural moments in the score, though what these facial expressions signified – if they were reflective of some sort of emotional change, or were more related to depiction of emphasis – is not clear. These kinds of gestures are the interest of the current study – those that do not reflect task demands, or capabilities and physical constraints of subjects, but rather planned conceptual grouping and the possibility of physical expression beyond the means necessary for playing, - and if they corresponding to musical structures on a level smaller than the phrase.

Motivation/Goals

With a goal of exploring how musicians express structural information on a sub-phrase level, an ideal avenue of consideration is to investigate gestural cues when performers are conceiving the music as mostly containing small melodic groupings rather than long phrases. While there are multiple motion capture systems on the Stanford campus, there is a lack of sufficient performer population at the level of students wanting to become professional musicians, and thus, I felt it necessary to travel off campus to get a sufficient number of participants, perhaps at a slight expense of the quality of the data.

Thus, this project is an exploration of the data quality that can be obtained using a low-cost, easily portable 2D 'motion capture system'. Since the data was collected prior to the start of this quarter, for the scope of this project, I focus mainly on the evaluation of the object-tracking algorithm chosen, and its suitability for analyzing the collected data. I leave out the results of the research question, as I hope to publish it separately. Thus, roughly I split the entire Methods into two stages, the first of which explains the quality of the data available of the second stage:

   Data Collection - Pre-data collection setup (preparing the stage and subject), data acquisition
   Gesture-Tracking Algorithm Evaluation - Evaluation of the runtime and accuracy of the chosen algorithm, 
        as well as some post-processing applied to ensure better quality tracking

I have implemented a small one-dimensional sample of the marker detection and tracking algorithm that I will use as an example later on. The sample was written in Matlab and is available for download.

Background - Optical Motion Tracking

Low-cost 'motion capture' systems have been implemented in the past using computer vision. The main umbrella title for these systems are 'optical motion capture' (OMC) systems. Specifically, OMC systems uses a number of special cameras that view a scene from a variety of angles. For the majority of professional applications, like movie production in which the goal is to track actors motion for realistic avatar generation, these systems aim to capture 3D motion. Thus, they use a selection of cameras recording the same subject to later reconstruct a 3D representation of the markers, similarly to stereoscopic vision in humans. In "Marker-based" OMC, reflective markers are placed on the performers's body. These markers can easily be recognized software typically built into the camera, usually due to their reflectance. Recording the positions of these markers throughout the performers actions allows the experimenter to determine the position of the performers's body at any given time. Alternatively, "Marker-less" OMC tries to accomplish the same task without the use of special tracking devices. Instead, the actor's silhouette is viewed from different angles, and is used to reconstruct the entire 3D body of the performer. For the purposes of tracking musicians, the primary use of motion capture has been marker-based, though some examples exist of research being successfully run using silhouettes, though these examples use only 2D silhouettes (cite Dahl).

While the costs of professional and lab-quality motion capture systems are not readily available, it is reasonable to presume that they range from the tens of thousands (for a limited number of cameras) to hundreds of thousands of dollars (for more complicated setups). Thus, part of the motivation for this project was to evaluate a low-cost alternative to such costly systems that are certainly out of the range of a student budget, and out of the range of most music departments that may be interested in research into music and gesture.

Methods

Data Collection

Participants

13 (7 female) cello performance majors. Recruitment via email over departmental listservs. The email notified them that they would be filmed.

Stimuli

The opening 8 bars of the 3rd Ricercar by Domenico Gabrielli. The cellists were instructed to prepare two versions of the unfamiliar excerpt: Version A – with long marked phrases, and Version B – with shorter marked melodic groupings. The familiar excerpt was the first 18 bars of the Courante from the G major Suite by J.S. Bach, which is noted to contain motivic implied polyphony (Winold, 2007). Tempo, bowings, and fingerings were marked and required for all excerpts.

Recording Setup

Recording took place in a quiet, secure room. Two Canon Vixia HF R600 were used for video recording: one to the front, at around a distance of 8 ft 10 in, one to the side at 6 ft 3 in, and was filmed at 60 fps. This was the highest frame rate available for this camera. This camera was selected due to discussions with the filmographer for our department that it was a quality under $300 camcorder. Audio was recorded using a Zoom H6, at 96 kHz sampling rate and 24-bit depth. Cellists were lit using a Pro LED 1000 12x12 LED Panel Light. Cellists wore 1-in squares of retroreflective tape, placed at the center of the forehead, the nose, the top of the hands, the center point on the clavicle (below the neck), both shoulders, the lower rib, and the right cheek. Participants were asked to wear dark clothing that was not too loose-fitting.

Procedure

Performers were asked to memorize all excerpts prior to the filming session. Upon arrival, they were given a few minutes to warm up, and markers were attached. They were then asked to play the two versions of the ‘unfamiliar’ excerpt, in two ways. In the first way, they will be asked to “play as if all of the notes are part of one line”. Performers played each version of the excerpt up to 5 times, or as many as needed to get two acceptable takes. Once two takes for each excerpt/version were obtained with markers on, the markers were removed and one take was obtained of each excerpt/version.

Gesture-Tracking Algorithm

I chose to use as a starting point, the algorithm described in Deshmukh, P. K. & Gholap, Y. (2012). Efficient object tracking using K-means and radial basis function. I include here the abstract from this paper, as well as a flowchart of the algorithm process, as it most succinctly describes their algorithm.

"In the present article, an efficient method for object tracking is proposed using Radial Basis Function Neural Networks and K-means. This proposed method starts with K-means algorithm to do the segmentation of the object and background in the frame. The Pixel-based color features are used for identifying and tracking the object. The remaining background is also considered. These classified features of object and extended background are used to train the Radial Basis Function Neural Network. The trained network will track the object in next subsequent frames. This method is tested for the video sequences and is suitable for real-time tracking due to its low complexity. The objective of this experiment is to minimize the computational cost of the tracking method with required accuracy."

Pixel-based color segmentation of an object from its background (manually selected) Input to RBFNN is RGB values of the object and background

@@ Line 50: / Line 50: @@
 ''"In the present article, an efficient method for object tracking is proposed using Radial Basis Function Neural Networks and K-means. This proposed method starts with K-means algorithm to do the segmentation of the object and background in the frame. The Pixel-based color features are used for identifying and tracking the object. The remaining background is also considered. These classified features of object and extended background are used to train the Radial Basis Function Neural Network. The trained network will track the object in next subsequent frames. This method is tested for the video sequences and is suitable for real-time tracking due to its low complexity. The objective of this experiment is to minimize the computational cost of the tracking method with required accuracy."''
-[[File:Algorithm_flow.jpg|right]]
+[[File:Algorithm_flow.jpg|upright|right]]
 Pixel-based color segmentation of an object from its background (manually selected)
 Input to RBFNN is RGB values of the object and background

ObjectTracking: Difference between revisions

Revision as of 23:08, 10 December 2015

Contents

Introduction

The motivating question: Musicians gestures at a subphrase level

Motivation/Goals

Background - Optical Motion Tracking

Methods

Data Collection

Participants

Stimuli

Recording Setup

Procedure

Gesture-Tracking Algorithm

Results

Conclusion

References

Navigation menu

ObjectTracking: Difference between revisions

Revision as of 23:08, 10 December 2015

Introduction

The motivating question: Musicians gestures at a subphrase level

Motivation/Goals

Background - Optical Motion Tracking

Methods

Data Collection

Participants

Stimuli

Recording Setup

Procedure

Gesture-Tracking Algorithm

Results

Conclusion

References

Navigation menu

Search