ObjectTracking

From Psych 221 Image Systems Engineering
Jump to navigation Jump to search

Applying a computer vision object tracking algorithm to track musicians’ ancillary gestures

Introduction

The Motivating Research Question

When watching a musical performance, it is possible to notice that musicians’ body movements seem closely related to their expressive intentions and to their conceptualization of musical structure, beyond the movements made necessarily to play the instrument. In recent years, many studies have noted the correspondence of body motion at the level of the phrase [1-4]. Looking at clarinetists, participants in Vines et al., (2006) tracked the phrasing of clarinetists under audio-only, video-only, and audiovisual conditions; the participants with only video cues could follow phrasing very well, indicating how communicative the visual channel can be for audience members. Indeed, performers used body motion to modulate audience perception of phrase length, in that, for the visual-only condition, they extended the perception of the length of the last phrase by continuing to move their bodies. A more recent study, MacRitchie et al. (2013) used PCA to analyze motion capture performance of 9 pianists playing two excerpts of Chopin preludes. Their findings suggested that overall motion in a performance arises from the performers’ representation of the musical structure, and that repeating body motion patterns at the level of the phrase can be observed.

These inquiries have largely focused on how musicians physically express musical structure in ancillary gestures on the timescale of the phrase. However, it is possible that ancillary gestures at a sub-phrase level reflect musicians’ conceptions of musical structure. Vines et al., (2006) mentions changes in performers’ facial expression changes at important structural moments in the score, though what these facial expressions signified – if they were reflective of some sort of emotional change, or were more related to depiction of emphasis – is not clear. These kinds of gestures are the interest of the current study – those that do not reflect task demands, or capabilities and physical constraints of subjects, but rather planned conceptual grouping and the possibility of physical expression beyond the means necessary for playing, - and if they corresponding to musical structures on a level smaller than the phrase.

Class Project Motivation/Goals

With a goal of exploring how musicians express structural information on a sub-phrase level, an ideal avenue of consideration is to investigate gestural cues when performers are conceiving the music as mostly containing small melodic groupings rather than long phrases. While there are multiple motion capture systems on the Stanford campus, there is a lack of sufficient performer population at the level of students wanting to become professional musicians, and thus, I felt it necessary to travel off campus to get a sufficient number of participants, perhaps at a slight expense of the quality of the data.

Thus, this project is an exploration of the data quality that can be obtained using a low-cost, easily portable 2D 'motion capture system'. Since the data was collected prior to the start of this quarter, for the scope of this project, I focus mainly on the evaluation of the object-tracking algorithm chosen, and its suitability for analyzing the collected data. I leave out the results of the research question, as I hope to publish it separately. Thus, roughly I split the entire Methods into two stages, the first of which explains the quality of the data available of the second stage:

   Data Collection - Pre-data collection setup (preparing the stage and subject), data acquisition
   Gesture-Tracking Algorithm Evaluation - Evaluation of the runtime and accuracy of the chosen algorithm, 
        as well as some post-processing applied to ensure better quality tracking

Background - Optical Motion Tracking

Low-cost 'motion capture' systems have been implemented in the past using computer vision. The main umbrella title for these systems are 'optical motion capture' (OMC) systems. Specifically, OMC systems uses a number of special cameras that view a scene from a variety of angles. For the majority of professional applications, like movie production in which the goal is to track actors motion for realistic avatar generation, these systems aim to capture 3D motion. Thus, they use a selection of cameras recording the same subject to later reconstruct a 3D representation of the markers, similarly to stereoscopic vision works in humans. In "marker-based" OMC, reflective markers are placed on the performers's body. These markers can easily be recognized by software typically built into the camera, usually due to their reflectance. Recording the positions of these markers throughout the performers actions allows the experimenter to determine the position of the performers's body at any given time. Another type of OMV, "marker-less" OMC, tries to accomplish 3D reconstruction without the use of special tracking devices. Instead, the actor's silhouette is viewed from different angles, and is used to reconstruct the entire 3D body of the performer. For the purposes of tracking musicians, the primary use of motion capture has been marker-based, though some examples exist of research being successfully run using silhouettes, though these examples use only 2D silhouettes.

While the costs of professional and lab-quality motion capture systems are not readily available, it is reasonable to presume that they range from the tens of thousands (for a limited number of cameras) to hundreds of thousands of dollars (for more complicated setups). Thus, part of the motivation for this project was to evaluate a low-cost alternative to such costly systems that are certainly out of the range of a student budget, and out of the range of most music departments that may be interested in research into music and gesture.

Methods

Data Collection

Participants

Fig. 1 - Recording studio setup. Side camera shown in left side of picture

13 (7 female) cello performance majors. Recruitment via email over departmental listservs. The email notified them that they would be filmed.

Stimuli

The opening 8 bars of the 3rd Ricercar by Domenico Gabrielli. The cellists were instructed to prepare two versions of the unfamiliar excerpt, one of which had long marked phrases, and the other of which had shorter marked melodic groupings. Tempo, bowings, and fingerings were marked and required for all excerpts.

Recording Setup

Recording took place in a quiet, secure room (Fig. 1). Two Canon Vixia HF R600 were used for video recording: one to the front, at around a distance of 8 ft 10 in, one to the side at 6 ft 3 in, and was filmed at 60 fps. This was the highest frame rate available for this camera. This camera was selected due to discussions with the filmographer for our department that it was a quality under $300 camcorder. Audio was recorded using a Zoom H6, at 96 kHz sampling rate and 24-bit depth. Cellists were lit using a Pro LED 1000 12x12 LED Panel Light. Cellists wore 1-in squares of retroreflective tape, placed at the center of the forehead, the nose, the top of the hands, the center point on the clavicle (below the neck), both shoulders, the lower rib, and the right cheek. Participants were asked to wear dark clothing that was not too loose-fitting.

Procedure

Performers were asked to memorize all excerpts prior to the filming session. Upon arrival, they were given a few minutes to warm up, and markers were attached. They were then asked to play the two versions of the ‘unfamiliar’ excerpt, in two ways. In the first way, they will be asked to “play as if all of the notes are part of one line”. Performers played each version of the excerpt up to 5 times, or as many as needed to get two acceptable takes. Once two takes for each excerpt/version were obtained with markers on, the markers were removed and one take was obtained of each excerpt/version.

Gesture-Tracking Algorithm

Fig. 2 - Object tracking algorithm (adopted from Deshmukh, P. K. & Gholap, Y., 2012)

I chose to use as a starting point, the algorithm described in Deshmukh, P. K. & Gholap, Y. (2012). Efficient object tracking using K-means and radial basis function [5]. I include here the abstract from this paper, as well as a flowchart of the algorithm process (Fig. 2), as it most succinctly describes their algorithm.

"In the present article, an efficient method for object tracking is proposed using Radial Basis Function Neural Networks and K-means. This proposed method starts with K-means algorithm to do the segmentation of the object and background in the frame. The Pixel-based color features are used for identifying and tracking the object. The remaining background is also considered. These classified features of object and extended background are used to train the Radial Basis Function Neural Network. The trained network will track the object in next subsequent frames. This method is tested for the video sequences and is suitable for real-time tracking due to its low complexity. The objective of this experiment is to minimize the computational cost of the tracking method with required accuracy."

The reason this algorithm was chosen was twofold.

  • First, it allows for the user to segment/track any object they want. Thus, it is flexible in not needing to predefine the shape or color of your markers. In my case, I used only two colors of markers (yellow against their dark clothing and green against their skin) to provide maximal contrast, and both were square, but this algorithm is robust to changes in the system in the future.
  • Second, it was designed to be robust to the potential changing shape of the object. While the markers do not change shape exactly, they may go partially out and back into view, and this algorithm would in theory be robust to handling this.






Results

In this section, I describe the results of the evaluation of the algorithm, as well as my critical analysis about its suitability for a motion-capture data extraction tool.

K-means Evaluation

Fig. 3 - Size of bounding box vs. runtime of K-means portion of algorithm.

It was proposed in the original paper that the K-means segmentation of the object color from its background would not take a lot of time. Fig. 3 shows a plot of the runtime of this segment of the algorithm as a function of one side of the bounding box square in which it is segmenting the colors into two clusters. As expected, runtime seemed to be a relatively linear function of bounding box size.

RBFNN Speed Evaluation

Fig. 4 - Suggested modifications to algorithm to improve speed without sacrificing accuracy.

Completion rate for one marker for one video: MacBook Air: 1.098 sec/frame, MacBook Pro: .746 sec/frame Mean length of videos tested = 1642 frames (SD = 182 frames), therefore avg length to process one marker for one movie = 30.05 min at 60 fps (using MacBook Air)

Due to the sheer runtime length of this section of the algorithm, I do not have a plot evaluating the runtime. However, it became clear to me through this process that this section of the algorithm would be too slow for the sheer amount of data that I have. So, I propose the following changes to the algorithm (Fig. 4): the object is found in every 10th frame and a potential trajectory of the marker is interpolated. Then, at a second pass, the object trajectory is found at the middle frames between those of the first pass (5th, 15th, etc.). If the marker location is within an acceptable margin of error, then no change is made to the interpolation. If the found location is outside of an acceptable margin of error, then the interpolation is refit bother immediately prior to and after that frame, and the algorithm iteratively checks the middle frames again (2nd/3rd, 7th/8th). This would significantly reduce runtime, and could even be first tested at a higher frame jump (i.e. the first pass tests every 20 frames, etc.) The figure to the right shows this alternative algorithm. Note that, also, on further reflection, I found that the background extension method of their algorithm was not necessary for these purposes, so it is removed in the accompanying figure.

Algorithm Accuracy

Overall, accuracy was satisfactory in testing both 60 fps and 30 fps (tracking at every other frame). There was failure in 9 videos (total = 104 videos) usually due to facial hair near marker or head hair that slips onto face, and in these cases, color correction using commercial video editing software was applied to highlight marker of interest. This change resulting in accurate tracking. (See Fig. 5 for examples.)

Fig. 5 - Bounding box at 1 and 1201 frames for two participants. Green lines indicated centroid trajectory. Color adjustments improved poor tracking.

Conclusion

Overall, I find this algorithm to be a suitable solution for marker detection in an inexpensive, mobile 2D motion capture system. The section with the neural network is too slow for my videos (approximately 30s average), especially since I only tested the tracking of one marker in each video. Since the goal is to be able to track all of the markers that you place on the performer, I proposed a modification in which a fraction of the frames are tracked in the first pass, and then the potential path of the markers is interpolated between those tracked frames. If the interpolation is inaccurate, the algorithm will iteratively correct for poor estimations. Due to time constraints, the implementation and testing of this proposed solution is my future work.

References

[1] Krumhansl & Schenck, (1997). Can Dance Reflect the Structural and Expressive Qualities of Music? A Perceptual Experiment on Balanchine's Choreography of Mozart's Divertimento No. 15. Musicae Scientae, 1(1), 63-85.
[2] MacRitchie, Buck, & Bailey, (2013). Inferring musical structure through bodily gestures. Musicae Scientae, 17(1), 86-106.
[3] Vines, Krumhansl, Wanderley, & Levitin (2006). Cross-modal interactions in the perception of musical performance. Cognition, 101, 80-113.
[4] Wanderley, Vines, Middleton, McKay, & Hatch (2005). The Musical Significance of Clarinetists' Ancillary Gestures: An Exploration of the Field. The Journal of New Music Research. 34(1), 97-113.
[5] Deshmukh, P. K. & Gholap, Y. (2012). Efficient object tracking using K-means and radial basis function. International Journal of Advanced Research in Computer and Communication Engineering, 1(1).

Appendix

I include here the link to the MATLAB code included in their paper for easy reference for those interested. [1]