MichaelAlissa
Introduction

Systems neuroscience is concerned with understanding how the brain works at the scale of systems (neural circuits, cortical regions, etc). With the emergence of multichannel recordings over recent decades, the field has been revolutionized by access to hundreds of simultaneously recorded neurons, and awake behaving animal experiments have become even more critical to advancing our understanding (1). Classical nonhuman primate (NHP) in-rig experiments constrain most bodily movements except the movement of interest to reduce confounding variables to draw tight correlations between the desired behavior and neural population dynamics (2). However, it is unclear if these results generalize to ambulatory behavior. Further, some evidence suggests that the complexity/variability of the neural recordings is constrained by the complexity of the task being performed, artificially and unintentionally limiting the observed neural data (3). To address this, experiments with higher task complexity need to be conducted.
Our lab will conduct freely moving experiments to directly ask whether increasing task complexity yields greater neural variance and how the extra neural variance correlates to various limb kinematics. To do this, we aim to simultaneously record neural data from motor regions of cortex of a freely moving rhesus macaque using a commercial wireless electrophysiology system and capture video of kinematic movements using multiple stereo depth cameras surrounding a large, transparent, observational rig (Figure 1a). This 3D data will yield a point cloud which will be fit to a skeleton to extract the kinematics of the monkey. In this project, we attempt to determine whether the Intel Realsense d435 depth camera would be a suitable device for the capture of this point cloud data.
Background


The Intel RealSense D400 Series depth cameras are commercial stereo tracking solutions that are low cost, lightweight, and powerful that are capable of recording both depth and RGB data. There are two types of cameras in the D400 Series, the D415 and the D435 depth cameras. Both have the same maximum depth resolution of 1280x720 and provide RGB-D data over USB 3. However, there are important differences between the two with regards to field of view and shutter type, which factored into our consideration of which camera to use.
The Intel RealSense D415 camera has a small field of view of about 70 degrees that provides a higher quality depth per degree. It is useful for imaging smaller objects and getting precise measurements because of its high depth resolution. However, it uses rolling shutters, which scan the image sequentially from one side of the sensor to the other. There are spatial distortions in images captured by rolling shutter sensors due to the fact that the the image has pixels that are not taken at the exact same time throughout the scene. In scenes with motion, this is especially apparent because the image will capture the motion at different times. For our research purposes, the D415 is not suitable because it will cause spatial distortion when we image and reconstruct limb and body movements.
The Intel RealSense D435 (Figure 2) depth camera provides a wide field of view and global shutter sensor. This depth stream employs a global shutter, which scan the entire area of the image simultaneously, has a diagonal field of view of approximately 95 degrees, and at a resolution of 848x480 pixels can achieve frame rates of up to 90 frames per second. Able to detect depths from 0.1 to 10 meters, this camera calculates depth using two monochrome sensors, as shown in Figure 2. These monochrome sensors—which detect both visible and infrared light—perform stereoscopic matching to calculate the frame depth values. Depicted in Figure 3, stereoscopic matching consists of first calculating the disparity (i.e. shift in the horizontal axis) between images created by the two cameras. As the focal length and baseline distance (i.e. the distance between the imaging sensors), depth values can be calculated with the following equation:
- where are the horizontal positions of the object captured by cameras 1 and 2, respectively.
Along with these depth sensors, the RealSense D435 is equipped with an infrared projector which illuminates the scene with infrared dots, providing additional markers for image alignment and helping to reduce measurement error.
Accuracy and Temporal Noise Characterization
Methods
To establish a baseline of the error present in the camera's depth measurements, later used in the point cloud reconstruction using a multi-camera setup, we first characterized the depth measurement accuracy and temporal noise for a single camera. To accomplish this, we designed an experimental rig—depicted in Figures 4a and 4b—which oriented the imaging plane of the camera and the surface of the wall to be parallel to each other. The camera could then be adjusted to different distances from the wall while maintaining this orientation. The depth value reported back from the camera is the distance of the object (in this experiment, the wall) to the plane of the camera sensors. With our setup, we measured exactly how far away the front of the camera was to the wall. We used this measurement as ground truth, and compared the reported depth measurement from the camera to calculate the error of the depth measurements. We performed recordings ranging from 0.2 – 2 meters at increments of 0.2 meters. Five, 1-minute recording with a sampling rate of 90 frames per second were taken at each measurement distance, analysis for all recordings was confined to the same section of wall, which possessed a matte, textured finish.

Results
With the recordings described in the Methods section, two metrics were calculated: (1) depth error and (2) temporal noise. Depth error, which relates the camera’s ability to accurately determine an object’s distance, can be described as
which is simply average of differences between the wall depths reported by the camera and true distance to the wall. Temporal noise, which characterizes the spread of depth values on an object over the duration of a recording, is calculated as
- , where is the standard deviation of the pixel over a recording.
Plots of these calculations at different distances can be viewed in Figure 5. As seen in these figures, both depth error and temporal noise increase as the square of the distance between the wall and the camera, which—as described below—impacts the ability of these cameras to accurately reconstruct 3D point clouds from this data.

While performing these calculations, we also analyzed the effect that different surfaces had on camera error. Examining a painted section of wall, shown in Figure 6, the temporal noise at 2 meters was evaluated. A heat map of the pixel standard deviations over the minute-long recording can be viewed in Figure 6a. From this heat map, glossier areas (such as the brown trees in Figure 6b) create significantly more temporal noise than areas with matte finish. This highlights the d435’s susceptibility to increased error in highly reflective environments, a fact which must be considered whenever recordings are performed.

Camera Calibration

Methods
With the characterization of a single camera complete, we moved to a multi-camera setup, depicted in Figure 7. This setup was composed of two cameras, which were slightly offset from one another. This arrangement not only provided a larger field of view, but the collection of multiple perspectives also allowed us to capture a more complete representation of the object. In order to combine these cameras together, however, we needed to relate the camera coordinate systems to each other. While methods—such as checkerboard calibration—exist to calibrate cameras using RGB data, we did not want to use this method of calibration for two reasons: (1) aligning between the depth and color camera sensors could introduce additional error into our measurements; and (2) our project does not seek to capture RBG data. We therefore developed a calibration technique which relies solely on depth data.
The first step in this process was to fabricate a calibration tool, which is shown in Figure 8. Our calibration tool gives us an automated way to calibrate the cameras using only depth data. This tool is an asymmetrical body consisting of 4 spheres that provides us with four locations where similar points between the two cameras could easily be extracted using only depth data. These spheres are optimal for the template matching algorithm we employed, allowing us to automatically locate the positions of the spheres in our image. The sphere was chosen because circles have clear edges and can be easily found in a scene reconstructed from only depth data. There is also relationship between the depth of the circle to the radius of the circle shown in the image. Closer circles appear larger in the image and farther circles appear smaller in the image. We implemented this property in our automation algorithm to find the location of the spheres in the image.
Template matching was performed by taking a cropped image of the sphere, with all values outside the sphere set to zero. This cropped image, or template, was then swept across the image, and the location which most closely matches the template is returned. This process is illustrated in Figure 8, with numbered templates lining the left-hand side of the image and numbered squares in the depth image detailing the region that best aligned with the template. With the location of the sphere determined, our algorithm returns the midpoint of the sphere, which provides a common point between the two cameras that be used for alignment.
Results
With this algorithm, we can successfully locate the four ball positions, providing four common points which we can use to align our camera system. Templates 3 and 4 in Figure 8 provide examples of successful template matches. Once the template is matched to a sphere on the calibration tool, the midpoint of the sphere is automatically found by taking the pixel coordinate in the center of the bounding box. For our two camera set up, the cameras were positions close enough together that the midpoint of the the spheres in both pictures provided a good estimate of similar points within the two images. We then used these similar points to calculate the rotation and translation matrices of one camera with respect to the other camera.
In its current state, however, our algorithm can provide false positive matches. For example, templates 1 and 2 in Figure 8 correspond to templates that do not match the spheres on our calibration tool, yet find high confidence matching locations outside of the tool. Currently, these false positives must be manually excluded when running the algorithm, but future iterations of this calibration technique will look to remedy this issue. Future iterations of the algorithm will also account for different orientations of camera placement. Instead of using the midpoint as similar points between the two cameras, we will template match to the curvature of the sphere to find the exact location of the full sphere. This way, we will be able to find similar points between any camera orientation.

3D Point Cloud Construction
Methods

With a two sets equivalent points described in each camera’s xyz coordinate system, we used an approach outlined in ref[] to calculate the rotation matrix and translation vector, allowing us to transform points in one camera’s coordinate system to the coordinate system of a second camera. For 3 x n matrices A and B—which each represent similar sets of 3D points for camera 1 and camera 2, respectively—this approached can outlined as follows:
- Calculate the centroid for each set of points, given by the following equation:
- Shown in Figure 9, this is simply the average of the x, y, and z values for all points in a set.
- Center both sets of points at the origin (Figure 9). This is accomplished subtracting the centroid of each set from all the points in the corresponding set.
- To find the rotation from A to B in Figure 9, we take the singular value decomposition (SVD) of the cross-covariance matrix of A and B. This cross covariance matrix can be calculated as
- We can then take the SVD of this cross-covariance matrix:
- Since we are assuming a rigid transformation (no scaling or shearing), the term can be neglected. The rotation matrix is then given as
- This rotation can be applied to A to so that it possesses the same orientation as B, as shown in Figure 9.
- To determine the translation vector, we find the vector joining the B and the rotated set A, seen in Figure9. This can be calculated as
Results
Using the method above to calculate the rotation matrix and translation vector, aligned point cloud reconstruction was performed using the two-camera setup. This setup was used to image the calibration tool 0.2 meter increments from 0.2 – 1 meter. As shown in Figure 10, by applying the appropriate rotation and translation, data from the first camera can be aligned with the coordinate space of camera 2, allowing this data to be combined. While each separate camera image (Figure 10a, 10b) contains only a partial depiction of the calibration tool, these images can be combined to provide a more complete representation of the object. The parts of the object that were occluded in camera one could be seen in camera two, so when the two point clouds were combined, the resulting image was one that was more a complete representation of the object. We repeated this process at greater distances from the two cameras to relate our multi-camera point cloud reconstruction to the beginning quantitative analysis of depth error. Figure 11 shows the heatmap of for both cameras of the object at distances of 0.4, 0.6, 0.8, and 1 meter away from the camera. The object is sharper and has a higher resolution in depths thats are less than 0.6 meters, but as the depth increases past 0.8 meters, the object's resolution dramatically decreases. The depth data for calibration tool at farther depths shows more noise and spread than the depths closer to the camera. As depicted in Figure 12, the point cloud reconstruction quality dramatically decreases. The data becomes noisier, the alignment between the cameras worsens, and it becomes difficult to distinguish finer details in the calibration tool. The rapid increase in measurement noise that occurs as distance increases—a phenomenon characterized above for a single camera—is a significant drawback of the RealSense d435, limiting its ability to perform 3d reconstruction outside of close range measurements.



Conclusions
For this project, we wanted to determine if the Intel Real Sense D435 Depth cameras could accurately reconstruct 3D images of freely moving non-human primates. To do this, we quantified the depth measurement error at different depths and quantified the variance of the depth measurement across time. We then created our own calibration tool and algorithm to calibrate two cameras using only depth date to calculate the rotation and translation matrices relating the two cameras to one camera's coordinates. After the two images were aligned to one coordinate frame, we reconstructed the point clouds of each camera and evaluated the quality of the point clouds at different depths.
After our analysis of the depth data and point cloud reconstruction, we believe that the Intel Real Sense cameras could work to recreate point clouds of animal motion, but with with limitations. The biggest limitation we face is that as depth increases, the error of depth per pixel increases at a very fast rate (square of the distance). Our setup to image the non-human primates is such that the cameras will be more than 1 meter away from the animal, meaning that the depth error will be large and the resolution of fine features will be small. For our purposes, we could be able to accurately reconstruct large scale limb movements, such as arm reaching or locomotion, but we would not be able to accurately reconstruct fine movements such as individual finger movements.
To increase the accuracy and viability of these cameras for our purpose, we will continue to work on the calibration process to limit the amount of error that occurs when translating camera coordinates to different spaces, and continue to fine tune our point cloud reconstruction process. We will fully automate our calibration template matching algorithm to rule out false positives and only return back the points on the spheres of the calibration tool. We will also compare our depth calibration technique to the well known checkerboard algorithm using RGB data to determine if we have improved upon the calibration process using only depth data. Using our depth error analysis, we will construct point clouds from weighted averages to create a more accurate point cloud representation. The points farther from the camera will be weighted less than the points closer to the camera because we know quantitatively that the error is larger for greater depths. Finally, we will create a volumetric point cloud of the object using four or more cameras to capture a 360 degree view of the object to create a more complete image.
References
1. Cunningham, John P., and M. Yu Byron. "Dimensionality reduction for large-scale neural recordings." Nature neuroscience 17.11 (2014): 1500.
2. Georgopoulos, Apostolos P., et al. "Spatial coding of movement: a hypothesis concerning the coding of movement direction by motor cortical populations." Experimental Brain Research 49.Suppl. 7 (1983): 327-336.
3. Gao, Peiran, and Surya Ganguli. "On simplicity and complexity in the brave new world of large-scale neuroscience." Current opinion in neurobiology 32 (2015): 148-155.
4. "Least-Squares Fitting of Two 3-D Point Sets", Arun, K. S. and Huang, T. S. and Blostein, S. D, IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 9 Issue 5, May 1987
Appendix
Our code can be found at [code.stanford.edu]