Ray Tracing with Neural Networks
Introduction
Accurate simulation of imaging systems is crucial for enabling cost-effective and rapid design processes and training machine learning algorithms. Such simulation tools typically incorporate the full imaging pipeline, including the scene, camera optics, and image sensor. However, the proprietary nature of lens designs often leads manufacturers to withhold specific details, posing a challenge for accurate simulation. To address this challenge, Goossens et al. (2022) propose a solution that leverages Zemax black-box lens models. At the core of this method is the Ray Transfer Function (RTF), which characterizes how the position and angle of a ray entering a lens determine its exit position and angle, providing a practical approach to lens modeling without exposing proprietary design information. Goossens et al., modeled the RTF using a polynomial fitting and showed promising results as in [1]. The idea of this project is to explore the potential of neural networks to approximate the ray transfer function and enable efficient, high-fidelity ray tracing without the need for detailed lens specifications. As neural networks have demonstrated remarkable capabilities in learning complex nonlinear functions, making them a promising candidate for modeling the ray transfer function [2].
Background
Dataset Generation using Zemax
The lens is reversed in Zemax to enable tracing rays from the sensor toward the object space. A dataset of rays is generated using a Zemax macro that samples the pupil at various field heights. The macro ensures uniform sampling in the polar coordinate system by increasing the number of angular samples linearly with radial distance. The ray-pass function defines the input and output planes, which determines whether the input rays will pass through the lens or get vignetted. The function is based on the shape of the pupil at different field heights and can be represented using ellipses or circle intersections.
Additionally, to simplify the data collection process and reduce the amount of data needed, we exploit the rotational symmetry of most lenses. We sample rays radially along one axis and then rotate these rays around the optical axis to create a complete dataset. This means we only need to store data for rays with an x-coordinate of 0. All in all, the Zemax macro traces these sampled rays through the reversed lens and records input and output ray parameters. The output of the macro is a text file that contains the position and unit direction vectors of the input and output rays for each sample. Any input ray that fails to exit the lens is marked with 'NaN' values in the output which are discarded before the training process. Refer to this example Zemax macro script to generate training data for wide angle lens: https://github.com/ISET/mlp-tiny/blob/20241117/interface%20scripts/zemax_opticalsystemdata.zpl. A screenshot of the Zemax macro script is shown below.

RTM vs. RTF
The RTF extends the classic ABCD ray-transfer matrices (RTM) used for paraxial calculations. There are several advantages of using RTFs over RTMs.
- While RTMs are suitable for paraxial ray tracing, they are limited in modeling aberrations and other non-linear optical phenomena. RTFs, with their polynomial or neural network representation, are better at handling these complexities.
- RTFs can account for 3D effects like depth and occlusion, which are crucial for accurately simulating real-world scenes. Also, we can integrate the RTF into ray tracing software like PBRT to simulate complex 3D spectral scenes.
Polynomial Estimation of Ray-Transfer Functions
In a general form, the position of each input ray can be represented in six variables, . Similarly, the position of each output ray is represented as . The polynomial-based approach aims to model this mapping between input and output ray positions using a set of polynomial equations:
where here is a column vector storing the "x" positions of all rays. The overall polynomial is multivariate but the logic is the same for . The user must specify the order of polynomial (denoted by ), as reported in [1], the fitting error varies as the polynomial degree changes (see the figure below).
Methods
Network Architecture
As a starting point, we are provided with the following GitHub repo: https://github.com/ISET/mlp-tiny, where a multi-layer perceptron (MLP) was implemented but the complete training/testing code was missing. We implemented the complete training/testing code and put all free parameters in a configuration file: https://github.com/ISET/mlp-tiny/blob/ray_tracing/mlp%20scripts/train_prediction.sh
The provided MLP has 16 fully connected layers (each is followed by batch normalization and ReLU activation function),
as shown in the following figure.
However, one issue we found (see next section), was that the network struggled with predicting values in varying ranges, i.e., the coordinates can have significantly different maximums and minimums. To address the issue, we used positional encoding [3] that remapped the coordinates into [-1, 1] with a series of sinusoid functions at different frequencies, e.g., .
We integrated the positional embedding technique into the MLP and the resulting architecture is shown below:
where we found using separate branch to pre-process spatial and directional information and later merge them together to be the most effective. The modified architecture even has fewer trainable parameters than the MLP provided.
Loss Function and Training Details
We used binary cross-entropy loss (BCE, see https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html) for ray classification and mean squared error (MSE, see https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html) for ray tracing.
We set the hidden layer sizes in the MLP to be 128 units. The learning rate was set to 0.005 and warmed up over 40 epochs. A cosine decay schedule was employed to update the learning rate over training after the warm up. The Adam optimizer was used with a weight decay of 0.00001 [4]. We trained the MLP for 500 epochs with early stopping triggered if the training loss stopped decreasing. The dataset was partitioned into 60% training and 40% testing subsets.
Results
Ray Tracing
Visually, we found the training loss converged faster when using the "NeRF" style MLP [5] with positional encoding, compared to the provided "LargerMLP". Such observations were consistent with different wavelengths.
Our analysis revealed the following pattern for mean squared error on the test dataset: PolyFit model (d=2) has larger MSE than the larger MLP, which in turn surpassed the PolyFit model (d=5), followed by the NeRF approach, and finally the PolyFit model (d=8) again. Notably, a lower mean squared error indicates superior performance. In short: PolyFit (d=2) > LargerMLP > PolyFit (d=5) > NeRF > PolyFit (d=8).
As mentioned in the section before, due to various scales of coordinates (y ranges from -7 to 0 while x ranges from -0.2 to 0.2), the provided MLP (named as "LargerMLP") does not predict positions accurately. The figure below is an example, where we show adding a positional encoding layer can address the issue.
Ray Classification
Initially, we believed this to be a straightforward task, but we encountered challenges in achieving higher performance, plateauing at a classification accuracy of 94%. Later we realized that the challenge was caused by the imbalanced dataset, i.e., most rays can pass the lens whereas only a minority cannot. Therefore, we tried weighted BCE loss (where we put higher weights to the minority class) and resampled the dataset such that all classes are more equally distributed. However, while these techniques altered the precision and recall metrics, the overall classification accuracy remained at approximately 94%. We then hypothesized that the multilayer perceptron model may have been overfitting to the majority class in the data. Consequently, we shifted our focus to simpler machine learning methods, including support vector machines, AdaBoost, XGBoost, and logistic regression. Regrettably, these alternative models also yielded classification accuracies around 94%.
We visualized the data distribution (x vs. y, dx vs. dy, dy vs. dz)using a heat map. As shown in figures below, rays that successfully pass through the lens are shown in blue, while those that fail to do so are rendered in orange.
Remaining Issue: Imbalanced Dataset
Given the imbalanced dataset, the best way to address this issue is to collect more data to create a more equally distributed dataset. However, within limited amount of time, we were unable to do so. Nevertheless, we believe a simple method (such as kernel SVM) can work well due to the decision boundary shown in the previous section. We will continuously investigate this problem.
Conclusions
We implemented and evaluated various machine learning approaches to model the ray transfer function and classify whether a ray can successfully pass through the lens. All data were simulated using Zemax under different wavelengths and optical properties (Double Gauss and Wide Angle). For ray tracing, we found that incorporating positional encoding techniques helped improve the performance of these models. The machine learning models demonstrated enhanced performance compared to polynomial fitting employing second and fifth degree polynomial orders. However, they were unable to match the accuracy attained by the eighth degree polynomial fitting approach. For ray classification, our best model achieved an accuracy of approximately 94%, due to the very imbalanced data distribution. We will explore strategies to address this class imbalance and further improve the classification performance in future work.
References
[1] Thomas Goossens, Zheng Lyu, Jamyuen Ko, Gordon C. Wan, Joyce Farrell, and Brian Wandell, "Ray-transfer functions for camera simulation of 3D scenes with hidden lens design," Opt. Express 30, 24031-24047 (2022).
[2] Ian Goodfellow and Yoshua Bengio and Aaron Courville, "Deep Learning," MIT Press (2016), http://www.deeplearningbook.org.
[3] Position Embeddings. https://paperswithcode.com/methods/category/position-embeddings.
[4] Diederik P. Kingma, Jimmy Ba, "Adam: A Method for Stochastic Optimization," in ICLR, 2015.
[5] Ben Mildenhall and Pratul P. Srinivasan and Matthew Tancik and Jonathan T. Barron and Ravi Ramamoorthi and Ren Ng, "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis," in ECCV, 2020.