m8ta
You are not authenticated, login. 

{1341} 
ref: 0
tags: image registration optimization camera calibration sewing machine
date: 07152016 05:04 gmt
revision:20
[19] [18] [17] [16] [15] [14] [head]


Recently I was tasked with converting from image coordinates to real world coordinates from stereoscopic cameras mounted to the endeffector of a robot. The end goal was to let the user (me!) click on points in the image, and have the robot record that position & ultimately move to it. The overall strategy is to get a set of points in both image and RW coordinates, then fit some sort of model to the measured data. I began by printing out a grid of (hopefully evenlyspaced and perpendicular) lines via a laserprinter; spacing was ~1.1 mm. This grid was manually aligned to the axes of robot motion by moving the robot along one axis & checking that the lines did not jog. The images were modeled as a grating with quadratic phase in $u,v$ texture coordinates: $p_h(u,v) = sin((a_h u/1000 + b_h v/1000 + c_h)v + d_h u + e_h v + f_h) + 0.97$ (1) $p_v(u,v) = sin((a_v u/1000 + b_v v/1000 + c_v)u + d_v u + e_v v + f_v) + 0.97$ (2) $I(u,v) = 16 p_h p_v / ( \sqrt{ 2 + 16 p_h^2 + 16 p_v^2})$ (3) The 1000 was used to make the parameter search distribution more spherical; $c_h,c_v$ were bias terms to seed the solver; 0.97 was a dutycycle term fit by inspection to the image data; (3) is a modified sigmoid. $I$ was then optimized over the parameters using a GPUaccelerated (CUDA) nonlinear stochastic optimization: $(a_h,b_h,d_h,e_h,f_h  a_v,b_v,d_v,e_v,f_v) = Argmin \sum_u \sum_v (I(u,v)  Img(u,v))^2$ (4) Optimization was carried out by drawing parameters from a normal distribution with a diagonal covariance matrix, set by inspection, and mean iteratively set to the best solution; horizontal and vertical optimization steps were separable and carried out independently. The equation (4) was sampled 18k times, and equation (3) 34 billion times per frame. Hence the need for GPU acceleration. This yielded a set of 10 parameters (again, $c_h$ and $c_v$ were bias terms and kept constant) which modeled the data (e.g. grid lines) for each of the two cameras. This process was repeated every 0.1 mm from 0  20 mm height (z) from the target grid, resulting in a sampled function for each of the parameters, e.g. $a_h(z)$ . This required 13 trillion evaluations of equation (3). Now, the task was to use this model to generate the forward and reverse transform from image to world coordinates; I approached this by generating a data set of the grid intersections in both image and world coordinates. To start this process, the known image origin $u_{origin}_{z=0},v_{origin}_{z=0}$ was used to find the corresponding roots of the periodic axillary functions $p_h,p_v$ : $\frac{3 \pi}{ 2} + 2 \pi n_h = a_h u v/1000 + b_h v^2/1000 + (c_h + e_h)v + d_h u + f_h$ (5) $\frac{3 \pi}{ 2} + 2 \pi n_h = a_v u^2/1000 + b_v u v/1000 + (c_v + d_v)u + e_v v + f_v$ (6) Or .. $n_h = round( (a_h u v/1000 + b_h v^2/1000 + (c_h + e_h)v + d_h u + f_h  \frac{3 \pi}{ 2} ) / (2 \pi )$ (7) $n_v = round( (a_v u^2/1000 + b_v u v/1000 + (c_v + d_v)u + e_v v + f_v  \frac{3 \pi}{ 2} ) / (2 \pi)$ (8) From this, we get variables $n_{h,origin}_{z=0} and n_{v,origin}_{z=0}$ which are the offsets to align the sine functions $p_h,p_v$ with the physical origin. Now, the reverse (world to image) transform was needed, for which a twostage newton scheme was used to solve equations (7) and (8) for $u,v$ . Note that this is an equation of phase, not image intensity  otherwise this direct method would not work! First, the equations were linearized with three steps of (911) to get in the right ballpark: $u_0 = 640, v_0 = 360$ $n_h = n_{h,origin}_{z} + [30 .. 30] , n_v = n_{v,origin}_{z} + [20 .. 20]$ (9) $B_i = {\left[ \begin{matrix} \frac{3 \pi}{ 2} + 2 \pi n_h  a_h u_i v_i / 1000  b_h v_i^2  f_h \\ \frac{3 \pi}{ 2} + 2 \pi n_v  a_v u_i^2 / 1000  b_v u_i v_i  f_v \end{matrix} \right]}$ (10) $A_i = {\left[ \begin{matrix} d_h && c_h + e_h \\ c_v + d_v && e_v \end{matrix} \right]}$ and ${\left[ \begin{matrix} u_{i+1} \\ v_{i+1} \end{matrix} \right]} = mldivide(A_i,B_i)$ (11) where mldivide is the Matlab operator. Then three steps with the full Jacobian were made to attain accuracy: $J_i = {\left[ \begin{matrix} a_h v_i / 1000 + d_h && a_h u_i / 1000 + 2 b_h v_i / 1000 + c_h + e_h \\ 2 a_v u_i / 1000 + b_v v_i / 1000 + c_v + d_v && b_v u_i / 1000 + e_v \end{matrix} \right]}$ (12) $K_i = {\left[ \begin{matrix} a_h u_i v_i/1000 + b_h v_i^2/1000 + (c_h+e_h) v_i + d_h u_i + f_h  \frac{3 \pi}{ 2}  2 \pi n_h \\ a_v u_i^2/1000 + b_v u_i v_i/1000 + (c_v+d_v) u_i + e_v v + f_v  \frac{3 \pi}{ 2}  2 \pi n_v \end{matrix} \right]}$ (13) ${\left[ \begin{matrix} u_{i+1} \\ v_{i+1} \end{matrix} \right]} = {\left[ \begin{matrix} u_i \\ v_i \end{matrix} \right]}  J^{1}_i K_i$ (14) Solutions $(u,v)$ were verified by plugging back into equations (7) and (8) & verifying $n_h, n_v$ were the same. Inconsistent solutions were discarded; solutions outside the image space $[0, 1280),[0, 720)$ were also discarded. The process (10)  (14) was repeated to tile the image space with gird intersections, as indicated in (9), and this was repeated for all $z$ in $(0 .. 0.1 .. 20)$ , resulting in a large (74k points) dataset of $(u,v,n_h,n_v,z)$ , which was converted to full realworld coordinates based on the measured spacing of the grid lines, $(u,v,x,y,z)$ . Between individual z steps, $n_{h,origin} n_{v,origin}$ was reestimated to minimize (for a current $z'$ ): $(u_{origin}_{z' + 0.1}  u_{origin}_{z' + 0.1})^2 + (v_{origin}_{z' + 0.1} + v_{origin}_{z'})^2$ (15) with gridsearch, and the method of equations (914). This was required as the stochastic method used to find original image model parameters was agnostic to phase, and so phase (via parameter $f_{}$ ) could jump between individual $z$ measurements (the origin did not move much between successive measurements, hence (15) fixed the jumps.) To this dataset, a model was fit: ${\left[ \begin{matrix} u \\ v \end{matrix} \right]} = A {\left[ \begin{matrix} 1 && x && y && z && x'^2 && y'^2 && \prime z'^2 && w^2 && x' y' && x' z' && y' z' && x' w && y' w && z' w \end{matrix} \right]}$ (16) Where $x' = \frac{x}{ 10}$ , $y' = \frac{y}{ 10}$ , $z' = \frac{z}{ 10}$ , and $w = \frac{ 20}{20  z}$ . $w$ was introduced as an axillary variable to assist in perspective mapping, ala computer graphics. Likewise, $x,y,z$ were scaled so the quadratic nonlinearity better matched the data. The model (16) was fit using regular linear regression over all rows of the validated dataset. This resulted in a second set of coefficients $A$ for a model of world coordinates to image coordinates; again, the model was inverted using Newton's method (Jacobian omitted here!). These coefficients, one set per camera, were then integrated into the C++ program for displaying video, and the inverse mapping (using closedform matrix inversion) was used to convert mouse clicks to realworld coordinates for robot motor control. Even with the relatively poor wideFOV cameras employed, the method is accurate to $\pm 50\mu m$ , and precise to $\pm 120\mu m$ . 