The perspective projection matrix in Vulkan
(Edited on: )
If you have any questions or remarks, you can leave a comment here.
The perspective projection matrix is crucial in computer graphics to display 3d points on a screen.
In most of the computer graphics/opengl/vulkan tutorials online there is only a brief mention of the glm::perspective
function and its parameters, and quick “hacks” to make it work on Vulkan (Hello negative viewport and correction matrix).
In these notes I will try to explain the maths behind the perspective projection and give a matrix that works with Vulkan.
Don’t worry you can still follow if you don’t use Vulkan and adapt the formulas with your settings.
The projection matrix is a transformation of the camera (or eye) space into clip space. The clip space is a homogeneous space that is used to remove (or clip) primitives outside the viewport. After clipping, the hardware will perform a “perspective division” that will transform the clip coordinates into normalized device coordinates by dividing each component by the 4th component w.
Classic perspective with a near and far plane
Projecting onto the near plane
The first step consists of projecting points in eye space on the near plane of the frustum. Here is the eye space is a righthanded coordinate system and the camera is facing towards the Z axis. Let’s start by finding the values of \(x\) and \(y\), the depth will be derived later.
The frustum is a pyramid with the camera as its apex: it forms similar triangles on the XZ and YZ planes.
Using the Intercept theorem (also known as Thales’s theorem) it’s easy to find the value of \(x_p\) and \(y_p\).
The intercept theorem tells us that the ratio between a point projected on the near plane and the coordinate in eye space is the same: \[ \frac{x_p}{x_e} = \frac{y_p}{y_e} = \frac{z_p}{z_e} = \frac{n}{z_e} \]
\(z_p\) will always be \(n\) because we are projecting points on the near plane.
In the end we have:
\begin{aligned}
& x_p = \frac{n x_e}{z_e} = \frac{1}{z_e} n x_e\\
& y_p = \frac{n y_e}{z_e} = \frac{1}{z_e} n y_e\\
& z_p = n = \frac{1}{z_e} n z_e
\end{aligned}
One thing to note here is that \(x_p\) and \(y_p\) are inversely proportional to \(z_e\). But the result of a multiplication between a matrix and a vector is a linear combination of its components, it’s not possible to divide by a component. To solve this, the hardware performs what’s called a “perspective divide”. Every components of the clip coordinate gets divided by its 4th component (\(w_c\)), so we will set it to \(z_e\) to divide \(x_p\), \(y_p\) and \(z_p\).
We just derived \(w_c\) so we can fill one row in our projection matrix.
\[ w_c = 1 \times z_e \]
\begin{equation}
\begin{pmatrix}
. & . & . & .\\
. & . & . & .\\
. & . & . & .\\
0 & 0 & 1 & 0
\end{pmatrix}
\begin{pmatrix}
x_e \\
y_e \\
z_e \\
1
\end{pmatrix}=
\begin{pmatrix}
x_c \\
y_c \\
z_c \\
w_c
\end{pmatrix}
\end{equation}
Going from near plane to near clip plane
Now that we have expressed \(x_p\) and \(y_p\) in terms of \(x_e\) and \(y_e\), let’s try to express their normalized device coordinates \(x_n\) and \(y_n\). The normalized device coordinates are the clip coordinates divided by their 4th component:
\begin{aligned}
\begin{pmatrix}
x_n \\
y_n \\
z_n \\
w_n
\end{pmatrix}=
\begin{pmatrix}
\frac{x_c}{w_c} \\
\frac{y_c}{w_c} \\
\frac{z_c}{w_c} \\
\frac{w_c}{w_c}
\end{pmatrix}
\end{aligned}
The near plane of our frustum is defined by 4 corners \((l, t)\), \((r, t)\), \((r, b)\) and \((l, b)\). We want to match these with \((1, 1)\), \((1, 1)\), \((1, 1)\) and \((1, 1)\) respectively.
Note: If you have a different clip space, you will have to adjust the corners of the near clip plane! Vulkan is different from other graphics APIs and uses a downward Y axis, and uses the same clip depth as DirectX (0 to 1).
If you use a legacy OpenGL clip space, your clip space corners will be different: Y is upward (\(t = 1\), \(b = 1\))
The mapping of the near frustum plane to the near clip plane is a linear function of the form \(f(x) = \alpha x + \beta\). We can use the formula to find the slope of the function and then, by replacing the known values in the function, we can find the constant term:
Starting with the \(x\) coordinate:
\begin{aligned}
& f(l) = 1 \quad \text{and} \quad f( r) = 1\\
\Rightarrow \quad & \alpha = \frac{1  (1)}{r  l} = \frac{2}{rl} \\
\\
& f( r) = 1\\
\Rightarrow \quad & f( r) = 1 = \frac{2}{r  l} r + \beta \\
\Leftrightarrow \quad &\beta = 1  \frac{2r}{rl} = \frac{r+l}{rl}\\
\\
& f(x_p) = x_n\\
\Leftrightarrow \quad & x_n = \frac{2}{rl} x_p  \frac{r+l}{rl}
\end{aligned}
The same goes for finding \(y\):
\begin{aligned}
& f(t) = 1 \quad \text{and} \quad f(b) = 1 \\
\Rightarrow \quad & \alpha = \frac{1  (1)}{b  t} = \frac{2}{bt} \\
\\
& f(b) = 1\\
\Rightarrow \quad & f(b) = 1 = \frac{2}{b  t} b + \beta \\
\Leftrightarrow \quad &\beta = 1  \frac{2b}{bt} = \frac{b+t}{bt}\\
\\
& f(y_p) = y_n\\
\Leftrightarrow \quad & y_n = \frac{2}{bt} y_p  \frac{b+t}{bt}
\end{aligned}
Now we just have to replace \(x_p\) and \(y_p\) by the expressions we found earlier in terms of \(x_e\) and \(y_e\).
\begin{aligned}
x_n &= \frac{2}{rl} x_p  \frac{r+l}{rl}\\
&= \frac{2}{rl} \left(\frac{1}{z_e} n x_e\right)  \frac{r+l}{rl}\\
&= \frac{1}{z_e} \left( \frac{2n}{rl} x_e + \frac{r+l}{rl} z_e \right)\\
\\
y_n &= \frac{2}{bt} y_p  \frac{b+t}{bt}\\
&= \frac{2}{bt} \left(\frac{1}{z_e} n y_e\right)  \frac{b+t}{bt}\\
&= \frac{1}{z_e} \left( \frac{2n}{bt} y_e + \frac{b+t}{bt} z_e \right)
\end{aligned}
Remember that \(x_n = x_c / w_c\) and \(y_n = y_c / w_c\). By factoring by \(\frac{1}{z_e}\), we can read the coefficients for \(x_c\) and \(y_c\).
We now have:
\begin{equation}
\begin{pmatrix}
\frac{2n}{rl} & 0 & \frac{r+l}{rl} & 0\\
0 & \frac{2n}{bt} & \frac{b+t}{bt} & 0\\
. & . & . & .\\
0 & 0 & 1 & 0
\end{pmatrix}
\begin{pmatrix}
x_e \\
y_e \\
z_e \\
1
\end{pmatrix}=
\begin{pmatrix}
x_c \\
y_c \\
z_c \\
w_c
\end{pmatrix}
\end{equation}
Deriving the depth projection
Unfortunately, we cannot use the same method to find the coefficients for z, because z will always be on the near plane after projecting a point onto the near plane. We know that the \(z\) coordinate does not depend on \(x\) and \(y\), so let’s fill the remaining row with 0 for x and y, and A and B for the coefficients we need to find.
\begin{equation}
\begin{pmatrix}
\frac{2n}{rl} & 0 & \frac{r+l}{rl} & 0\\
0 & \frac{2n}{bt} & \frac{b+t}{bt} & 0\\
0 & 0 & A & B\\
0 & 0 & 1 & 0
\end{pmatrix}
\begin{pmatrix}
x_e \\
y_e \\
z_e \\
1
\end{pmatrix}=
\begin{pmatrix}
x_c \\
y_c \\
z_c \\
w_c
\end{pmatrix}
\end{equation}
By definition \(z_n\) is :
\[ z_n = \frac{z_c}{w_c} = \frac{A \times z_e + B \times w_e}{z_e} \]
\(w_e\) is always going to be 1.
\[ z_n = \frac{z_c}{w_c} = \frac{A \times z_e + B}{z_e} \]
To find A and B, we will replace \(z_n\) and \(z_e\) with known values: the near plane maps to \(1\) in NDC and the far plane maps to \(0\).
Note: The setup shown here is using \(1\) for the near plane and \(0\) for the far plane. It is called “Reverse Depth” and results in a better distribution of the floating points values than using \(1\) and \(1\) or \(0\) and \(1\) so I highly recommend to use it.
The default DirectX convention is to use \(0\) for near plane and \(1\) for far plane, legacy OpenGL uses \(1\) for near and \(1\) for far (this is the worst for floating point precision).
\begin{aligned}
& z_n = 1 \Rightarrow z_e = n\\
& z_n = 0 \Rightarrow z_e = f
\end{aligned}
\begin{aligned}
& \left\{
\begin{array}{lr}
\frac{A \times \left(n\right) + B}{(n)} = 1\\
\frac{A \times \left(f\right) + B}{(f)} = 0
\end{array}
\right.\\
\\
\Leftrightarrow \quad &
\left\{
\begin{array}{lr}
A \times (n) + B = n\\
A \times (f) + B = 0
\end{array}
\right.\\
\\
\Leftrightarrow \quad &
\left\{
\begin{array}{lr}
A \times (n) + Af = n\\
B = A f
\end{array}
\right.\\
\\
\Leftrightarrow \quad &
\left\{
\begin{array}{lr}
A = \frac{n}{fn}\\
B = \frac{nf}{fn}
\end{array}
\right.
\end{aligned}
Here is our final expression for \(z_n\):
\[ z_n = \frac{1}{z_e} \left({\frac{n}{fn} \times z_e + \frac{nf}{fn}}\right) \]
Our matrix is now complete!
\begin{equation}
\begin{pmatrix}
\frac{2n}{rl} & 0 & \frac{r+l}{rl} & 0\\
0 & \frac{2n}{bt} & \frac{b+t}{bt} & 0\\
0 & 0 & \frac{n}{fn} & \frac{nf}{fn}\\
0 & 0 & 1 & 0
\end{pmatrix}
\begin{pmatrix}
x_e \\
y_e \\
z_e \\
1
\end{pmatrix}=
\begin{pmatrix}
x_c \\
y_c \\
z_c \\
w_c
\end{pmatrix}
\end{equation}
Simplifying the matrix for the usual case
Most of the time we are working with symmetric frustum, that is \(l = r\) and \(b = t\).
The matrix becomes a bit simpler:
\begin{aligned}
& l = r \Rightarrow l + r = 0 \quad \text{and} \quad r  l = 2r = width\\
& b = t \Rightarrow b + t = 0 \quad \text{and} \quad b  t = 2t = height
\end{aligned}
\begin{equation}
\begin{pmatrix}
\frac{2n}{width} & 0 & 0 & 0\\
0 & \frac{2n}{height} & 0 & 0\\
0 & 0 & \frac{n}{fn} & \frac{nf}{fn}\\
0 & 0 & 1 & 0
\end{pmatrix}
\begin{pmatrix}
x_e \\
y_e \\
z_e \\
1
\end{pmatrix}=
\begin{pmatrix}
x_c \\
y_c \\
z_c \\
w_c
\end{pmatrix}
\end{equation}
It’s also easier to reason on the field of view and aspect ratios rather than on the width and the height of the near plane, so let’s replace them:
Let’s try to express \(\frac{2n}{width}\) and \(\frac{2n}{height}\) in terms of field of view and aspect ratio:
\begin{aligned}
& \tan\left(\frac{fov_y}{2}\right) = \frac{\frac{height}{2}}{n} \\
\Leftrightarrow \quad & 2n \tan\left(\frac{fov_y}{2}\right) = height \\
\Leftrightarrow \quad & \frac{2n}{height} = \frac{1}{\tan\left(\frac{fov_y}{2}\right)}
\end{aligned}
\begin{aligned}
\frac{2n}{width} & = \frac{2n}{width} \times \frac{height}{height} \\
& = \frac{2n}{height} \times \frac{height}{width} \\
& = \frac{2n}{height} \left(\frac{width}{height}\right)^{1} \\
& = \frac{1}{\tan\left(\frac{fov_y}{2}\right)} \left(\frac{width}{height}\right)^{1}
\end{aligned}
And finally when replacing in the matrix:
\begin{aligned}
& \text{focal length} = \frac{1}{\tan\left(\frac{fov_y}{2}\right)}\\
\\
& \text{aspect ratio} = \frac{width}{height}\\
\\
&
\begin{pmatrix}
\frac{\text{focal length}}{\text{aspect ratio}} & 0 & 0 & 0\\
0 & {\text{focal length}} & 0 & 0\\
0 & 0 & \frac{n}{fn} & \frac{nf}{fn}\\
0 & 0 & 1 & 0
\end{pmatrix}
\begin{pmatrix}
x_e \\
y_e \\
z_e \\
1
\end{pmatrix}=
\begin{pmatrix}
x_c \\
y_c \\
z_c \\
w_c
\end{pmatrix}
\end{aligned}
Implementation
Here is a possible implementation in C++. Note that the matrix is trivial to invert using your favourite method (I like to use rowreduction for this). So we can compute it at the same time as the perspective projection matrix.


A lot of people just come here to find a function to copy paste in their projects when things are not working for them, if it is not working for you I am going to try to give a few advices.
This particular implementation is using all the formulas explained in previous sections. As mentionned in the different notes, the matrix will change depending on your setup. You could also be using row vectors and leftmultiplication for vectors and matrix. In this case the correct matrix is the transpose of the one in this post. If you are indeed using row vectors, I advise you to take a deep breath and make sure you understand the differences with using row vectors vs column vectors.
If you are not using the exact same setup as I do (Vulkan, Reverse depth) you need to derive the formulas yourself, follow what I have done with your own known values.
Infinite perspective
Having to set scenespecific values for both the near and far plane can be a pain the ass. If you want to display large openworld scenes you will almost always use an absurdly high value for the far plane anyway.
With the increased precision of using a reverse depth buffer, it is possible to set an infinitely distant far plane.
\begin{aligned}
& \lim_{f \to +\infty} A = \frac{n}{fn} = 0 \\
& \lim_{f \to +\infty} B = n \times \frac{f}{fn} = n
\end{aligned}
We just have to replace A with \(0\) and B with \(n\)!
External references
 Reverse depth
 https://developer.nvidia.com/content/depthprecisionvisualized