Model, View and Projection Transform

The Model Matrix

A model is defined by a set of vertices. The $X,Y,Z$ coordinates of these vertices are defined relative to the object’s center: that is, if a vertex is at $(0,0,0)$, it is at the center of the object.

We’d like to be able to move this model (you just learnt to do so: translation**rotation**scale, and done. You apply this matrix to all your vertices at each frame and everything moves. Something that doesn’t move will be at the center of the world.

Model Transform

Your vertices are now in World Space. We went from Model Space (all vertices defined relatively to the center of the model) to World Space (all vertices defined relatively to the center of the world). See figure below:

World Coordinates

The View Matrix

It you want to view a moutain from another angle, you can either move the camera… or move the mountain.

So initially your camera is at the origin of the World Space. In order to move the world, you simply introduce another matrix. Let’s say you want to move your camera of $3$ units to the right ($+X$). This is equivalent to moving your whole world $3$ units to the left ($-X$).

View Transform

We went from World Space (all vertices defined relatively to the center of the world, as we made so in the previous section) to Camera Space (all vertices defined relatively to the camera). The figure below shows how we go from model/object coordinates to world coordinates and finally to camera coordinates.

Camera Coordinates

The Model-View Matrix

The Model-View matrix allows us to perform affine transformations in our scene. Affine is a mathematical name that describes transformations that do not change the structure of the object undergoing such transformations. In our 3D world scene, such transformations are rotation, scaling, reflection shearing, and translation. Let’s take a look at how the Model-View matrix is constructed.

Spatial Encoding of the World

By default, when you render a scene, you are looking at it from the origin of the world in the negative direction of the z-axis. As shown in the following diagram, the z-axis is coming out of the screen:

Spatial World Encoding

Rotation Matrix

The intersection of the first three rows with the first three columns defines the 3x3 Rotation matrix. This matrix contains information about rotations around the standard axis.

$$ \begin{aligned} \begin{bmatrix} m_1 & m_2 & m_3 \\ m_5 & m_6 & m_7 \\ m_9 & m_{10} & m_{11} \end{bmatrix} \end{aligned} $$

Translation Vector

The intersection of the first three rows with the last column defines a three-component Translation vector.

$$ \begin{aligned} \begin{bmatrix} m_{13} & m_{14} & m_{15} \end{bmatrix} \end{aligned} $$

The Mysterious Fourth Row

The fourth row does not have any special meaning.

The $m_4$, $m_8$, and $m_{12}$ elements are always $0$.
The $m_{16}$ element (the Homogeneous coordinate) will always be $1$.

The Projection Matrix

Projection matrices are specialized $4$x$4$ matrices designed to transform a 3D point in camera space into its projected counterpart on the canvas. Essentially, when you multiply a 3D point by a projection matrix, you determine its 2D coordinates on the canvas within NDC (Normalized Device Coordinates) space (we’ll see what these are later). Points in NDC space fall within the range $[-1, 1]$.

It’s crucial to remember that projection matrices are intended for transforming vertices or 3D points, not vectors. The workaround involves treating points as $1\times 4$ vectors, enabling their multiplication by a $4\times 4$ matrix. The result is another $1\times 4$ matrix, or 4D points with homogeneous coordinates. These coordinates are only directly applicable as 3D points if their fourth component is $1$, allowing the first three components to represent a standard 3D Cartesian point.

This operation determines how much of the view space will be rendered and how it will be mapped onto the computer screen. This region is known as the frustum and it is defined by six planes (near, far, top, bottom, right, and left planes), as shown in the following diagram:

Projection Transform

These six planes are encoded in the Projection matrix. Any vertices lying outside the frustum after applying the transformation are clipped out and discarded from further processing. Therefore, the frustum defines clipping coordinates, and the Projection matrix that encodes the frustum produces clipping coordinates.

If the far and near planes have the same dimensions, the frustum will then determine an orthographic projection. Otherwise, it will be a perspective projection.

Orthographic vs Perspective projection

We went from Camera Space (all vertices defined relatively to the camera) to Homogeneous Space (all vertices defined in a small cube. Everything inside the cube is onscreen).

Before projection, we’ve got our blue objects, in Camera Space, and the red shape represents the frustum of the camera: the part of the scene that the camera is actually able to see.

Projection Coordinates

Multiplying everything by the Projection Matrix has the following effect:

Projection Coordinates

Perspective or Orthogonal Projection

A perspective projection assigns more space to details that are closer to the camera than details that are farther away. In other words, the geometry that is close to the camera will appear larger than the geometry that is farther from it.

In contrast, an orthogonal projection uses parallel lines; this means that lines will appear to be the same size, regardless of their distance to the camera.

Perspective Matrix

The Projection matrix determines the field of view (FOV) of the camera. Which is how much of the 3D space will be captured by the camera. It is a measure given in degrees, and the term is used interchangeably with the term angle of view.

Field of View (FOV)

Perspective Matrix

Orthographic Matrix

Orthographic Matrix

Clipping

Up to this point, we are still working with Homogeneous coordinates. Projection matrices actually transform points from the camera space to the homogeneous clip space, not to NDC (Normalized Device Coordinate) space.

Because WebGL doesn’t know anything about the coordinate space it requires that when all of the transformations are done, things should be in normalized device coordinates. Normalized device coordinates are obtained by dividing the clipping coordinates by the $w$ component. This is why this step is known as perspective division. In the NDC space, the $x$ and $y$ coordinates represent the location of your vertices on a normalized 2D screen, while the z-coordinate encodes depth information, which is the relative location of the objects with respect to the near and far planes.

Basically the homogeneous coordinates have four components: $x$, $y$, $z$, and $w$. The clipping is done by comparing the $x$, $y$, and $z$ components against the Homogeneous coordinate, $w$. If any of them is more than, $+w$, or less than, $-w$, then that vertex lies outside the frustum and is discarded.

The clipping coordinates now range from $-1$ to $+1$ on each axis, regardless of the shape or size of the actual screen. The bottom left corner will be at $(-1, -1)$, and the top right corner will be at (1, 1). WebGL will then map these coordinates onto the viewport that was configured with glViewport.

Normalized Device Coordinates

View Transform

Recap

The following diagram shows the theory we have learned so far, along with the relationships between the steps in the theory and the implementation in WebGL.

Translations Recap

The five transformations that we apply to object coordinates to obtain viewport coordinates are:

The Model-View matrix that groups the model and view transform in one single matrix. When we multiply our vertices by this matrix, we end up in the camera space with homogeneous coordinates.
The Projection matrix as a result, we end up in the homogeneous clip space.
Clipping: transforms the homogeneous coordinates on cartesian coordinates by leaving out all vertices ouside of the range $[-w, w]$. This leaves us on the clip space.
Perspective Division: after we apply perspective division, so now our coordinates are on the NDC space.
GL Viewport: internal transform to move to the raster space.

An extra transformation matrix is defined specially for the normals. This is the Normal matrix, which is obtained by inverting and transposing the Model-View matrix. This matrix is applied to normal vectors to ensure that they continue to be perpendicular to the surface.

Model, View and Projection Transform

The Model Matrix

The View Matrix

The Model-View Matrix

Spatial Encoding of the World

Rotation Matrix

Translation Vector

The Mysterious Fourth Row

The Projection Matrix

Perspective or Orthogonal Projection

Perspective Matrix

Orthographic Matrix

Clipping

Recap

References