Face Alignment

Most modern face recognition mechanisms work on crops of images that just contain the face of a person, like those we are able to create with face detection. Feeding crops of images that were created just using the bounding boxes we get from face detection works to some extend. We can improve the accuracy quite a bit though if we align the faces first. In real-world scenarios, people might have their heads turned slightly. The photo may also be taken from different perspectives, such as from above or below. Then the face is shifted in perspective.

Just like other AI methods, face recognition works best if we normalize the input data first. Thus, the result may depend more on the differences or similarities at issue - namely, which person is shown - and less on any other differences that are reduced by normalization. A simple approach is to rotate the face so that the two eyes are on a horizontal line. To do this, we need to know the position of the eyes in addition to the rectangles in which the faces lie. The task of finding such defined features in the face is called Facial Landmark Detection.

Nowadays, there are several frameworks for Facial Landmark Detection. Simpler ones approximate the position of 5 points: both eyes and corners of the mouth as well as the tip of the nose. The dlib approach is also common, where 68 more detailed points are searched. However, modern models can even create a three-dimensional point cloud that reflects the face. Such information is used, for example, in face-swapping apps or in masking filters on Instagram, Snapchat and co.

For the simple approach of just rotating the face, the five points are sufficient. With this type of normalization, however, we would disregard the perspective. Modern face recognition methods often have known fixed points where they expect to find eyes, corners of the mouth and the nose in their input data. If we know their real positions in the image of the face, we can approximate an affine transformation that moves the points in the face as close as possible to the target points. An affine transformation can be expressed by a matrix and include, for example, shear, scale, and rotation. We obtain the matrix by solving a system of equations with the source and target points. Then we multiply the image of the face by the matrix and get the aligned image.

Expected eye, mouth and nose positions for ArcFace

We have multiple options to obtain these landmark points. There are specialized AI models that focus on this task like Google's Face Mesh. A simpler and often more efficient option is to use a face detection model that does not only find faces but also determines the landmark points in one step. Popular models that achieve this include SCRFD and MTCNN.

Live Demo: Face Alignment

Get started with an image that contains one or more human faces.

- or -

Determined landmarks:
Expected landmark positions:

As you might have noticed, the aligned images are not technically transformed versions of the images that we obtained by cropping the rectangle areas out of the input image. Rather, the aligned version sometimes uses more surrounding pixels from the input image than the cropped rectangle version did. As we know the facial landmark points in the original input image's coordinates, we can estimate the affine transform regarding one face for the whole input image. Because the transform will include a translation then, it does not matter where the face is located in the input image. Thus, we can use pixels from the input image that are cropped off in the simpler cropped version.

Usage examples

Beyond alignment for face recognition, there are more use cases for algorithms that find facial landmarks. For example:

  • Face modifying filters for social media, see e.g. Google MediaPipe Face Mesh
  • Retouch skin blemishes only on one photo and automatically transfer to other photos of the same person
  • Virtual trying on of glasses

Try this yourself

If you're writing your own application, you can use the exact same face detection, face landmarks detection and face alignment mechanisms that are used in the demo above from your own code. The following example uses C# as programming language and leverages the FaceAiSharp library.

  1. Install a recent version of .NET.
  2. Create a new console app project by running this command in your favorite shell in an empty folder:
    
    dotnet new console
  3. Install two packages providing the relevant code and models:
    
    dotnet add package Microsoft.ML.OnnxRuntime
    dotnet add package FaceAiSharp.Bundle
  4. Replace the content of the Program.cs file with the following code:
    
    using FaceAiSharp;
    using SixLabors.ImageSharp;
    using SixLabors.ImageSharp.PixelFormats;
    
    using var hc = new HttpClient();
    var groupPhoto = await hc.GetByteArrayAsync(
        "https://raw.githubusercontent.com/georg-jung/FaceAiSharp/master/examples/obama_family.jpg");
    var img = Image.Load<Rgb24>(groupPhoto);
    
    var det = FaceAiSharpBundleFactory.CreateFaceDetectorWithLandmarks();
    var rec = FaceAiSharpBundleFactory.CreateFaceEmbeddingsGenerator();
    
    var face = det.DetectFaces(img).First();
    rec.AlignFaceUsingLandmarks(img, face.Landmarks!);
    img.Save("aligned.jpg");
  5. Run the program you just created:
    
    dotnet run
    Now you can find an aligned.jpg file in the same folder as the Program.cs file, which contains an aligned version of one of the faces in the photo.
An error has occurred. This application may no longer respond until reloaded. Reload 🗙