使用dlib/face_recognition时，如何从对齐人脸提取128D向量？有无参考论文？

阿华AIGC实验室

2026-5-29

Understanding dlib's 128D Face Embedding: Reference Papers & Network Breakdown

Awesome question— I totally get the frustration of digging through dlib’s source code trying to figure out how that 128D face embedding works under the hood. Let me break this down for you:

Core Foundational Paper: The entire concept of mapping faces to a compact 128-dimensional similarity space comes from FaceNet: A Unified Embedding for Face Recognition and Clustering (Schroff et al.). This paper introduced the triplet loss function, which trains a neural network to ensure embeddings of the same person are close together in the vector space, while embeddings of different people are far apart. This is exactly the framework dlib uses for its face embedding.
dlib's Specific Implementation: The creator of dlib, Davis King, built on FaceNet’s ideas using a ResNet-34 architecture (a lightweight, efficient variant of the original ResNet) as the feature extraction backbone. While there’s no standalone academic paper dedicated solely to dlib’s face embedding model, King detailed his design choices and training process in dlib’s official docs and personal blog posts. For formal academic context, pair the FaceNet paper with the original ResNet paper: Deep Residual Learning for Image Recognition (He et al.), which explains the residual block structure that powers dlib’s network.
Quick Network Structure Breakdown:
- Input: An aligned 150x150 face image (processed via dlib’s face detection and landmark-based alignment—either 5 or 68 points).
- Backbone: A series of residual blocks (from ResNet-34) that extract increasingly complex hierarchical features from the face image.
- Output: A fully connected layer that produces the 128-dimensional vector, normalized to have an L2 norm of 1. This normalization lets you use simple Euclidean distance to compare embedding similarity directly.
- Training: The model is trained with triplet loss, which pushes the model to minimize the distance between embeddings of the same person and maximize the distance between embeddings of different people.

If you’re still picking through dlib’s source code, look for the resnet34 implementation in its deep learning modules, along with the triplet loss training loop. The pre-trained weights are optimized for real-world face recognition, which is why it integrates so smoothly with tools like face_recognition.

内容的提问来源于stack exchange，提问作者Jayhello