Face Datasets

February 14, 2021 , Datagen

Most people can recognize about 5,000 faces, and it takes a human 0.2 seconds to recognize a specific one. We also interpret facial expressions and detect emotions automatically. In other words, we’re naturally good at facial recognition and analysis. But, in recent years, Computer Vision (CV) has been catching up and in some cases outperforming humans in facial recognition. Advances in CV and Machine Learning have created solutions that can handle tasks more efficiently and accurately than humans.

The applications of this technology are wide-ranging and exciting. One example is in marketing and retail. Saks Fifth Avenue uses facial recognition technology in their stores both to check against criminal databases and prevent theft, but also to identify which displays attract attention and to analyze in-store traffic patterns. 7-Eleven in Japan has started trials with facial recognition enabled payments. And, Amazon Go is another example of the deployment of facial recognition and detection tools in a retail context. 

Powering all these advances are numerous large datasets of faces, with different features and focuses. Sifting through the datasets to find the best fit for a given project can take time and effort.

In order to help teams looking for the best datasets for their needs, we provide a quick guide to some popular and high-quality, public datasets focused on human faces. We’ll list some key characteristics and strengths and weaknesses of each. 

 

facial recognition

 

CelebFaces Attributes Dataset (CelebA)

  • Affiliation – The Chinese University of Hong Kong
  • Publication – International Conference on Computer Vision (ICCV)
  • Released – 2015 
  • Description – CelebFaces Attributes Dataset (CelebA) is a large-scale face attributes dataset with more than 200K celebrity images, each with 40 attribute annotations. The images in this dataset cover large pose variations and background clutter. 
  • Main Use – 2D face recognition
  • Face Images  – 202,599
  • Identities – 10,177
  • Annotations – 5 face landmark locations, 40 binary attributes. Binary attributes are a list of facial features and characteristics. For example Bushy Eyebrows, Mustache, Gray Hair, Pointy Nose, Wearing Necklace, Heavy Makeup among others
  • Data Gathering Method – Celebrity Images from the Internet. 
  • Licensing – The CelebA dataset is available for non-commercial research purposes only.

 

CelebA dataset

 

VGG Face2

  • Affiliation – University of Oxford
  • Publication – IEEE International Conference on Automatic Face & Gesture Recognition
  • Released – 2018
  • Description – The dataset contains 3.31 million images with large variations in pose, age, illumination, ethnicity and professions.
  • Main Use – 2D face recognition
  • Face Images  – 3,310,000
  • Identities – 9,131
  • Annotations –  This Dataset includes human-verified bounding boxes around faces and five face landmarks, similarly to the CelebA Dataset. In addition, pose (yaw, pitch and roll) and apparent age information are estimated by pre-trained pose and age classifiers as opposed to ground truth annotations. 
  • Data Gathering Method – Celebrity Images from the Internet, the images were downloaded from Google Image Search. 
  • Notes – “This dataset excels at training facial recognitions to recognize people despite “noisy” images that have many other objects in the image.” 
  • Licensing – The VGG Face2 dataset is available for non-commercial research purposes only.

 

UMDFaces

  • Affiliation – University of Maryland
  • Publication – IEEE International Joint Conference on Biometrics
  • Released – 2017 
  • Description – UMDFaces has 367,888 annotated faces of 8,277 subjects. We discuss how a large dataset can be collected and annotated using human annotators and deep networks
  • Main Use – 2D face recognition, video face recognition. 
  • Face Images  – 22,000 videos + 367,888 images
  • Identities – 8,277 in images + 3,100 in video
  • Annotations – Human-curated bounding boxes for faces, estimated pose (roll, pitch and yaw), locations of twenty-one key-points and gender information generated by a pre-trained neural network.
  • Data Gathering Method – Celebrity Images from the Internet. 
  • Notes – This dataset can “compare the quality of the dataset with other publicly available face datasets at similar scales.” 
  • Licensing – The UMDFaces dataset is available for non-commercial research purposes only.

 

MS-Celeb-1M:

  • Affiliation – Microsoft Research
  • Publication – European Conference on Computer Vision
  • Released – 2016 
  • Description – “This training dataset was prepared in two main steps. First, we select the top 100K entities from our one-million celebrity list in terms of their web appearance frequency. Then, we leverage popular search engines to provide approximately 100 images per celebrity.”
  • Main Use – 2D face recognition
  • Face Images  – 10,000,000
  • Identities – 100,000
  • Annotations –  Not Annotated.
  • Data Gathering Method – Celebrity Images from the Internet. 
  • Licensing –  This Dataset is under the Open Data Commons Public Domain Dedication and License.

 

YoutubeFace

  • Affiliation – Tel Aviv University
  • Publication – CVPR
  • Released – 2011 
  • Description – This training dataset follows the identical preparation steps to MS-Celeb-1M
  • Main Use – Video Face Recognition 
  • Face Videos  – 3,425 videos
  • Identities – 1,595
  • Annotations –  Bounding boxes. And the following descriptors: Local Binary Patterns (LBP), CenterSymmetric LBP (CSLBP) and Four-Patch LBP.
  • Data Gathering Method – Celebrity Images from the Internet. 
  • Licensing – YouTubeFacesDB is licensed under the MIT License a simple and permissive license with conditions only requiring preservation of copyright and license notices that enables commercial use. 

PaSc

  • Affiliation – NIST, SAIC, Colorado State University, Notre Dame University
  • Publication – IEEE Sixth International Conference on Biometrics
  • Released – 2013
  • Description – The challenge includes 9,376 still images and 2,802 videos of 293 people. The images are balanced with respect to distance to the camera, alternative sensors, frontal versus not-frontal views, and different locations. Verification results are presented for public baseline algorithms and a commercial algorithm for three cases: comparing still images to still images, videos to videos, and still images to videos.
  • Main Use – Video Face Recognition 
  • Face Videos  – 2,802
  • Identities – 293
  • Annotations – PaSC was designed to be a dataset for benchmark testing, and as such was not annotated by the developers. 
  • Data Gathering Method – Manual Data Collection 
  • Licensing – This is strictly licensed, so should be checked before use. 

iQIYI-VID

  • Affiliation – iQIYI, Inc.
  • Released – 2018
  • Description – iQIYI-VID, the largest video dataset for multi-modal person identification. It is composed of 600K video clips of 5,000 celebrities. These video clips are extracted from 400K hours of online videos of various types, ranging from movies, variety shows, TV series, to news broadcasting. All video clips pass through a careful human annotation process, and the error rate of labels is lower than 0.2%.
  • Main Use – Video Face Recognition 
  • Face Videos  – 600,000
  • Identities – 5,000
  • Annotations – iQIYI-VID was annotated using a two-step process. The first stage was localizing faces and identities using algorithms. The second stage was a manual annotations process. The manual labeling and annotation were repeated twice by different labelers to ensure accuracy. The annotations consist of facial recognition and identification. 
  • Data Gathering Method – Celebrity Images from the Internet.  
  • Licensing –  a simple and permissive license with conditions only requiring preservation of copyright and license notices that enables commercial use.

 

Wider Face

  • Affiliation – The Chinese University of Hong Kong
  • Released – 2018
  • Description – we introduce the WIDER FACE dataset, which is 10 times larger than existing datasets. The dataset contains rich annotations, including occlusions, poses, event categories, and face bounding boxes. Faces in the proposed dataset are extremely challenging due to large variations in scale, pose and occlusion,
  • Main Use – Face Detection 
  • Images  – 32,203 images
  • Identities – 393,703
  • Annotations – Occlusion, pose, and event categories plus bounding boxes for all the recognizable faces. This dataset also labels faces that are occluded or need to be ignored due to low quality or resolution. Each annotation is labeled by one annotator and cross-checked by two different people.
  • Data Gathering Method – Internet search engines. 
  • Notes –   Wider Face enables teams to focus on some inherent challenges of face detection – small scale, occlusion, and extreme poses. These factors are common in many real-world applications. For instance, faces captured by surveillance cameras in public spaces or events are typically small, occluded, and atypical poses. 
  • Licensing – The Wider Face dataset is available for non-commercial research purposes only.

 

MALF

  • Affiliation – Baidu, NLPR
  • Publication – IEEE International Conference and Workshops on Automatic Face and Gesture Recognition 
  • Released – 2015
  • Description – MALF is the first face detection dataset that supports fine-gained evaluation.
  • Main Use – Face Detection 
  • Images – 5,250
  • Identities – 11,931
  • Annotations – Square bounding boxes; pose deformation level of yaw, pitch and roll (small, medium, large); facial attributes: gender(female, male, unknown).
  • Data Gathering Method – Internet, specifically Flickr and Baidu. 
  • Notes-   MALF could serve as a helpful face detection benchmark that offers deep and all-around diagnosis and improvement advice on evaluated algorithms.
  • Licensing – The MALF dataset is available for non-commercial research purposes only.

 

IMDB-Wiki

  • Affiliation – ETH Zurich
  • Publication – IEEE
  • Released – 2015
  • Description – “We crawled 0.5 million images of celebrities from IMDb and Wikipedia that we make public on this website. This is the largest public dataset for age prediction to date.”
  • Main Use – Face Attributes
  • Images  – 524,230
  • Identities – 100,000
  • Data Gathering Method – Celebrity Images from the Internet.  
  • Notes –   This Dataset was used to train a model on age detection – and the dataset was collected with this goal in mind and is great for similar projects. 
  • Licensing – This dataset is made available for academic research purposes only.