Cascaded Iterative Transformer for Jointly Predicting Facial Landmark, Occlusion Probability and Head Pose
Landmark detection under large pose with occlusion has been one of the challenging problems in the field of facial analysis. Recently, many works have predicted pose or occlusion together in the multi-task learning (MTL) paradigm, trying to tap into their dependencies and thus alleviate this issue. However, such implicit dependencies are weakly interpretable and inconsistent with the way humans exploit inter-task coupling relations, i.e., accommodating the induced explicit effects. This is one of the essentials that hinders their performance. To this end, in this paper, we propose a Cascaded Iterative Transformer (CIT) to jointly predict facial landmark, occlusion probability, and pose. The proposed CIT, besides implicitly mining task dependencies in a shared encoder, innovatively employs a cost-effective and portability-friendly strategy to pass the decoders’ predictions as prior knowledge to human-like exploit the coupling-induced effects. Moreover, to the best of our knowledge, no dataset contains all these task annotations simultaneously, so we introduce a new dataset termed MERL-RAV-FLOP based on the MERL-RAV dataset. We conduct extensive experiments on several challenging datasets (300W-LP, AFLW2000-3D, BIWI, COFW, and MERL-RAV-FLOP) and achieve remarkable results.
If you find this work useful, please consider citing:
@article{li2023cascaded,
title={Cascaded Iterative Transformer for Jointly Predicting Facial Landmark, Occlusion Probability and Head Pose},
author={Li, Yaokun and Tan, Guang and Gou, Chao},
journal={International Journal of Computer Vision},
pages={1--16},
year={2023},
publisher={Springer}
In addition, we highly recommend citing the great MERL-RAV, AFLW dataset, and face_alignment, who have been very supportive of our work.
We have already prepared our dataset. If you need it, please send an email to [email protected] to obtain the download link.
We propose the MERL-RAV-FLOP dataset based on the MERL-RAV dataset. From the perspective of efficiency, we adopt a simple and efficient semi-automated annotation process, i.e., automatic annotation followed by manual annotation, which is as follows:
First, we follow the MERL-RAV instruction to download the AFLW dataset and prepare the original MERL-RAV Dataset. The directory structure of the prepared MERL-RAV dataset is shown below:
|--MERL_RAV_Dataset |--merl_rav_organized | |-- frontal | | |--testset | | | |--image00019.jpg | | | |--image00019.pts | | | |--... | | |--trainset | | | |-- left | | |--testset | | |--trainset | | | |-- lefthalf | | |--testset | | |--trainset | | | |-- right | | |--testset | | |--trainset | | | |-- righthalf | |--testset | |--trainset | |--merl_rav_labels |--aflw |--common_functions.py |--organize_merl_rav_using_aflw_and_our_labels.py
Next, we use face_alignment and the AFLW dataset to acquire pose and landmarks automatically. The concept behind this is to get the coarse-grained annotations automatically, and then get the fine-grained annotations manually, which saves our time in such a semi-automated way.
Specifically, we match the file_id of the MERL-RAV dataset with the faces in the AFLW dataset to select the correct faces with pose, and then use face_alignment to predict the same 68 keypoints as MERL-RAV for the selected samples, which is done to coarse-grain fill in the missing landmarks in MERL-RAV dataset. This process is detailed in make_files.py
(see the downloaded file) and is achieved by executing the following command. Note that before executing this command, you need to put the three pre-processed files Faces
, FacePose
, and FaceRect
about the AFLW dataset into the aflw folder (again, these files are in the downloaded file).
python make_files.py
After above, the merl_rav_organized
folder will be in the following form. The train_files.txt
and test_files.txt
are the list of training and testing files for the MERL-RAV-FLOP dataset, and the extra xxx.npy
files for each sample in each folder are the corresponding annotation files for the samples (xxx.pts
files are no longer used).
|--merl_rav_organized |-- frontal | |--testset | | |--image00019.jpg | | |--image00019.npy | | |--image00019.pts | | |--... | |--trainset | |-- left |-- lefthalf |-- right |-- righthalf |-- train_files.txt |-- test_files.txt
Finally, after performing the above automated pre-processing, we cleaned the data manually, by checking and correcting the landmark and visibility annotations that are incorrect in the above process. After our manual modifications, the final train_files.txt
, test_files.txt
, and .npy
files are shown in the download link above. The directories are shown below.
|--MERL_RAV_FLOP | |-- frontal | | |--testset | | | |--image00019.jpg | | | |--image00019.npy | | | |--... | | |--trainset | | | |-- left | |-- lefthalf | |-- right | |-- righthalf | |-- train_files.txt | |-- test_files.txt |--make_files.py |--FacePose |--FaceRect |--Faces
Each .npy
file is an array with size of (209,). Taking image00019.npy
as an illustration, the meaning represented by each dimension of this array is shown below:
import numpy as np annotation = np.load("image00019.npy") x_coordinates = annotation[:68] y_coordinates = annotation[68:136] bbox_w, bbox_h = annotation[-2], annotation[-1] pose = annotation[-5:-2] visibility = annotation[-73:-5]
Feel free to contact [email protected] if you have any doubts or questions.