Head- and eye-based features for continuous core affect prediction
O'Dwyer, Jonathan (Jonny)
MetadataShow full item record
Feelings, or affect, are a fundamental part of human experience. Arousal and valencemake up core affect and have received intense study in affective computing. Speechand facial features have been extensively studied as predictors of core affect. Otherindicators of affect include head- and eye-based gestures, yet these are underexploredfor affect prediction. In this dissertation, handcrafted feature sets from head and eyemodalities are proposed and evaluated in two audiovisual continuous (core) affectprediction experiments on the RECOLA and SEMAINE affective corpora.In the first experiment, head- and eye-based features were input to deep feed-forward neural network (DNN), along with speech and face features, for unimodalcontinuous affect prediction. Two proposed head feature sets and one eye featureset outperformed minimum performance benchmarks, estimated human predictionperformances, for arousal prediction on both corpora. The more complex headfeature set proposed performed second-best overall, after speech, and best from thevisual modalities, for arousal prediction. This feature set obtained validation setconcordance correlation coefficient (CCC) scores of 0.572 on RECOLA and 0.671on SEMAINE. For valence, head feature sets performed best from those proposed,and best overall for valence prediction on SEMAINE (CCC = 0.289), however,these sets were unable to match or exceed human performance estimates. From thisexperiment, it was concluded that head-based features are suitable for unimodalarousal prediction. It was also concluded that arousal prediction performance within-15.82% of speech, relative CCC, can be obtained from head-based features.In the second experiment, the proposed feature sets were evaluated with speechand face features for multimodal continuous affect prediction using DNNs. The ex-perimentation included a fusion study, cross-modal interaction feature investigation,and the proposal for, and evaluation of, teacher-forced learning with multi-stage re-gression (TFL-MSR). TFL-MSR is a method for leveraging correlations betweenaffect dimensions to improve affect prediction. An algorithm screening-based sensit-ivity was also performed to highlight important feature groups for prediction in thedifferent corpora. Model fusion performed better than feature fusion in the experi-ment. Relative CCC performance increases of 4.91% and 18.23% on RECOLA and13.18% and 74.17% on SEMAINE above model fusion speech and face were observedfor arousal and valence respectively for multimodal systems that used all modalit-ies. One eye and face cross-modal interaction feature was discovered for valenceprediction on RECOLA and it was able to improve CCC prediction performance by2.66%. TFL-MSR improved valence prediction on RECOLA but not on SEMAINEwhere a small arousal and valence correlation relationship was present. Interestingcross-corpus similarities and differences were found in the sensitivity analysis thatindicated some feature groups have similar importances, while other feature groups’importances were inverted across the social situations in the corpora. The finalmodels of this work produced test set CCC results of 0.812 for arousal and 0.463 forvalence on RECOLA and 0.616 for arousal and 0.436 for valence on SEMAINE.The usefulness of the proposed head and eye features has been shown in thisresearch, and they can also facilitate model interpretability efforts as the handcraftedfeatures are themselves interpretable. This work provides researchers with newaffective feature sets from video and methods that can improve affect predictionand potentially other social and affective computing efforts.
The following license files are associated with this item: