The evaluation of systems that claim to recognize emotions expressed by human beings is a contested and complex task: The early pioneers in this field gave the impression that these systems will eventually recognize a flash of anger, suppressed glee/happiness, momentary disgust or contempt, lurking fear, or sadness in someone’s face or voice (Picard and Klein, 2002; Schuller et al., 2011). Emotion recognition systems are trained on ‘labelled’ databases — collection of video/audio recording comprising images and voices of humans enacting one emotional state. Machine learning programmes then regress the pixel distributions or wave forms against the labels. The system is then said to have learnt how to recognize and interpret human emotions and rated using information science metrics. These systems are adopted by the world at large for applications ranging from autistic spectrum communications to teaching and learning, and onwards to covert surveillance. The training databases depend upon human emotions recorded in ideal conditions — faces looking at the camera and centrally located, voices articulated through noise-cancelling microphones. Yet there are reports that the posed training data set, that is racially-skewed and gender unbalanced, does not prepare these systems to cope with data-in-the-wild and that expression-unrelated variations (like illumination, head pose, and identity bias (Li and Deng, 2020)) can impact their performance as well. Deployments of these systems tend to adopt one or other and apply it to data collected outside laboratory conditions and use the resulting classifications in subsequent processing. We have devised a testing method that helps to quantify the similarities and differences of facial emotion recognition systems (FER) and speech emotion recognition systems (SER). We report on the development of a data base comprising videos and sound track of 64 politicians and 7 government spokespersons (25 F, 46 M; 34 White Europeans, 19 East Asians, and 18 South Asians), ranging in age from 32–85 years, and each of the 71 has on average three 180 s videos; a total of 16.66 h of data. We have compared the performance of two FERs (Emotient and Affectiva) and two SERs (OpenSmile and Vokaturi) on our data by analysing emotions reported by these systems on a frame-by-frame basis. We have analysed the directly observable head movements, and the indirectly observable muscle movement parts of the face and for the muscle movements in the vocal tract. There was marked disagreement in emotions recognized, and the differences were exacerbated more women than for men, and more for South and East Asians than for White Europeans. Levels of agreement and disagreement on both high-level (i.e. emotion labels) and lower-level features (e.g. Euler angles of head movement) are shown. We show that inter-system disagreement may also be used as an effective response variable in reasoning about data features that influence disagreement. We argue that reliability of subsequent processing in approaches that adopt these systems may be enhanced by restricting action to cases where systems agree within a given tolerance level. This paper may be considered as a foray into the greater debate about the so-called algorithmic (un)fairness and data bias in the development and deployment of machine learning systems of which FERs and SERs are a good exemplar.
Read full abstract