Computational Modeling of Face-to-Face Social Interaction Using Nonverbal Behavioral Cues
Supervisor(s) and Committee member(s): Daniel Gatica-Perez (supervisor, thesis director), Pearl Pu Faltings (president of the jury), Anton Nijholt (jury member), Fabio Pianesi (jury member), Jean-Philippe Thiran (jury member)
The computational modeling of face-to-face interactions using nonverbal behavioral cues is an emerging and relevant problem in social computing. Studying face-to-face interactions in small groups helps in understanding the basic processes of individual and group behavior; and improving team productivity and satisfaction in the modern workplace. Apart from the verbal channel, nonverbal behavioral cues form a rich communication channel through which people infer – often automatically and unconsciously – emotions, relationships, and traits of fellow members.
There exists a solid body of knowledge about small groups and the multimodal nature of the nonverbal phenomenon in social psychology and nonverbal communication. However, the problem has only recently begun to be studied in the multimodal processing community. A recent trend is to analyze these interactions in the context of face-to-face group conversations, using multiple sensors and make inferences automatically without the need of a human expert. These problems can be formulated in a machine learning framework involving the extraction of relevant audio, video features and the design of supervised or unsupervised learning models.
While attempting to bridge social psychology, perception, and machine learning, certain factors have to be considered. Firstly, various group conversation patterns emerge at different time-scales. For example, turn-taking patterns evolve over shorter time scales, whereas dominance or group-interest trends get established over larger time scales. Secondly, a set of audio and visual cues that are not only relevant but also robustly computable need to be chosen. Thirdly, unlike typical machine learning problems where ground truth is well defined, interaction modeling involves data annotation that needs to factor in interannotator variability. Finally, principled ways of integrating the multimodal cues have to be investigated.
In the thesis, we have investigated individual social constructs in small groups like dominance and status (two facets of the so-called vertical dimension of social relations). In the first part of this work, we have investigated how dominance perceived by external observers can be estimated by different nonverbal audio and video cues, and affected by annotator variability, the estimation method, and the exact task involved. In the second part,we jointly study perceived dominance and role-based status to understand whether dominant people are the ones with high status and whether dominance and status in small group conversations be automatically explained by the same nonverbal cues. We employ speaking activity, visual activity, and visual attention cues for both the works.
In the second part of the thesis, we have investigated group social constructs using both supervised and unsupervised approaches. We first propose a novel framework to characterize groups. The two-layer framework consists of a individual layer and the group layer. At the individual layer, the floor-occupation patterns of the individuals are captured. At the group layer, the identity information of the individuals is not used. We define group cues by aggregating individual cues over time and person, and use them to classify group conversational contexts – cooperative vs competitive and brainstorming vs decision-making. We then propose a framework to discover group interaction patterns using probabilistic topic models. An objective evaluation of our methodology involving human judgment and multiple annotators, showed that the learned topics indeed are meaningful, and also that the discovered patterns resemble prototypical leadership styles – autocratic, participative, and free-rein – proposed in social psychology.
Social Computing group
Social computing is an emerging research domain focused on the automatic sensing, analysis, and interpretation of human and social behavior from sensor data. Through microphones and cameras in multi-sensor spaces, mobile phones, and the web, sensor data depicting human behavior can increasingly be obtained at large-scale - longitudinally and population-wise. The research group integrates models and methods from multimedia signal processing and information systems, statistical machine learning, ubiquitous computing, and applying knowledge from social sciences to address questions related to the discovery, recognition, and prediction of short-term and long-term behavior of individuals, groups, and communities in real life. This can range from people at work having meetings, users of social media sites, or people with mobile phones in urban environments. The group's research methods are aimed at creating ethical, personally and socially meaningful applications that support social interaction and communication, in the contexts of work, leisure, healthcare, and creative expression.