Read my position paper on "Cognitive multimodal processing: from signal to behavior" (invited paper at the Workshop on Roadmapping the Future of Multimodal Interaction Research).

Representation Models, Cognition and Learning

For too many decades the emphasis in our community has been on task-specific decoding performance rather than creating models that have good generalization power and, especially, good induction properties, i.e., can learn from one-to-five examples just like humans do. My vision is creating cognitively-motivated representations (aka models) that radically depart from the unified metric space fallacy (aka the real-world bias) and respect macroscopic cognitive principles such as low-dimensionality, hierarchy, abstraction, two-tier architecture (system 1 vs system 2) etc. Instead of following the popular path in representation modeling of adding these constraints as training tricks in deep neural nets or regularization terms in autoencoder training, we propose instead a top-down hierarchical manifold representation that explicitly (by design) respects cognitive principles. In our recent work, we show that by creating and reasoning using an ensemble of sparse, low-dimensional subspaces we achieve human-like performance not only for decoding but also for induction (learning) lexical semantics. The framework is currently being applied to a variety of other classification/learning tasks.

Network-based DSMs and Conceptual Spaces

We work on language-agnostic, fully unsupervised algorithms for the construction of distributed semantic models (DSMs) using web-harvested corpora. Unlike traditional DSMs were the emphasis is on creating a unified metric semantic space, we take a cognitively-motivated approach of constructing a union of semantic neighborhoods that are defined using co-occurrence or contextual similarity features. On top of these neighborhoods semantic similarity metrics can be defined achieving state-of-the-art results. I am working with collaborators to extend these network-based models to full-blown multimodal conceptual models that include other modalities, such as images, audio snippets, emotions etc. In addition to the multimodal dimension we are also working on building compositional and mapping algorithms to seamlessly extend network DSMs from the lexical/concept to the phrase/sentence level. Applications include:
  • Historically our work on DSMs started from experiments on automatic grammar induction (part of the DARPA Communicator project) and now does full circle back to this important application. In the PortDial and SpeDial projects we investigate (among others) how the proposed network DSM technologies can help improve grammar induction and paraphrasing performance.
  • Putting semantics back into NLP: Once a strong semantic model is constructed it is possible to use it for a variety of NLP applications, e.g., language modeling, machine translation, paraphrasing. We proposed to use network-based DSMs for morphological analysis and stemming that is semantics-aware, e.g., select stemming rules that minimize semantic distortion while minimizing the total number of wordforms (see [Zervanou et al., LREC 2014]).
  • Lexical acquisition in autistic spectrum and typically developing children: In the BabyAffect project we investigate and model lexical acquisition using concept networks and show how by augmenting these networks with affect and other "multimodal" cues results in improved learning rate.

Semantic-Affective Models and Beyond

The basic idea behind semantic-affective models is emotion is a mapping from a (lexical) semantic space to an affective space. Our semantic model is a union of semantic neighborhoods (see above) and the semantic-affective map is a weighted linear combination of the affective scores of each semantic neighborhood. The model is readily extendable to other type of labels that are related to semantics, e.g., politeness markers, sentiment, cognitive state, and has been shown to be very successful in recent SemEval evaluation campaigns. Recent emphasis of our work is on adaptation of the semantic-affective models to new domains or labels, as well as cognitively-motivated compositional models that integrate information over time.

Multimodal Dialogue Interaction and Communication

The crowning achievement of human communication is our unique ability to share intentionality, create and execute on joint plans. Recently the experimental analysis of the emergence of a shared communication code in human children and primates has provided significant new insights. Human interaction via gestures and speech can be represented as a three-step process: sharing attention, establishing common ground and forming shared goals [Tomasello 2008]. Two prerequisites for successful human-human communication via joint intentionality are: 1) our ability to form a successful model of the cognitive state of people around us, i.e., decoding not only overt but also covert communication signals also referred to as recursive mind-reading and 2) establishing and building trust a truly human trait. I am interested in applying such basic communication principles in human-machine and human-robot communication especially as it pertains to negotiating semantics and intent, i.e., establishing common ground and forming joint goals.

Attention, Saliency and Affect in Multimedia

Saliency- and attention-based modeling have played a significant role in image and video processing in the past decade. However, saliency and attention is less researched in audio, speech and natural language processing. Recently, there have been important findings from neurocognition and cognitive science unraveling the mechanisms for audio/speech saliency and attention, e.g., spectro-temporal attentional maps, the role of low-level features such as periodicity and spectral change. I am interested in investigating the role of bottom-up attentional mechanism in speech, audio and music perception with application to background-foreground audio classification, audio scene analysis and speech recognition. For an example on the fused audio, visual and text saliency for event detection and movie summarization see here.

Speech Analysis and Robust Speech Recognition

Although research in speech analysis and feature extraction has become less glamorous due to the recent success of deep neural nets, the signal processing aspects of speech processing remain fundamentally important and have consistently provide over the years good intuition and performance improvements in robust speech recognition and various speech processing tasks. I am especially interested in analyzing and modeling the fine-structure of speech: micro-modulations that occur in amplitude and frequency within a pitch-period and are due to 1) non-linear interaction between the source and the vocal tract, 2) transitional phenomena at phonemic boundaries and 3) deviations from modal voicing due to lack of fine motor control. These phenomena are especially important as they often indicate the cognitive and affective state of the speaker making the features very successful, e.g., for emotion recognition tasks. For relevant publications on the AM-FM model, speech analysis and robust recognition see here.