Read my position paper on "Cognitive multimodal processing: from signal to behavior" (invited paper at the Workshop on Roadmapping the Future of Multimodal Interaction Research).
Representation Models, Cognition and Learning
For too many decades the emphasis in our community has been on task-specific decoding performance
rather than creating models that have good generalization power and, especially, good induction properties,
i.e., can learn from one-to-five examples just like humans do.
My vision is creating cognitively-motivated representations (aka models) that
radically depart from the unified metric space fallacy (aka the real-world bias)
and respect macroscopic cognitive principles such as low-dimensionality,
hierarchy, abstraction, two-tier architecture (system 1 vs system 2) etc.
Instead of following the popular path in representation modeling of
adding these constraints as training tricks in deep neural nets or regularization
terms in autoencoder training, we propose instead a top-down hierarchical manifold
that explicitly (by design) respects cognitive principles.
In our recent work, we show that by creating and reasoning using an ensemble of
sparse, low-dimensional subspaces we achieve human-like performance not only
for decoding but also for induction
(learning) lexical semantics. The framework
is currently being applied to a variety of other classification/learning tasks.
Network-based DSMs and Conceptual Spaces
We work on language-agnostic, fully unsupervised algorithms for the construction of
distributed semantic models (DSMs) using web-harvested corpora.
Unlike traditional DSMs were the emphasis is on creating a unified metric semantic space,
we take a cognitively-motivated approach of constructing a union of semantic neighborhoods
that are defined using co-occurrence or contextual similarity features. On top of these
neighborhoods semantic similarity metrics can be defined achieving state-of-the-art results
I am working with collaborators to extend these network-based models to full-blown multimodal
that include other modalities, such as images, audio snippets, emotions etc.
In addition to the multimodal dimension we are also working on building compositional and
to seamlessly extend network DSMs from the lexical/concept to the phrase/sentence level.
Historically our work on DSMs started from experiments on automatic grammar induction (part of the
DARPA Communicator project)
and now does full circle back to this important application. In
the PortDial and
SpeDial projects we investigate (among others)
how the proposed network DSM technologies can help improve grammar induction and
- Putting semantics back into NLP: Once a strong semantic model is constructed it is possible
to use it for a variety of NLP applications, e.g., language modeling, machine translation, paraphrasing.
We proposed to use network-based DSMs for morphological analysis and stemming that is semantics-aware,
e.g., select stemming rules that minimize semantic distortion while minimizing the total number of
wordforms (see [Zervanou et al., LREC 2014]).
Lexical acquisition in autistic spectrum and typically developing children: In the
BabyAffect project we investigate and model lexical
acquisition using concept networks and show how by augmenting these networks with
affect and other "multimodal" cues results in improved learning rate.
Semantic-Affective Models and Beyond
The basic idea behind semantic-affective models is emotion is a mapping from a
(lexical) semantic space to an affective space. Our semantic model
is a union of semantic neighborhoods (see above) and the semantic-affective
map is a weighted linear combination of the affective scores of each
semantic neighborhood. The model is readily extendable to other type
of labels that are related to semantics, e.g., politeness markers, sentiment, cognitive state,
and has been shown to be very successful in recent SemEval evaluation campaigns
Recent emphasis of our work is on adaptation of the semantic-affective models to new
domains or labels, as well as cognitively-motivated compositional models
that integrate information over time.
Multimodal Dialogue Interaction and Communication
The crowning achievement of human communication is our unique ability to share intentionality, create and execute on joint plans. Recently the experimental analysis of the emergence of a shared communication code in human children and primates has provided significant new insights.
Human interaction via gestures and speech can be represented as a three-step process: sharing attention, establishing common ground and forming shared goals [Tomasello 2008]. Two prerequisites for successful human-human communication via joint intentionality are: 1) our ability to form a successful model of the cognitive state of people around us, i.e., decoding not only overt but also covert communication signals also referred to as recursive mind-reading and 2) establishing and building trust a truly human trait.
I am interested in applying such basic communication principles in
human-machine and human-robot communication especially as it pertains to negotiating
semantics and intent, i.e., establishing common ground and forming joint goals.
Attention, Saliency and Affect in Multimedia
Saliency- and attention-based modeling have played a significant role in image and video processing
in the past decade. However, saliency and attention is less researched in audio, speech and natural language
processing. Recently, there have been important findings from neurocognition and cognitive
science unraveling the mechanisms for audio/speech saliency and attention, e.g., spectro-temporal attentional
maps, the role of low-level features such as periodicity and spectral change.
I am interested in investigating the role of bottom-up attentional mechanism in speech, audio and music perception
with application to background-foreground audio classification, audio scene analysis and speech recognition.
For an example on the fused audio, visual and text saliency
for event detection and movie summarization see here
Speech Analysis and Robust Speech Recognition
Although research in speech analysis and feature extraction has become less glamorous
due to the recent success of deep neural nets, the signal processing aspects of speech
processing remain fundamentally important and have consistently provide over the years
good intuition and performance improvements in robust speech recognition and various speech
processing tasks. I am especially interested in analyzing and modeling the fine-structure of speech:
micro-modulations that occur in amplitude and frequency within a pitch-period and are due to
1) non-linear interaction between the source and the vocal tract, 2) transitional phenomena
at phonemic boundaries and 3) deviations from modal voicing due to lack of fine motor control.
These phenomena are especially important as they often indicate the cognitive and affective state of the speaker
making the features very successful, e.g., for emotion recognition tasks. For relevant
publications on the AM-FM model, speech analysis and robust recognition see here