Statement of Future Research

I have been studying Linguistics for over ten years and Computational Linguistics for over seven years.  During this time, I have been particularly interested in the nature of spoken language because of its richness and eloquence in communicating meaning.  Not only do we communicate our feelings, thoughts, and intentions by the words we say, but also by how we say them.  Of late, my research has focused on the role of suprasegmental, prosodic information in conveying emotion and other cognitive states, both as a matter of theoretical interest and for use in Spoken Dialog Systems.  My future interests relate to this line of investigation insofar as they involve the computational modeling of speech for the improvement of both linguistic theory and speech technology applications.

My interests lie at the intersection of the fields of Artificial Intelligence, Human-Computer Interaction, and Natural Language Processing.  Currently, computers are adept at doing certain things better than most humans, such as math and even playing chess.  However, we are still years away from enabling computers to do things as humans do.  I see my research interests to be those that foster better social interaction between human and computer by improving the spoken conversational ability of automated computer agents.

Research has shown that we react to computers as we would to other humans, even though we may not be consciously aware of it (Reeves & Nass, 1998).  We are susceptible to flattery by computers and even compare the intelligence of different computers.  I conjecture that we also judge computers based on their conversational behavior.  As anyone who has ever interacted with a Spoken Dialog System can tell you, it is often a frustrating experience.  Users of such systems report that frustration arises from performance error, such as mis-recognitions.  Though the displeasure of using such systems is surely related to performance issues, I believe that there are other reasons as well, including the unnatural behavior of the automated agent.  Even when the dialog is ‘successful’ in terms of task-completion, the interaction often leaves users negatively affected.  Conversation is not only a means to an end—a way to transmit information between two parties—it also constitutes social interaction whereby feelings and emotional states are constantly monitored and reciprocated.  Good conversation partners are expected to follow certain rules of the language and those that do not are perceived to be not only incompetent, but also uncooperative, rude, and even hostile.  I believe that this is one of the main factors that leave users of Spoken Dialog Systems with a feeling of dissatisfaction.  In the future, I would like to make computers more cooperative communication partners by engineering them to display certain behaviors found in all felicitous spoken interactions.  A good conversation partner provides supportive feedback in the form of back-channeling, is responsive to turn-taking strategies, respects face-saving devices, and is capable of error-handling.

The idea of enabling Spoken Dialog Systems to be more conversationally sophisticated is not a novel one.  Currently, a few researchers are working to this end (e.g., Carlson et al., 2006).  However, I think that there are two reasons why such functionality is rarely implemented in deployed systems.  The first is the belief that sophisticated world knowledge and reasoning capabilities are needed in order to accomplish these tasks.  The second is that such information is not deemed critical to the success of systems in terms of task-completion.  I would argue that these two views are critically misguided.  It is true that state-of-the-art Artificial Intelligence is currently incapable of modeling the cognitive knowledge required for full communicative competency on the part of computers.  However, many techniques can be done at a much lower level while still successfully enhancing the communicative experience.  For example, back-channeling in response to simple acoustic cues such as signal intensity, or inserting disfluencies for attentional focus or naturalness can be done quite well without sophisticated reasoning capabilities.  Secondly, though Spoken Dialog Systems may be able to achieve tasks without this functionality, ignorance of human conversational behavior actually creates antipathy towards the use of such systems by members of society at large.  For a system to be truly successful, users must enjoy the experience, rather than simply endure the frustration and awkwardness to achieve their goal.  In fact, one of the goals of communication is simply to engage in social interaction.  Part of my future research therefore revolves around other ways of evaluating such systems that take these issues into account.  In addition to traditional goal-based metrics, I plan to explore qualitative measures such as the presence of displayed negative emotions by users as an indicator of user enjoyment.

Most current approaches to human-computer dialog construction are more an art than science.  In other words, engineers often imagine how dialog should flow and construct dialog grammars heuristically.  This is analogous to formal models of language in which linguists imagine felicitous and non-felicitous language productions and extrapolate theories based on them.  Instead, I would like to apply a functional model to dialog management in which actions taken on the part of the automated agent are modeled directly on human behavior as observed and automatically-learned from corpora of human-human interaction.  This is a similar approach taken by functional linguists.  In particular, I am interested in studying types of human conversational behavior that are used to structure dialog flow, such as back-channeling and turn-taking cues, as well as error-recovery strategies.  This approach differs from most direct-modeling approaches that focus on modeling underlying dialog act sequences (Levin et al., 2000).

Automatic speech recognition—the process of converting a speech signal to a sequence of words—has been studied extensively, and performance—in terms of word error rate—is quite robust.  Such systems adopt the Noisy Channel Model, which searches for the most likely word sequence given an acoustic signal.  Much of the power of this approach relies on statistical language modeling that uses word sequence probabilities to model the most likely sequence of words.  Often, this leads to very un-human-like recognition errors.  I plan to explore ways of incorporating knowledge about human misrecognition to inform system misrecognition (after all, humans make recognition errors as well).  Furthermore, I would like to more fully develop recognition error-recovery by exploring the cues humans use to indicate that they have made a recognition error and what steps they take to correct such errors.  I do not see perfect speech recognition as a realistic goal.  Instead, I feel that proper error-recovery modeling will produce systems that adhere more to the behavior of human interaction, and thus produce Spoken Dialog Systems that are more natural to use and well-liked by system users.