In an effort to extend traditional human-computer interfaces research has introduced embodied agents to utilize the modalities of everyday human-human communication, like facial expression, gestures and body postures. However, giving computer agents a human-like body introduces new challenges. Since human users are very sensitive and critical concerning bodily behavior the agents must act naturally and individually in order to be believable. This dissertation focuses on conversational gestures. It shows how to generate conversational gestures for an animated embodied agent based on annotated text input. The central idea is to imitate the gestural behavior of a human individual. Using TV show recordings as empirical data, gestural key parameters are extracted for the generation of natural and individual gestures. The gesture generation task is solved in three stages: observation, modeling and generation. For each stage, a software module was developed. For observation, the video annotation research tool ANVIL was created. It allows the efficient transcription of gesture, speech and other modalities on multiple layers.
ANVIL is application-independent by allowing users to define their own annotation schemes, it provides various import/export facilities and it is extensible via its plug-in interface. Therefore, the tool is suitable for a wide variety of research fields. For this work, selected clips of the TV talk show "Das Literarische Quartett" were transcribed and analyzed, arriving at a total of 1,056 gestures. For the modeling stage, the NOVALIS module was created to compute individual gesture profiles from these transcriptions with statistical methods. A gesture profile models the aspects handedness, timing and function of gestures for a single human individual using estimated conditional probabilities. The profiles are based on a shared lexicon of 68 gestures, assembled from the data. Finally, for generation, the NOVA generator was devised to create gestures based on gesture profiles in an overgenerate-and-filter approach. Annotated text input is processed in a graph-based representation in multiple stages where semantic data is added, the location of potential gestures is determined by heuristic rules, and gestures are added and filtered based on a gesture profile.
NOVA outputs a linear, player-independent action script in XML.