Multimodal Interactions with Dynamically Autonomous Robots



D.J Perzanowski,1 A.C. Schultz,1 W.L Adams,1 M. Bugajska,1 M.S. Skubic,2 G. Trafton,1 D.P Brock,1 E. Marsh,1 and M. Abramson3
1
Information Technology Division
2University of Missouri-Columbia
3ITT Industries


Introduction: Intelligent interaction between humans and robots requires that they interact naturally, intelligently, and cooperatively to accomplish goals. This interaction is dependent on their roles. In other words, a sense of teamwork needs to be built in to the interface. This kind of interaction and the ability to act as cooperative or independent agents is known as dynamic autonomy.

Natural interactions, such as natural language and gestures, facilitate dynamic autonomy. They affect easy communication, allowing the participants to concentrate on the task and not on the ways to communicate. Awareness of the environment is also important. Thus, our interface incorporates both spoken utterances, natural gestures, and a cognitive model of spatial relations to indicate such elements as the location of objects and other spatial information about the environment. Given this model of the environment, humans and robots have a common ground for interacting with each other and the environment.

As robots and autonomous vehicles become more prevalent, human-robot interaction is becoming increasingly important. Current state of the art requires many humans to control a single, seemingly autonomous vehicle. For example, Global Hawk, a high-altitude, long-endurance unmanned air vehicle, currently requires a team of 10 operators to control it, while Predator, a medium-altitude, long-endurance unmanned aerial vehicle, requires three. Future autonomous systems must work closely and cooperatively with humans, sometimes exhibiting full autonomy while at other times collaborating with varying numbers of humans in close proximity. To facilitate collaboration and cooperation in such systems, we have designed a multimodal interface2 that incorporates both natural language and gestures, touch screen modalities, and a cognitive model of spatial relations (Fig. 1).

Robot Platforms: We are using several robots—Nomad 200s, a B21r, and several ATRV-Jrs (Fig. 1). They are equipped with range sensors (sonars, structured light or LIDAR rangefinders, etc.) to enable environment mapping, crude object detection, and gesture detection. The robots are also equipped with a wireless microphone for speech input and an optional camera to provide the user with a real-time video of the environment.

Fig 1
FIGURE 1
System architecture.

Multimodal Interface: When using the interface, human users need not conform to predetermined methods of interaction to complete a task. Speaking a command and gesturing may seem appropriate and natural at times (Fig. 2). Or, the human user can use graphical modes, such as a hand-held personal digital assistant (PDA) (shown in Fig. 1) or an end user terminal (EUT) (Fig. 3). Menu buttons on the PDA and EUT (top right-hand screen in Fig. 3) replace spoken commands and queries. An EUT satellite image (bottom right-hand screen in Fig. 3) provides an aerial view of the robot's environment. The lower left-hand screen (Fig. 3) shows a live robot-eye-view of the immediate environment, and a mapped representation of the latter is on both the PDA and EUT (middle left-hand screen in Fig. 3). A text window (upper left-hand screen in Fig. 3) displays the human-robot dialog. Users can combine any of the various modalities to interact with the robot, e.g., speaking and clicking on a location on the robot's map.

Fig 2




FIGURE 2
A researcher interacts with a mobile robot using natural language and gestures.



Fig 3






FIGURE 3
Multi-screen display of the end user terminal (EUT).

Commands or queries are linguistically parsed,4 and the resulting representation is correlated with gesture data, knowledge of other participating agents, and with spatial information from the robot sensors. The result is then mapped to a robot command, which produces either the requested action or invokes a further interchange of information. Thus, humans and robots become cooperative and collaborative agents in completing a task.

The spatial reasoning component3 clusters the sonar data to define discreet objects. Objects can be named for easy reference, and spatial information, such as left of and behind, is derived, which can then be used for further interactions.

Finally, human-robot interaction is facilitated by shared cognitive models of behavior. Humans communicate, cooperate, and collaborate because they share these models. Using ACT-R, a cognitive architecture for simulating and understanding human cognition and behavior,1 the robots can reason about spatial relations and objects, and behave in ways analogous to humans. With a similar model of behavior, humans and robots can interact and communicate more effectively and efficiently.

Thus, the robot can understand complex navigational commands, such as "Go between the two buildings on your left and hide on the northwest corner behind the storage container." Not only must the robot understand what the various objects in these utterances are, but it must also be able to identify significant locations on or near those objects. With this information, it can then perform an action, such as hiding, which involves a complex set of heuristics.

Conclusions: We are concentrating on two research areas to facilitate cooperation and collaboration in human-robot interaction. The first area is the design and implementation of a multimodal interface. By providing a natural or intuitive multimodal interface, users can concentrate on the task, not on the modes of interaction. The second research area is the use of computational models of human cognition to facilitate spatial reasoning in robots that share information about the environment, objects, and locations with humans and with each other. By incorporating human cognitive models, we enable collaborative and cooperative interactions that enhance dynamic autonomy in robots.

[Sponsored by ONR and DARPA]

References

1 J.R. Anderson and C. Lebiere, The Atomic Components of Thought (Lawrence Erlbaum, Mahwah, NJ, 1998).
2 D. Perzanowski et al., "Communicating with Teams of Cooperative Robots," in Multi-Robot Systems: From Swarms to Intelligent Automata, A.C. Schultz and L.E. Parker, eds. (Kluwer: The Netherlands, 185-193, 2002).
3 M. Skubic, D. Perzanowski, A. Schultz, and W. Adams, "Using Spatial Language in a Human-Robot Dialog," in Proceedings of the IEEE 2002 International Conference on Robotics and Automation, Washington, DC, 2002, pp. 4143-4148.
4 K. Wauchope, "Eucalyptus: Integrating Natural Language Input with a Graphical User Interface," NRL/FR/5510-94-9711, Naval Research Laboratory, Washington, DC.