Bradley A. Singletary and Thad E. Starner
College Of Computing, Georgia Institute of Technology, Atlanta, GA 30332
{bas,thad}@cc.gatech.edu
Our experimental system proved % accurate when tested on wearable video data captured at a professional conference. Over 300 individuals were captured during social engagement, and the data was separated into independent training and test sets. A metric for balancing the performance of face detection, localization, and recognition in the context of a wearable interface is discussed.
Recognizing social engagement with a user's wearable computer provides context data that can be useful in determining when the user is interruptible. In addition, social engagement detection may be incorporated into a user interface to improve the quality of mobile face recognition software. For example, the user may cue the face recognition system in a socially graceful way by turning slightly away and then toward a speaker when conditions for recognition are favorable.
One way to implement such a system would be to place cameras in every environment in which a user may meet new acquaintances. This method is untenable from a cost perspective. Since wearables are self-contained, a face recognizer implemented on a wearable could function in environments devoid of specialized infrastructure. In this manner, the face recognition resource would always be available to the wearer. Intelligent interface agents implemented on top of this system can then provide the face-name associations suggested earlier [32,20,6]. Rhodes accurately labels such systems just-in-time information retrieval agents [24].
Effective wearable interfaces apply what they understand about the wearer's context (directly or indirectly provided by the wearer) to some problem to be solved for the user [31]. These systems must balance computation against human burden. For example, if the wearable computer interrupts its wearer during a social interaction (e.g. to alert him to a wireless telephone call), the conversation may be disrupted by the intrusion. Detection of social engagement allows for blocking or delaying interruptions appropriately during a conversation.
Hall [8] defines ``near social interaction'' to be from four to seven feet of separation between the participants. To segment casual social interaction visually, we identify social engagement - the first stage of social intercourse where one or both parties exchange a desire to communicate through verbal or non-verbal behaviors - as the start of conversation. Proxemics, non-verbal communication, and other social interplay are outside the scope of this paper, but we refer the reader to Hall [8] and Harrison [9] for such topics. To identify social engagement visually from the first-person perspective, we wish to use features endemic to engagement. For example, eye fixation, patterns of change in head orientation, social conversational distance, and change in visual spatial content may be relevant [30] [23]. We are as yet uncertain which features are required for recognition, so we induce a set of behaviors that assist the computer with face recognition. Specifically, the wearer aligns x's on a head-up display with the eyes of the subject to be recognized. As we learn more about the applicability of our method from our sample data set, we will extend our recognition algorithms to include non-induced behaviors.
When a conversant is socially engaged with the user, a weak constraint may be exploited for face recognition and detection. Specifically, search over scale and orientation may be limited to that typical of the near social interaction distances. Thus, a method for determining these types of social interactions may be profitable. Example works on visual modeling of human interaction include hidden Markov models (HMMs), Coupled HMMs(CHMMs)[19], and stochastic grammars [12]. These works were primarily conducted from the third-person perspective of surveillance but serve as a model for our work with the first-person. HMMs with stochastic grammars were used by Moore [17] to model complex actions. Furthermore, HMMs have been successful in recognition of American Sign Language gestures [33] and location recovery[34].
A similar problem exists in ethologically inspired robotics. In Breazeal et al.[1,4], the authors code high level knowledge of social behavior into robots as part of a four level hybrid architecture. The knowledge of social constraints enhances situational awareness. Pre-attentive, visually-attentive, and post-attentive processing of video obtained from the robot's 'eyes' are applied in succesion, each constraining or refining the successor to a smaller, more salient, subset of visual information.
Mann [14] and Starner [32] describe manual alignment of target faces with calibration marks overlaid by a head mount display. Users must explicitly request recognition after aligning the face. A likelihood sorted list of candidates is presented to the wearer for human selection. Neither paper quantifies the performance of either human or computer at detection or recognition. If detection can be done automatically, there would be less load on the human. Conversely, if the processor or algorithm for detection are weak, hand-aligned recognition may be desirable. In Brzezowski, Dunn, and Vetter [5], a mobile system for military and police use in identifying criminals is presented. This system was a combination of commercial face recognition software and mobile-wireless variable bandwidth infrastructure. Its limitations are numerous, but it notably failed in varied lighting conditions common to mobile usage. Finally, in Iordanoglou et al. [11], a method for wearable face recognition is described but was not prototyped or tested on data acquired from a wearable computer. While the system does not address detection, it discusses algorithm performance under bandwidth limitations, a problem central to mobile computing.
While there are many face detection, localization, and recognition algorithms in the literature that were considered as potential solutions to our problem, our task is to recognize social engagement in context of human behavior and the environment. Face presence may be one of the most important features, but it is not the only feature useful for segmenting engagement. Generally, face detection consists of a search across scale and within some tolerance of in-plane or out-of plane rotations [7,25,29,35,13]. Prior work with HMMs on face detection by Nefian [18] modeled both face recognition and face detection using embedded HMMs. This work demonstrates the feasibility of HMMs for face recognition and detection. However,search at scale was performed and no background or noise models were used. Unfortunately, classic detection is usually under-constrained and over-optimistic about background content. In examination of 10 standard face databases ( images)[21,2,29,28,25,16,10,35,27,15,22], we found that background contents had little variation. By comparison, scenes obtained from a body-worn camera in everyday life contained highly varied scene backgrounds. Furthermore, current general purpose detectors are very compute-intensive due to searching at scale. Though a low false positive rate is relatively important for interface reasons, the computational price without specialized hardware is generally unacceptable. Expensive algorithms should only be computed if there is a reasonable chance a face exists or if fine grain localization is required. We detail comparison metrics below for deciding which methods better satisfy real-time interface requirements.
We assembled a prototype wearable camera system to acquire necessary preliminary test data. (see Figure 2)The apparatus consists of: a color camera, an infrared(IR) sensitive black and white camera, a low-power IR illuminator, two digital video(DV) recorder decks, one video character generator, one audio tone generator, and four lithium ion cam-corder batteries. The DV deck, character generator, tone generator, camera DSP unit, and battery/power system are housed in a camera vest. The cameras, head mount display, and infrared illuminator are mounted on a plastic helmet for increased stability and precision of capture. Using infrared sensitive cameras with infrared emitting illuminators allows for night time capture or semi-covert operation. Unfortunately, data captured at the conference did not make use of the illuminator as sunlight and incandesant lighting provided more than enough IR radiation for capture.
The output of one camera is split with a low-power video distribution amplifier and displayed in one eye of the head mount display. The signal is annotated with two 'x' characters spaced and centered horizontally then placed one third of the way from the top of the video frames (Figure 2). The other copy of the signal is saved to DV tape. To capture face data, the wearer of the vest approaches a subject and aligns the person's eyes with the two 'x' characters. The 'x' characters represent known locations for a subject's eyes to appear in the video feed. The marks and lens focus are ideally calibrated to be appropriate for footage taken at normal conversational distances from the subject. Once the marks are aligned, the wearer pushes a button that injects an easily detected tone into the DV deck's audio channel for later recovery. The audio tones serve as ground-truth markers for training purposes.
Since the wearer lined up two x's with the eyes of a viewed subject, the presence of a face could safely be guaranteed to be framed by a 360x360 subregion of the 720x480 DV frame at the annotated locations in the video. Faces present at engagement were large with respect to the subregion. We first convert to grey-scale, deinterlace, and correct non-squareness of the image pixels in the subregion. We then used Gaussian sub-sampling to reduce the size of the images to 22x22 pixels. Therefore, each feature vector consists of 484 elements.
We model the face class by a 3 state left-right HMM as shown in Figure 3. The other class was much more complex to model and required a 6 state ergodic model to capture the interplay of garbage types of scenes as shown in Figure 4. We plot the mean values of the state output probabilities. The presence of a face seems important for acceptance by the engagement model. The first state contains a rough face-like blob and is followed by a confused state that likely represents the alignment portion of our gesture. The final state is clearly face-like, with much sharper features than the first state and would be consistent with conversational engagement. Looking at the other class model, we see images that look like horizons and very dark or light scenes. The complexity of the model allowed wider variations in scene without loss in accuracy. It is clear that different environments and viewpoints would derive different model structures. Thus, user and location specific models can likely be derived or adapted to improve the general detection strategy. For single user wearables, only learning location-dependent models may be sufficient. For a reference on visual modeling of location see Rungsarityotin [26].
Accuracy results are shown in Table 1. Confusion
matrices are given in Table 2 and Table
3.
experiment | training set | independent test | |
22x22 video stream | 89.71% | 90.10% |
train confusion, N=843 | engagement | other | |
engagement | 82.1%(128) | 17.9%(28) | |
other | 8.6%(63) | 91.3%(665) |
test confusion, N=411 | engagement | other | |
engagement | 83.3%(50) | 16.7%(10) | |
other | 8.7%(30) | 91.3%(314) |
Given a false positive rate , we can solve the equation to determine the maximum allowable time for the localization and recognition process as compared to the detection process.
We are beginning to model other modalities of engagement behavior. Engagement detection failure in one modality may be discounted by addition of further sensors on the user. For example, Selker [30] discusses an eye fixation detector; eye fixation may help indicate social engagement. Two parties meeting for the first time will usually look to see whom they are meeting. Sound may provide another modality with which to detect social engagement. For instance, personal utterances like ``hello, my name is ...'' are common during social engagement. A simple range sensor using sonar or pulsed IR could be mounted on the camera to determine presence of objects within near social interaction distances. This could be used as a trigger for activating body-worn cameras. Finally, we have constructed, but not yet integrated, a vision-based walking/not-walking classifier. Detection of head stillness and other interest indicators will likely reduce false positives in our system[23].
We are considering several applications for this technology. Face recognition on a wearable platform aids and protects military and law enforcement officers in the field by providing personnel the ability to conduct comparisons of viewed subjects to records of wanted criminals. Such a tool would ideally augment wanted posters and visual comparison, reduce human error, reduce time to capture, and reduce legal costs. To directly aid border guards, sentries, and patrol officers, such a system should be configured to function in daylight or at night. Medical benefit can be realized by people that suffer from prosopagnosia (face blindness) by restoring their ability to learn and recognize faces directly. More generally, such a system may be useful to anyone who needs to associate large numbers of names with faces. For example, salespersons and politicians could suddenly recall a person's name and any previous salient interactions. As a final application, we are considering the creation of an attention manager to protect the wearer from stimulus overload. Detecting conversational context is key to handling distractors, such as cellular phone calls, in an intelligent fashion.
This document was generated using the LaTeX2HTML translator Version 2K.1beta (1.49)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 paper.tex
The translation was initiated by Bradley A. Singletary on 2001-09-18