«ABSTRACT Unmanned aerial vehicles with high quality video cameras are able to provide videos from 50,000 feet up that show a surprising amount of ...»
Object and Event Recognition for Aerial Surveillance
Yi Li, Indriyati Atmosukarto, Masaharu Kobashi, Jenny Yuen and Linda G. Shapiro
University of Washington, Department of Computer Science and Engineering, Box 352350,
Seattle, WA 98195-2350, U.S.A.
Unmanned aerial vehicles with high quality video cameras are able to provide videos from 50,000 feet up that
show a surprising amount of detail on the ground. These videos are diﬃcult to analyze, because the airplane moves, the camera zooms in and out and vibrates, and the moving objects of interest can be in the scene, out of the scene, or partly occluded. Recognizing both the moving and static objects is important in order to ﬁnd events of interest to human analysts. In this paper, we describe our approach to object and event recognition using multiple stages of classiﬁcation.
Keywords: object recognition, event recognition, machine learning, aerial surveillance
1. INTRODUCTION Unmanned aerial vehicles (UAVs) are able to provide large amounts of video data over terrain of interest to defense and intelligence agencies. These agencies are looking for signiﬁcant events that may be of importance for their missions. However, most of this footage will be eventless and therefore of no interest to the analysts responsible for checking it. If a computer system could scan the videos for potential events of interest, it would greatly lessen the work of the analyst, allowing them to focus on the events of possible importance.
Several diﬀerent processes are needed for the computer analysis of aerial videos. First, the static objects in the video frames must be recognized to determine the context of the events. Static objects might include forests, ﬁelds, roads, runways, and buildings, among others. Next the moving objects in the video must be detected, tracked, and identiﬁed. Moving objects include vehicles (cars, trucks, tanks, and buses) and people. Given the static objects and moving objects in a set of frames, events are deﬁned by the actions of the moving objects and their interactions with the static objects. For example, two cars might pull oﬀ a road and stop together in a ﬁeld.
People might get out of the cars and approach each other for a meeting. A caravan of trucks might travel in one direction on a dirt road for a period of time and then make a U-turn and proceed in the opposite direction.
A vehicle might pull up to a building and disappear into an underground garage or tunnel, then reappear some time later. In all of these cases, both the moving objects and the static objects must be recognized and their interactions noted.
We are developing a system for object and event recognition for this purpose. In this paper we describe the structure of our system and give brief overviews of the underlying algorithms.
2. SYSTEM OVERVIEW In order to recognize events in a video, the moving objects across a sequence of frames and the static objects in each of these frames must be detected and recognized. Then using these moving and static objects as primitives, simple events can be deﬁned in terms of the relationships among the moving objects and between the moving objects and the static ones. Finally, more complex events can be deﬁned as sequences of simple events. For example, simple events such as a vehicle appearing on a road, moving forward for a short distance, and disappearing behind a tree would lead to a complex event as shown in the ﬁrst row of Figure 1. Three other complex events (a convoy of cars making a U-turn, a car overtaking another car, and a truck making a turn and passing by a line of cars in the opposite direction) are also shown in Figure 1.
Further author information: Send correspondence to Linda Shapiro, E-mail: firstname.lastname@example.org, Telephone:
1 206 543 2196; Yi Li is now with Vidient Systems, Inc., Sunnyvale, CA, U.S.A.
Figure 1. Examples of events.
(Row 1) Vehicle disappears behind a tree, (Row 2) Cars making a Uturn, (Row 3) Car overtaking another car, (Row 4) Truck makes a turn and passes by cars moving in the opposite direction.
Figure 2 shows the architecture of our system. The system receives a video sequence as its input. The static feature extraction module will extract region features such as color regions, texture regions and structure regions from the static objects in the video, while the dynamic feature extraction module will extract features from the objects that are moving in the video. The object recognition module will use the features extracted by both the static and dynamic feature extraction modules to label the objects in the frames, while the object tracking module will track the objects that are moving from frame to frame. Relationships between objects over time such as relative position between objects within a frame will be computed by the object relationship extraction module. The results of all three modules: object recognition, object tracking and object relationship extraction will be used by the event recognition module to recognize the events happening in the video and output the results.
3. STATIC OBJECT CLASS RECOGNITION
Our methodology for object recognition has three main parts:
1. Select a set of features that have multiple attributes for recognition and design a uniﬁed representation for them.
2. Develop methods for encoding complex features into feature vectors that can be used by general-purpose classiﬁers.
3. Design a learning procedure for automating the development of classiﬁers for new objects.
The uniﬁed representation we have designed is called the abstract region representation. The idea is that all features will be regions, each with its own set of attributes, but with a common representation. The regions we are using in our work are color regions, texture regions and structure regions deﬁned as follows.
Color regions are produced by a two-step procedure. The ﬁrst step is color clustering using a variant of the K-means algorithm on the original color images represented in the CIELab color space. 1 The second step is an iterative merging procedure that merges multiple tiny regions into larger ones. Figure 3 illustrates this process on a football image in which the K-means algorithm produced hundreds of tiny regions for the multi-colored crowd, and the merging process merged them into a single region. Our texture regions come from a color-guided texture segmentation process. Color segmentation is ﬁrst performed using the K-means algorithm. Next, pairs of regions are merged if after a dilation they overlap by more than 50%. Each of the merged region is segmented using the same clustering algorithm on the Gabor texture coeﬃcients. Figure 4 illustrates the texture segmentation process.
The features we use for recognizing man-made structures are called structure features and are obtained using
the concept of a consistent line cluster.2 These features are obtained as follows:
Original Color Merged
Figure 4. The texture segmentation is color-guided: it is performed on regions of the initial color segmentation.
1. Apply the Canny edge detector3 and ORT line detector4 to extract line segments from the image.
2. For each line segment, compute its orientation and its color pairs (pairs of colors for which the ﬁrst is on one side and the second on the other side of the line segment).
3. Cluster the line segments according to their color pairs, to obtain a set of color-consistent line clusters.
4. Within the color-consistent clusters, cluster the line segments according to their orientations to obtain a set of color-consistent orientation-consistent line clusters.
5. Within the orientation-consistent clusters, cluster the line segments according to their positions in the image to obtain a ﬁnal set of consistent line clusters.
Figure 5 illustrates the abstract regions for several representative images. The ﬁrst image is of a large campus building at the University of Washington. Regions such as the sky, the concrete, and the large brick section of the building show up as large homogeneous regions in both color segmentation and texture segmentation. The windowed part of the building breaks up into many regions for both the color and the texture segmentations, but it becomes a single region in the structure image. The structure-ﬁnder also captures a small amount of structure at the left side of the image. The second image of a park is segmented into several large regions in both color and texture. The green trees merge into the green grass on the right side in the color image, but the texture image separates them. No structure was found. In the last image of a sailboat, both the color and texture segmentations provide some useful regions that will help to identify the sky, water, trees and sailboat.
The sailboat is captured in the structure region. It is clear that no one feature type alone is suﬃcient to identify the objects.
In our framework for object and concept class recognition, each image is represented by sets of abstract regions and each set is related to a particular feature type. To learn the properties of a speciﬁc object, we must know which abstract regions correspond to it. Once we have the abstract regions from an object, we extract the common characteristics of those regions as the model of that object. Then given a new region, we can compare it to the object models in our database to decide to which it belongs. We designed the algorithm to learn Original Color Texture Structure Figure 5. The abstract regions constructed from a set of representative images using color clustering, color-guided texture clustering and consistent-line segment clustering.
correspondences between regions and objects in the training images to require only the list of objects in each training image. With such a solution, not only is the burden of constructing the training data largely relieved, the principle of keeping the system open to new image features is upheld.
3.2. EM-Variant Approach to Object Classiﬁcation Our object recognition methodology uses whole images of abstract regions, rather than single regions for classiﬁcation. A key part of our approach is that we do not need to know where in each image the objects lie. We only utilize the fact that objects exist in an image, not where they are located. We have designed an EM-like procedure that learns multivariate Gaussian models for object classes based on the attributes of abstract regions from multiple segmentations of color photographic images.5 The objective of this algorithm is to produce a probability distribution for each of the object classes being learned. It uses the label information from training images to supervise EM-like iterations. In the initialization phase of the EM-variant approach, each object is modeled as a Gaussian component, and the weight of each component is set to the frequency of the corresponding object class in the training set. Each object model is initialized using the feature vectors of all the regions in all the training images that contain the particular object, even though there may be regions in those images that do not contribute to that object. From these initial estimates, which are full of errors, the procedure iteratively re-estimates the parameters to be learned. The iteration procedure is also supervised by the label information, so that a feature vector only contributes to those Gaussian components representing objects present in its training image. The resultant components represent the learned object classes and one background class that accumulates the information from feature vectors of other objects or noise. With the Gaussian components, the probability that an object class appears in a test image can be computed. The EM-variant algorithm is trained on a set of training images, each of which is labeled by the set of objects it contains. For each test image, it computes the probability of each of the object classes appearing in that image. Figure 6 shows some sample classiﬁcations using abstract regions with both color and texture properties.
Figure 6. Classiﬁcation results for grass and tree using the EM-variant approach with regions having both color and texture attributes.
3.3. Generative/Discriminative Approach to Object Classiﬁcation
Our two-phase generative/discriminative learning approach addresses three goals 6:
1. We want to handle object classes with more variance in appearance.
2. We want to be able to handle multiple features in a completely general way.
3. We wish to investigate the use of a discriminative classiﬁer to add more power.
Phase 1, the generative phase, is a clustering step that can be implemented with the classical EM algorithm (unsupervised) or the EM variant (partially supervised). The clusters are represented by a multivariate Gaussian mixture model and each Gaussian component represents a cluster of feature vectors that are likely to be found in the images containing a particular object class. Phase 1 also includes an aggregation step that has the eﬀect of normalizing the description length of images that can have an arbitrary number of regions. The aggregation step produces a ﬁxed-length feature vector for each training image whose elements represent that image’s contribution to each Gaussian component of each feature type.
Phase 2, the discriminative phase, is a classiﬁcation step that uses the feature vectors of Phase 1 to train a classiﬁer to determine the probability that a given image contains a particular object class. It also generalizes to any number of diﬀerent feature types in a seamless manner, making it both simple and powerful. We currently use neural net classiﬁers (multi-layer perceptions) in Phase 2.