Note this paper was not part of the FRAME workshop but was included because there was space on the CD's and we felt it might be of interest to anyone that attended the workshop. The system is implemented in linux and uses mmx for some of its low level processing. You may find our linux mmx example helpful in learning a little about how that is done.


Frame-Rate Omnidirectional Surveillance & Tracking 1

Terry Boult              tboult@eecs.lehigh.edu
R. Micheals, X. Gao, P. Lewis, C.Power, C. Qian, W. Yin and A. Erkan
VAST Lab, EECS Department Lehigh University

This web paper is is loosely based on our paper in the 1999 IEEE workshop on Visual Surveillance.

Video surveillance is watching an area for significant events. Perimeter security generally requires watching areas that afford trespassers reasonable cover and concealment. Almost by definition such ``interesting'' areas have limited visibility distance. These situations call for a wide field of view, and are a natural application for omni-directional VSAM.

This paper describes an omni-directional tracking system. We then discuss some domain application constraints and background on the paracamera. We then go through the basic components of the frame-rate Lehigh Omni-directional Tracking System (LOTS) and describe some of its unique features. We end with a summary of an external evaluation of the system.

1  Background

There has been considerable work on tracking systems, e.g. [,,,,]. Our system draws ideas from these and other works. Our area of application, however, is a bit different than all but the last of the aforementioned papers. Our goal is to track targets in a perimeter security type setting, i.e. outdoor operation in moderate to high cover areas. This domain of application restricts the techniques that can be applied. Some of the constraints, and their implications for our systems include: Note that, except for the last, these are all generic problem constraints and are not dependent on the geometry of the paraimage. If a system can track under these constraints it can be used in many situations, not just omni-directional tracking in outdoor settings.

2  Paraimages

While other techniques might generate video in all directions, we consider the single-viewpoint constraint to be important. The Paracamera2 captures omni-directional video that has a single viewpoint. This allows us to generate geometrically correct perspective images in any viewing direction. Figure  shows an example of a raw paraimage and the generated perspective view.


 


urban.gif

Figure 1: Tracking system with a single perspective ``target'' window. (Left-right reversal because of mirror.) Note figure is also link to a MPEG movie of the tracker in action. (Quality lost because of need to keep down MPEG size, this one is 18Megs and shows 4 perspective windows...)


While it would be possible to unwarp in multiple directions and then do tracking, it would add considerable expense. Therefore, we are working directly in the complex geometry of the paraimage.

Note that the ``spatial resolution'' of the paraimage is not uniform. While it may seem counter intuitive, the spatial resolution of the paraimages is greatest along the horizon, just where objects are most distant. While the process scales to any size imager, the current systems use 640x480 NTSC (or 756x568 PAL) cameras. If we image the whole hemisphere, the spatial resolution along the horizon is [(240 pixels * 2*p)/ 360 degrees] = 4.2 [pixels/ degrees] (5.1 PAL) which is 14.3 arc-minutes per pixel (11.8 PAL). If we zoom in on the mirror, cutting off a small part of it, to increase the captured mirror diameter to 640 pixels (756 PAL), we can achieve 10.7 (6.6 PAL) arc-minutes per pixel.

As a point of comparison, let us consider a traditional ``wide-angle'' perspective camera. Allowing for a small overlap to continually track objects, it would take 3 cameras with about a 150° horizontal field-of-view (FOV) to watch the horizon. Note that each of these would have [640pixels/ 150degrees] = 4.2[pixels/ degrees], i.e. about the same as the Paracamera. Clearly, the traditional cameras would need more hardware and computation.

Every surveillance system must consider the tradeoff between resolution and FOV. The paracamera's unique design yields what may be a new pareto optimal design choice in the resolution/FOV trade-off. We have the horizontal resolution of a 150° camera but cover the full 360° of the horizon.

With a wide field of view, objects to be tracked will cover only a small number of pixels. With 4.2 pixels per degree, a target of dimension 0.5m by 2.0m, at 50m is approximately 2 pixels by 8 pixels, i.e. 16 pixels per person. At 30m, it yields approximately 32 pixels per person, presuming ideal imaging. Realistic tracking in a such a wide field of view requires the processing of the full resolution image with a sensitive yet robust algorithm.

3  LOTS: The Lehigh Omnidirectional Tracking System

We discuss the main components of the LOTS system, interjecting rationale for many of the design decisions. Most of the design choices were informally tested empirically using a mixture of data sets and often compared to alternatives which will not be discussed. We will briefly cover some of the uniqueness of the algorithms and the techniques that allow full resolution processing at 30fps on standard PC hardware. The algorithms could be applied (with some minor changes) to regular perspective images. A system diagram is shown in figure .

Our tracker is running under Linux using MMX enabled processors. The code described herein runs at 30fps on a 266 Mhz K6 with 32MB of memory and a PCI frame-grabber. We have demonstrated a smaller system based on a 166MMX in a Compact-PCI housing (12''x5''x5'') that tracks at 15fps. We are also porting the tracker to our augmented Remote Reality ``wearable'' (a low-power 133MMX based system), see [].

While the system is capturing 16 bit YUV data, we are currently using only the Y channel, to allow the same system to be used with monochrome night-vision devices. Color would enhance tracking in urban scenes, but adds little for camouflaged targets in the woods and fields.


 


tracker-flow.gif

Figure 2: Main modules and data flow for LOTS System.

3.1  Background Modeling

Like many systems, our processing starts with change-detection based on subtraction of a ``background'' image, B. Because a stationary omnidirectional camera does not need to pan and tilt to cover a large viewing area, it has opportunities for developing strong background models. Our ``background subtraction'' has three distinctive features: its adaption speed, its background modeling, and its thresholding method.

Most background subtraction based systems use temporal integration to adapt to changing lighting, i.e. Bt+1 = a It + (1-a) Bt , where a is the blending parameter, Bt is the background image at time t and It is the image at time t. Some systems also utilize it to provide a streaking-effect or ``motion history'' which can increase connectivity and apparent size for fast moving targets.

However, because of the very gradual image change inherent with our target's slow speed and small size, we use a very slow temporal integration. The system supports pixel updates with the effective integration (blending) factor from a = 0.25 (our fastest integration) down to a = 0.0000610351, with a = 0.0078125 as the default value. Using such small fractions is, of course, numerically unstable, especially when using byte-images. For the sake of both speed and numerical accuracy the system does not update the background images every frame, rather it reduces the rate at which the background is updated such that the multiplicative blending factor is at least 1/32. For example an effective integration factor of 0.00006 is achieved by adding in 1/32 of the new frame, every 512 frames.

Given a target that differs from a mid-gray background by 32, and a ``threshold'' of 16, this takes between 2 to 4096 frames (1/15 of a second to 2+ minutes) for the target to become part of the background. To further reduce target pixels blending into their background, such pixels are updated even less frequently, generally 0.25 times as often. The result is that moderate contrast targets, once detected, are ``tracked'' for between 1 second and 8 minutes, and low contrast targets generally last a minute or two.

Our approach to adaption gives the detection system an important asymmetric behavior: it is very quick to acquire targets, but once detected, its very slow to give them up. This is very important for low-contrast slow moving targets, e.g. the sniper in figure . The downside of this approach is that some false alarms tend to persist, and when objects that are stationary for a long time depart, they leave behind long-lasting ghosts. How these issues are handled will be discussed later.


 


sniper.gif

Figure 3: Tracking a sniper moving in the grass. The camouflage is quite good, but careful background differencing shows the location. Frame-to-frame motion is small, a good sniper may take crawl at under .5 meters per minute and be motionless for minutes at a time. Note figure is also link to a MPEG movie of the tracker in action. (Quality lost because of need to keep down MPEG size, this one is 4Megs...)


Those with experience in these types of systems might be wondering how this can really work. While even the slowest of the above update rates is sufficiently fast for gradual lighting changes such as normal sun motion, it will not adapt fast enough to handle rapid lighting changes, e.g., the sun going behind a cloud. This observation is correct and to handle these other situations explicit ``lighting change heuristics'' are applied at a later stage (after connected component labeling). These heuristics will be described later.

The second significant feature of our background technique is that there is not a single background model, but 3 different backgrounds models, i.e. pixels can have 3 different ``backgrounds''. This is a significant advantage for ignoring real but contextually insignificant motions such as moving trees/brush. When the trees move they occlude/disocclude the scene behind them and the system ends up building models for both backgrounds. The testing against the secondary background is done on a per pixel basis but adds very minimal cost because it is only consulted when the pixel is significantly above primary background, which is very infrequent. The disadvantages of these two background models are the additional memory requirement, and the loss of sensitivity if the secondary background is updated poorly.

Currently, we acquire the second background model by an initial batch learning with interactive supervised learning when false-alarms occur. We are investigating more automatic methods, which involve feedback from the high-level tracking components of the system. If false alarms occur during processing, the user may request that particular regions update their secondary background model to prevent further false alarms by that target. We also point out that the secondary background generally does not update frequently enough to allow good ``variation'' estimates and we do not maintain a secondary variation image. However, whenever the update value for a pixel's is very different from its current value, we increase the pixels variation and replace, rather than blend the values. The latter helps when there are really 3 backgrounds that need to be modeled, at least we get 2 of them rather then one plus a mixture of the other two.

Both the primary and secondary backgrounds are updated via the blending model to allow them to track slow lighting changes. Unfortunately, the blending also means that targets leave ``ghosts'' and false alarms that are quite persistent. To help handle these problems the system has a third background image (called the old-image) that is not updated via a blending model. Rather it is a pure copy of an old image, except that targets within that image are tagged with a value of zero.3 This old image is always at between 9000-1800 frames (5-10 min) old and for efficient implementation purposes switches between two images. As described in section , comparison against this old-image is not done pixel by pixel, but rather it is incorporated into the cleaning phase.

3.2  Thresholding

In addition to having multiple backgrounds, the system has multiple backgrounding thresholds. The first, a global threshold, handles camera gain noise and can be dynamically adjusted. The other is a per-pixel threshold which attempts to to account for the inherent spatial variability of the scene intensity at the pixel, e.g. points near edges will have higher variability because small changes in the imaging situation can cause large intensity changes.


 


woods.gif

Figure 4: Tracking soldiers moving in the woods at Ft. Benning GA. Each box is on a moving target. The multi-background modeling and thresholding with hysteresis are important to ignore the many moving trees in this scene and help connect the soldiers even when their camouflage is nearly the same intensity as the background. Note figure is also link to a MPEG movie of the tracker in action. (Quality lost because of need to keep down MPEG size, this one is 16Megs and shows 1 perspective window1...)


The background subtraction uses a two level thresholding-with-hysteresis, i.e., all pixels in a region must be above the low threshold and must be connected to at least one pixel above the high threshold. The low threshold is the sum of the global threshold and the per-pixel variation. The high threshold is currently set at 4 times the low threshold; if the variances were true variances and the noise was Gaussian this would assure us that at least part of each target was, with > 99.9% confidence, not noise. The two level thresholds are maintained by initially computing the difference with the low threshold. If that is exceeded we test the high threshold, and set some high order bits in the ``low-resolution'' image to be described in the next section.

To keep the subsequent processing fast, the thresholding process keeps pointers to the initial and final pixels, per row, that are above the low threshold. Rows with nothing above threshold (usually 80% or more of the image) are skipped in subsequent processing.

Because we expect only a small growth in the number of pixels above threshold, the thresholding counts the number of pixels above threshold and checks this assumption. If it is violated, it is probably a rapid lighting change or a radical camera motion.4 The system attempts to handle radical growth by presuming it to be a rapid lighting change and temporarily increases both the global threshold and the allowed growth rate. This attempt to increase the thresholds is tried twice. If it is successful in reducing the growth, we proceed as usual except we force an update of non-target regions with a blending factor much larger than usual.

If after raising the threshold a few times we still detect that the number of pixels differing from the backgrounds is growing too fast then we presume the growth must be because of camera motion.5 To handle camera motion we:

The proper fix, doing image stabilization, is in progress.

3.3  Connected components & region definition

After thresholding, the system needs to find connected components. Keeping this process fast is aided by two techniques. First, we only process the part of the row between the leftmost pixel above threshold and the rightmost pixel above threshold. The connectivity code also has special cases for when the entire previous row was empty. As is common, the connectivity analysis uses a union-find data structure to allow very fast association of regions. The usual union-find was modified to also support the area and temporal associations used by our approach. The final, and probably most interesting speedup, comes from a reduction in resolution.

While thresholding, we also build a lower resolution image of those pixels above threshold. The pixels in this parent (low resolution) image maintain a count of how many of its associated children (high resolution pixels) were above threshold. Since resolution is reduced by a factor of 4 in each direction, the parent image contains values between 0 and 16, which allow us to have accurate low-level area counts for thresholding. For example a region which connects 4 ``parent-level'' pixels might have a total area of support of only 12 pixels in the high resolution image (and would not be connected at that level). The high-order bits of the low resolution image were also set when the pixel was above the high threshold. As we add the areas we naturally find out if any were also above the high threshold (we AND with a mask to make sure they don't overflow). In this way the thresholding-with-hysteresis is computed along the way in the connected component labeling processes, without the need for a second thresholding pass.

The connected component phase is only applied to the parent image. In addition to the speedup due to data reduction, this resolution reduction also has the effect of filling in many small gaps. The gap filling is spatially varying; the maximum distance between ``neighbors'' varies from 4 to 8 pixels. While not as ``uniform'' as morphological processing, it is considerably faster and has shown itself to be very effective.

3.4  Cleaning Regions

After the connected components, we have a collection of regions that are different from the background. The next phase is to clean up noise regions and some expected, but insignificant, differences. There are three different cleaning algorithms: size, lighting normalized and unoccluded region.

The goal of area thresholding is to remove noise regions. The area thresholds, which are applied per region, use the accumulated pixel counts from the parent image. The area is not the number of pixels in the low resolution connected components image, but rather the total count of high-resolution pixels above the ``low'' threshold that are in the region. This allows the system to detect (and retain) a human target at 50m, i.e., a 2 pixels by 8 pixels region in the full resolution paraimage. There is also a maximum threshold to avoid large slow-moving shadows of clouds.

The second and third cleanings use normalized intensity data. This is done by computing the average pixel intensity within the target computed in three images: the input, the primary background, and the old-image. Computations are then scaled by ratios of these values. We compute the normalize factor across the whole region rather than doing a more accurate, but more expensive, local normalization. In our experience to date, this has been sufficient for small and moderate size regions, while large regions are usually thresholded away by the area test.

The second cleaning looks for local lighting variations, e.g. small cloud patches, sunlight filtered through the trees, etc. We do this by redoing the ``thresholding-with-hysteresis'' for the region again but this time using normalized background intensity.

The third cleaning phase is to see if it is a region where a moving target has disoccluded the background. This is done by doing a thresholding-with-hysteresis comparison against the normalized old image.

3.5  Tracking Regions

The tracking phase attempts to connect these regions to those from previous frames. The simplest, and most common, aspect of this association occurs when the current region spatially overlaps a region from the previous time frame. The system actually checks for this association while it is doing its connected components labeling. The labeling looks at both the current parent image as well as the parent image from the past frame. Objects that are connected in space-time are labeled with the same label they had in the past frame, which simplifies the search for potential matches.6

After handling the spatio-temporal connected regions, only a small number of regions remain unlabeled. Therefore, the system can spend considerably more time trying to match up these regions. It looks to merge new regions with near-by regions that have strong temporal associations. It also looks to connect new regions with regions that were not in the previous frame but that had been tracked in earlier frames and disappeared. Both of these more complex matchings use a mixture of spatial proximity and feature similarities.

3.6  The display and user ``feedback''

While it is acceptable to run tracking algorithms directly on the paraimage, it is not the best way to show the targets to human users. The system provides the user a collection of windows that contain perspectively corrected views. While any number of windows are allowed, we generally use between 1 and 6 depending on the anticipated number of moving objects. The viewing direction within these windows can be controlled via the mouse. More importantly, the system has a process that automatically chooses views such that the perspective windows track the N most ``significant'' targets. See figure  for an example.


 
 

cmuspu.gif

 
Figure 5: Example from VSAM IFD demo with 3 perspective windows tracking the top 3 targets. In the paraimagem targets have trails showing their recent ``path''.


Another feature of the user interface (UI) is the ability to select both regions of interest and regions of disinterest (``always'' ignored). The latter can be used to exclude targets in areas where motion is expected, but insignificant.

The UI can show a ``density'' of targets over the recent past, can limit display to targets with a particular certainty, and can toggle between targets having ``trails''. These help in isolating regions of potential problem, and in understanding the nature of a target.

If a target perspective window contains a ``false alarm'', the user can provide runtime feedback and have it ``ignore'' that target and update the multi-background model to attempt to ignore it in the future.

The LOTS system has parameters allowing it to be tuned to different conditions. A number of predefined scenarios can be used to set these system parameters.

3.7  Geolocation

Up to now, all of the targets have been regions in the coordinate system of the paraimage, or video in a perspective window. While this is often enough for human viewing, it is not sufficient for automated processing or for multi-camera coordination.

The single-viewpoint constraint of the paracamera makes it straightforward to ``back-project'' a ray into the world, and also allows a simple calibration process. Given three 3D world coordinate points on the ground plane, we can solve for the transform from paraimage coordinates to the world coordinates of the point on the ground plane that projects to that a given image coordinate. The detailed equations for the relation are beyond the scope of this paper.

Given the calibration points, we precompute a ``map'' that gives the 3D coordinates of the back-projection of each pixel in the paraimage, thus runtime computation is a trivial table lookup. When the calibration points are chosen near the ``horizon'', the resulting 3D projection is usually quite good. Localization of the calibration points, the presumption of a ground plane (rather than a complex surface) and the ability to localize a target are the limiting factors.

3.8  A network of para-sensors

For each tracked object, the system computes and displays via color encoding a heuristic confidence measure that is based on many contributing factors including the objects size, contrast, how long it has been tracked, and how fast/far it has moved. This provides an easy way for users to crudely adjust their probability of detection versus false-alarm (FA) rate by demanding only higher confidence targets.

A final component of our ongoing efforts is the multi-camera coordination and a fully networked system. With this extension, the targets are tracked in local sensor processing unit (SPU) (computer/camera pair). As mentioned before each paracamera SPU has a local ``scheduler'' for the most significant targets. Target information and significant video clips are integrated and displayed by a networked display controller (NDC). The NDC is in charge of network bandwidth allocations and chooses amount the potential video streams from the various SPUs . The goal is to have one networked computer connected to 5-20 paracameras with all of the ``significant events'' being viewed on the NDC.

One of the design constraints in our development was the ability of the protocol to scale to a large numbers of sensors each with a large number of targets while not saturating the network. It has been demonstrated with wireless communication connecting the paracamera SPU to a mobile display unit. When running in network mode the code can be slowed down as network ``targets'' are sent only 4-5 times per second. This may allow slower, lower power processors to be used or multiple paracameras on each processor.

The network protocol design underwent a number of iterations and in mid 1998 we coordinated with CMU on a compromised protocol which incorporated key ideas from both the original CMU and the new Lehigh design. The result was less bandwidth efficient but considerably easier to implement/use and something we could share among researchers for the DARPA VSAM program, see [] for details.

A paracamera SPU was a part of the VSAM integrated feasibility demonstration [] where its targets were coordinated with a number of CMU pan-tilt-zoom type SPUs. The ``mini-OCU'' of the paracamera controlled 3 perspective views, but 3D target locations of the top 7 targets were sent to the coordinating CMU processor.

4  Evaluation

To support the evaluation of our system we collected omni-directional image data at Ft. Benning in scenarios of interest to the DARPA SUO-SAS program. Approximately 70 hours of omni-directional video was collected and includes both significant amounts of ``targets'' and empty scenes for false-alarm evaluation. Atmospheric conditions range from light rain and wind, partly sunny and windy to sunny with light breeze. Limited amounts of data are available from the author.

Researchers at the Institute for Defense Analysis have done some preliminary analysis of LOTS, as of Aug. 1998, and we report on their evaluation.

The analysis showed approximately 95% detection rates (ranging from 100% down to 87%) and a false-alarm-rates ranging from .15FA per min to 1.7FA per min. Almost all detections were considered ``immediate'', with only the most difficult cases taking longer than a second. The scenarios evaluated included: short indoor segments, two urban/street scenes, two different wooded settings, a town edge and tree-line and a sniper in a grass field. These evaluations did not include the use of any confidence measures, nor did they allow for incremental learning or adaptive feedback on false alarms.

At the time of this initial external evaluation, the only cleaning phase was area based. A large fraction of detected false alarms were small to moderate sized locations with lighting related changes, e.g. small sun patches filtering through the trees or shadows. In a wide field of view, many of these appear very much like a person emerging from occlusion. The ``ghosting'' of targets was also noted in their report. This feedback lead to additional cleaning phases. The confidence measures, feedback, and new cleaning processes, will be part of a reevaluation scheduled for Jan. or Feb. 1999.

5  Conclusion & Future work

This paper presented the LOTS system and described some of its unique design choices. Initial evaluation, by an external group using realistic data, has shown the collection of techniques to be both efficient and effective. The main limitations (and hence subjects of ongoing/future work) are issues related to target fragmentation, loss of identity during occlusion and grouping of individual targets into one. Decreasing the false alarm rate is also an ongoing activity.


Footnotes:

1 This work supported in part by DARPA VSAM program.

2 Cyclovision, Inc has an exclusive license on this patented design, see www.cyclovision.com.

3 The capture card produces a minimum valid gray value of 16

4 In urban settings with nearby buildings, its possible for disocclusion of a large object to cause a moderate growth. Therefore the growth percentage is a system parameter.

5 In most of our data collection scenarios the camera is mounted in a tree and is subject to tree motion.

6 While generally correct, this may incorrectly connect/label individuals within a closely packed fast-moving group; While not a problem for targets at walking speeds, it can occur for vehicles. Corrections for this problem are being investigated.

Acknowledgements

This work supported in part by DARPA VSAM program and the ONR MURI Program. The opinions here in are those of the authors and do not necessarily represent those of our funding agents. A variant of this paper appeared in the IEEE workshop on Visual Surveilance, 1999.


File translated from TEX by TTH, version 1.92.
On 4 Dec 1998, 13:13.