Developing Real-Time Computer Vision Applications for Intel Pentium III based Windows NT Workstations
Ross Cutler and Larry Davis
September 4, 1999
http://www.cs.umd.edu/~rgc/pub/frame99
1.1 What is a real-time application?
1.2 Why Intel Pentium III PCs?
2 Overview of a Real-Time Computer
Vision Application
3.1 Capturing video for offline use
4.1 The Pentium III PC architecture
4.2 Benchmarking the memory system
4.3 How to detect memory bottlenecks
4.4 What is the upper limit on how fast my
algorithm will run?
4.5 Using MMX and Streaming SIMD
4.8.3 Cache issues with SMP systems
4.9 Benchmarking your application
4.10 Bypassing Windows NT’s virtual memory system
4.12 Hardware acceleration of image processing
operations
6 Hard real-time extensions to Windows
NT
Abstract
In this paper, we describe our experiences in developing real-time computer vision applications for Intel Pentium III based Windows NT workstations. Specifically, we discuss how to optimize your code, efficiently utilize memory and the file system, utilize multiple CPUs, get video input, and benchmark your code. Intrinsic soft real-time features of Windows NT are discussed, as well as hard real-time extensions. An optimized real-time optical flow application is given. Empirical results of memory subsystems and cache scheduling issues are also reported.
Intel processor-based PCs running the Microsoft Windows 98/NT operating systems are the clear market leader in both home and business computer use. Utilizing the ubiquitous PC for the use in real-time computer vision applications is appealing for both the user and developer. For the user, utilizing their existing inexpensive hardware and operating system is attractive for both economical and integration reasons. For the developer, the success of the PC has produced many useful developer tools and hardware peripherals that can be utilized for developing a real-time computer vision application. However, effectively utilizing the power of this hardware requires a great deal of low-level understanding of the hardware and low-level programming, which this paper attempts to address.
A real-time application is one that can respond in a predictable, timely way to external events. Real-time system requirements are typically classified as hard or soft real-time. For a hard real-time system, events must be handled predictably in all cases; a late response can cause a catastrophic failure. For a soft real-time system, not all events must be handled predictably; some late responses are tolerated. For many real-time computer vision applications, a soft real-time system is sufficient. For example, in a real-time gesture recognition system (e.g., [1]), it may be tolerable to occasionally or systematically drop video frames, as long as the system is designed to robustly handle frame drops
The recent Intel Pentium III processor now provides sufficient processing power for many real-time computer vision applications. In particular, the Pentium III includes integer and floating point Single Instruction Multiple Data (SIMD) instructions, which can greatly increase the speed of computer vision algorithms. Multiprocessor Pentium III systems are relatively inexpensive, and provide similar memory bandwidths and computational power as non-Intel workstation costing significantly more money.
The reference PC used in all of the benchmarks given in this paper is a Dell 610 dual 550 MHz Pentium III Xeon, with 512 MB of ECC SDRAM.
We utilize the Microsoft Windows NT operating system for our real-time computer vision application development. The reasons for choosing Windows NT over the more popular Windows 98 are: (1) Windows NT supports multiple processors; (2) Windows NT has a better architecture for real-time applications than Windows 98; (3) Windows NT is more robust than Windows 98. The reason for choosing Windows NT over Linux is that Windows NT provides many software development tools that aren’t available on Linux. In addition, Windows NT has a superior number of hardware peripherals available (with supported drivers), particularly for frame grabbers (essential for most real-time vision applications).
Windows NT is a general-purpose operating system, and was designed to maximize aggregate throughput and achieve fair sharing of resources. It was not designed to provide low-latency responses to events, predictable time-based scheduling, or explicit resource allocation mechanisms. Real-time OS features not found in Windows NT include deadline-based scheduling, explicit CPU or resource management, priority inheritance, fine-granularity clock and timer services, and bounded response times for essential system services. However, real-time OS features Windows NT does include are elevated fixed real-time thread priorities, interrupt routines that typically re-enable interrupts very quickly, and periodic callback routines [2-4].
This paper serves three purposes. First, it summarizes many of the important issues in developing real-time applications on a Windows NT Intel Pentium III platform, including many references for further details. In addition, we provide empirical results of memory and cache scheduling issues, which are particularly important for real-time computer vision applications. Finally, we give working examples of computer vision algorithms, which can be further used for building real-time computer vision applications.
A typical real-time
computer vision system includes the following components:

Figure 1:
Real-time computer vision system components
The video input is typically a CCD-based video camera, connected to a frame-grabber installed in the PC (see Section 3). The image and computer vision processing can be done by the main CPU(s), by the graphics hardware (see Section 4.12), and by optional digital signal processors (DSPs) (e.g., a C80 integrated on the video capture card). The processing of results depends completely on the application. For example, a real-time person tracker may send tracking events to another process running on the system, which displays the results in real-time using a graphical user interface (GUI).
Capturing live video is an essential part of most real-time computer vision applications. Windows NT 4.0 provides the Video For Windows (VFW) [5] interface as a standard for video input. While VFW may suffice for some applications, it has some efficiency problems. Specifically, VFW drivers perform memory copies on the captured images, instead of transferring images directly to DMA image buffers and making these buffers available to the user. The result of this inefficiency is wasted CPUs cycles and dropped frames during video captures.
In Windows 2000 (and Windows 98), VFW is succeeded by WDM Video Capture [5], which alleviates many of the problems of VFW. In WDM Video Capture, images are transferred directly to a circular DMA buffer, and user interrupts are triggered when an image capture completes. WDM Video Capture uses the DirectShow interface to provide compatibility with many third-party applications.
There are numerous (over 30 at last count) PCI-based machine vision frame grabbers available for Intel-based Windows NT workstations. Almost all of these frame grabbers provide a custom SDK, which the programmer can use for direct control of the frame grabber. Many of the machine vision (not consumer) frame grabbers do not support VFW, due to the inadequacies previously discussed.
A common paradigm used for providing live video is to capture the video in a circular image buffer, and trigger the consumer of the video images when a new frame is ready (typically via a callback function). The image buffers are DMA buffers, which the frame grabber can write directly to, and which the user can process directly without memory copies. The size of the circular buffer depends on how much image history is required for the application, and how quickly the images can be processed. For example, if only the current frame is required, and the processing can always finish in less than the frame-sampling interval (e.g., 33ms for NTSC video), then a circular buffer length of 2 suffices (this is called double-buffering). However, since Windows NT is not hard real-time, one should not assume that the processing can always finish on time. Instead, a larger circular buffer length should be used to account to the variability in the processing time.
The Microsoft Vision SDK (http://www.research.microsoft.com/projects/VisSDK) provides a useful abstraction for device independent capturing video, which we use in our own development.
Capturing video for offline use is an essential task in developing a computer vision application. Computer vision algorithms can be deterministically tested and optimized using the captured video. Video can be captured to memory or disk.
When capturing to disk, a RAID is typically used to provide sufficient throughput. For example, a 640x480x24 image sequence at 30 fps requires 26.4 MB/s sustained disk throughput, which can be achieved using two fast SCSI disks (e.g., Seagate Ultra 2 Wide SCSI Cheetah disks). A RAID controller is not required, as Windows NT can efficiently simulate a level-0 RAID in software. To maximize throughput, we bypass the Windows NT file cache, which would otherwise copy the streaming image data to virtual memory; even worse, the default settings for the file cache cause the file cache to grow until the system starts to swap! Multiple file DMA buffers (typically 4) and overlapped writes are used to account for the disk throughput variations. The image file is pre-allocated to allow overlapped writes to execute optimally. See [6] for details and source code. See [7] for a useful disk benchmarking utility. See http://www.cs.umd.edu/~rgc/software for other video capture applications.
When capturing to memory, it is most efficient (space-wise) to capture to memory above what Windows NT uses (see Section 4.10). Otherwise, Windows NT 4.0 can only utilize 40% of the DMA buffers actually requested. Some of the available frame grabber drivers (e.g., from Matrox and EPIX) provide direct support for image buffers above the Windows NT memory space, so that 100% of the memory can be utilized for image buffers (less the amount used by Windows NT).
Because of the massive amounts of data involved, computer vision algorithms can be extremely computationally expensive. In order to make this processing run in real-time, often a great deal of optimization needs to be utilized. The payoff in doing so can be significant: we have increased the speed of many computer vision algorithms by over 10 times by utilizing the techniques described in this section.
The Pentium III is a general purpose 32-bit CPU, with some DSP-like features added to accelerate multimedia applications. The DSP-like features include integer and floating point SIMD instructions, as well as cache control. A block diagram of a dual Pentium III Xeon system is given in Figure 2 [8]. A block diagram of the Pentium III is given in Figure 3 [9]. The Pentium III consists of three major units: Fetch and decode; dispatch / execute; and retirement. The dispatch / execute (DE) unit is detailed in Figure 4. On each clock cycle, the DE unit can send a micro-op to each the 5 ports. With careful coding, the Pentium III can execute instructions on multiple ports simultaneously.

Figure 2: Dual Pentium III system block diagram [8]

Figure 3: Pentium III Architecture [9]

Figure 4: Execution Units and Ports in the Out-Of-Order Core [9]
Many computer vision algorithms can be memory-bound if not properly designed and coded (i.e., the processor spends excessive time waiting on the memory to read/write data). To avoid these situations, we need to fully understand the design and actual throughput of the memory system. The memory system of a Pentium III Xeon PC consists of a 32 KB L1 cache (16 KB data, 16 KB code, four-way set associative with a cache line length of 32 bytes, pseudo-least recently used replacement algorithm), a 512KB L2 cache (larger L2 caches are available), and ECC SDRAM main memory. The theoretical and actual speeds of these memory subsystems are given in Table 1. The theoretical speeds are for either memory reads or writes. The memcpy() and fastcopy() speeds are for read plus writes. Note that the Pentium III (non-Xeon) has an L2 cache that is half the speed of the Pentium III Xeon.
|
|
Theoretical |
memcpy |
fastcopy |
readmem128 |
readmem64 |
Writemem128 |
writemem64_movq |
|
L1 cache |
4400 |
2196 |
513 |
3653 |
1779 |
723 |
1338 |
|
L2 cache |
4400 |
400 |
385 |
2177 |
1755 |
727 |
492 |
|
Main |
800 |
164 |
393 |
590 |
547 |
730 |
264 |
Table 1: Theoretical and actual memory subsystem speeds (in MB/s) of a 550 MHz Pentium III Xeon
The main memory uses SDRAM (PC-100) with a 100 MHz 64-bit wide simplex bus, with a theoretical limit of 800 MB/s (or 400 MB/s for copies). Our fastcopy() benchmark gave an actual speed of 393 MB/s (read plus write). Note this is over twice the main memory performance of comparable workstations like the SGI Octane (see the STREAM benchmark, http://www.cs.virginia.edu/stream). (Note that the STREAM memory copy benchmarks report twice the throughput actually achieved; our 393 MB/s result for fastcopy() would be reported as 786 MB/s by STREAM).
To benchmark the memory subsystem, we measure the time it takes to copy (both read and write) a source buffer to a destination buffer, for various buffer sizes. We repeat each copy many times to ensure the memory is totally within the cache, if the buffer is small enough. The main memory speed can be measured using a buffer that can’t fit within the cache (thus ensuring caches misses for every byte copied). The source code used in benchmarking the memory subsystems is at http://www.cs.umd.edu/~rgc/pub/frame99/BenchMemory.zip.
The standard memcpy() function uses the rep movsb instruction to copy a block of memory from a source to destination. This method can be improved with the Pentium III by using the 128-bit Streaming SIMD registers and prefetching the cache [9]. Specifically, we first read a byte from the source buffer to ensure that the memory page that the buffer resides in is in the transaction lookaside buffer (TLB), since prefetches only work for memory pages within the TLB. We then prefetch 4 KB of memory into the L1 cache, and then copy 4KB of memory using 128-registers. The actual copy is unrolled twice to improve performance. The code is given below in the function fastcopy(). Note that fastcopy() is significantly faster than memcpy() for main memory copies (see Table 1). One caveat of fastcopy() is that the memory copied will not reside in the L2 cache, since the L2 cache is bypassed during the memory copy. Therefore, subsequent operations on the source or destination data may not generate a L2 cache hit, as it might with memcpy().
#define CACHESIZE 4096
#define CACHELINESIZE 32
// copy memory from source to dest
using 128-bit Streaming SIMD registers
// assumes: n >= CACHESIZE
void fastcopy(char *dest, char *source,
int n)
{
for
(int i=0; i < n; i+=CACHESIZE) {
char
temp = source[i+CACHESIZE];
for
(int j=i+CACHELINESIZE; j<i+CACHESIZE; j+=CACHELINESIZE) {
_mm_prefetch(source+j,
_MM_HINT_NTA);
}
for
(j=i; j<i+CACHESIZE; j+=32) {
_mm_stream_ps((float*)(dest+j),
_mm_load_ps((float*)(source+j)));
_mm_stream_ps((float*)(dest+j+16),
_mm_load_ps((float*)(source+j+16)));
}