MMX processing is an important tool in speeding up vision and image processing. This is a very brief "introduction" to using MMX with the GNU tools (i.e. for Linux or FreeBSD), and it presumes you already know why and how to do the basic MMX coding. (There are good "introductions to why/how MMX works but this is intended as an example for vision/image processing types who already know what they want). For general background you should also checkout the Linux Parallel Processing HOWTO by Hank Dietz ppLinux@ecn.purdue.edu
The routine computes two outputs. The first, the output image, will
contain the absolute value of the difference of the pixel from a
background reference if that difference is above a per-pixel threshold.
If the absolute value of the difference is below threshold, the
associated pixel will be 0. Logically we can view it as:
if ( abs( inim[i] - refim[i]) <= (thresh + varim[i]) )
outim[i] = 0
else outim[i] = abs( inim[i] - refim[i]) - (thresh + varim[i])
We also compute a per-row flag, such that if (row_used[i] != 0), then we know that at least one pixel in the row is "above" the threshold. This type of "cached" information is quite important in speeding up further processing. ( In our real system the code is more complex because we handle different data formats, e.g. yuv422packed, because we only process "cords" in each row we also we make a second "pass" over each row and test against a secondary background and also keep track of first and last locations on each row that are above threshold with respect to both backgrounds.) But these make the code much harder to follow.
I am not a big fan of doing everything in assembly, I do inline assembly when I need it , so most of the setup/calling/looping code is straight C (or C++).
void absdiffmmx( // the output's unsigned char* outim, // where we store output int* row_used, // a per-row flag if "used" // inputs unsigned char* inim, // the input image unsigned char* refim, // the reference image (input) unsigned char* varim, // the per pixel threshold (variance) unsigned char thresh, // the global threshold int rows, int cols) {The next part of the code sets up part of the threshold. As mentioned the thresholding process combines a per-pixel threshold (varim) and the global threshold (thresh). We want to copy the global threshold into an MMX register, and need 8 "copies", one for each item in an MMX register. To do that we include the following:
double mmtemp_d; // gets it double aligned and allocates storage
unsigned char* mmtemp= (unsigned char*) & mmtemp_d; //ease of access
// setup,the global threshold in a MMX register
for (int i=0;i<8;i++) {
mmtemp[i] = thresh;
}
Before we show how to load that into the register lets look at the general
GNU/assembly interface. Remember GNU syntax is the
opposite of Microsoft.. it reads
op source, destand when we use inline assembly we have the syntax:
__asm__("formatted asm instructions" : output_vars : input_vars : clobbered-registers);for example
__asm__ volatile("\n\t pxor %%mm3, %%mm3 \t # clear row-used flag" : : );Which uses the parallel xor operation to "clear" register mm3. Note that the register's name is %mm3, but to get a % through the "formater" you needed to prefix it with %%. You can see we have included formating (\n and \t) and comments (everything in the string after the #) . Note that while the "clobbered-registers" parameter is discussed in the GNU tools documentation, it is currently not supported for MMX because the compiler does not include them in its register allocation scheme. Note that use of the volatile keyword is recommended for inline asm (I've never seen it make a difference, but YMMV). While it is not in our code, a simple example to show the "loading/storing" of data is in order. It is very straight forward:
__asm__ volatile("\n\t movl %0,%%esi" : : "m" (i) );would move the contents of the "C" variable i into the %esi register and
__asm__ volatile("\n\t movl %%ecx,%0" : "=m" (i) " );would move the contents of the %ecx register into the "C" variable i. Note that outputs between the first set of colons (:) and have an "=m" while the inputs are in the second set of colons and have just an "m".
In our code we define a macro for loading an MMX register (most because
we use it in many places, and partially because we change definitions when we
compile our code under VC6.0 on WIN32.) Here is our GNU macro:
// load the threshold in mm7.. example macro for loading just 1 register
// note we use newline and tab so we get nice looking assembly...
#define MMX_load8byte_mm7(data)__asm__("\n\t movq %0,%%mm7\n": "=m" (data):)
MMX_load8byte_mm7(mmtemp_d);
Now we need to loop over the image. We could do all the looping in assembly
but that is harder to visualize and debug so I keep as much as reasonably
possible in C. As we shall see,the generated assembly is quite good so not
much is lost by this choice. To make it easier to "index and load" I abuse
the data types and use double* to access the images. (Doubles on an x86
means 8 bytes per "index", which is just right for MMX processing).
With that in mind here is the start of the loop:
Now we need to get into the real MMX part.
The algorithm computes abs(i-r) by computing both i-r and r-i, but it uses
saturating subtraction. Thus, per pixel, one of the two values will be 0 and
the other will be "positive". (Testing the insides of an MMX register is not
possible without "unpacking" it back into a normal register).
In the MMX registers, with 8 pixels in each
register, we don't really want to test each entry. But since the
non-positive one is all 0, we can do a parallel bit-wise OR of the two
subtraction results to get the final absolute value of the difference.
In the code we need to add the global and per-pixel threshold before the
subtraction takes place.
for (int i=0;i < rows;i++)
{
// Make them double pointers so they index easily
double *ipt = (double*) &inim[i*cols];
double *opt = (double*) &outim[i*cols];
double *rpt = (double*) &refim [i*cols];
double *vpt = (double*) &varim[i*cols];
int k = cols/8; // we will be using 8 chars at a time.
int index=0;
// mm3 is used for a flag tracking if any pixels in the row are above
// threshold. not really needed but it can make further processing
// faster by skipping empty rows.
__asm__("\n\tpxor %%mm3, %%mm3 \t # clear row-used flag" : :);
In the code we allocate the MMX registers as follows:
You can see that the inline code include "formating and comments", which are not required but I find they make the generated assembly much easier to read. The comments in the code include expected pipe position (u or v) on a regular MMX.. The actual pipeline allocation might be different on a K6 or PII where the underlying/ hardware is a tad different. (It definitely is different on K6/K7 as their MMX pipes are more flexible.) Note that two consecutive lines with u means different "cycles", but a u/v pair are issued at the same time. We could hand optimize for a better pipe usage for a particular arch and gain somewhere between 0-7%.
Here, is the main processing loop (done per row)
This concludes the example.
Note that egcs-2.91.66, which ships with RedHat6.0, (and a few other
egcs/g++) versions do not properly handle MMX
in the assembly pass-through. It should replace the %%mm5 with %mm5 as it
processes, but it does not. (If you get error messages about "bad
register name ('%%mm3')" or something like it, you know you have the broken
version).
The workaround is to compile to assembly
then use an editor or sed to replace %%mm with %mm
(e.g.
Note that some level of optimization will probably be be needed for inline
assembly with MMX. (without it you may get error messages such as
"fixed or forbidden register 7 (sp) was spilled for class GENERAL_REGS."
since GNU supports both optimization and debugging, its not a real issue.
You will notice that I did not spend time "optimizing" my loops and writing
the whole loop in assembly. I prefer to keep most if it in "C" and then
occasionally check the "generated" code to see if it is efficient. I compile
to assembly and look at it. Note that the GNU compiler WILL
optimize around your MMX block, including the data loading issues. For example
if we compile the above code with -O we get
assembly code for the main loop that looks like:
There are also multi-media extensions for and HP, SUN, DEC workstations which
support (and have for many years) many similar multi-media instruction.
while(index < k){ // run across the row
__asm__( // instruction #pipe comment
"\n\t movq %1,%%mm0 \t# u load ref "
"\n\t movq %2,%%mm1 \t# u load input "
"\n\t movq %%mm0,%%mm2 \t# v copy ref "
"\n\t movq %3,%%mm4 \t# u load var image data "
"\n\t psubusb %%mm1, %%mm0\t# v ref - inp "
"\n\t psubusb %%mm2, %%mm1\t# u inp - ref ?subtract stall? "
"\n\t paddusb %%mm7,%%mm4 \t# v add base threshold to variance"
"\n\t por %%mm1, %%mm0 \t# u get abs diff (via or) "
"\n\t psubusb %%mm4,%%mm0 \t# u subtract with saturate thresh"
"\n\t movq %%mm0,%0 \t# u store result, "
"\n\t por %%mm0, %%mm3 \t# v mark row used if it was "
: "=m" (opt[index]) // this is %0, output
: "m" (rpt[index]), // and %1 reference
"m" (ipt[index]), // and %2 input
"m" (vpt[index]) // and %3 per-pixel-threshold
// the "=m" implies it is output.. just "m" is input
// see the gcc info pages for more details...
);
index++ ;
}
// get the per row flag back into an
__asm__("\n\tmovq %%mm3,%0" : "=m" (mmtemp_d));
row_used [i]=0; // could be faster by playing int games, but why bother
for (int ii=0;ii < 8;ii++) row_used [i] |= mmtemp[ii];
} // end for loop over all rows
// ok, done with MMX,reset it to normal floating point usage
__asm__("emms" : : );
} // end of function absdiffmmx
We could include loop-unrolling and gain a little more in performance (3-5%)
and then I would need to do the index registers myself. The primary gain of
unrolling here is not from the loop overhead, but because by unrolling we
would better use the "v" pipes by starting starting the second set of "loads"
at the instruction where we get abs diff via parallel or (i.e. after the
line
"\n\t por %%mm1, %%mm0 \t# u get abs diff (via or) "
Some hints for working with GNU and MMX
g++ -O3 -S -o absdiffmmx-tmp.s absdiffmmx.cc
sed -e 's/%%mm/%mm/g' < imdiff-tmp.s >absdiffmmx.s
g++ -O3 -o absdiffmmx.o absdiffmmx.s
Also note that this "bug" also breaks all lines which try and use the
"clobber-register" approach (for which I know no work around).
Looking at your assembly, and code tuning...
#APP
pxor %%mm3, %%mm3 # clear row-used flag
#NO_APP
You can also add assembly tags (.labelname) to make it easier to find your
code and also to help with profiling.
#APP
pxor %mm3, %mm3 # clear row-used flag
#NO_APP
cmpl %edx,%eax
jge .L13
.align 4
.L14:
movl -28(%ebp),%esi
movl -24(%ebp),%edi
movl -16(%ebp),%ecx
movl -20(%ebp),%ebx
#APP
movq (%edi,%eax,8),%mm0 # u load ref
movq (%ecx,%eax,8),%mm1 # u load input
movq %mm0,%mm2 # v copy ref
movq (%ebx,%eax,8),%mm4 # u load var image data
psubusb %mm1, %mm0 # v ref - inp
psubusb %mm2, %mm1 # u inp - ref ?subtract stall?
paddusb %mm7,%mm4 # v add base threshold to variance
por %mm1, %mm0 # u get abs diff (via or)
psubusb %mm4,%mm0 # u subtract with saturate thresh
movq %mm0,(%esi,%eax,8) # u store result,
por %mm0, %mm3 # v mark row used if it was
#NO_APP
incl %eax
cmpl %edx,%eax
jl .L14
.L13:
but if we use -O2 we get assembly code that
looks like :
#APP
pxor %mm3, %mm3 # clear row-used flag
#NO_APP
cmpl %eax,%esi
jge .L13
movl -32(%ebp),%ebx
movl %ebx,-40(%ebp)
movl -24(%ebp),%ecx
movl -36(%ebp),%edx
movl -28(%ebp),%eax
.align 4
.L14:
movl -40(%ebp),%ebx
#APP
movq (%eax),%mm0 # u load ref
movq (%edx),%mm1 # u load input
movq %mm0,%mm2 # v copy ref
movq (%ecx),%mm4 # u load var image data
psubusb %mm1, %mm0 # v ref - inp
psubusb %mm2, %mm1 # u inp - ref ?subtract stall?
paddusb %mm7,%mm4 # v add base threshold to variance
por %mm1, %mm0 # u get abs diff (via or)
psubusb %mm4,%mm0 # u subtract with saturate thresh
movq %mm0,(%ebx) # u store result,
por %mm0, %mm3 # v mark row used if it was
#NO_APP
addl $8,%ebx
movl %ebx,-40(%ebp)
addl $8,%ecx
addl $8,%edx
addl $8,%eax
incl %esi
cmpl %esi,-20(%ebp)
jg .L14
.L13:
The second code is faster (5-15% depending on system cache issues).
If there is something you don't like, you can always take the generated code
and make it all "inline" and then clean things up by hand (like why is ebx
being moved in and out of the stack?) Of course, such minor optimizations
are warranted only if this is still the system bottleneck...
Data Alignment issues
unsigned char* real_inimg=malloc(spaceneeded + 16);
inimg = (unsigned char *) ((unsigned long)(real_inimg + 8) &
(unsigned long) 0xFFFFFFF8);
The above works by advancing the pointer into the allocated data and then
insuring that the last 3 bits of the "image" address are ==0. There are
other ways to accomplish this too, including doing things in assembly.
When you use this trick remember to free real_inimg
, not
inimg
. If you are using real C++, if you make the underlying data
double
or unsigned long long
it will align itself
properly, but "C" malloc is type independent. (Some operating systems
provide memalign, a function to better address the alignment issue for C).
Some MMX related links (on the web):