Using MMX with the Gnu-tools (e.g. under Linux or FreeBSD)

Using MMX with the GNU-tools (e.g. under Linux or FreeBSD)

Terry Boult
Vision and Software Technology Lab (VAST)
EECS Department, Lehigh University

MMX processing is an important tool in speeding up vision and image processing. This is a very brief "introduction" to using MMX with the GNU tools (i.e. for Linux or FreeBSD), and it presumes you already know why and how to do the basic MMX coding. (There are good "introductions to why/how MMX works but this is intended as an example for vision/image processing types who already know what they want). For general background you should also checkout the Linux Parallel Processing HOWTO by Hank Dietz ppLinux@ecn.purdue.edu

An image-differencing with MMX example

We will discuss a simple code example . While simplified for this discussion, it is still a realistic example of using MMX under GNU tools (e.g. for Linux, FreeBSD, NetBSD or even windog ) It is based on some of the code we do in the Lehigh Omni-directional Tracking system (LOTS) system (see www.eecs.lehigh.edu/~tboult/TRACK)

The routine computes two outputs. The first, the output image, will contain the absolute value of the difference of the pixel from a background reference if that difference is above a per-pixel threshold. If the absolute value of the difference is below threshold, the associated pixel will be 0. Logically we can view it as:

    if ( abs( inim[i] - refim[i]) <= (thresh + varim[i])  )   
        outim[i] = 0
    else outim[i] = abs( inim[i] - refim[i]) - (thresh + varim[i])

We also compute a per-row flag, such that if (row_used[i] != 0), then we know that at least one pixel in the row is "above" the threshold. This type of "cached" information is quite important in speeding up further processing. ( In our real system the code is more complex because we handle different data formats, e.g. yuv422packed, because we only process "cords" in each row we also we make a second "pass" over each row and test against a secondary background and also keep track of first and last locations on each row that are above threshold with respect to both backgrounds.) But these make the code much harder to follow.

I am not a big fan of doing everything in assembly, I do inline assembly when I need it , so most of the setup/calling/looping code is straight C (or C++).

void absdiffmmx( // the output's  
                 unsigned char* outim, // where we store output
		 int*        row_used, // a per-row flag if "used"
		 // inputs
		 unsigned char* inim,  // the input image
		 unsigned char* refim, // the reference image (input)
		 unsigned char* varim, // the per pixel threshold (variance) 
		 unsigned char thresh, // the global threshold
		 int rows, int cols)  
{

The next part of the code sets up part of the threshold. As mentioned the thresholding process combines a per-pixel threshold (varim) and the global threshold (thresh). We want to copy the global threshold into an MMX register, and need 8 "copies", one for each item in an MMX register. To do that we include the following:

  
  double mmtemp_d;  // gets it double aligned and allocates storage
  unsigned char* mmtemp= (unsigned char*) & mmtemp_d; //ease of access

  // setup,the global threshold in a MMX register
  for (int i=0;i<8;i++) {
    mmtemp[i] = thresh;
  }

Before we show how to load that into the register lets look at the general GNU/assembly interface. Remember GNU syntax is the opposite of Microsoft.. it reads

         op  source, dest

and when we use inline assembly we have the syntax:

    __asm__("formatted asm instructions"   : output_vars : input_vars : clobbered-registers);

for example

      __asm__ volatile("\n\t pxor %%mm3, %%mm3 \t # clear row-used flag"   : :
      );

Which uses the parallel xor operation to "clear" register mm3. Note that the register's name is %mm3, but to get a % through the "formater" you needed to prefix it with %%. You can see we have included formating (\n and \t) and comments (everything in the string after the #) . Note that while the "clobbered-registers" parameter is discussed in the GNU tools documentation, it is currently not supported for MMX because the compiler does not include them in its register allocation scheme. Note that use of the volatile keyword is recommended for inline asm (I've never seen it make a difference, but YMMV). While it is not in our code, a simple example to show the "loading/storing" of data is in order. It is very straight forward:

      __asm__ volatile("\n\t movl %0,%%esi" : : "m" (i) );

would move the contents of the "C" variable i into the %esi register and

      __asm__ volatile("\n\t movl %%ecx,%0" : "=m" (i) " );

would move the contents of the %ecx register into the "C" variable i. Note that outputs between the first set of colons (:) and have an "=m" while the inputs are in the second set of colons and have just an "m".

In our code we define a macro for loading an MMX register (most because we use it in many places, and partially because we change definitions when we compile our code under VC6.0 on WIN32.) Here is our GNU macro:

  // load the threshold in mm7.. example macro for loading just 1 register
  // note we use newline and tab so we get nice looking assembly...
#define MMX_load8byte_mm7(data)__asm__("\n\t movq %0,%%mm7\n":   "=m" (data):)
  MMX_load8byte_mm7(mmtemp_d);

Now we need to loop over the image. We could do all the looping in assembly but that is harder to visualize and debug so I keep as much as reasonably possible in C. As we shall see,the generated assembly is quite good so not much is lost by this choice. To make it easier to "index and load" I abuse the data types and use double* to access the images. (Doubles on an x86 means 8 bytes per "index", which is just right for MMX processing). With that in mind here is the start of the loop:

 
  for (int i=0;i < rows;i++) 
    {
      // Make them double pointers so they index easily 
      double *ipt = (double*) &inim[i*cols];
      double *opt = (double*) &outim[i*cols];
      double *rpt = (double*) &refim [i*cols];
      double *vpt = (double*) &varim[i*cols];

      int k = cols/8; // we will be using 8 chars at a time.
      int index=0; 


      // mm3 is used for a flag tracking if any pixels in the row are above 
      // threshold.  not really needed but it can make further  processing
      // faster by skipping empty rows.
      __asm__("\n\tpxor %%mm3, %%mm3 \t # clear row-used flag"   : :);

Now we need to get into the real MMX part. The algorithm computes abs(i-r) by computing both i-r and r-i, but it uses saturating subtraction. Thus, per pixel, one of the two values will be 0 and the other will be "positive". (Testing the insides of an MMX register is not possible without "unpacking" it back into a normal register). In the MMX registers, with 8 pixels in each register, we don't really want to test each entry. But since the non-positive one is all 0, we can do a parallel bit-wise OR of the two subtraction results to get the final absolute value of the difference. In the code we need to add the global and per-pixel threshold before the subtraction takes place.

In the code we allocate the MMX registers as follows:

mm0 is the ref image,
mm1 is the input
mm2 is a copy of reference (so we can do i-r and r-i destructively)
mm3 is the row used "flag"
mm4 is the per-pixel threshold (variance)
mm7 is the global threshold (8 copies)

Note I choose not to use low-level registers in my code; I let the g++ optimizer choose what variables to store in what general-purpose registers. I only control the MMX registers. We will come back to this choice later.

You can see that the inline code include "formating and comments", which are not required but I find they make the generated assembly much easier to read. The comments in the code include expected pipe position (u or v) on a regular MMX.. The actual pipeline allocation might be different on a K6 or PII where the underlying/ hardware is a tad different. (It definitely is different on K6/K7 as their MMX pipes are more flexible.) Note that two consecutive lines with u means different "cycles", but a u/v pair are issued at the same time. We could hand optimize for a better pipe usage for a particular arch and gain somewhere between 0-7%.

Here, is the main processing loop (done per row)

      while(index < k){ // run across the row
        __asm__( // instruction             #pipe    comment                  
                "\n\t movq %1,%%mm0       \t# u  load ref  "
                "\n\t movq %2,%%mm1       \t# u  load input "
                "\n\t movq %%mm0,%%mm2    \t# v copy ref "
                "\n\t movq %3,%%mm4       \t# u  load var image data "
                "\n\t psubusb %%mm1, %%mm0\t# v  ref - inp "
                "\n\t psubusb %%mm2, %%mm1\t# u inp - ref  ?subtract stall? "
                "\n\t paddusb %%mm7,%%mm4 \t# v add base threshold to variance"
                "\n\t por %%mm1, %%mm0    \t# u get abs diff (via or) "
                "\n\t psubusb %%mm4,%%mm0 \t# u subtract with saturate thresh"
                "\n\t movq %%mm0,%0       \t# u store result, "
                "\n\t por %%mm0, %%mm3    \t# v mark row used if it was "
		: "=m" (opt[index])  // this is %0,  output
		: "m" (rpt[index]),     // and %1   reference 
		  "m" (ipt[index]),     // and %2   input
		  "m" (vpt[index])      // and %3   per-pixel-threshold
		// the "=m" implies it is output.. just "m" is input
		// see the gcc info pages for more details...
		);  
	index++ ;
      }
      // get the per row flag back into an 
      __asm__("\n\tmovq %%mm3,%0" : "=m" (mmtemp_d));
      
      row_used [i]=0; // could be faster by playing int games, but why bother
      for (int ii=0;ii < 8;ii++) row_used [i] |= mmtemp[ii];

    } // end  for loop over all rows

  // ok, done with MMX,reset it to normal floating point usage
   __asm__("emms" : : );

} // end of function absdiffmmx



We could include loop-unrolling and gain a little more in performance (3-5%)
and then I would need to do the index registers myself.  The primary gain of
unrolling here is not from the loop overhead, but because by unrolling we
would better use the "v" pipes by starting starting the second set of "loads"
at the instruction where we get abs diff via parallel or (i.e.  after the
line  
 "\n\t por %%mm1, %%mm0    \t# u get abs diff (via or) "


This concludes the example.    




   Some hints for working with GNU and MMX 

  First of all, as the is not commonly used stuff,  some of the
  compilers/binutils may have problems.  With binutils 2.9 or and a recent
  version of g++ (2.8.1 or better) you may be ok.  You can test it with
   above example .  
  Note that egcs-2.91.66, which ships with RedHat6.0,  (and a few other
  egcs/g++) versions do not properly handle MMX  
  in the assembly pass-through.   It should replace the %%mm5 with %mm5 as it
  processes,  but it does not.    (If you get error messages about "bad
  register name ('%%mm3')"  or something like it, you know you have the broken
  version).  
 The workaround is to compile to assembly 
  then  use an editor or sed to replace %%mm with %mm
(e.g. 
         g++ -O3 -S -o absdiffmmx-tmp.s absdiffmmx.cc 
         sed -e 's/%%mm/%mm/g' < imdiff-tmp.s >absdiffmmx.s
         g++ -O3 -o absdiffmmx.o absdiffmmx.s 

Also note that this "bug" also breaks all lines which try and use the
"clobber-register" approach (for which I know no work around).  

  
  Note that some level of optimization will probably be be needed for inline
  assembly with MMX.  (without it you may get   error messages such as
  "fixed or forbidden register 7 (sp) was spilled for class GENERAL_REGS."
  since GNU supports both optimization and debugging, its not a real issue.



 Looking at your assembly, and code tuning... 
  Also note the above shows how to generate assembly so you can "check" the
  quality of your code and decide if there is more to do for optimization.
  Note the asm code passed through will show up in the  .s file between sets
  with something like.   For example our
#APP
	pxor %%mm3, %%mm3 	 # clear row-used flag
#NO_APP

You can also add assembly tags (.labelname)  to make it easier to find your
code and also to help with profiling. 
You will notice that I did not spend time "optimizing" my loops and writing
the whole loop in assembly.  I prefer to keep most if it in "C" and then
occasionally check the "generated" code to see if it is efficient.  I compile
to assembly and look at it.  Note that the GNU compiler  WILL 
optimize around your MMX block, including the data loading issues. For example
if we compile the above code with -O we get 
assembly code  for the main loop that looks like: 
#APP
	pxor %mm3, %mm3 	 # clear row-used flag
#NO_APP
	cmpl %edx,%eax
	jge .L13
	.align 4
.L14:
	movl -28(%ebp),%esi
	movl -24(%ebp),%edi
	movl -16(%ebp),%ecx
	movl -20(%ebp),%ebx
#APP
	 movq (%edi,%eax,8),%mm0       	# u  load ref  
	 movq (%ecx,%eax,8),%mm1       	# u  load input 
	 movq %mm0,%mm2    	# v copy ref 
	 movq (%ebx,%eax,8),%mm4       	# u  load var image data 
	 psubusb %mm1, %mm0	# v  ref - inp 
	 psubusb %mm2, %mm1	# u inp - ref  ?subtract stall? 
	 paddusb %mm7,%mm4 	# v add base threshold to variance
	 por %mm1, %mm0    	# u get abs diff (via or) 
	 psubusb %mm4,%mm0 	# u subtract with saturate thresh
	 movq %mm0,(%esi,%eax,8)       	# u store result, 
	 por %mm0, %mm3    	# v mark row used if it was 
#NO_APP
	incl %eax
	cmpl %edx,%eax
	jl .L14
.L13:

but if we use -O2 we get  assembly code that
looks like :
#APP
	
	pxor %mm3, %mm3 	 # clear row-used flag
#NO_APP
	cmpl %eax,%esi
	jge .L13
	movl -32(%ebp),%ebx
	movl %ebx,-40(%ebp)
	movl -24(%ebp),%ecx
	movl -36(%ebp),%edx
	movl -28(%ebp),%eax
	.align 4
.L14:
	movl -40(%ebp),%ebx
#APP
	 movq (%eax),%mm0       	# u  load ref  
	 movq (%edx),%mm1       	# u  load input 
	 movq %mm0,%mm2    	# v copy ref 
	 movq (%ecx),%mm4       	# u  load var image data 
	 psubusb %mm1, %mm0	# v  ref - inp 
	 psubusb %mm2, %mm1	# u inp - ref  ?subtract stall? 
	 paddusb %mm7,%mm4 	# v add base threshold to variance
	 por %mm1, %mm0    	# u get abs diff (via or) 
	 psubusb %mm4,%mm0 	# u subtract with saturate thresh
	 movq %mm0,(%ebx)       	# u store result, 
	 por %mm0, %mm3    	# v mark row used if it was 
#NO_APP
	addl $8,%ebx
	movl %ebx,-40(%ebp)
	addl $8,%ecx
	addl $8,%edx
	addl $8,%eax
	incl %esi
	cmpl %esi,-20(%ebp)
	jg .L14
.L13:

The second code is faster (5-15% depending on system cache issues).
If there is something you don't like, you can always take the generated code
and make it all "inline" and then clean things up by hand (like why is ebx
being moved in and out of the stack?)  Of course, such minor optimizations
are warranted only if this is still the system bottleneck...



 Data Alignment issues 
 As mentioned,  getting alignment of data is important.  For MMX and PII, only
 double alignment is needed.  To achieve this I do something like
  
  unsigned char* real_inimg=malloc(spaceneeded + 16);
  inimg = (unsigned char *) ((unsigned long)(real_inimg + 8) & 
				(unsigned long) 0xFFFFFFF8);

The above works by advancing the pointer into the allocated data and then
insuring that the last 3 bits of the "image" address are ==0.   There are
other ways to accomplish this too, including doing things in assembly.
When you use this trick remember to free  real_inimg, not 
inimg. If you are using real C++, if you make the underlying data
 double or  unsigned long long it will align itself
properly, but "C" malloc is type independent.  (Some operating systems
provide memalign, a function to better address the alignment issue for C).



 


	









 Some MMX related links (on the web): 



Intel
Pentium II & Pentium with MMX Application notes  (MultiMedia
eXtensions)  This has many code examples.


Intel MMX Manuals


AMD K7   and 
   MMX & 3D Now documentation


Cyrix M2
Application notes  including their MMX (MultiMedia eXtensions)




There are also multi-media extensions for and HP, SUN, DEC workstations which
support (and have for many years) many similar multi-media instruction.