Guidelines for developing MMX code

The following guidelines will help you develop fast and efficient
MMX code that scales well across all processors with MMX technology.

Section numbers refer to
<URL:http://www.intel.com/drg/manuals/mmx/dg/>


Rules
- Use a current generation compiler that will produce an optimized
application. This will help you generate good code from the start.

- Avoid partial register stalls. See Section 3.2.4.
- Pay attention to the branch prediction algorithm. See Section
3.2.5.
This is the most important optimization for dynamic execution
(P6-family) processors. By improving branch predictability, your
code will spend fewer cycles fetching instructions.
- Schedule your code to maximize pairing. See Section 3.3.

- Make sure all data are aligned. See Section 4.6.
- Arrange code to minimize instruction cache misses and optimize
prefetch. See Section 3.5.

- Do not intermix MMX instructions and floating-point instructions.
See Section 4.3.1.
- Avoid prefixed opcodes other than 0F. See Section 3.2.3.
- Avoid small loads after large stores to the same area of memory.
Avoid large loads after small stores to the same area of memory.
Load and store data to the same area of memory using the same data
sizes and addres s alignments. See Section 3.6.1.

- Use the OP REG, MEM format whenever possible. This format helps
to free registers and reduce cycles without generating unnecessary
loads. See Section 3.4.1.

- Always put an EMMS at the end of all sections of MMX instructions.
See Section 4.4.
- Optimize cache data bandwidth to MMX registers. See Section 3.6.




Suggestions
- Arrange code so that forward conditional branches are usually not
taken, and backward conditional branches are usually taken.
- Align frequently executed branch targets on 16-byte boundaries.

- Unroll loops to schedule instructions.
- Use software pipelining to schedule latencies and functional units.
- Always pair CALL and RET (return) instructions.

- Avoid self-modifying code.
- Avoid placing data in the code segment.
- Calculate store addresses as soon as possible.
- Avoid instructions that contain three or more micro-ops or
instructions that are more than 7 bytes long. If possible, use
instructions that require one micro-op.
- Avoid using two 8-bit loads to produce a 16-bit load.

- Cleanse partial registers before calling callee-save procedures.
- Resolve blocking conditions, such as store addresses, as far as
possible away from loads they may block.

- In general, an N-byte quantity which is directly supported by the
processor (8-bit bytes, 16-bit words, 32-bit doublewords, and
32-bit, 64-bit, and 80-bit floating-po int numbers) should be
aligned on the next highest power-of-two boundary. Avoid
misaligned data.
-- Align 8-bit data on any boundary.
-- Align 16-bit data to be contained within an aligned 4-byteword.
-- Align 32-bit data on any boundary which is a multiple of four.
-- Align 64-bit data on any boundary which is a multiple of eight.
-- Align 80-bit data on a 128-bit boundary (that is, any boundary
which is a multiple of 16 bytes).