[LC++]optimizing array access: x86 instruction timings

Thu Jan 29 14:29:02 UTC 2004

Unfortunately my book on Intel instructions is at home, but I did notice 
a couple of things...

Dr Mark H Phillips wrote:

> noticed that with indexing you get assembler
> commands like:
> 
>         movl    $0, -808(%ebp,%eax,4)
> 
> whereas with pointer incrementing you get commands like:
> 
>         movl    $1, (%edx)
> 
> although later you have to update the pointer with:
> 
>         addl    $4, %edx
 >
> But the question is: do these instructions have different
> timings, or are they all the same?  Ie is the addressing mode
> used for indexing expensive enough that pointer incrementing
> is worth the effort, or not?

I think you're pretty safe assuming that the pointer increment makes the 
later method more expensive.  Both of the movl operations shown above 
put a value into a memory address calculated from a register.  This is 
*probably* going to come out at the same speed even though the first 
movl has to do arithmetic to get the address and the second one doesn't. 
  I believe that this takes the same time because of the superscalar 
design of Intel CPUs.

That would make the addl the differentiating factor.

> But I haven't been able to find much in the way of x86 timings
> for instructions such as these, and what information I have
> found is conflicting.  Ie, one place it said that from around
> the 80386 and later all addressing modes took the same amount
> of time, meaning pointer incrementing is a bad solution.  But
> elsewhere seemed to indicate this was not true.  Does anyone
> know / is anyone able to point me to a good source for this
> kind of information?

All I can offer is hearsay, but I'm pretty sure I've seen docs on 486s 
which said that all addressing modes take the same period of time.

Either way, it's very difficult to quantify these time periods due to 
the nature of pipelined, superscalar architectures.  Depending on what 
else you're doing in your loop you may also find that out-of-order 
execution and pipelining could make any efficiencies like this 
completely redundant.  If the addl were followed by an unrelated 
long-running operation then the addl might even occur in 0 effective 
time!  :-)

In other words, the CPU is designed in such a way as to make questions 
like this purely hypothetical.  If it really *is* an issue then the best 
solution is to try both methods and profile them.  However, I'd really 
be surprised to discover that this was a bottleneck in any program.

Now that I think about it, all you're doing is initialising a block of 
memory.  Isn't there an MMX instruction for this?

-- 
Regards,
Paul Gearon

Software Engineer                Telephone:   +61 7 3876 2188
Plugged In Software              Fax:         +61 7 3876 4899
http://www.PIsoftware.com        PGP Key available via finger

Catapultam habeo. Nisi pecuniam omnem mihi dabis, ad caput tuum saxum
immane mittam.
(Translation from latin: "I have a catapult. Give me all the money,
or I will fling an enormous rock at your head.")