[LC++]optimizing array access: x86 instruction timings

Sun Feb 15 15:51:59 UTC 2004

On Thu, 29 Jan 2004, Dr Mark H Phillips wrote:

> Hi,
> 
> I am trying to work out efficiency issues with using
> loops.
> 
[...]

If you want to look at what GCC is actually going to do, use g++ -O2 -S;
with no optimization GCC is going to do the stupid straightforward
translation from C to asm (which is typically what you want with -O0). That
won't be helpful for comparing the two methods.

My guess is that modern compilers (recent g++ being kind-of-but-not-really
in that category) will have an easier time optimizing the indexed version.
For example, Intel C++ will probably be able to convert the indexed version
to use SSE2 (though I don't know the state of integer-based vectorization
on recent ICC), while the pointer version will confuse it and it will use
the normal stuff. And the docs for KAI always recommended writing code that
was as straghtforward and easy to read as possible because the optimizer
would have the easiest time with that (that probably apples to all
EDG-based compilers).

> But I haven't been able to find much in the way of x86 timings
> for instructions such as these, and what information I have
> found is conflicting.  Ie, one place it said that from around
> the 80386 and later all addressing modes took the same amount
> of time, meaning pointer incrementing is a bad solution.  But
> elsewhere seemed to indicate this was not true.  Does anyone
> know / is anyone able to point me to a good source for this
> kind of information?

The only semi-reliable source is the docs put out by Intel/AMD; these have
stuff like instruction timing info and so on. However, there have been
errors in the past (such as, IIRC, the i586 was supposed to do a constant
time multiply, when in fact it was not). Keep in mind that the instruction
timing and scheduling of x86 chips can vary a lot, even if you just
consider the current high-end of P4, Athlon-XP, and Athlon64 (ignoring
P-III, K6, VIA, P-Pro, etc). The P4 in particular looks very different than
the P-III or K7, so anything discussing earlier chips really doesn't apply
to it.

> As an alternative approach I tried to write some test programs 
> and time them.  Unfortunately I've been having problems with 
> this as well.  To get test programs that take a reasonable 
> time to run one needs to write redundant code, ie code that 
> does a useless thing over and over.  Running the compiler 
> with optimization can get rid of a lot of this redundancy, 
> hence stuffing up the timings.  But we want to compare the 
> timings of different methodologies _with_ optimization on.  
> There are probably ways of doing this, but it seems non 
> trivial.

Often compilers are too smart for their own good. Instead of initializing
everything to 1, use rand() [and srand() at the start of main, of course].  
Compilers won't inline that (or at least I've never seen one that did), so
it will pretty much have to compute everything. It will run slower, of
course, but rand() should take basically constant time run-to-run, so
whichever is faster with rand() should also be faster with 1 (or whatever).

At the end sum up all the values (or something like that), and printf the 
results. If a program produces no output (more precisely, no side-effects), 
then an optimizer can basically reduce your program down to 'int main() {}'.
At least in theory. That would have to be one smart optimizer, though.

-Jack