Click to See Complete Forum and Search --> : Intel c++ Compiler


t0t3r
05-14-2004, 06:14 AM
anyone experiences with the official linux intel c++ compiler? .... on thi site they say something like 30% performance gain.

JayMan8081
05-14-2004, 06:24 AM
We use it at my work for some of our larger programs. It does give a performance boost, but I wouldn't say that it's 30%. We've found that using good optimizations with g++ also increases performance to within a couple percent of the Intel compiler's performance.

t0t3r
05-14-2004, 06:44 AM
so is it worth a try ? ... compiling kernel etc?

JayMan8081
05-14-2004, 08:30 AM
I would wait and see what some other people have to say. It might give a better performance boost than what I saw at work if it's something where the libraries being used can be recompiled with icc as well. The application we use it for uses OpenGL and wxWidgets so part of the lack of performance gain might be in those libraries. I know we did recompile wxWidgets with icc. If I remember correctly to get the compiler had a pretty hefty cost too so I don't know that it would be worth it for a home user.

t0t3r
05-14-2004, 03:25 PM
fo home users its free .... theres a version for non commercial use. so if u compile most og the needed libraries too with icc there shoulfbe some performance gain.

JayMan8081
05-14-2004, 04:31 PM
I would give it a try the next time you set up a system and see if it really makes a difference. I wonder if there would be anyway to do like a Gentoo stage1 install using the Intel compiler that way everything would be extremely optimized.

Strogian
05-14-2004, 04:37 PM
I wonder if there would be anyway to do like a Gentoo stage1 install using the Intel compiler that way everything would be extremely optimized.

This is Linux ... of course there is a way!

t0t3r
05-14-2004, 05:06 PM
thought of something like that too ... "complet" system compiled with icc ...! in faczt i just downloaded it ... and i will try to get it installed and work with it

GaryJones32
05-14-2004, 11:30 PM
this is only worth the trouble if you are running P4
If you are running P4 30% is a very low end estimate of performance increase.
I did my kernel with it and it's like a different machine.
The kernel throughput is amazing -- everything is zing bling.
It's like night and day -- for some programs i would say into the 100% increase range and beyond.
I saw on the web a P4 benchmark with some common benchmark program that came out 600% performance gain when they began ramping up icc optimization flags. (they had to put a *possible error* footnote on the results it was so unbelievable) -- like always in real life high levels of optimzation breaks things.
athough i think better understanding of the flags might help -ipo seems to create ld errors and when i try to use -O3 with the kernel it automatically scales it back to -O2
the compiler uses vectorization and the P4 see2 registers and most importantly of all for the multithreaded P4 recognized as two different processors by the kernel
if you write say a simple (for) loop icc will fork out each aspect of the loop and do the entire thing as parallel threads all at once -- YEA !!!!!!!!! that's what i'm talking about. needless to say i'm impressed. And no i don't think it's up to the programmer to do that kind of crap -- if that were the case we could just all write machine code and forget higher level languages altogether.

As for compiling the system with it.
It's real picky about stuff gcc will let you get away with like funky memory allocation and like that so Linux core will not compile straight off without major hacking troubles i bet. I have observed the gentoo people getting pretty far but ending up with a segfaulting mess.

there is a utility on the net called gccicc that will allow both compilers to do stuff and can accept flags for both in a --- divided flag line ---
(failover to gcc when icc gets stuck either for files you choose or you can set it to do it automagically)
(the binary code is compatable) google for the icc kernel patches and you will find gccicc
major hacking of Makefiles ahead to use it alot
and possible still object linking troubles with previous gcc produced files
or just plain runtime troubles.

t0t3r
05-15-2004, 07:54 AM
uhm very interesting .... sound incredible. As far as the kernel concerns ... are there any problems in compiling it? anyhackin patching to do before or something? ... can u post alittle howto how to do it . i just worked with gcc so far ...




but after ure post im really keen on testing icc.

t0t3r
05-15-2004, 08:52 AM
ok i now got the icc for non commercial use with a license file via mail from intel. i installed icc. the i tried to install this flexlm license server ... all is done. but when i try to start ./iccbin it says no license file could be obtained.

dunno if i actually need a licence. How ido i get that thing working and how can i test it with a simple c prog?

Moroni
05-15-2004, 09:08 AM
When you refer to LFS running a lot better than Fedora, what do you mean?

I just installed Mandrake on my laptop, but was thinking to go to REdHat so I can use the Ximian Desktop, but when I saw your post, made me think about trying something else to get a better performance. Although I will use my linux box at work not for developing but for my usual duties (monitor systems thru vnc, email, office documents, browse several company sites, etc).

Do you think is worth the pain to get the icc in place in my distribution for this purposes only?

Waiting your comments and, if possible, a detailed How-To :)

t0t3r
05-15-2004, 09:18 AM
think compiling a whole system is really a hard struggle ... but in fact compiling the most used libraries the kernel and progs should give a performance boost.

lol but in fact the icc installation sux hard if u do not have rpm based system.

maccorin
05-15-2004, 09:41 AM
Originally posted by GaryJones32
this is only worth the trouble if you are running P4
If you are running P4 30% is a very low end estimate of performance increase.
I did my kernel with it and it's like a different machine.
The kernel throughput is amazing -- everything is zing bling.
It's like night and day -- for some programs i would say into the 100% increase range and beyond.
I saw on the web a P4 benchmark with some common benchmark program that came out 600% performance gain when they began ramping up icc optimization flags. (they had to put a *possible error* footnote on the results it was so unbelievable) -- like always in real life high levels of optimzation breaks things.
athough i think better understanding of the flags might help -ipo seems to create ld errors and when i try to use -O3 with the kernel it automatically scales it back to -O2
the compiler uses vectorization and the P4 see2 registers and most importantly of all for the multithreaded P4 recognized as two different processors by the kernel
if you write say a simple (for) loop icc will fork out each aspect of the loop and do the entire thing as parallel threads all at once -- YEA !!!!!!!!! that's what i'm talking about. needless to say i'm impressed. And no i don't think it's up to the programmer to do that kind of crap -- if that were the case we could just all write machine code and forget higher level languages altogether.

As for compiling the system with it.
It's real picky about stuff gcc will let you get away with like funky memory allocation and like that so Linux core will not compile straight off without major hacking troubles i bet. I have observed the gentoo people getting pretty far but ending up with a segfaulting mess.

there is a utility on the net called gccicc that will allow both compilers to do stuff and can accept flags for both in a --- divided flag line ---
(failover to gcc when icc gets stuck either for files you choose or you can set it to do it automagically)
(the binary code is compatable) google for the icc kernel patches and you will find gccicc
major hacking of Makefiles ahead to use it alot
and possible still object linking troubles with previous gcc produced files
or just plain runtime troubles.

I would _love_ to see these supposed benchmarks, please link.

GaryJones32
05-15-2004, 10:37 PM
Questions -- that's fair -- i started it !

As far as the kernel concerns

need kernel patch -- works for kernel 2.6.3 and some older
http://www.pyrillion.org/index.html?showframe=linuxkernelpatch.html
comes with gccicc and instructions -- works fine
to compile nvidia for icc kernel
./NVIDIA-Linux-x86-1_0-5336-pkg1.run -x
cd ./NVIDIA-Linux-x86-1_0-5336-pkg1
export IGNORE_CC_MISMATCH=1
export CC=gccicc
export ICC2GCCFILES="DELEGATE"
make install

When you refer to LFS running a lot better than Fedora, what do you mean?

LFS compiled on machine runs lots faster and everything works perfectly
takes a long time to build and you need a blank partition
lots more simple and to the point.
start off with building your own kernel -- that will help alot
Fedora kernel even has filesystem debugging turned on -- are they crazy???

I would _love_ to see these supposed benchmarks, please link.

ok first by the people that supposedly write gcc allegedly it is rumored but
unsubstantiated as of yet
http://gcc.gnu.org/ml/gcc/2004-05/msg00021.html
shows icc wins by 550% on mole test
200% on alma
about 50% overall
i don't really remember the exact site i mentioned but i can google
http://www.coyotegulch.com/reviews/intel_comp/intel_gcc_bench2.html
this one shows the older intel 7 sometimes it a small amount behind gcc -- sometimes it's as much as 400% faster
http://www.mcsr.olemiss.edu/parallelogram/01_03/icc.html
i quote from this one on the whetstone test with -wp_ipo

This is a truly amazing performance increase, approximately 675%! It is
certainly not typical, however, so your mileage may vary.

i think olemiss is a suposed university

GaryJones32
05-15-2004, 10:42 PM
Originally posted by t0t3r
ok i now got the icc for non commercial use with a license file via mail from intel. i installed icc. the i tried to install this flexlm license server ... all is done. but when i try to start ./iccbin it says no license file could be obtained.

dunno if i actually need a licence. How ido i get that thing working and how can i test it with a simple c prog?

just put a copy of that license in /opt/intel_cc_80/licenses
if that don't work set the environmental variable
INTEL_LICENSE_FILE=/opt/intel_cc_80/licenses

maccorin
05-16-2004, 03:35 AM
I would _love_ to see these supposed benchmarks, please link.

ok first by the people that supposedly write gcc allegedly it is rumored but
unsubstantiated as of yet
http://gcc.gnu.org/ml/gcc/2004-05/msg00021.html
shows icc wins by 550% on mole test
200% on alma
about 50% overall
i don't really remember the exact site i mentioned but i can google
http://www.coyotegulch.com/reviews/intel_comp/intel_gcc_bench2.html
this one shows the older intel 7 sometimes it a small amount behind gcc -- sometimes it's as much as 400% faster
http://www.mcsr.olemiss.edu/parallelogram/01_03/icc.html
i quote from this one on the whetstone test with -wp_ipo

This is a truly amazing performance increase, approximately 675%! It is
certainly not typical, however, so your mileage may vary.

i think olemiss is a suposed university [/B]

Ok, interesting.

I would like to point out that most of those benchmarks are by people that simply don't understand optimization on gcc, while i do expect icc is better by a bit, i doubt it's that much

example. one of the benchmarks use -O9.... is he stupid? or did he just not RTFM

all of them used at least -O3 and -funroll-loops (which is pointless if you using -O3 anyways....)

this makes for large amounts of cache thrashing and it has been shown that -Os or sometimes -O2 even is faster (in some applications, it really depends on your code)

another interesting note is _none_ of them used -mfpmath=sse, which would probably give you the closest results to what icc does, as it uses the sse floating point instruction set.

now, that said, icc probably _is_ gonna put out a bit faster code for an intel cpu, but the 600% statement and the like is just unrealistic.

t0t3r
05-16-2004, 03:59 AM
ok ... as i cant get icc running on my debian system i will check out icc on another machine. rpm really sux. could some one recompilie me a kernel with icc when i send the config? .... wanna just see if its worth the trouble. Do some performacne benchs etc

would be great ....

t0t3r
05-16-2004, 08:14 AM
with a workaround i got it runnin ....

when i wann compile the kernel after patching it ... which went through without errors i get the following msg

CHK include/linux/compile.h
UPD include/linux/compile.h
CC init/version.o
CC init/do_mounts.o
LD init/mounts.o
/bin/sh: line 1: xild: command not found
make[1]: *** [init/mounts.o] Error 127
make: *** [init] Error 2



do i have to set some variables before doing "make bzImage" ... ??


in fact i tried with a 2.4.20 kernel i also get errors

a /usr/src/linux-2.4.20/arch/i386/lib/lib.a \
--end-group \
-o vmlinux
net/network.o(.text+0x981f): In function `br_write_unlock':
: undefined reference to `__br_lock_usage_bug'
net/network.o(.text+0x9837): In function `br_write_lock':
: undefined reference to `__br_lock_usage_bug'
net/network.o(.text+0x16c13): In function `br_write_unlock':
: undefined reference to `__br_lock_usage_bug'
net/network.o(.text+0x16c2b): In function `br_write_lock':
: undefined reference to `__br_lock_usage_bug'
net/network.o(.text+0x3fb4b): In function `br_write_unlock':
: undefined reference to `__br_lock_usage_bug'
net/network.o(.text+0x3fb63): more undefined references to `__br_lock_usage_bug' follow
make: *** [vmlinux] Error 1


..... any ideas?

Strogian
05-16-2004, 11:05 AM
So what you are saying is, if you don't know much about gcc optimizations, (which I'll bet describes most people ;) including me ) then icc will give you a 400% speed boost?

t0t3r
05-16-2004, 12:33 PM
ilol .... very realistiv values. i spoke with a friend who works on intel machines and has some experiences with different compilers. icc will give u a speed boost. it makes better/faster optimized code compared to gcc. in fact the performacne gain is approx between 10% and 20% .... on some apps more then 500%

he had a prog compiled with gxx it took 12 secs to do the job ... compiled with icc it took 0.2 secs. But such examples are very rare.

GaryJones32
05-16-2004, 04:42 PM
Originally posted by maccorin
Ok, interesting.

I would like to point out that most of those benchmarks are by people that simply don't understand optimization on gcc

Is this to say they DO know alot about optimizing with icc but DON"T know alot about optimizing with gcc ???????

That seems a little biased and presumtive yes.


_none_ of them used -mfpmath=sse

this is true and a valid point -- i noticed that too
P4 uses sse2 not sse
so the flag should be -mfpmath=sse2
however to simple say that would make the two equal without data is invalid.
-O9 harms nothing and is exactly the same as -O3
funroll-loops is not a part of -O3 and is a good flag to use if the loop isn't too big for the cache
cache thrashing of course is an issue especially for P4 with it's shared cache plus parallelism. Also has to do with smpt kernel sceduling or choice of processors during scheduling.
obviously by the bancmarks icc is using parralelism (if that's what -ipo does ???) and not causing thrashing.
It's fair to assume that better flags for icc can also yeild faster code than the tests.
gcc does not use parallelizing of loops so i think i fail to see how -funroll-loops can cause cache thrashing accept in multi-threaded apps and these simple artificial benchmarks are not multi-threaded so i don't get how that flag is harming the results.
I don't think anybody is trying to say the flags used for either compiler are perfect for all situations or even perfect for the tests.

GaryJones32
05-16-2004, 06:25 PM
Originally posted by t0t3r
with a workaround i got it runnin ....

when i wann compile the kernel after patching it ... which went through without errors i get the following msg

CHK include/linux/compile.h
UPD include/linux/compile.h
CC init/version.o
CC init/do_mounts.o
LD init/mounts.o
/bin/sh: line 1: xild: command not found
make[1]: *** [init/mounts.o] Error 127
make: *** [init] Error 2



do i have to set some variables before doing "make bzImage" ... ??


in fact i tried with a 2.4.20 kernel i also get errors

a /usr/src/linux-2.4.20/arch/i386/lib/lib.a \
--end-group \
-o vmlinux
net/network.o(.text+0x981f): In function `br_write_unlock':
: undefined reference to `__br_lock_usage_bug'
net/network.o(.text+0x9837): In function `br_write_lock':
: undefined reference to `__br_lock_usage_bug'
net/network.o(.text+0x16c13): In function `br_write_unlock':
: undefined reference to `__br_lock_usage_bug'
net/network.o(.text+0x16c2b): In function `br_write_lock':
: undefined reference to `__br_lock_usage_bug'
net/network.o(.text+0x3fb4b): In function `br_write_unlock':
: undefined reference to `__br_lock_usage_bug'
net/network.o(.text+0x3fb63): more undefined references to `__br_lock_usage_bug' follow
make: *** [vmlinux] Error 1


..... any ideas?

on the first one
/opt/intel_cc_80/bin/xild
is a part of the intel install so you have to add /opt/intel_cc_80/bin
to your PATH variable and try that one again

on the second one
i get these problems with the object files not being compatible between the two compilers alot. Especially with optimizations.
you can try taking the -unroll flag out of the top level Makefile in the +OPTFLAGS setting ??????
i don't even see -unroll in the list of flags for icc version 8 so i'm stumped on that one -- perhaps those patches were written for version 7 ????
anyway i user kernel 2.6.3 and gcc 3.3.1 and icc version 8
and it worked
it says to use gcc 2.95 for the earlier kernel versions perhaps that's why

maccorin
05-16-2004, 07:10 PM
Originally posted by GaryJones32 ]Is this to say they DO know alot about optimizing with icc but DON"T know alot about optimizing with gcc ???????

That seems a little biased and presumtive yes.

I don't know how much they know about icc, because I know nil about it's flags.



_none_ of them used -mfpmath=sse

this is true and a valid point -- i noticed that too
P4 uses sse2 not sse
so the flag should be -mfpmath=sse2

there is no -mfpmath=sse2 on gcc, this is one area that icc has it licked, but then again... sse2 is only useful if you have an intel cpu ;p sse actually works on some of the later athlons

however to simple say that would make the two equal without data is invalid.

that is true, just as "benchmarking" w/ dumb flags doesn't make anything valid

-O9 harms nothing and is exactly the same as -O3

I know that, but it just shows incompetence, not the type of person i would want to trust

funroll-loops is not a part of -O3

you got me there, i just checked the docs and it seems i was remembering incorrectly
and is a good flag to use if the loop isn't too big for the cache

it is to big 99% of the time, remember we are talking x86 here, last I checked, my shiny new XP2800 only had a 512K cache

cache thrashing of course is an issue especially for P4 with it's shared cache plus parallelism. Also has to do with smpt kernel sceduling or choice of processors during scheduling.

I am assuming you mean using multiple pipelines in the FPU by parallelism, if i'm wrong correct me. But anyways, that has nothing to do w/ cache thrashing AFAIK (although there may be some way that it does that i'm missing

obviously by the bancmarks icc is using parralelism (if that's what -ipo does ???) and not causing thrashing.

see above.

It's fair to assume that better flags for icc can also yeild faster code than the tests.

that is true, but one of the things i have read about icc is that it's defaults are much better chosen (for speed purposes) then gcc's, that would lead me to believe that you could optimize it a bit more, but not as drastically as you could gcc

gcc does not use parallelizing of loops so i think i fail to see how -funroll-loops can cause cache thrashing accept in multi-threaded apps and these simple artificial benchmarks are not multi-threaded

lets see... 10 instructions looped, or 10000s of instructions. think about that
so i don't get how that flag is harming the results. I don't think anybody is trying to say the flags used for either compiler are perfect for all situations or even perfect for the tests.
that's true, but i have read reports online of 20% increase, 10% increase and so on, _not_ of 600% and then when you give me a benchmark "proving" it by someone that uses -O9 that means nothing

GaryJones32
05-17-2004, 01:59 AM
OK this isn't any fun so i'm not going to do this anymore
the numbers are the numbers... that's the usefull thing about numbers.
they are not a belief system or a personal character issue.
they just are.

from gcc 3.3.3 changelogs

The following changes have been made to the IA-32/x86-64 port:
SSE2 and 3dNOW! intrinsics are now supported.


you wrote
sse2 is only useful if you have an intel cpu

hello !!!!!!
we are or were trying to discuss the INTEL compiler made specifically for INTEL cpu.
This stuff is not so simple as we make out and
this ain't your grandmas cpu and compiler
I am far from competent in these matters but:
no i am not talking about pipelining though that is the point of unrolling loops.
loops are unrolled specifically so they can be piped and yes i guess that is an older form of parallelism. Which is why it is so fast, there is better and more predictable scheduling of memory access which allows parallelism.
That is -- the entire loop (or close to the entire loop is loaded and used before it is evicted from the cache) Even if the loop is huge the first part is just evicted and more (the spilled part) is loaded and dealt with.
This is the opposite of cache thrashing.
as a matter of fact this by definition precludes cache thrashing.
Thrashing or (data cache missing) is the swapping in and out of different data elements mapped over and over to the same cache location.
nested loops that are NOT unrolled can cause data cache misses or thrashing with each call to the other array wiping out the first array or a part of it in cache and vice versa to varying degrees. This is more true not less for the P4 - it's L1 cache is only 8K for speed - P4 L2 has huge bandwidth and is 256K
I still fail to see how unrolling loops -- even really huge ones can cause thrashing.
I do see how it could cause instruction thrashing BUT Pentium4 replaces the
conventional L1 instruction cache with an execution trace cache that can hold
12,000 micro- ops.
Perhaps someone else can enlighten the discussion.

what i was trying to refer to was openPM and Hyper-Threading (thread-level-parallelism)
and the SMP kernel and their relationship to potential cache thrashing
this is also something that gcc doesn't begin to try to deal with nor are our
benchmark examples for icc using -openmp or -parallel or -par_threshold[n].
so our examples have not touched on auto-parallelization of loops at all which is where one real potential for speed lies in icc. Also i think using -xN instead of -xW turns off vectorization where a loop using see2 is stripped into just one single instruction by icc but i'm not sure.

FWIW:
since it was said sse would make the results equal
i'm looking now at the results of an almabench (floating point math)
on a P4 2.8
gcc w/ -O3 -march=pentium4 -mcpu=pentium4 -msse -msse2 -mmmx -mfpmath=sse -ffast-math
30.6 seconds
gcc w/ -O3 -march=pentium4 -mcpu=pentium4 -msse -msse2 -mmmx -mfpmath=sse -ffast-math -funroll-loops
30.4 seconds
note to self: unroll loops improved not hurt !
gcc w/ -O3 -march=pentium4 -mcpu=pentium4 -msse -msse2 -mmmx -funroll-loops -ffast-math
28.8 seconds
note to self: better without -mfpmath=sse

icc w/ -xW -tpp7 -O3 -ipo -i_dynamic -openmp
8.9 seconds
a 220% win for icc over gcc even with your flag suggestion which turned out to be quite wrong

So don't argue just for the sake of arguing for some idealogical reason -- that's booring !

maccorin
05-17-2004, 09:14 AM
220 is a far cry from 600, but in any case, yes, you win... ok? gonna quit taking this personally now?

side note: all those tests used -O3 which i specifically recommended against (for most cases), but it won't make anywhere near the difference that it would need to catch up anyway

t0t3r
05-17-2004, 10:46 AM
anyway ... i got 2.6.3 with nvidia working just as u said. kinda stupid that xild issue. and teh 2.4 pacth was made for version 7.

kernel just runs very performant. in fact lets just say icc for intel cpus is better than gcc. :)

GaryJones32
05-19-2004, 12:55 AM
Originally posted by maccorin
yes, you win... ok? gonna quit taking this personally now?


t0t3r
Yea i wasn't really trying to say i'm right and you were wrong -- benchmarking is weird
it's more about interpreting the data correctly than the data itself.
and interpreting the data is really hard cause it's all over the place.
right now i'm trying to figure out an issue where a certain gcc flag is increasing math overhead speed by over 3000% -- large number yes but what the heck does it mean in terms of overall perfomance?? i don't have a clue.
certainly floating point math like we were taling about with icc amounts to like
.00001% of the overall performance so it really ain't important.
more like a curiosity for us who follow stupid crap like that.
i ran some tests on unrolling the loops like what you were saying and -funroll-loops
seems to work fine but i did get some extra data cache missing with -funroll-all-loops
so what you were saying had some validity. here are the results if you are interested. used recursion test Tower of Hanoi for some(no) reason??????
might be more pronounced with another

with
-march=pentium4 -O3 -s -ftracer -momit-leaf-frame-pointer
Usage: ./hanoi duration [disks]
==25427==
==25427== I refs: 12,475
==25427== I1 misses: 368
==25427== L2i misses: 219
==25427== I1 miss rate: 2.94%
==25427== L2i miss rate: 1.75%
==25427==
==25427== D refs: 6,483 (4,663 rd + 1,820 wr)
==25427== D1 misses: 381 ( 343 rd + 38 wr)
==25427== L2d misses: 272 ( 242 rd + 30 wr)
==25427== D1 miss rate: 5.8% ( 7.3% + 2.0% )
==25427== L2d miss rate: 4.1% ( 5.1% + 1.6% )
==25427==
==25427== L2 refs: 749 ( 711 rd + 38 wr)
==25427== L2 misses: 491 ( 461 rd + 30 wr)
==25427== L2 miss rate: 2.5% ( 2.6% + 1.6% )

with
-march=pentium4 -O3 -s -funroll-loops -ftracer -momit-leaf-frame-pointer
same thing
Usage: ./hanoi duration [disks]
==1233==
==1233== I refs: 12,473
==1233== I1 misses: 367
==1233== L2i misses: 219
==1233== I1 miss rate: 2.94%
==1233== L2i miss rate: 1.75%
==1233==
==1233== D refs: 6,479 (4,659 rd + 1,820 wr)
==1233== D1 misses: 379 ( 344 rd + 35 wr)
==1233== L2d misses: 272 ( 242 rd + 30 wr)
==1233== D1 miss rate: 5.8% ( 7.3% + 1.9% )
==1233== L2d miss rate: 4.1% ( 5.1% + 1.6% )
==1233==
==1233== L2 refs: 746 ( 711 rd + 35 wr)
==1233== L2 misses: 491 ( 461 rd + 30 wr)
==1233== L2 miss rate: 2.5% ( 2.6% + 1.6% )

BUT with this next one data misses go up to 6.4% enough to degrade performance as you said
-march=pentium4 -O3 -s -funroll-all-loops -ftracer -momit-leaf-frame-pointer
Usage: ./hanoi duration [disks]
==25414==
==25414== I refs: 12,475
==25414== I1 misses: 369
==25414== L2i misses: 218
==25414== I1 miss rate: 2.95%
==25414== L2i miss rate: 1.74%
==25414==
==25414== D refs: 6,483 (4,663 rd + 1,820 wr)
==25414== D1 misses: 418 ( 375 rd + 43 wr)
==25414== L2d misses: 273 ( 243 rd + 30 wr)
==25414== D1 miss rate: 6.4% ( 8.0% + 2.3% )
==25414== L2d miss rate: 4.2% ( 5.2% + 1.6% )
==25414==
==25414== L2 refs: 787 ( 744 rd + 43 wr)
==25414== L2 misses: 491 ( 461 rd + 30 wr)
==25414== L2 miss rate: 2.5% ( 2.6% + 1.6% )

thanks for the insight
BTW
i'm also not saying gcc aint cool cause it's a world class compiler that is way cross platform and open source !!!

maccorin
05-19-2004, 01:33 PM
Originally posted by GaryJones32
BTW
i'm also not saying gcc aint cool cause it's a world class compiler that is way cross platform and open source !!!

:) those are the 2 best things about gcc IMHO, but I'm one of those "free software zealouts"...

in fact i did go to d/l icc for my laptop (the only intel cpu i have in the house), but as soon as i hit a user agreement i stopped.

Well, I've got a U60 (2 x 300MHz) coming in the mail soon, so I'll get to have fun figuring out what CFLAGS are the best for it as soon as I've got it, I'm guessing it will be quite a bit different. Esp considering the bigger cache and massive amount of registers available. I did some sys admin stuff on sparc for an old job a lot, but never really had time to experiment much at all :(

Your right about the benchmarking being mostly about interpretation. It's one of those cases where numbers just simply don't speak for themselves. But yea, it's safe to say icc can beat out gcc _easily_ on a P4.

Since i'm so damn religious about Free Software I probably won't test this myself, but I wonder how icc would hold up on another x86 (i know you would have to disable things like sse2). It would be interesting to find out at least. I would be suprised if it _didn't_ do well. Because that would make icc kinda pointless for closed-source software (how are you gonna know down to the brand what proc everyone is running?). Which seems to be what they are catering to.