Thanks for the analysis. One step forwards and one step backwards. Overall still pretty poor :(
With records it seems to do the juggling on the result variable, but not on the parameters, so that's an improvement. But the extra float juggling is still there of course. Here's from a simple 2D vector record with an inlined Add operator:
Bill Meyer Well, not that incompatible actually, XE6 still generates x87 FPU opcodes... at least they're 32bit x87 opcodes, but hey, SSE was only introduced 13 years ago!
Asbjørn Heid It's a bit surprising they didn't use all the open-source benchmarks and/or call for sample code before doing it (ala FastCode B&V). I hope it's not the old Borland Ivory Tower culture striking back.
Ah, but incompatible with that old chip, I'm afraid! ;)
ReplyDeleteThanks for the analysis. One step forwards and one step backwards. Overall still pretty poor :(
ReplyDeleteWith records it seems to do the juggling on the result variable, but not on the parameters, so that's an improvement. But the extra float juggling is still there of course. Here's from a simple 2D vector record with an inlined Add operator:
XE3.dpr.40: vr := v1 + v2;
0041C427 8B05D83E4200 mov eax,[$00423ed8]
0041C42D 8945E8 mov [ebp-$18],eax
0041C430 8B05DC3E4200 mov eax,[$00423edc]
0041C436 8945EC mov [ebp-$14],eax
0041C439 8B05E03E4200 mov eax,[$00423ee0]
0041C43F 8945E0 mov [ebp-$20],eax
0041C442 8B05E43E4200 mov eax,[$00423ee4]
0041C448 8945E4 mov [ebp-$1c],eax
0041C44B D945E8 fld dword ptr [ebp-$18]
0041C44E D845E0 fadd dword ptr [ebp-$20]
0041C451 D95DD8 fstp dword ptr [ebp-$28]
0041C454 9B wait
0041C455 D945EC fld dword ptr [ebp-$14]
0041C458 D845E4 fadd dword ptr [ebp-$1c]
0041C45B D95DDC fstp dword ptr [ebp-$24]
0041C45E 9B wait
0041C45F 8B45D8 mov eax,[ebp-$28]
0041C462 8905E83E4200 mov [$00423ee8],eax
0041C468 8B45DC mov eax,[ebp-$24]
0041C46B 8905EC3E4200 mov [$00423eec],eax
XE6.dpr.40: vr := v1 + v2;
0041C4AF D905BC3E4200 fld dword ptr [$00423ebc]
0041C4B5 D805C43E4200 fadd dword ptr [$00423ec4]
0041C4BB D95DDC fstp dword ptr [ebp-$24]
0041C4BE 9B wait
0041C4BF D945DC fld dword ptr [ebp-$24]
0041C4C2 D95DE8 fstp dword ptr [ebp-$18]
0041C4C5 9B wait
0041C4C6 D905C03E4200 fld dword ptr [$00423ec0]
0041C4CC D805C83E4200 fadd dword ptr [$00423ec8]
0041C4D2 D95DDC fstp dword ptr [ebp-$24]
0041C4D5 9B wait
0041C4D6 D945DC fld dword ptr [ebp-$24]
0041C4D9 D95DEC fstp dword ptr [ebp-$14]
0041C4DC 9B wait
0041C4DD 8B45E8 mov eax,[ebp-$18]
0041C4E0 8905CC3E4200 mov [$00423ecc],eax
0041C4E6 8B45EC mov eax,[ebp-$14]
0041C4E9 8905D03E4200 mov [$00423ed0],eax
I just ran a comparison between XE5 and XE6 with the SciMark2 test here:
ReplyDeletehttps://code.google.com/p/scimark-delphi/
XE6 Win32 Results:
Mininum running time = 2,00 seconds
Composite Score MFlops: 632,06
FFT Mflops: 297,35 (N=1024)
SOR Mflops: 895,01 (100 x 100)
MonteCarlo: Mflops: 184,05
Sparse matmult Mflops: 360,58 (N=1000, nz=5000)
LU Mflops: 1423,33 (M=100, N=100)
XE5 Win32 Results:
Mininum running time = 2,00 seconds
Composite Score MFlops: 859,98
FFT Mflops: 390,91 (N=1024)
SOR Mflops: 1193,53 (100 x 100)
MonteCarlo: Mflops: 198,91
Sparse matmult Mflops: 538,50 (N=1000, nz=5000)
LU Mflops: 1978,03 (M=100, N=100)
Bill Meyer Well, not that incompatible actually, XE6 still generates x87 FPU opcodes... at least they're 32bit x87 opcodes, but hey, SSE was only introduced 13 years ago!
ReplyDeleteLeif Uneus I can confirm that between XE & XE6, with 480 MPFlops for XE vs 350 for XE6 (on my old AMD CPU), and similar ratio on more recent Xeon E5.
ReplyDeleteA decline in performance?
ReplyDeleteOk, I know why XE6 is slowser: it is now stack juggling even for simple expressions...
ReplyDeleteWoah...
ReplyDeleteLars Fosdal Oh, come on, speed is overrated. ;)
ReplyDeleteBill Meyer - I prefer my mistakes to be fast ;)
ReplyDeleteSlowdown ranges from 7% (integer-heavy Monte Carlo) to 50% (sparse matrix multiplication), with 30% on average.
ReplyDeleteI mean, the CPU speed will have caught up within a year or five, who cares?
Eric Grange So you need new hardware in order to just recompiled apps run as fast as they did before. Talk about progress...
ReplyDeleteWell, at least they did some work on optimization... so there's a faint hope they'll fix this and the result juggling and such.
ReplyDeleteThough, they clearly need to run the output past a few more eyeballs, as well as running more regression tests before release.
Dalija Prasnikar That's consumerism for you ;-)
ReplyDeleteDalija Prasnikar Except that CPU speed is not increasing, and multi-core only helps when you can improve performance through threading.
ReplyDeleteAsbjørn Heid It's a bit surprising they didn't use all the open-source benchmarks and/or call for sample code before doing it (ala FastCode B&V).
ReplyDeleteI hope it's not the old Borland Ivory Tower culture striking back.
Eric Grange Is it? Seems like NIH...
ReplyDeleteEric Grange Given the job postings for compiler jobs in Romania, I hope it's just inexperience...
ReplyDeleteMarco Cantù Attention please.
ReplyDeleteLars Fosdal Rapidly in error? ;)
ReplyDeletehttp://www.delphitools.info/2014/05/08/delphi-xe6-32bits-and-scimark/
ReplyDelete