A Look at Improved Inlining in Delphi XE6 - DelphiTools


http://www.delphitools.info/2014/05/07/a-look-at-improved-inlining-in-delphi-xe6

Comments

  1. Ah, but incompatible with that old chip, I'm afraid! ;)

    ReplyDelete
  2. Thanks for the analysis. One step forwards and one step backwards. Overall still pretty poor :(

    With records it seems to do the juggling on the result variable, but not  on the parameters, so that's an improvement. But the extra float juggling is still there of course. Here's from a simple 2D vector record with an inlined Add operator:

    XE3.dpr.40: vr := v1 + v2;
    0041C427 8B05D83E4200     mov eax,[$00423ed8]
    0041C42D 8945E8           mov [ebp-$18],eax
    0041C430 8B05DC3E4200     mov eax,[$00423edc]
    0041C436 8945EC           mov [ebp-$14],eax
    0041C439 8B05E03E4200     mov eax,[$00423ee0]
    0041C43F 8945E0           mov [ebp-$20],eax
    0041C442 8B05E43E4200     mov eax,[$00423ee4]
    0041C448 8945E4           mov [ebp-$1c],eax
    0041C44B D945E8           fld dword ptr [ebp-$18]
    0041C44E D845E0           fadd dword ptr [ebp-$20]
    0041C451 D95DD8           fstp dword ptr [ebp-$28]
    0041C454 9B               wait
    0041C455 D945EC           fld dword ptr [ebp-$14]
    0041C458 D845E4           fadd dword ptr [ebp-$1c]
    0041C45B D95DDC           fstp dword ptr [ebp-$24]
    0041C45E 9B               wait
    0041C45F 8B45D8           mov eax,[ebp-$28]
    0041C462 8905E83E4200     mov [$00423ee8],eax
    0041C468 8B45DC           mov eax,[ebp-$24]
    0041C46B 8905EC3E4200     mov [$00423eec],eax

    XE6.dpr.40: vr := v1 + v2;
    0041C4AF D905BC3E4200     fld dword ptr [$00423ebc]
    0041C4B5 D805C43E4200     fadd dword ptr [$00423ec4]
    0041C4BB D95DDC           fstp dword ptr [ebp-$24]
    0041C4BE 9B               wait
    0041C4BF D945DC           fld dword ptr [ebp-$24]
    0041C4C2 D95DE8           fstp dword ptr [ebp-$18]
    0041C4C5 9B               wait
    0041C4C6 D905C03E4200     fld dword ptr [$00423ec0]
    0041C4CC D805C83E4200     fadd dword ptr [$00423ec8]
    0041C4D2 D95DDC           fstp dword ptr [ebp-$24]
    0041C4D5 9B               wait
    0041C4D6 D945DC           fld dword ptr [ebp-$24]
    0041C4D9 D95DEC           fstp dword ptr [ebp-$14]
    0041C4DC 9B               wait
    0041C4DD 8B45E8           mov eax,[ebp-$18]
    0041C4E0 8905CC3E4200     mov [$00423ecc],eax
    0041C4E6 8B45EC           mov eax,[ebp-$14]
    0041C4E9 8905D03E4200     mov [$00423ed0],eax

    ReplyDelete
  3. I just ran a comparison between XE5 and XE6 with the SciMark2 test here:
    https://code.google.com/p/scimark-delphi/

    XE6 Win32 Results:

    Mininum running time = 2,00 seconds
    Composite Score MFlops:   632,06
    FFT             Mflops:   297,35    (N=1024)
    SOR             Mflops:   895,01    (100 x 100)
    MonteCarlo:     Mflops:   184,05
    Sparse matmult  Mflops:   360,58    (N=1000, nz=5000)
    LU              Mflops:  1423,33    (M=100, N=100)

    XE5 Win32 Results:

    Mininum running time = 2,00 seconds
    Composite Score MFlops:   859,98
    FFT             Mflops:   390,91    (N=1024)
    SOR             Mflops:  1193,53    (100 x 100)
    MonteCarlo:     Mflops:   198,91
    Sparse matmult  Mflops:   538,50    (N=1000, nz=5000)
    LU              Mflops:  1978,03    (M=100, N=100)

    ReplyDelete
  4. Bill Meyer Well, not that incompatible actually, XE6 still generates x87 FPU opcodes...  at least they're 32bit x87 opcodes, but hey, SSE was only introduced 13 years ago!

    ReplyDelete
  5. Leif Uneus I can confirm that between XE & XE6, with 480 MPFlops for XE vs 350 for XE6 (on my old AMD CPU), and similar ratio on more recent Xeon E5.

    ReplyDelete
  6. Ok, I know why XE6 is slowser: it is now stack juggling even for simple expressions...

    ReplyDelete
  7. Lars Fosdal Oh, come on, speed is overrated. ;)

    ReplyDelete
  8. Bill Meyer - I prefer my mistakes to be fast ;)

    ReplyDelete
  9. Slowdown ranges from 7% (integer-heavy Monte Carlo) to 50% (sparse matrix multiplication), with 30% on average.

    I mean, the CPU speed will have caught up within a year or five, who cares?

    ReplyDelete
  10. Eric Grange So you need new hardware in order to just recompiled apps run as fast as they did before. Talk about progress...

    ReplyDelete
  11. Well, at least they did some work on optimization... so there's a faint hope they'll fix this and the result juggling and such.

    Though, they clearly need to run the output past a few more eyeballs, as well as running more regression tests before release.

    ReplyDelete
  12. Dalija Prasnikar That's consumerism for you ;-)

    ReplyDelete
  13. Dalija Prasnikar Except that CPU speed is not increasing, and multi-core only helps when you can improve performance through threading.

    ReplyDelete
  14. Asbjørn Heid It's a bit surprising they didn't use all the open-source benchmarks and/or call for sample code before doing it (ala FastCode B&V).
    I hope it's not the old Borland Ivory Tower culture striking back.

    ReplyDelete
  15. Eric Grange Is it? Seems like NIH...

    ReplyDelete
  16. Eric Grange Given the job postings for compiler jobs in Romania, I hope it's just inexperience...

    ReplyDelete
  17. Marco Cantù  Attention please.

    ReplyDelete
  18. Lars Fosdal Rapidly in error? ;)

    ReplyDelete

Post a Comment