Delphi XE3 vs i7 2700k Win7 x64, summing an array of 100 million elements of Vector2 = record x, y: single; end; on:

Delphi XE3 vs i7 2700k Win7 x64, summing an array of 100 million elements of Vector2 = record x, y: single; end; on:
X86: 1.1 seconds
X64: 4.8 seconds

Changing singles to doubles:
X86: 1.4 seconds
X64: 1.0 seconds

Alrighty then.

Comments

  1. I guess ratio single to double is opposite in Android / iOS (single much faster). Maybe we need a "TFastfloat" type ifdef'd depending on platform.

    ReplyDelete
  2. Seems to be suboptimal code gen. Compiler emits CVTSS2SD to perform the calculations in double precision, but CVTSS2SD does not touch the upper 64 bits of the register, ie a partial register update, which is about as good as it sounds. No reason for it to do so either that I can see, it doesn't try to autovectorize at all.

    ReplyDelete
  3. David Berneda I guess it depends on the chip? Some ARM's come with decent FPU's don't they?

    ReplyDelete
  4. Adding {$EXCESSPRECISION OFF} only reduces the time to 2.95 for singles on X64.

    ReplyDelete
  5. Delphi never has auto-vectorized, so it would be strange to expect it now. If you really care about application speed for math calculations, you cannot really use Delphi. It is way behind the times. However, if you use mtxVec, you will get to within a few percent of highly optimized C++ (http://www.dewresearch.com/products)

    ReplyDelete
  6. The problem with Delphi X64bit compile: It copies your 32bit single into 64bit doubles, does the math and copies back. Thats what causes the slowdown. And thats why 64bit doubles are faster. Needles copy is not required(thats an odd architecture of the x64 compiler, iirc i read somewhere that they did this so users get high precision with single too...eh year -.-)

    ReplyDelete
  7. David Novo Of course I didn't expect it to autovectorize, as you say it's an ancient beast. I'm just saying I think it's weird to use those conversion functions when it's trivial to use much better options, ie the scalar instructions.

    And while we won't get C++-speeds any time soon, 3x slower than 32bit/doubles isn't exactly fun.

    ReplyDelete
  8. As Asbjorn mentioned turn off {$EXCESSPRECISION OFF} to stop that.

    ReplyDelete
  9. Asbjørn Heid  I have never been impressed by the Delphi compiler in terms of math efficiency.  I have found 15X improvements using mtxVec in some basic math stuff compared to Delphi. I have basically switched over to using it for all calculations that are even remotely math intensive.  Plus the code ends up look much simpler because mtxVec is inherently vectorized and you can eliminate all the stupid boilerplate looping code.

    For example,

    for i:=0 to length(arr)-1 do
      if arr[i]<0 then
        arr[i]:=0

    becomes

    arr.ThreshBottom(0)

    cleaner code and a huge speedup in performance. What more can you want?

    ReplyDelete
  10. Asbjørn Heid Yes it seems Neon is the good pipelined fpu. Found this link, mul and div are faster with singles vs double: https://pixhawk.ethz.ch/omap/optimization/arm_cortex_a8

    ReplyDelete
  11. David Novo Yeah, I presume mtxVec is DLL, hence compiled Fortran or C++? Simple integer tests I did showed VC++ 11 (2012) being 4-5x faster than Delphi. I don't doubt 15x for float stuff. Was hoping a LLVM backend would make it in soon, but I guess that's out of the picture for the immediate future. Oh well. At least we have "native code".

    ReplyDelete
  12. Alexander B. The problem is also that it uses a very poor way of doing the single to double promotion. It uses CVTSS2SD (convert scalar single to scalar double), which does not touch the upper 64bits of the register, which may cause CPU stalls.

    Instead it should use CVTPS2PD (convert packed single to packed double), which touches the entire register. We don't care what the upper 64bits contain as we'll only move the lower 64bits back into memory, so we can use the packed instructions safely. Since we overwrite the entire register, the CPU can more easily rearrange the instructions.

    ReplyDelete
  13. Asbjørn Heid
     MtxVec primarily wraps the Intel Performance Primitives,the Intel Math Kernel library as well as his own C++ for stuff not included in the ones above.  You are not going to get much faster than that. Supposedly it has different code paths for different processors, optimized for each one.
    I know he uses the Intel compiler, which is apparently much better than the VC++ compiler.

    ReplyDelete
  14. David Novo Alrighty, yeah if you can get correct code out of it, Intel's compiler sure is fast. Had some issues with a C++ codebase I worked on some years ago, never tried it again after that (resulting program would generate garbage). I suspect it's because the codebase relied on IEEE spec handling of infinities.

    ReplyDelete
  15. single precision floating point seems outdated these days. Try to rerun with double precision floats.

    ReplyDelete
  16. This is a quirk with code generation precision, it can be turned of, and Visual Studio has the same quirk

    http://www.delphitools.info/2011/09/05/xe2-single-precision-floating-point-disappointment/

    ReplyDelete
  17. Lars Dybdahl They're not outdated, it's mostly a code-generation quirk. If you don't need the precision, Singles are going to be faster, thanks to reduced memory bandwidth needs.

    Also, with the generalization of GPU-based UI, your CPU is likely processing more singles than doubles these days (if only to feed the GPU)

    ReplyDelete
  18. Lars Dybdahl The results from using doubles is in my original post. As Eric mentions memory bandwidth can play a significant role, at least with a decent compiler :-P

    Eric Grange Yea saw that post and tried with the flag off, as mentioned above, with disappointing results. Have to take a closer look today to see what the compiler is doing.

    ReplyDelete
  19. Asbjørn Heid FWIW, using your test case optimal perf seems around 0.3 seconds for single precision on this machine (not a core i7, an older and slower Opteron), and 0.6 seconds for double precision (both memory-bandwidth limited, scalar or simd SSE doesn't matter, and even FPU achieves very close results).

    ReplyDelete
  20. Eric Grange I should have specified that I use Kahan summation (which you really should for 100 mill singles), so four flops per iteration. But if you take your numbers and multiply by four you get pretty close to mine, give or take some cache and instruction-level parallelism effects.

    ReplyDelete
  21. Asbjørn Heid Given that memory bandwidth is the limiting factor, I don't think it'll be 4 times slower, even discounting the extra juggling for Kahan....  Ok, seeing 0.7 sec for scalar Kahan in scalar single-precision. This means that two threads should be able to do it in 0.3-0.4 sec (but no faster, as memory will then be bottlenecked)

    ReplyDelete

Post a Comment