Delphi XE3 vs i7 2700k Win7 x64, summing an array of 100 million elements of Vector2 = record x, y: single; end; on:
Delphi XE3 vs i7 2700k Win7 x64, summing an array of 100 million elements of Vector2 = record x, y: single; end; on:
X86: 1.1 seconds
X64: 4.8 seconds
Changing singles to doubles:
X86: 1.4 seconds
X64: 1.0 seconds
Alrighty then.
X86: 1.1 seconds
X64: 4.8 seconds
Changing singles to doubles:
X86: 1.4 seconds
X64: 1.0 seconds
Alrighty then.
I guess ratio single to double is opposite in Android / iOS (single much faster). Maybe we need a "TFastfloat" type ifdef'd depending on platform.
ReplyDeleteSeems to be suboptimal code gen. Compiler emits CVTSS2SD to perform the calculations in double precision, but CVTSS2SD does not touch the upper 64 bits of the register, ie a partial register update, which is about as good as it sounds. No reason for it to do so either that I can see, it doesn't try to autovectorize at all.
ReplyDeleteDavid Berneda I guess it depends on the chip? Some ARM's come with decent FPU's don't they?
ReplyDeleteAdding {$EXCESSPRECISION OFF} only reduces the time to 2.95 for singles on X64.
ReplyDeleteDelphi never has auto-vectorized, so it would be strange to expect it now. If you really care about application speed for math calculations, you cannot really use Delphi. It is way behind the times. However, if you use mtxVec, you will get to within a few percent of highly optimized C++ (http://www.dewresearch.com/products)
ReplyDeleteThe problem with Delphi X64bit compile: It copies your 32bit single into 64bit doubles, does the math and copies back. Thats what causes the slowdown. And thats why 64bit doubles are faster. Needles copy is not required(thats an odd architecture of the x64 compiler, iirc i read somewhere that they did this so users get high precision with single too...eh year -.-)
ReplyDeleteDavid Novo Of course I didn't expect it to autovectorize, as you say it's an ancient beast. I'm just saying I think it's weird to use those conversion functions when it's trivial to use much better options, ie the scalar instructions.
ReplyDeleteAnd while we won't get C++-speeds any time soon, 3x slower than 32bit/doubles isn't exactly fun.
As Asbjorn mentioned turn off {$EXCESSPRECISION OFF} to stop that.
ReplyDeleteAsbjørn Heid I have never been impressed by the Delphi compiler in terms of math efficiency. I have found 15X improvements using mtxVec in some basic math stuff compared to Delphi. I have basically switched over to using it for all calculations that are even remotely math intensive. Plus the code ends up look much simpler because mtxVec is inherently vectorized and you can eliminate all the stupid boilerplate looping code.
ReplyDeleteFor example,
for i:=0 to length(arr)-1 do
if arr[i]<0 then
arr[i]:=0
becomes
arr.ThreshBottom(0)
cleaner code and a huge speedup in performance. What more can you want?
Asbjørn Heid Yes it seems Neon is the good pipelined fpu. Found this link, mul and div are faster with singles vs double: https://pixhawk.ethz.ch/omap/optimization/arm_cortex_a8
ReplyDeleteDavid Novo Yeah, I presume mtxVec is DLL, hence compiled Fortran or C++? Simple integer tests I did showed VC++ 11 (2012) being 4-5x faster than Delphi. I don't doubt 15x for float stuff. Was hoping a LLVM backend would make it in soon, but I guess that's out of the picture for the immediate future. Oh well. At least we have "native code".
ReplyDeleteAlexander B. The problem is also that it uses a very poor way of doing the single to double promotion. It uses CVTSS2SD (convert scalar single to scalar double), which does not touch the upper 64bits of the register, which may cause CPU stalls.
ReplyDeleteInstead it should use CVTPS2PD (convert packed single to packed double), which touches the entire register. We don't care what the upper 64bits contain as we'll only move the lower 64bits back into memory, so we can use the packed instructions safely. Since we overwrite the entire register, the CPU can more easily rearrange the instructions.
Asbjørn Heid
ReplyDeleteMtxVec primarily wraps the Intel Performance Primitives,the Intel Math Kernel library as well as his own C++ for stuff not included in the ones above. You are not going to get much faster than that. Supposedly it has different code paths for different processors, optimized for each one.
I know he uses the Intel compiler, which is apparently much better than the VC++ compiler.
David Novo Alrighty, yeah if you can get correct code out of it, Intel's compiler sure is fast. Had some issues with a C++ codebase I worked on some years ago, never tried it again after that (resulting program would generate garbage). I suspect it's because the codebase relied on IEEE spec handling of infinities.
ReplyDeletesingle precision floating point seems outdated these days. Try to rerun with double precision floats.
ReplyDeleteThis is a quirk with code generation precision, it can be turned of, and Visual Studio has the same quirk
ReplyDeletehttp://www.delphitools.info/2011/09/05/xe2-single-precision-floating-point-disappointment/
Lars Dybdahl They're not outdated, it's mostly a code-generation quirk. If you don't need the precision, Singles are going to be faster, thanks to reduced memory bandwidth needs.
ReplyDeleteAlso, with the generalization of GPU-based UI, your CPU is likely processing more singles than doubles these days (if only to feed the GPU)
Lars Dybdahl The results from using doubles is in my original post. As Eric mentions memory bandwidth can play a significant role, at least with a decent compiler :-P
ReplyDeleteEric Grange Yea saw that post and tried with the flag off, as mentioned above, with disappointing results. Have to take a closer look today to see what the compiler is doing.
Asbjørn Heid FWIW, using your test case optimal perf seems around 0.3 seconds for single precision on this machine (not a core i7, an older and slower Opteron), and 0.6 seconds for double precision (both memory-bandwidth limited, scalar or simd SSE doesn't matter, and even FPU achieves very close results).
ReplyDeleteEric Grange I should have specified that I use Kahan summation (which you really should for 100 mill singles), so four flops per iteration. But if you take your numbers and multiply by four you get pretty close to mine, give or take some cache and instruction-level parallelism effects.
ReplyDeleteAsbjørn Heid Given that memory bandwidth is the limiting factor, I don't think it'll be 4 times slower, even discounting the extra juggling for Kahan.... Ok, seeing 0.7 sec for scalar Kahan in scalar single-precision. This means that two threads should be able to do it in 0.3-0.4 sec (but no faster, as memory will then be bottlenecked)
ReplyDelete