Originally shared by David Berneda

- January 05, 2014

Originally shared by David Berneda

Playing with SSE2, simple "sum of array of double" asm its 6x times faster in my machine (32bit only) vs. normal fpu (pure pascal).

Wished asm had a stack-align directive to use aligned sse2 intrinsics !

function DoubleSum(const d:Array of Double):Double;
var t: Integer;
begin
result:=0;
for t:=0 to Length(d)-1 do
result:=result+d[t];
end;

vs:

function DoubleSum(const s:Array of Double):Double;
type
DoubleVector = array [0..1] of Double;

var p : Pointer;
r : DoubleVector;
num,
l : Integer;
begin
p:=@s[0];
num:=Length(s);
l:=num div 2;

asm
mov ecx, p
movupd xmm0, [DoubleVector(ecx)]
mov edx, 1
@loop:
add ecx,16 // 2*SizeOf(Double)
movupd xmm1, [DoubleVector(ecx)]
addpd xmm0, xmm1

inc edx
cmp edx, l
jnz @loop
movupd r, xmm0
end;

result:=r[0]+r[1];

if num mod 2 = 1 then
result:=result+s[num-1];
end;

Comments

Eric GrangeJanuary 5, 2014 at 5:47 AM
The difference in that case isn't really FPU vs SSE2, it's the "old Delphi compiler's FPU codegen" vs ASM SSE2.

For a simple sum of an array an FPU-based asm version will be just as fast as SSE2, and will even be higher precision (80bits vs 64bits). For SSE2 to pull ahead opn a simple sum, you need to leverage the SIMD aspects (ie. add two items simultaneously).
The old Delphi FPU codegen has a significant amount of unnecessary register juggling, and doesn't do FPU register allocation.
ReplyDelete
Replies
David BernedaJanuary 5, 2014 at 5:51 AM
Yep !, x64 codegen is much better, I'll benchmark when I figure out how to convert the asm.
ReplyDelete
Replies
David BernedaJanuary 5, 2014 at 1:51 PM
Done with x64. Speed improvement is 5x. Compiler is not vectorizing, is using sse2 scalar "addsd" instead of "addpd". Huge difference !
I read somewhere a doc about llvm/clang auto-vectorizing loops, maybe its optional.
ReplyDelete
Replies
Eric GrangeJanuary 5, 2014 at 8:34 PM
Did you try FPU asm?
ReplyDelete
Replies
Asbjørn HeidJanuary 6, 2014 at 12:40 AM
Does the NextGen compiler have SIMD intrinsics?
ReplyDelete
Replies
David BernedaJanuary 7, 2014 at 2:35 AM
Eric Grange Existing System.Math Sum is faster than my SSE2 in 32bit (its fpu-only, unrolled loop using 4 ST registers to exploit double-fpu in modern cpus). 64bit asm is not avail and its 2x slower than simple Pascal loop as it does Kahan sum.
ReplyDelete
Replies
David BernedaJanuary 7, 2014 at 7:14 AM
My fault. In 64 bit custom asm vs. Pascal is only 2x faster, not 6x. (As expected, 1-scalar vs. 2-packed)
ReplyDelete
Replies
David BernedaJanuary 7, 2014 at 7:15 AM
Corrected functions, if anyone is interested:

function DoubleSumPascal(const d:Array of Double):Double;
var t: Integer;
begin
result:=0;
for t:=0 to Length(d)-1 do
result:=result+d[t];
end;

function DoubleSum(const s:Array of Double):Double;
type
DoubleVector = array [0..1] of Double;

procedure Loop(p,l:NativeInt; out r:DoubleVector); assembler;
asm
{$IFDEF CPUX64}
// Assuming RAX=l
movupd xmm0, [DoubleVector(p)]
@loop:
add p,16 // 2*SizeOf(Double)
addpd xmm0, [DoubleVector(p)] // <-- inlined, no problem
dec rax
{$ELSE}

// EAX=p, EDX=l, ECX=r
movupd xmm0, [DoubleVector(p)]
@loop:
add p,16 // 2*SizeOf(Double)
movupd xmm1, [DoubleVector(p)] // <-- necessary ! (AV if not)
addpd xmm0, xmm1
dec edx
{$ENDIF}

jnz @loop
movupd [r], xmm0
end;

var num : NativeInt;
r : DoubleVector;
begin
num:=Length(s);

case num of
0: result:=0;
1: result:=s[0];
2: result:=s[0]+s[1];
3: result:=s[0]+s[1]+s[2];
else
begin
Loop(NativeInt(@s[0]),num div 2,r);

if num mod 2=1 then
result:=r[0]+r[1]+s[num-1]
else
result:=r[0]+r[1];
end;
end;
end;
ReplyDelete
Replies

Add comment

Search This Blog

Delphi Developers Archive

Originally shared by David Berneda

Comments

Post a Comment