Originally shared by David Berneda
Originally shared by David Berneda
Playing with SSE2, simple "sum of array of double" asm its 6x times faster in my machine (32bit only) vs. normal fpu (pure pascal).
Wished asm had a stack-align directive to use aligned sse2 intrinsics !
function DoubleSum(const d:Array of Double):Double;
var t: Integer;
begin
result:=0;
for t:=0 to Length(d)-1 do
result:=result+d[t];
end;
vs:
function DoubleSum(const s:Array of Double):Double;
type
DoubleVector = array [0..1] of Double;
var p : Pointer;
r : DoubleVector;
num,
l : Integer;
begin
p:=@s[0];
num:=Length(s);
l:=num div 2;
asm
mov ecx, p
movupd xmm0, [DoubleVector(ecx)]
mov edx, 1
@loop:
add ecx,16 // 2*SizeOf(Double)
movupd xmm1, [DoubleVector(ecx)]
addpd xmm0, xmm1
inc edx
cmp edx, l
jnz @loop
movupd r, xmm0
end;
result:=r[0]+r[1];
if num mod 2 = 1 then
result:=result+s[num-1];
end;
Playing with SSE2, simple "sum of array of double" asm its 6x times faster in my machine (32bit only) vs. normal fpu (pure pascal).
Wished asm had a stack-align directive to use aligned sse2 intrinsics !
function DoubleSum(const d:Array of Double):Double;
var t: Integer;
begin
result:=0;
for t:=0 to Length(d)-1 do
result:=result+d[t];
end;
vs:
function DoubleSum(const s:Array of Double):Double;
type
DoubleVector = array [0..1] of Double;
var p : Pointer;
r : DoubleVector;
num,
l : Integer;
begin
p:=@s[0];
num:=Length(s);
l:=num div 2;
asm
mov ecx, p
movupd xmm0, [DoubleVector(ecx)]
mov edx, 1
@loop:
add ecx,16 // 2*SizeOf(Double)
movupd xmm1, [DoubleVector(ecx)]
addpd xmm0, xmm1
inc edx
cmp edx, l
jnz @loop
movupd r, xmm0
end;
result:=r[0]+r[1];
if num mod 2 = 1 then
result:=result+s[num-1];
end;
The difference in that case isn't really FPU vs SSE2, it's the "old Delphi compiler's FPU codegen" vs ASM SSE2.
ReplyDeleteFor a simple sum of an array an FPU-based asm version will be just as fast as SSE2, and will even be higher precision (80bits vs 64bits). For SSE2 to pull ahead opn a simple sum, you need to leverage the SIMD aspects (ie. add two items simultaneously).
The old Delphi FPU codegen has a significant amount of unnecessary register juggling, and doesn't do FPU register allocation.
Yep !, x64 codegen is much better, I'll benchmark when I figure out how to convert the asm.
ReplyDeleteDone with x64. Speed improvement is 5x. Compiler is not vectorizing, is using sse2 scalar "addsd" instead of "addpd". Huge difference !
ReplyDeleteI read somewhere a doc about llvm/clang auto-vectorizing loops, maybe its optional.
Did you try FPU asm?
ReplyDeleteDoes the NextGen compiler have SIMD intrinsics?
ReplyDeleteEric Grange Existing System.Math Sum is faster than my SSE2 in 32bit (its fpu-only, unrolled loop using 4 ST registers to exploit double-fpu in modern cpus). 64bit asm is not avail and its 2x slower than simple Pascal loop as it does Kahan sum.
ReplyDeleteMy fault. In 64 bit custom asm vs. Pascal is only 2x faster, not 6x. (As expected, 1-scalar vs. 2-packed)
ReplyDeleteCorrected functions, if anyone is interested:
ReplyDeletefunction DoubleSumPascal(const d:Array of Double):Double;
var t: Integer;
begin
result:=0;
for t:=0 to Length(d)-1 do
result:=result+d[t];
end;
function DoubleSum(const s:Array of Double):Double;
type
DoubleVector = array [0..1] of Double;
procedure Loop(p,l:NativeInt; out r:DoubleVector); assembler;
asm
{$IFDEF CPUX64}
// Assuming RAX=l
movupd xmm0, [DoubleVector(p)]
@loop:
add p,16 // 2*SizeOf(Double)
addpd xmm0, [DoubleVector(p)] // <-- inlined, no problem
dec rax
{$ELSE}
// EAX=p, EDX=l, ECX=r
movupd xmm0, [DoubleVector(p)]
@loop:
add p,16 // 2*SizeOf(Double)
movupd xmm1, [DoubleVector(p)] // <-- necessary ! (AV if not)
addpd xmm0, xmm1
dec edx
{$ENDIF}
jnz @loop
movupd [r], xmm0
end;
var num : NativeInt;
r : DoubleVector;
begin
num:=Length(s);
case num of
0: result:=0;
1: result:=s[0];
2: result:=s[0]+s[1];
3: result:=s[0]+s[1]+s[2];
else
begin
Loop(NativeInt(@s[0]),num div 2,r);
if num mod 2=1 then
result:=r[0]+r[1]+s[num-1]
else
result:=r[0]+r[1];
end;
end;
end;