Originally shared by David Berneda

Originally shared by David Berneda

Playing with SSE2, simple "sum of array of double" asm its 6x times faster in my machine (32bit only) vs. normal fpu (pure pascal).

Wished asm had a stack-align directive to use aligned sse2 intrinsics !

function DoubleSum(const d:Array of Double):Double;
var t: Integer;
begin
  result:=0;
  for t:=0 to Length(d)-1 do
      result:=result+d[t];
end;

vs:

function DoubleSum(const s:Array of Double):Double;
type
  DoubleVector = array [0..1] of Double;

var p : Pointer;
    r : DoubleVector;
    num,
    l : Integer;
begin
  p:=@s[0];
  num:=Length(s);
  l:=num div 2;

  asm
    mov ecx, p
    movupd xmm0, [DoubleVector(ecx)]
    mov edx, 1
 @loop:
    add ecx,16 // 2*SizeOf(Double)
    movupd xmm1, [DoubleVector(ecx)]
    addpd xmm0, xmm1

    inc edx
    cmp edx, l
    jnz @loop
    movupd r, xmm0
  end;

  result:=r[0]+r[1];

  if num mod 2 = 1 then
     result:=result+s[num-1];
end;

Comments

  1. The difference in that case isn't really FPU vs SSE2, it's the "old Delphi compiler's FPU codegen" vs ASM SSE2.

    For a simple sum of an array an FPU-based asm version will be just as fast as SSE2, and will even be higher precision (80bits vs 64bits). For SSE2 to pull ahead opn a simple sum, you need to leverage the SIMD aspects (ie. add two items simultaneously).
    The old Delphi FPU codegen has a significant amount of unnecessary register juggling, and doesn't do FPU register allocation.

    ReplyDelete
  2. Yep !, x64 codegen is much better, I'll benchmark when I figure out how to convert the asm.

    ReplyDelete
  3. Done with x64. Speed improvement is 5x. Compiler is not vectorizing, is using sse2 scalar "addsd" instead of "addpd". Huge difference !
    I read somewhere a doc about llvm/clang auto-vectorizing loops, maybe its optional.

    ReplyDelete
  4. Does the NextGen compiler have SIMD intrinsics?

    ReplyDelete
  5. Eric Grange Existing System.Math Sum is faster than my SSE2 in 32bit (its fpu-only, unrolled loop using 4 ST registers to exploit double-fpu in modern cpus).  64bit asm is not avail and its 2x slower than simple Pascal loop as it does Kahan sum.

    ReplyDelete
  6. My fault. In 64 bit custom asm vs. Pascal is only 2x faster, not 6x. (As expected, 1-scalar vs. 2-packed)

    ReplyDelete
  7. Corrected functions, if anyone is interested:

    function DoubleSumPascal(const d:Array of Double):Double;
    var t: Integer;
    begin
      result:=0;
      for t:=0 to Length(d)-1 do
          result:=result+d[t];
    end;

    function DoubleSum(const s:Array of Double):Double;
    type
      DoubleVector = array [0..1] of Double;

      procedure Loop(p,l:NativeInt; out r:DoubleVector); assembler;
      asm
        {$IFDEF CPUX64}
        // Assuming RAX=l
        movupd xmm0, [DoubleVector(p)]
     @loop:
        add p,16 // 2*SizeOf(Double)
        addpd xmm0, [DoubleVector(p)]  // <-- inlined, no problem
        dec rax
        {$ELSE}

        // EAX=p, EDX=l, ECX=r
        movupd xmm0, [DoubleVector(p)]
     @loop:
        add p,16 // 2*SizeOf(Double)
        movupd xmm1, [DoubleVector(p)] // <-- necessary ! (AV if not)
        addpd xmm0, xmm1
        dec edx
        {$ENDIF}

        jnz @loop
        movupd [r], xmm0
      end;

    var num : NativeInt;
        r : DoubleVector;
    begin
      num:=Length(s);

      case num of
       0: result:=0;
       1: result:=s[0];
       2: result:=s[0]+s[1];
       3: result:=s[0]+s[1]+s[2];
      else
      begin
        Loop(NativeInt(@s[0]),num div 2,r);

        if num mod 2=1 then
           result:=r[0]+r[1]+s[num-1]
        else
           result:=r[0]+r[1];
      end;
      end;
    end;

    ReplyDelete

Post a Comment