Currently "fighting" trying to obtain a good codegen (for speed) on swapping dynamic array items (Int32, Int64, Single etc) in both cpu x86 and 64bit.

Currently "fighting" trying to obtain a good codegen (for speed) on swapping dynamic array items (Int32, Int64, Single etc) in both cpu x86 and 64bit.
The typical asm xchg trick to swap two integers does not help too much.

var X : Array of Integer; // Int64
...
ExchangeItems(X,23,45) // X[23] <--> X[45]

Comments

  1. Code it in a good C++ compiler. Look at code gen. Use that code as inline asm.

    ReplyDelete
  2. How do you know the codegen is the issue, rather than say cache misses?

    ReplyDelete
  3. David Heffernan yes I'm comparing fpc, clang etc. Looking first for the best pure pascal way, then ifdef x86/x64 for better asm

    ReplyDelete
  4. Asbjørn Heid The problem with arrays is no registers are used (array elements are volatile), so the typical "tmp:=A A:=B B:=tmp" calculates memory position of the index element all the time. Inlining is no better. Probably the best way is to pass address of A and B elements (@X[23] ,@X[45]) but I preferred to avoid using pointers

    ReplyDelete
  5. David Berneda Well yea, but you didn't really answer the question. You can execute several dozen instructions in the time it takes for a single L2 miss.

    Have you measured any significant change when you alter your ExchangeItems?

    My second questions is have you tried to have your ExchangeItems call an inline ExchangeValues(var a, b: integer) procedure?

    ReplyDelete
  6. Yeah ok, so the compiler gets quite silly when you pass array elements to a var parameter for some reason. Seems you do indeed need to hit the compiler with a hammer and revert to pointers (at least you can have the pointer stuff inside ExchangeItems only).

    ReplyDelete
  7. Asbjørn Heid I've added these test (call an inline proc) and it runs better (minor improvement, but better). There are differences between Delphi and FreePascal, and between x86 and x64, depending on using arrays of Int32 and Int64

    ReplyDelete
  8. I've uploaded a small test project that displays timings for all combinations:

    https://github.com/davidberneda/exchange

    ReplyDelete
  9. David Berneda I added the following, which is fastest on my machine:

    procedure InlineExchangePtr(const X:TTestArray; const A,B:Integer); overload; inline;
    begin
      InlineExchange(PTestType(@X[A])^, PTestType(@X[B])^);
    end;

    Array A B: 188msec
    @A @B: 159msec
    Inline Array A B: 173msec
    Inline @A @B: 181msec
    Call Inline @A @B: 158msec
    Inline Array A B with ptr: 144msec <--- my version

    ReplyDelete
  10. Asbjørn Heid Thanks ! I've added this new test to github repo. Similar timings on my i7 4770 machine ( 152msec ) in 32bit. For x64, the "@A @B" version is faster

    ReplyDelete
  11. XCHG instruction is slow because it has a build-in lock mechanism ... Each call to XCHG REG,MEM or XCHG MEM,REG will result entering the lock process (performed by the CPU). So things go slowly. better alternative is to use stack way or reg to reg way. I wrote two functions showing how to use both way (but it works just for x86 ... and I think it's very easy to port to x64 :) ) and now AsmStackExchange is almost 3 time faster than the old AsmExchange and AsmExchange2 is almost 5 five time faster !
    Things would be better if Delphi supports inline asm code.

    //---------------------------------------------------

    procedure AsmStackExchange(var A, B: Int32);
    asm
    push [eax]
    push [edx]
    pop [eax]
    pop [edx]
    end;

    procedure AsmExchange2(var A, B: Int32);
    asm
    push ebx
    push ecx
    mov ebx,[eax]
    mov ecx,[edx]
    mov [eax],ecx
    mov [edx],ebx
    pop ecx
    pop ebx
    end;

    ReplyDelete
  12. Mahdi Safsafi thanks for these functions !, I'll include them in the github test project to compare

    ReplyDelete
  13. Attila Kovacs no problem to ifdef and add asm if there is a significant improvement, there will always be the pascal way just in case things change in the future

    ReplyDelete
  14. Attila Kovacs I did a quick test but couldn't see any difference, maybe I'm not doing it correctly, just the InlineExchange test, running without debugging release mode 32bit

    ReplyDelete

Post a Comment