Currently "fighting" trying to obtain a good codegen (for speed) on swapping dynamic array items (Int32, Int64, Single etc) in both cpu x86 and 64bit.
Currently "fighting" trying to obtain a good codegen (for speed) on swapping dynamic array items (Int32, Int64, Single etc) in both cpu x86 and 64bit.
The typical asm xchg trick to swap two integers does not help too much.
var X : Array of Integer; // Int64
...
ExchangeItems(X,23,45) // X[23] <--> X[45]
The typical asm xchg trick to swap two integers does not help too much.
var X : Array of Integer; // Int64
...
ExchangeItems(X,23,45) // X[23] <--> X[45]
/sub
ReplyDelete/sub ☺
ReplyDeleteCode it in a good C++ compiler. Look at code gen. Use that code as inline asm.
ReplyDelete/sub
ReplyDeleteHow do you know the codegen is the issue, rather than say cache misses?
ReplyDeleteDavid Heffernan yes I'm comparing fpc, clang etc. Looking first for the best pure pascal way, then ifdef x86/x64 for better asm
ReplyDeleteAsbjørn Heid The problem with arrays is no registers are used (array elements are volatile), so the typical "tmp:=A A:=B B:=tmp" calculates memory position of the index element all the time. Inlining is no better. Probably the best way is to pass address of A and B elements (@X[23] ,@X[45]) but I preferred to avoid using pointers
ReplyDeleteDavid Berneda Well yea, but you didn't really answer the question. You can execute several dozen instructions in the time it takes for a single L2 miss.
ReplyDeleteHave you measured any significant change when you alter your ExchangeItems?
My second questions is have you tried to have your ExchangeItems call an inline ExchangeValues(var a, b: integer) procedure?
Yeah ok, so the compiler gets quite silly when you pass array elements to a var parameter for some reason. Seems you do indeed need to hit the compiler with a hammer and revert to pointers (at least you can have the pointer stuff inside ExchangeItems only).
ReplyDeleteAsbjørn Heid I've added these test (call an inline proc) and it runs better (minor improvement, but better). There are differences between Delphi and FreePascal, and between x86 and x64, depending on using arrays of Int32 and Int64
ReplyDeleteI've uploaded a small test project that displays timings for all combinations:
ReplyDeletehttps://github.com/davidberneda/exchange
David Berneda I added the following, which is fastest on my machine:
ReplyDeleteprocedure InlineExchangePtr(const X:TTestArray; const A,B:Integer); overload; inline;
begin
InlineExchange(PTestType(@X[A])^, PTestType(@X[B])^);
end;
Array A B: 188msec
@A @B: 159msec
Inline Array A B: 173msec
Inline @A @B: 181msec
Call Inline @A @B: 158msec
Inline Array A B with ptr: 144msec <--- my version
Asbjørn Heid Thanks ! I've added this new test to github repo. Similar timings on my i7 4770 machine ( 152msec ) in 32bit. For x64, the "@A @B" version is faster
ReplyDeleteXCHG instruction is slow because it has a build-in lock mechanism ... Each call to XCHG REG,MEM or XCHG MEM,REG will result entering the lock process (performed by the CPU). So things go slowly. better alternative is to use stack way or reg to reg way. I wrote two functions showing how to use both way (but it works just for x86 ... and I think it's very easy to port to x64 :) ) and now AsmStackExchange is almost 3 time faster than the old AsmExchange and AsmExchange2 is almost 5 five time faster !
ReplyDeleteThings would be better if Delphi supports inline asm code.
//---------------------------------------------------
procedure AsmStackExchange(var A, B: Int32);
asm
push [eax]
push [edx]
pop [eax]
pop [edx]
end;
procedure AsmExchange2(var A, B: Int32);
asm
push ebx
push ecx
mov ebx,[eax]
mov ecx,[edx]
mov [eax],ecx
mov [edx],ebx
pop ecx
pop ebx
end;
Mahdi Safsafi thanks for these functions !, I'll include them in the github test project to compare
ReplyDeleteAttila Kovacs no problem to ifdef and add asm if there is a significant improvement, there will always be the pascal way just in case things change in the future
ReplyDeleteAttila Kovacs I did a quick test but couldn't see any difference, maybe I'm not doing it correctly, just the InlineExchange test, running without debugging release mode 32bit
ReplyDelete