New experimental FastMM version was committed to https://github.com/gabr42/FastMM4/tree/Locking

New experimental FastMM version was committed to https://github.com/gabr42/FastMM4/tree/Locking_Improvements.

- March 18, 2016

New experimental FastMM version was committed to https://github.com/gabr42/FastMM4/tree/Locking_Improvements.

New in this version:
- Slightly simplified code in FastFreeMem path for small blocks.
- Added debugging function LogReleaseStackUsage (has to be enabled by defining DebugReleaseStack).
- Replaced GetStackSlot with a better-hashing version (turned out to make a difference in the field).
- Added a low priority release stack cleanup thread (frees memory that could get "stuck" in release stacks when a thread exits).

All changes were also merged into https://github.com/pleriche/FastMM4/tree/Locking_Improvements.

Please test & report!
https://github.com/gabr42/FastMM4/tree/Locking_Improvements

Comments

Tommi PramiMarch 18, 2016 at 1:59 AM
/stub
ReplyDelete
Replies
Tommi PramiMarch 18, 2016 at 2:21 AM
Just my 0.02€ : http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed Pretty interesting stuff on Hashes...
ReplyDelete
Replies
Tommi PramiMarch 18, 2016 at 2:23 AM
For 32bit MurMurHash2 seems to be good candidate... Some had very poor performance on numbers...
ReplyDelete
Replies
Primož GabrijelčičMarch 18, 2016 at 2:46 AM
Got mine from http://burtleburtle.net/bob/hash/integer.html - a nice overview of different integer hashing algorithms.
ReplyDelete
Replies
Tommi PramiMarch 18, 2016 at 2:47 AM
Yes, I saw that, just provided further reading ;)
ReplyDelete
Replies
David MillingtonMarch 18, 2016 at 5:31 AM
/sub. I notice that SO hash comparison was written in Delphi (yay)... Delphi 7 (oh.)
ReplyDelete
Replies
Johan BontesMarch 18, 2016 at 8:42 AM
Tommi Prami
MurMurhash2 has been superseded by much faster alternatives. crc32 in hardware and variations on FNV come to mind.
ReplyDelete
Replies
Johan BontesMarch 18, 2016 at 9:29 AM
Primož Gabrijelčič
I wanted to have a look at GetStackSlot, but I cannot find a method by that name anywhere in the source.
ReplyDelete
Replies
Eric GrangeMarch 18, 2016 at 9:40 AM
Primož Gabrijelčič CRC32 in hardware is the best, even in software it is fast enough.

Bob's hashes are quite obsolete by this point, it is what Delphi uses for its generics, and it is rather poor (both in performance and collisions).

MurmurHash and similar hashes are intended for rather largish blobs of data, they do not perform very well on just a few bytes.

Beyond that, SHA extensions if your CPU supports them are a no brainer: they will just wipe the floor with everything else (you do not need all the rounds if you do not want cryptographic strength, but just a few rounds will provide better hashes than other non-cryptographic hashes).
ReplyDelete
Replies
Johan BontesMarch 18, 2016 at 9:55 AM
Eric Grange
The performance of crc32 in software is poor for small samples because of the need for a lookup table and the cache issues that entails. For large data it is fine. For short samples best use a modern FNV variant. The FNV variant in https://github.com/JBontes/FastCode/blob/master/FastCompare.pas outperforms crc32 for all code sizes whilst having simular collision stats.
ReplyDelete
Replies
Dany MarmurMarch 18, 2016 at 2:54 PM
Just to clarify; i have nothing to do with murmurhash. And my family name is spelled correctly.
ReplyDelete
Replies
Primož GabrijelčičMarch 20, 2016 at 1:29 AM
If somebody has a concrete suggestion for a different hashing function that hashes thread ID into 0..63, post it here and I'll test it. Current version hashes better than (GetCurrentThreadID shr 4 AND 63) and that's all I can say about it :)
ReplyDelete
Replies
A. BouchezMarch 22, 2016 at 8:20 AM
Johan Bontes I doubt it outperforms hardware crc32c
ReplyDelete
Replies
A. BouchezMarch 22, 2016 at 8:22 AM
Primož Gabrijelčič Your version is just fine. A thread ID has a very small entropy, around 16 bits I guess.
Perhaps a TLS slot may be a faster alternative, in some cases....
ReplyDelete
Replies
David MillingtonMarch 22, 2016 at 8:28 AM
A. Bouchez TLS is unlikely be faster. A while ago I wrote a memory manager (never released) which kept separate pools per thread, as well as a bunch of other stuff. At the core of every GetMem call was a lookup into the data for the thread. (Multiple threads shared one data instance; they were allocated round-robin, I think with # data instances = # cores.) Using thread-local storage for a pointer to the data, vs a lookup of thread ID into a once-allocated container for the data, was about 6 times slower.
ReplyDelete
Replies
Johan BontesMarch 22, 2016 at 8:30 AM
Primož Gabrijelčič
Where is this hash function? I can't find it in the code.
ReplyDelete
Replies
Primož GabrijelčičMarch 22, 2016 at 8:58 AM
Johan Bontes https://github.com/gabr42/FastMM4/blob/Locking_Improvements/FastMM4.pas#L3291
ReplyDelete
Replies
Johan BontesMarch 22, 2016 at 11:06 AM
Primož Gabrijelčič
Aha, I was looking in the Master, not the Locking_improvements branch.
ReplyDelete
Replies
Johan BontesMarch 22, 2016 at 11:26 AM
2 cycles faster:
function GetStackSlot2: DWORD;
asm
.NoFrame
Call GetCurrentThreadID
//Unit40.pas.32: Result := (Result XOR 61) XOR (Result SHR 16);
mov ecx,eax
xor ecx,61
shr eax,16
xor eax,ecx
//Unit40.pas.33: Result := Result + (Result SHL 3);
lea eax,[eax+eax*8]
//Unit40.pas.34: Result := Result XOR (Result SHR 4);
mov ecx,eax
shr ecx,4
xor eax,ecx
//Unit40.pas.35: Result := Result * $27d4eb2d;
imul eax,eax,$27d4eb2d
//Unit40.pas.36: Result := Result XOR (Result SHR 15);
mov ecx,eax
shr ecx,15
xor eax,ecx
//Unit40.pas.37: Result := Result AND (NumStacksPerBlock-1);
and eax,(NumStacksPerBlock-1)
end;

Another single cycle faster:

function GetStackSlot3: DWORD;
asm
.NoFrame
Call GetCurrentThreadID
mov ecx,eax
imul ecx,$27d4eb2d
imul eax,16777619 //32-bit FNV prime
ror ecx,23
xor eax,ecx
and eax,$3f
end;

The collision stats for all functions are the same. I tested all collisions for 0..$FFFFFF.
On the latest Intel processors the difference should be a little greater.
ReplyDelete
Replies
A. BouchezMarch 22, 2016 at 11:57 AM
David Millington A plain threadvar access using Delphi RTL would be slower, you are right. But I was not thinking about this.
In fact, GetThreadID is a function call, which already reads an OS-level TLS/threadvar slot. So direct access to a TLS slot would be definitively faster than GetThreadID + lookup.
See e.g. in SynScaleMM or ScaleMM2 memory managers, how the TLS may be accessed more efficiently using raw asm than a plain threadvar. See https://github.com/andremussche/AndrewsDelphiStuff/blob/master/ProfilingTechniques/Demo/ScaleMM2.2/ScaleMM2.pas#L246
ReplyDelete
Replies
Primož GabrijelčičMarch 23, 2016 at 1:09 AM
Johan Bontes Three cycles is really not important in this case and it's hard to maintain assembler so I'll keep the current pascal version.
ReplyDelete
Replies
David MillingtonMarch 23, 2016 at 3:16 AM
A. Bouchez That code is hard for me to understand. Does Delphi use or access TLS differently to Windows/VC++?
ReplyDelete
Replies
A. BouchezMarch 23, 2016 at 6:43 AM
David Millington In Delphi, accessing a threadvar would call low-level SysInit.GetTLS stub, AFAIR. By using asm, you can avoid this call, and retrieve the information in a few cycles directly from the FS:[$2C] memory, which point to the SysInit.tlsArray, i.e. the Thread specific buffer of all threadvars. You access the threadvar directly by its offset in this buffer.
ReplyDelete
Replies
Johan BontesMarch 23, 2016 at 5:55 PM
A. Bouchez
No outperforming crc32 in hardware is not possible, But that requires a rather new CPU. Outperforming software crc32 is rather trivial.
ReplyDelete
Replies
A. BouchezMarch 24, 2016 at 12:40 AM
Johan Bontes Yes, but all servers I rent do have SSE42/CRC32C hardware support, since years. Even the cheap atoms at 9€ from https://www.online.net/fr/serveur-dedie/dedibox-scg2 do support them. All Intel CPUs do support those instructions now. I would not expect support for an old CPU in the context of this FastMM4 fork - which is about performance.
Idea is to have a fallback: use crc32c if available, or an optimized FNV as you propose if not available. The hash won't be stored, just used in memory.
I doubt Primož Gabrijelčič current hash is a bottleneck.
ReplyDelete
Replies
A. BouchezMarch 24, 2016 at 12:44 AM
Primož Gabrijelčič Here is what I used in SynLog:
fThreadLastHash := NativeUInt(fThreadID xor (fThreadID shr MAXLOGTHREADBITS)
xor (fThreadID shr (MAXLOGTHREADBITS*2))) and (MAXLOGTHREAD-1);
See https://github.com/synopse/mORMot/blob/master/SynLog.pas#L2745
Here the hash table has 2^14 slots.
No problem on production, on heavily-multithreaded server apps running for months.
ReplyDelete
Replies

Add comment

Search This Blog

Delphi Developers Archive

New experimental FastMM version was committed to https://github.com/gabr42/FastMM4/tree/Locking_Improvements.

Comments

Post a Comment