New experimental FastMM version was committed to https://github.com/gabr42/FastMM4/tree/Locking_Improvements.

New experimental FastMM version was committed to https://github.com/gabr42/FastMM4/tree/Locking_Improvements.

New in this version:
- Slightly simplified code in FastFreeMem path for small blocks.
- Added debugging function LogReleaseStackUsage (has to be enabled by defining DebugReleaseStack).
- Replaced GetStackSlot with a better-hashing version (turned out to make a difference in the field).
- Added a low priority release stack cleanup thread (frees memory that could get "stuck" in release stacks when a thread exits).

All changes were also merged into https://github.com/pleriche/FastMM4/tree/Locking_Improvements.

Please test & report!
https://github.com/gabr42/FastMM4/tree/Locking_Improvements

Comments

  1. For 32bit MurMurHash2 seems to be good candidate... Some had very poor performance on numbers...

    ReplyDelete
  2. Got mine from http://burtleburtle.net/bob/hash/integer.html - a nice overview of different integer hashing algorithms.

    ReplyDelete
  3. Yes, I saw that, just provided further reading ;)

    ReplyDelete
  4. /sub. I notice that SO hash comparison was written in Delphi (yay)... Delphi 7 (oh.)

    ReplyDelete
  5. Tommi Prami
    MurMurhash2 has been superseded by much faster alternatives. crc32 in hardware and variations on FNV come to mind.

    ReplyDelete
  6. Primož Gabrijelčič
    I wanted to have a look at GetStackSlot, but I cannot find a method by that name anywhere in the source.

    ReplyDelete
  7. Primož Gabrijelčič CRC32 in hardware is the best, even in software it is fast enough.

    Bob's hashes are quite obsolete by this point, it is what Delphi uses for its generics, and it is rather poor (both in performance and collisions).

    MurmurHash and similar hashes are intended for rather largish blobs of data, they do not perform very well on just a few bytes.

    Beyond that, SHA extensions if your CPU supports them are a no brainer: they will just wipe the floor with everything else (you do not need all the rounds if you do not want cryptographic strength, but just a few rounds will provide better hashes than other non-cryptographic hashes).

    ReplyDelete
  8. Eric Grange
    The performance of crc32 in software is poor for small samples because of the need for a lookup table and the cache issues that entails. For large data it is fine. For short samples best use a modern FNV variant. The FNV variant in https://github.com/JBontes/FastCode/blob/master/FastCompare.pas outperforms crc32 for all code sizes whilst having simular collision stats.

    ReplyDelete
  9. Just to clarify; i have nothing to do with murmurhash. And my family name is spelled correctly.

    ReplyDelete
  10. If somebody has a concrete suggestion for a different hashing function that hashes thread ID into 0..63, post it here and I'll test it. Current version hashes better than (GetCurrentThreadID shr 4 AND 63) and that's all I can say about it :)

    ReplyDelete
  11. Johan Bontes​ I doubt it outperforms hardware crc32c

    ReplyDelete
  12. Primož Gabrijelčič Your version is just fine. A thread ID has a very small entropy, around 16 bits I guess.
    Perhaps a TLS slot may be a faster alternative, in some cases....

    ReplyDelete
  13. A. Bouchez TLS is unlikely be faster. A while ago I wrote a memory manager (never released) which kept separate pools per thread, as well as a bunch of other stuff. At the core of every GetMem call was a lookup into the data for the thread. (Multiple threads shared one data instance; they were allocated round-robin, I think with # data instances = # cores.) Using thread-local storage for a pointer to the data, vs a lookup of thread ID into a once-allocated container for the data, was about 6 times slower.

    ReplyDelete
  14. Primož Gabrijelčič
    Where is this hash function? I can't find it in the code.

    ReplyDelete
  15. Primož Gabrijelčič
    Aha, I was looking in the Master, not the Locking_improvements branch.

    ReplyDelete
  16. 2 cycles faster:
    function GetStackSlot2: DWORD;
    asm
      .NoFrame
      Call GetCurrentThreadID
    //Unit40.pas.32: Result := (Result XOR 61) XOR (Result SHR 16);
      mov ecx,eax
      xor ecx,61
      shr eax,16
      xor eax,ecx
      //Unit40.pas.33: Result := Result + (Result SHL 3);
      lea eax,[eax+eax*8]
      //Unit40.pas.34: Result := Result XOR (Result SHR 4);
      mov ecx,eax
      shr ecx,4
      xor eax,ecx
      //Unit40.pas.35: Result := Result * $27d4eb2d;
      imul eax,eax,$27d4eb2d
      //Unit40.pas.36: Result := Result XOR (Result SHR 15);
      mov ecx,eax
      shr ecx,15
      xor eax,ecx
      //Unit40.pas.37: Result := Result AND (NumStacksPerBlock-1);
      and eax,(NumStacksPerBlock-1)
    end;

    Another single cycle faster:

    function GetStackSlot3: DWORD;
    asm
      .NoFrame
      Call GetCurrentThreadID
      mov ecx,eax
      imul ecx,$27d4eb2d
      imul eax,16777619  //32-bit FNV prime
      ror ecx,23
      xor eax,ecx
      and eax,$3f
    end;

    The collision stats for all functions are the same. I tested all collisions for 0..$FFFFFF.
    On the latest Intel processors the difference should be a little greater.

    ReplyDelete
  17. David Millington A plain threadvar access using Delphi RTL would be slower, you are right. But I was not thinking about this.
    In fact, GetThreadID is a function call, which already reads an OS-level TLS/threadvar slot. So direct access to a TLS slot would be definitively faster than GetThreadID + lookup.
    See e.g. in SynScaleMM or ScaleMM2 memory managers, how the TLS may be accessed more efficiently using raw asm than a plain threadvar. See https://github.com/andremussche/AndrewsDelphiStuff/blob/master/ProfilingTechniques/Demo/ScaleMM2.2/ScaleMM2.pas#L246

    ReplyDelete
  18. Johan Bontes Three cycles is really not important in this case and it's hard to maintain assembler so I'll keep the current pascal version.

    ReplyDelete
  19. A. Bouchez That code is hard for me to understand. Does Delphi use or access TLS differently to Windows/VC++?

    ReplyDelete
  20. David Millington In Delphi, accessing a threadvar would call low-level SysInit.GetTLS stub, AFAIR. By using asm, you can avoid this call, and retrieve the information in a few cycles directly from the FS:[$2C] memory, which point to the SysInit.tlsArray, i.e. the Thread specific buffer of all threadvars. You access the threadvar directly by its offset in this buffer.

    ReplyDelete
  21. A. Bouchez
    No outperforming crc32 in hardware is not possible, But that requires a rather new CPU. Outperforming software crc32 is rather trivial.

    ReplyDelete
  22. Johan Bontes Yes, but all servers I rent do have SSE42/CRC32C hardware support, since years. Even the cheap atoms at 9€ from https://www.online.net/fr/serveur-dedie/dedibox-scg2 do support them. All Intel CPUs do support those instructions now. I would not expect support for an old CPU in the context of this FastMM4 fork - which is about performance.
    Idea is to have a fallback: use crc32c if available, or an optimized FNV as you propose if not available. The hash won't be stored, just used in memory.
    I doubt Primož Gabrijelčič current hash is a bottleneck.

    ReplyDelete
  23. Primož Gabrijelčič Here is what I used in SynLog:
      fThreadLastHash := NativeUInt(fThreadID xor (fThreadID shr MAXLOGTHREADBITS)
        xor (fThreadID shr (MAXLOGTHREADBITS*2))) and (MAXLOGTHREAD-1);
    See https://github.com/synopse/mORMot/blob/master/SynLog.pas#L2745
    Here the hash table has 2^14 slots.
    No problem on production, on heavily-multithreaded server apps running for months.

    ReplyDelete

Post a Comment