Interesting alternative to TCriticalSection, not just TMREWS

Interesting alternative to TCriticalSection, not just TMREWS
http://www.delphitools.info/2013/12/18/slim-readerwriter-locks-rock/

Comments

  1. Nice find. Been using Boost.Thread for the C++ code I've worked on, which also isn't reentrant. While it initially felt a bit weird, it does indeed highlight design issues, and should be avoided. So IMHO that's not really a downside. Plus, POSIX threads are not reentrant either, so helps writing portable code.

    ReplyDelete
  2. Ive looked a bit on the benchmark code from Alexandre Machado, checking my own TkbmMWLock against the other 4 (Tcriticalsection, srwlock, TMrew, TMonitor). TMonitor is actually impresively fast... and so are srwlock and actually most of the others too in that particular benchmark.

    The problem with it is that it assumes a static workload (none or exactly the same sleep for both reader and writer and the same number of iterations for both reader and writer) which gives a very skewed real life performance benchmark.

    If I modified the benchmark to allow for say 1 writer thread with 200 passes each taking a random number of ms (max 100ms) and 50 reader threads with 200000 passes with no additional delay (only the simple static lookup it makes in the benchmark code), then even srwlock and tmonitor came out slower (about 10-12%) than my own and the rest a good amount slower than that (I was starting to worry a bit at first :) )..

    It can be discussed what is a fair benchmark comparison with setting up any synthesized benchmarks, but imo locks are often used to protect resources where writing to the resource, often require varying access times, while reading often is significantly faster and more predictable.

    Just my 2 cents.

    best regards
    Kim Madsen
    C4D

    ReplyDelete
  3. Kim Madsen Interesting. For locking in general I usually have one of two scenarios: even workload (pushing/popping from a circular buffer say), or very skewed towards slow write / fast read (database-like stuff).

    ReplyDelete
  4. Kim Madsen did you align the loops? If not, 10% is well within the loops alignement variability.

    Benchmark wise my original version was involving read/writes to a global hash table, Alexandre was a degenerate case that only considered locks.

    ReplyDelete
  5. But that unalignment/alignment result should be the same for all test cases, since the loops are the exactly same code, regardless of the chosen locking mechanism.

    ReplyDelete
  6. Kim Madsen Not really, the Delphi compiler doesn't align, so the same code will end up aligned or not depending on the rest of the code that is compiled, the order it is compiled in, etc.

    In other words, the exact same code may or may not be aligned if anything else is changed in the build. If you copy-paste the same loop in ten functions, it'll sometimes be aligned, sometimes not, and which are aligned may vary from build to build depending on other minor changes.

    To eliminate effects from alignment, you can check the FastCode B&V code, it basically had the same code copy-pasted several times or manually aligned/misaligned to take alignment out of the measurement equation, as some snippets would behave very well when aligned, or poorly when not aligned. There are some cases in which the winner implementation was picked because it was more resilient to mis-alignment, and thus better on average under the Delphi compiler

    ReplyDelete
  7. But its the same exact loop, not recompiled for each different scenario. The loop itself is aligned the same. The loop calls another procedure which is the binary same always, regardless of the chosen locking mechanism, and then that procedure executes one or another lock depending on which locking mechanism was selected. 
    So the loop itself do not change in any way binary, and the procedure it calls do not change binary, why I dont think there is any difference in loop alignments between the different locking mechanisms.

    ReplyDelete
  8. That just moves the alignment issue to that of the address of the procedure you jump to.

    ReplyDelete
  9. I have just optimized the TkbmMWLock class further.
    Not only does it support lock escalation now, but its also consistently 25% faster than slimreader/writer locks,, 18% faster than TMonitor, 61% faster than TCriticalSection, and  46% faster than using TMREWSync.
    Common for all tests, was 50 reader threads each spinning 2.000.000 times, 50 writer threads spinning 200.000 times each, with default stringlist lookup and alterations, followed by a verification that alteration was changed to expected value.

    All locking mechanisms are reader biased, but TkbmMWLock can also be set to be writer biased, which result in TkbmMWLock running at the same speed as slimreader/writer locks.

    An aditional side effect is that its now possible to track status of all currently held read locks incl. thread reference, application wide very fast.

    Im quite excited about it actually.

    best regards
    Kim Madsen
    Components4Developers

    ReplyDelete
  10. Kim Madsen How is the performance when using one thread per logical core?

    ReplyDelete
  11. I made some more measurements and observations.

    On a i7 4 core HT setup I ran a test where I locked the VCL main app to core 1, and created 3 read and 3 write threads, each locked to core 3,5 and 7 (one reader and writer per genuine core). Setting each reader and writer to spin 20 mill times. 

    TkbmMWLock, TMonitor and slim rw locks came out neck to neck, about 40% faster than TCriticalSection.  TMREWS came out more than 280% slower than TkbmMWLock (all testing on XE5).

    best regards
    Kim Madsen
    C4D

    ReplyDelete

Post a Comment