Title

Comments

  1. Yes very innovative! I Thanks for enhancement! I have improved my application with FastMM4 but hit a border with many large memory allocations. I'll test the new define..

    ReplyDelete
  2. Very interesting. I run into a similar problem with FastMM last week: After parsing a large (35 MB) XML file with over 6000 nodes, it took over 15 minutes to release the memory (TXMLDocument with default Microsoft engine). That was with FullDebugMode on, running in the IDE with debugging. Running the same application stand alone it took "only" 10 seconds.
    I'll try your changes tomorrow too see if there are any runtime improvements.

    ReplyDelete
  3. Achim Kalwa That is to be expected, FullDebugMode can be very slow.

    ReplyDelete
  4. Achim Kalwa I don't think you've understood what Primož Gabrijelčič has said. He's not attempting to improve performance with full debug options. He's interested in lock contention when multi-threaded, which appears to be unrelated to what you are discussing.

    ReplyDelete
  5. David Heffernan You are right. Thanks for clarification.

    ReplyDelete
  6. LogLock feature could also be adapted to rank/profile normal single thread allocs, with finer detail vs profiling. btw I can't find a way to increase medium blocks size (ie, n MB) to compare vs virtualalloc for big sizes. I've been messing with constants but can't find the right combination

    ReplyDelete
  7. David Berneda Can you explain a bit more? I'm not sure I understand where you're going.

    ReplyDelete
  8. Primož Gabrijelčič oops I wrote too fast ! I've mixed two ideas, one is using something similar to LogLock but for single thread, just gathering allocations stats (counting) per blocksize (min,med,large) to output the top stack traces for each. The other question is how can medium size threshold be changed (to increase it) in an attempt to reduce calls to virtualalloc for big big memory alloc sizes

    ReplyDelete
  9. David Berneda Former is an interesting idea. The modifications to my changes would probably be quite small.

    For the latter - do you just want to change the threshold between the medium and large memory allocator?

    ReplyDelete
  10. Primož Gabrijelčič Yes, the threshold between medium and large, there are several constants but any changes I do produces AVs

    ReplyDelete
  11. I made a quick fix for the problem: the pending free queue wasn't cleared properly so it kept sleeping. With my fix I get the following results (when running this test: http://www.stevemaughan.com/delphi/delphi-parallel-programming-library-memory-managers/)

    single multi
    Delphi Seatle X 24.643 11.695
    Primoz's FastMM 25.468 7.711

    So almost perfect scaling and 95% cpu usage on my quad core! Kudos for the great work!

    https://github.com/andremussche/FastMM4/commit/589c87ab55997837156a3bbe9637a691d3be03fb

    ReplyDelete
  12. André Mussche I actually don't understand what your change does - except that it doesn't free the small block pool when it becomes free.

    ReplyDelete
  13. Primož Gabrijelčič yes thats another change (because of AV's when freeing the small block pool) but the real change is this line (which is executed in a repeat loop):
    if not LPSmallBlockType.ReleaseStack.IsEmpty then LPSmallBlockType.ReleaseStack.Pop(APointer);

    ReplyDelete
  14. I believe my code also executes that:

    {$ifdef UseReleaseStack}
            if (count = (ReleaseStackSize div 2)) or
               LPSmallBlockType.ReleaseStack.IsEmpty or
               (not LPSmallBlockType.ReleaseStack.Pop(APointer)) then
            begin
    {$endif}
              APointer := nil;
              {Unlock this block type}
              LPSmallBlockType.BlockTypeLocked := False;
    {$ifdef UseReleaseStack}
            end;
            Inc(count);
    {$endif}

    ReplyDelete
  15. Nasty AV in FastFreeMem w/UseReleaseStack was just fixed in https://github.com/gabr42/FastMM4/tree/Locking_Improvements. Pull request committed to pleriche/FastMM4 (https://github.com/pleriche/FastMM4/pull/9).

    ReplyDelete
  16. Steve's speedtest gives now 21.000 for single and 5.600 msec for multithreaded so seems to scale nice! (95% cpu).

    However in the fastcode mm challenge it is only slightly faster (overal) than build-in DelphiX (single code is bit slower,
    multithread a bit faster). Probably because of lock contention in medium memory too?

    ReplyDelete
  17. André Mussche Try out the new https://github.com/gabr42/FastMM4/tree/Locking_Improvements (just committed) with /dUseReleaseStack and /dPerCPUReleaseStack. I'll add release stacks for medium/large blocks too, now that I'm sure the concept is working.

    ReplyDelete
  18. great, I will re-test when the medium blocks are MT too

    ReplyDelete
  19. Added release stack for medium blocks. I don't think there's much sense in doing the same for large blocks, though.

    ReplyDelete
  20. Some numbers (D10S, 2 CPUs (6 HT cores each), average of three runs, lower is better):

    Built-in MM single core: 34,1 sec
    Built-in MM multi core: 11,0 sec (3,1x faster)
    FastMM4 4.991 MC: 9,0 sec (3,8x faster)
    + UseReleaseStack: 8,6 sec (4,0x faster)
    + PerCPUReleaseStack: 7,1 sec (4,8x faster)

    ReplyDelete
  21. Not the version, 'Multicore' checkbox in the test. Sorry for the confusion.

    ReplyDelete
  22. BTW, Pierre just merged everything into the main Locking_Improvement branch.

    ReplyDelete
  23. Another major speed improvement - with the current gabr42/FastMM4:Locking_Improvement branch I'm getting benchmar result of 5,2 sec, which is 1,7x the speed of FastMM 4.991 and more than twice the speed of Delphi 10 Seattle built-in memory manager!

    UseReleaseStacks still has to be defined, PerCPUReleaseStack was removed as it is now always enforced.

    ReplyDelete
  24. In the FastCode MM Challenge it is only slightly faster, because it waits often in LockLargeBlocks...

    ReplyDelete
  25. I don't think large blocks are important for most multithreaded applications but most probably the MediumReleaseStack approach could be easily adapted to large blocks too.

    ReplyDelete
  26. André Mussche Release stack for large blocks is now implemented in my fork (with pull request sent to pleriche).

    ReplyDelete
  27. Thanks, it gets better but still a lot medium and large locks when requesting new blocks (nil param)

    :7762460d KERNELBASE.Sleep + 0xf
    FastMM4.LockMediumBlocks(nil,nil)
    FastMM4.FastGetMem(???)

    :7762460d KERNELBASE.Sleep + 0xf
    FastMM4.LockLargeBlocks(nil,nil)
    FastMM4.AllocateLargeBlock(420112)
    FastMM4.FastGetMem(420112)

    ReplyDelete
  28. The release stack mechanism won't fix that, definitely. This problem could only be circumvented by implementing multiple allocators for medium and large blocks.

    ReplyDelete
  29. That's a pitty because these are the biggest bottlenecks right now

    ReplyDelete
  30. André Mussche They are bottlenecks in the benchmark code. I don't believe they are bottleneck in most real applications. (And if they are, you should adapt the algorithm.)

    If you can find a real application where GetMem on medium/large blocks causes problems, let me know and I'll see what can be done. I don't think improving FastMM just so that a benchmark runs faster can bring any good.

    ReplyDelete
  31. True :) But it is the only extensive benchmark for comparison we have right now?

    ReplyDelete

Post a Comment