Yes very innovative! I Thanks for enhancement! I have improved my application with FastMM4 but hit a border with many large memory allocations. I'll test the new define..
Very interesting. I run into a similar problem with FastMM last week: After parsing a large (35 MB) XML file with over 6000 nodes, it took over 15 minutes to release the memory (TXMLDocument with default Microsoft engine). That was with FullDebugMode on, running in the IDE with debugging. Running the same application stand alone it took "only" 10 seconds. I'll try your changes tomorrow too see if there are any runtime improvements.
Achim Kalwa I don't think you've understood what Primož Gabrijelčič has said. He's not attempting to improve performance with full debug options. He's interested in lock contention when multi-threaded, which appears to be unrelated to what you are discussing.
LogLock feature could also be adapted to rank/profile normal single thread allocs, with finer detail vs profiling. btw I can't find a way to increase medium blocks size (ie, n MB) to compare vs virtualalloc for big sizes. I've been messing with constants but can't find the right combination
Primož Gabrijelčič oops I wrote too fast ! I've mixed two ideas, one is using something similar to LogLock but for single thread, just gathering allocations stats (counting) per blocksize (min,med,large) to output the top stack traces for each. The other question is how can medium size threshold be changed (to increase it) in an attempt to reduce calls to virtualalloc for big big memory alloc sizes
Primož Gabrijelčič yes thats another change (because of AV's when freeing the small block pool) but the real change is this line (which is executed in a repeat loop): if not LPSmallBlockType.ReleaseStack.IsEmpty then LPSmallBlockType.ReleaseStack.Pop(APointer);
{$ifdef UseReleaseStack} if (count = (ReleaseStackSize div 2)) or LPSmallBlockType.ReleaseStack.IsEmpty or (not LPSmallBlockType.ReleaseStack.Pop(APointer)) then begin {$endif} APointer := nil; {Unlock this block type} LPSmallBlockType.BlockTypeLocked := False; {$ifdef UseReleaseStack} end; Inc(count); {$endif}
Steve's speedtest gives now 21.000 for single and 5.600 msec for multithreaded so seems to scale nice! (95% cpu).
However in the fastcode mm challenge it is only slightly faster (overal) than build-in DelphiX (single code is bit slower, multithread a bit faster). Probably because of lock contention in medium memory too?
André Mussche Try out the new https://github.com/gabr42/FastMM4/tree/Locking_Improvements (just committed) with /dUseReleaseStack and /dPerCPUReleaseStack. I'll add release stacks for medium/large blocks too, now that I'm sure the concept is working.
Another major speed improvement - with the current gabr42/FastMM4:Locking_Improvement branch I'm getting benchmar result of 5,2 sec, which is 1,7x the speed of FastMM 4.991 and more than twice the speed of Delphi 10 Seattle built-in memory manager!
UseReleaseStacks still has to be defined, PerCPUReleaseStack was removed as it is now always enforced.
I don't think large blocks are important for most multithreaded applications but most probably the MediumReleaseStack approach could be easily adapted to large blocks too.
The release stack mechanism won't fix that, definitely. This problem could only be circumvented by implementing multiple allocators for medium and large blocks.
André Mussche They are bottlenecks in the benchmark code. I don't believe they are bottleneck in most real applications. (And if they are, you should adapt the algorithm.)
If you can find a real application where GetMem on medium/large blocks causes problems, let me know and I'll see what can be done. I don't think improving FastMM just so that a benchmark runs faster can bring any good.
You have very innovative ideas!
ReplyDeleteYes very innovative! I Thanks for enhancement! I have improved my application with FastMM4 but hit a border with many large memory allocations. I'll test the new define..
ReplyDeleteVery interesting. I run into a similar problem with FastMM last week: After parsing a large (35 MB) XML file with over 6000 nodes, it took over 15 minutes to release the memory (TXMLDocument with default Microsoft engine). That was with FullDebugMode on, running in the IDE with debugging. Running the same application stand alone it took "only" 10 seconds.
ReplyDeleteI'll try your changes tomorrow too see if there are any runtime improvements.
/sub
ReplyDeleteAchim Kalwa That is to be expected, FullDebugMode can be very slow.
ReplyDeleteAchim Kalwa I don't think you've understood what Primož Gabrijelčič has said. He's not attempting to improve performance with full debug options. He's interested in lock contention when multi-threaded, which appears to be unrelated to what you are discussing.
ReplyDeleteDavid Heffernan You are right. Thanks for clarification.
ReplyDeleteLogLock feature could also be adapted to rank/profile normal single thread allocs, with finer detail vs profiling. btw I can't find a way to increase medium blocks size (ie, n MB) to compare vs virtualalloc for big sizes. I've been messing with constants but can't find the right combination
ReplyDeleteDavid Berneda Can you explain a bit more? I'm not sure I understand where you're going.
ReplyDeletePrimož Gabrijelčič oops I wrote too fast ! I've mixed two ideas, one is using something similar to LogLock but for single thread, just gathering allocations stats (counting) per blocksize (min,med,large) to output the top stack traces for each. The other question is how can medium size threshold be changed (to increase it) in an attempt to reduce calls to virtualalloc for big big memory alloc sizes
ReplyDeleteDavid Berneda Former is an interesting idea. The modifications to my changes would probably be quite small.
ReplyDeleteFor the latter - do you just want to change the threshold between the medium and large memory allocator?
Primož Gabrijelčič Yes, the threshold between medium and large, there are several constants but any changes I do produces AVs
ReplyDeleteI made a quick fix for the problem: the pending free queue wasn't cleared properly so it kept sleeping. With my fix I get the following results (when running this test: http://www.stevemaughan.com/delphi/delphi-parallel-programming-library-memory-managers/)
ReplyDeletesingle multi
Delphi Seatle X 24.643 11.695
Primoz's FastMM 25.468 7.711
So almost perfect scaling and 95% cpu usage on my quad core! Kudos for the great work!
https://github.com/andremussche/FastMM4/commit/589c87ab55997837156a3bbe9637a691d3be03fb
André Mussche I actually don't understand what your change does - except that it doesn't free the small block pool when it becomes free.
ReplyDeletePrimož Gabrijelčič yes thats another change (because of AV's when freeing the small block pool) but the real change is this line (which is executed in a repeat loop):
ReplyDeleteif not LPSmallBlockType.ReleaseStack.IsEmpty then LPSmallBlockType.ReleaseStack.Pop(APointer);
I believe my code also executes that:
ReplyDelete{$ifdef UseReleaseStack}
if (count = (ReleaseStackSize div 2)) or
LPSmallBlockType.ReleaseStack.IsEmpty or
(not LPSmallBlockType.ReleaseStack.Pop(APointer)) then
begin
{$endif}
APointer := nil;
{Unlock this block type}
LPSmallBlockType.BlockTypeLocked := False;
{$ifdef UseReleaseStack}
end;
Inc(count);
{$endif}
Nasty AV in FastFreeMem w/UseReleaseStack was just fixed in https://github.com/gabr42/FastMM4/tree/Locking_Improvements. Pull request committed to pleriche/FastMM4 (https://github.com/pleriche/FastMM4/pull/9).
ReplyDeleteSteve's speedtest gives now 21.000 for single and 5.600 msec for multithreaded so seems to scale nice! (95% cpu).
ReplyDeleteHowever in the fastcode mm challenge it is only slightly faster (overal) than build-in DelphiX (single code is bit slower,
multithread a bit faster). Probably because of lock contention in medium memory too?
André Mussche Try out the new https://github.com/gabr42/FastMM4/tree/Locking_Improvements (just committed) with /dUseReleaseStack and /dPerCPUReleaseStack. I'll add release stacks for medium/large blocks too, now that I'm sure the concept is working.
ReplyDeletegreat, I will re-test when the medium blocks are MT too
ReplyDeleteAdded release stack for medium blocks. I don't think there's much sense in doing the same for large blocks, though.
ReplyDeleteSome numbers (D10S, 2 CPUs (6 HT cores each), average of three runs, lower is better):
ReplyDeleteBuilt-in MM single core: 34,1 sec
Built-in MM multi core: 11,0 sec (3,1x faster)
FastMM4 4.991 MC: 9,0 sec (3,8x faster)
+ UseReleaseStack: 8,6 sec (4,0x faster)
+ PerCPUReleaseStack: 7,1 sec (4,8x faster)
What is MC version?
ReplyDeleteNot the version, 'Multicore' checkbox in the test. Sorry for the confusion.
ReplyDeleteBTW, Pierre just merged everything into the main Locking_Improvement branch.
ReplyDeleteAnother major speed improvement - with the current gabr42/FastMM4:Locking_Improvement branch I'm getting benchmar result of 5,2 sec, which is 1,7x the speed of FastMM 4.991 and more than twice the speed of Delphi 10 Seattle built-in memory manager!
ReplyDeleteUseReleaseStacks still has to be defined, PerCPUReleaseStack was removed as it is now always enforced.
In the FastCode MM Challenge it is only slightly faster, because it waits often in LockLargeBlocks...
ReplyDeleteI don't think large blocks are important for most multithreaded applications but most probably the MediumReleaseStack approach could be easily adapted to large blocks too.
ReplyDelete/sub
ReplyDeleteAndré Mussche Release stack for large blocks is now implemented in my fork (with pull request sent to pleriche).
ReplyDeleteThanks, it gets better but still a lot medium and large locks when requesting new blocks (nil param)
ReplyDelete:7762460d KERNELBASE.Sleep + 0xf
FastMM4.LockMediumBlocks(nil,nil)
FastMM4.FastGetMem(???)
:7762460d KERNELBASE.Sleep + 0xf
FastMM4.LockLargeBlocks(nil,nil)
FastMM4.AllocateLargeBlock(420112)
FastMM4.FastGetMem(420112)
The release stack mechanism won't fix that, definitely. This problem could only be circumvented by implementing multiple allocators for medium and large blocks.
ReplyDeleteThat's a pitty because these are the biggest bottlenecks right now
ReplyDeleteAndré Mussche They are bottlenecks in the benchmark code. I don't believe they are bottleneck in most real applications. (And if they are, you should adapt the algorithm.)
ReplyDeleteIf you can find a real application where GetMem on medium/large blocks causes problems, let me know and I'll see what can be done. I don't think improving FastMM just so that a benchmark runs faster can bring any good.
True :) But it is the only extensive benchmark for comparison we have right now?
ReplyDeleteAs far as I know.
ReplyDelete