Delphi Developers Archive

- February 21, 2016

Title

Comments

Ralf StockerFebruary 21, 2016 at 11:33 AM
You have very innovative ideas!
ReplyDelete
Replies
Dirk CarstensenFebruary 21, 2016 at 12:31 PM
Yes very innovative! I Thanks for enhancement! I have improved my application with FastMM4 but hit a border with many large memory allocations. I'll test the new define..
ReplyDelete
Replies
Achim KalwaFebruary 21, 2016 at 12:58 PM
Very interesting. I run into a similar problem with FastMM last week: After parsing a large (35 MB) XML file with over 6000 nodes, it took over 15 minutes to release the memory (TXMLDocument with default Microsoft engine). That was with FullDebugMode on, running in the IDE with debugging. Running the same application stand alone it took "only" 10 seconds.
I'll try your changes tomorrow too see if there are any runtime improvements.
ReplyDelete
Replies
Balázs SzakályFebruary 21, 2016 at 1:10 PM
/sub
ReplyDelete
Replies
Primož GabrijelčičFebruary 22, 2016 at 1:42 AM
Achim Kalwa That is to be expected, FullDebugMode can be very slow.
ReplyDelete
Replies
David HeffernanFebruary 22, 2016 at 3:09 AM
Achim Kalwa I don't think you've understood what Primož Gabrijelčič has said. He's not attempting to improve performance with full debug options. He's interested in lock contention when multi-threaded, which appears to be unrelated to what you are discussing.
ReplyDelete
Replies
Achim KalwaFebruary 22, 2016 at 3:12 AM
David Heffernan You are right. Thanks for clarification.
ReplyDelete
Replies
David BernedaFebruary 22, 2016 at 1:51 PM
LogLock feature could also be adapted to rank/profile normal single thread allocs, with finer detail vs profiling. btw I can't find a way to increase medium blocks size (ie, n MB) to compare vs virtualalloc for big sizes. I've been messing with constants but can't find the right combination
ReplyDelete
Replies
Primož GabrijelčičFebruary 22, 2016 at 2:34 PM
David Berneda Can you explain a bit more? I'm not sure I understand where you're going.
ReplyDelete
Replies
David BernedaFebruary 22, 2016 at 2:38 PM
Primož Gabrijelčič oops I wrote too fast ! I've mixed two ideas, one is using something similar to LogLock but for single thread, just gathering allocations stats (counting) per blocksize (min,med,large) to output the top stack traces for each. The other question is how can medium size threshold be changed (to increase it) in an attempt to reduce calls to virtualalloc for big big memory alloc sizes
ReplyDelete
Replies
Primož GabrijelčičFebruary 23, 2016 at 12:07 AM
David Berneda Former is an interesting idea. The modifications to my changes would probably be quite small.

For the latter - do you just want to change the threshold between the medium and large memory allocator?
ReplyDelete
Replies
David BernedaFebruary 23, 2016 at 12:13 AM
Primož Gabrijelčič Yes, the threshold between medium and large, there are several constants but any changes I do produces AVs
ReplyDelete
Replies
André MusscheFebruary 23, 2016 at 12:13 AM
I made a quick fix for the problem: the pending free queue wasn't cleared properly so it kept sleeping. With my fix I get the following results (when running this test: http://www.stevemaughan.com/delphi/delphi-parallel-programming-library-memory-managers/)

single multi
Delphi Seatle X 24.643 11.695
Primoz's FastMM 25.468 7.711

So almost perfect scaling and 95% cpu usage on my quad core! Kudos for the great work!

https://github.com/andremussche/FastMM4/commit/589c87ab55997837156a3bbe9637a691d3be03fb
ReplyDelete
Replies
Primož GabrijelčičFebruary 23, 2016 at 2:07 AM
André Mussche I actually don't understand what your change does - except that it doesn't free the small block pool when it becomes free.
ReplyDelete
Replies
André MusscheFebruary 23, 2016 at 2:15 AM
Primož Gabrijelčič yes thats another change (because of AV's when freeing the small block pool) but the real change is this line (which is executed in a repeat loop):
if not LPSmallBlockType.ReleaseStack.IsEmpty then LPSmallBlockType.ReleaseStack.Pop(APointer);
ReplyDelete
Replies
Primož GabrijelčičFebruary 23, 2016 at 2:55 AM
I believe my code also executes that:

{$ifdef UseReleaseStack}
if (count = (ReleaseStackSize div 2)) or
LPSmallBlockType.ReleaseStack.IsEmpty or
(not LPSmallBlockType.ReleaseStack.Pop(APointer)) then
begin
{$endif}
APointer := nil;
{Unlock this block type}
LPSmallBlockType.BlockTypeLocked := False;
{$ifdef UseReleaseStack}
end;
Inc(count);
{$endif}
ReplyDelete
Replies
Primož GabrijelčičFebruary 23, 2016 at 4:00 AM
Nasty AV in FastFreeMem w/UseReleaseStack was just fixed in https://github.com/gabr42/FastMM4/tree/Locking_Improvements. Pull request committed to pleriche/FastMM4 (https://github.com/pleriche/FastMM4/pull/9).
ReplyDelete
Replies
André MusscheFebruary 23, 2016 at 4:39 AM
Steve's speedtest gives now 21.000 for single and 5.600 msec for multithreaded so seems to scale nice! (95% cpu).

However in the fastcode mm challenge it is only slightly faster (overal) than build-in DelphiX (single code is bit slower,
multithread a bit faster). Probably because of lock contention in medium memory too?
ReplyDelete
Replies
Primož GabrijelčičFebruary 23, 2016 at 4:45 AM
André Mussche Try out the new https://github.com/gabr42/FastMM4/tree/Locking_Improvements (just committed) with /dUseReleaseStack and /dPerCPUReleaseStack. I'll add release stacks for medium/large blocks too, now that I'm sure the concept is working.
ReplyDelete
Replies
André MusscheFebruary 23, 2016 at 4:47 AM
great, I will re-test when the medium blocks are MT too
ReplyDelete
Replies
Primož GabrijelčičFebruary 23, 2016 at 6:16 AM
Added release stack for medium blocks. I don't think there's much sense in doing the same for large blocks, though.
ReplyDelete
Replies
Primož GabrijelčičFebruary 23, 2016 at 6:41 AM
Some numbers (D10S, 2 CPUs (6 HT cores each), average of three runs, lower is better):

Built-in MM single core: 34,1 sec
Built-in MM multi core: 11,0 sec (3,1x faster)
FastMM4 4.991 MC: 9,0 sec (3,8x faster)
+ UseReleaseStack: 8,6 sec (4,0x faster)
+ PerCPUReleaseStack: 7,1 sec (4,8x faster)
ReplyDelete
Replies
André MusscheFebruary 23, 2016 at 6:49 AM
What is MC version?
ReplyDelete
Replies
Primož GabrijelčičFebruary 23, 2016 at 6:52 AM
Not the version, 'Multicore' checkbox in the test. Sorry for the confusion.
ReplyDelete
Replies
Primož GabrijelčičFebruary 23, 2016 at 6:52 AM
BTW, Pierre just merged everything into the main Locking_Improvement branch.
ReplyDelete
Replies
Primož GabrijelčičFebruary 23, 2016 at 9:03 AM
Another major speed improvement - with the current gabr42/FastMM4:Locking_Improvement branch I'm getting benchmar result of 5,2 sec, which is 1,7x the speed of FastMM 4.991 and more than twice the speed of Delphi 10 Seattle built-in memory manager!

UseReleaseStacks still has to be defined, PerCPUReleaseStack was removed as it is now always enforced.
ReplyDelete
Replies
André MusscheFebruary 23, 2016 at 9:40 AM
In the FastCode MM Challenge it is only slightly faster, because it waits often in LockLargeBlocks...
ReplyDelete
Replies
Primož GabrijelčičFebruary 23, 2016 at 9:44 AM
I don't think large blocks are important for most multithreaded applications but most probably the MediumReleaseStack approach could be easily adapted to large blocks too.
ReplyDelete
Replies
Alexandre MachadoFebruary 23, 2016 at 12:04 PM
/sub
ReplyDelete
Replies
Primož GabrijelčičFebruary 25, 2016 at 12:53 AM
André Mussche Release stack for large blocks is now implemented in my fork (with pull request sent to pleriche).
ReplyDelete
Replies
André MusscheFebruary 25, 2016 at 5:17 AM
Thanks, it gets better but still a lot medium and large locks when requesting new blocks (nil param)

:7762460d KERNELBASE.Sleep + 0xf
FastMM4.LockMediumBlocks(nil,nil)
FastMM4.FastGetMem(???)

:7762460d KERNELBASE.Sleep + 0xf
FastMM4.LockLargeBlocks(nil,nil)
FastMM4.AllocateLargeBlock(420112)
FastMM4.FastGetMem(420112)
ReplyDelete
Replies
Primož GabrijelčičFebruary 25, 2016 at 5:55 AM
The release stack mechanism won't fix that, definitely. This problem could only be circumvented by implementing multiple allocators for medium and large blocks.
ReplyDelete
Replies
André MusscheFebruary 25, 2016 at 12:31 PM
That's a pitty because these are the biggest bottlenecks right now
ReplyDelete
Replies
Primož GabrijelčičFebruary 25, 2016 at 12:35 PM
André Mussche They are bottlenecks in the benchmark code. I don't believe they are bottleneck in most real applications. (And if they are, you should adapt the algorithm.)

If you can find a real application where GetMem on medium/large blocks causes problems, let me know and I'll see what can be done. I don't think improving FastMM just so that a benchmark runs faster can bring any good.
ReplyDelete
Replies
André MusscheFebruary 25, 2016 at 12:40 PM
True :) But it is the only extensive benchmark for comparison we have right now?
ReplyDelete
Replies
Primož GabrijelčičFebruary 26, 2016 at 12:03 AM
As far as I know.
ReplyDelete
Replies

Add comment

Search This Blog

Delphi Developers Archive

Comments

Post a Comment