On the left: bench app compiled with FastMM 4.991.

- February 24, 2016

On the left: bench app compiled with FastMM 4.991.

On the right: same app, compiled with https://github.com/pleriche/FastMM4/tree/Locking_Improvements and /dUseReleaseStack.

Comments

Lars FosdalFebruary 24, 2016 at 4:02 AM
Ooouuuh! Me like!
ReplyDelete
Replies
Martin WienoldFebruary 24, 2016 at 4:05 AM
Does the increase in the "Private Bytes" part indicate anything, or is it just visual noise?
ReplyDelete
Replies
Matteo Salvi (El Salvador)February 24, 2016 at 4:10 AM
Very interesting...
ReplyDelete
Replies
Alexander BenikowskiFebruary 24, 2016 at 4:22 AM
Does this improvement work, without degrading singlethread-performance?
ReplyDelete
Replies
David MillingtonFebruary 24, 2016 at 4:30 AM
Very nice!
ReplyDelete
Replies
Steve MaughanFebruary 24, 2016 at 5:39 AM
/sub
ReplyDelete
Replies
Primož GabrijelčičFebruary 24, 2016 at 5:52 AM
Martin Wienold 5.0 vs 5.4? That's increase because of slightly less optimal memory usage. In a real life app memory usage went from 153 to 157 MB, which is completely acceptable to me.
ReplyDelete
Replies
marcin r.February 24, 2016 at 5:52 AM
/sub
ReplyDelete
Replies
Primož GabrijelčičFebruary 24, 2016 at 5:52 AM
Alexander Benikowski It should not affect the single-thread performance at all.
ReplyDelete
Replies
Asbjørn HeidFebruary 24, 2016 at 6:27 AM
Nice improvement! It's not just in the RAID world where having "hot spares" is a good thing :)
ReplyDelete
Replies
Primož GabrijelčičFebruary 24, 2016 at 8:14 AM
Note to everybody: Please test & report findings!
ReplyDelete
Replies
Steve MaughanFebruary 24, 2016 at 8:26 AM
So is there any reason why this shouldn't be the default configuration settings?
ReplyDelete
Replies
Primož GabrijelčičFebruary 24, 2016 at 8:39 AM
a) Not yet implemented in assembler version. b) Testing. c) Testing. d) Testing.
ReplyDelete
Replies
David HeffernanFebruary 24, 2016 at 8:40 AM
Does this have 64 bit support?
ReplyDelete
Replies
Primož GabrijelčičFebruary 24, 2016 at 9:02 AM
David Heffernan Yes.
ReplyDelete
Replies
Tommi PramiFebruary 24, 2016 at 12:00 PM
/sub
ReplyDelete
Replies
Alexandre MachadoFebruary 24, 2016 at 12:03 PM
/GoingToTestToday! :-)
ReplyDelete
Replies
Asbjørn HeidFebruary 24, 2016 at 12:05 PM
So this adds a small memory overhead per thread? How much per thread for a contended application?
ReplyDelete
Replies
Primož GabrijelčičFebruary 24, 2016 at 1:47 PM
Hard to say. Really hard to say. In most cases, about nothing. The trick is really just in delayed FreeMem which can cause some memory to be freed later then expected.

We have to add some mechanism which will prevent some memory from never getting released. (At the moment I could probably create this problem with a very very tricky program, but I don't think this will ever occur in practice. [famous last words press])
ReplyDelete
Replies
Primož GabrijelčičFebruary 24, 2016 at 1:48 PM
Oh, and it is not per thread. There's an array of lock-free stacks in each small block allocator, in medium block allocator, and soon in large block allocator. Each array contains 64 stacks and each stack can hold up to 16 memory buffers which are waiting to be released.
ReplyDelete
Replies
Asbjørn HeidFebruary 24, 2016 at 2:25 PM
Primož Gabrijelčič Oh right, sorry I was a bit sleepy when posting, forgot what you had written :(

So I'm not familiar with the deep internals of FastMM (or rather, I forgot). Does FastMM keep a small block page (or a few) per-thread? Or are all memory allocations serialized?
ReplyDelete
Replies
David BernedaFebruary 24, 2016 at 3:14 PM
Did quick tests with latest Locking_improvements branch, on an old i5 with Seattle. Single-cpu is quite the same (Seattle default, FastMM Pierre and yours). Multi-cpu is great, for example 3.8 seconds (Seattle), 2.5 sec (Pierre), 2.35 sec yours.

I'm doing lots of string assignments in the test code, which hurts threads specially on this poor old i5 (perfmon cpus can't pass 50% or 60% usage). Forcing large blocks allocs shows bigger speed diff also in single-cpu.
ReplyDelete
Replies
Primož GabrijelčičFebruary 24, 2016 at 11:24 PM
David Berneda Thanks for the feedback!
ReplyDelete
Replies
Dany MarmurFebruary 24, 2016 at 11:33 PM
/sub
ReplyDelete
Replies
Stéphane WierzbickiFebruary 25, 2016 at 12:50 AM
/sub
ReplyDelete
Replies
Lübbe OnkenFebruary 25, 2016 at 1:15 AM
/sub the really fancy stuff :)
ReplyDelete
Replies
Adz BaywesFebruary 25, 2016 at 1:21 AM
/sub
ReplyDelete
Replies
David BernedaFebruary 25, 2016 at 3:07 AM
Primož Gabrijelčič oops sorry, forgot add define UseReleaseStack in my test. Results are even better with UseReleaseStack. Now tested on i7-4770 with a more longer test.

Seattle:
single-cpu: 2.8 sec
multi-cpu: 11.9 sec

Latest FastMM:
single-cpu: 2.8 sec
multi-cpu without UseReleaseStack: 2.5 sec
multi-cpu with UseReleaseStack: 1.7 sec
ReplyDelete
Replies
Primož GabrijelčičFebruary 25, 2016 at 3:14 AM
David Berneda Great improvement! BTW, did you switch values for single and multi cpu?
ReplyDelete
Replies
David BernedaFebruary 25, 2016 at 3:18 AM
Primož Gabrijelčič switch values? (sorry I can't understand). The test is to create and execute 23K diverse SQL queries using a pure-pascal hand-made SQL engine against in-memory datasets
ReplyDelete
Replies
Primož GabrijelčičFebruary 25, 2016 at 3:28 AM
David Berneda You wrote that single-cpu needed 2.8 sec and multi-cpu 11.9 sec. That seems strange to me.
ReplyDelete
Replies
David BernedaFebruary 25, 2016 at 3:44 AM
Primož Gabrijelčič yep no change at all, tested multiple times same results, I blame the heavy usage of strings and variants, copying big arrays of strings with system.copy etc. I dont use any explicit criticalsection or lock
ReplyDelete
Replies
Primož GabrijelčičFebruary 25, 2016 at 4:08 AM
David Berneda So you don't use multiple CPUs to split work but you run full set of queries on each CPU?
ReplyDelete
Replies
David BernedaFebruary 25, 2016 at 5:26 AM
Primož Gabrijelčič The test repeats 1000 times a TParallel.For(0,22...)

What is more puzzling now, adding NeverSleepOnMMThreadContention:=True I get the same results for both default Seattle and UseReleaseStack (1.7 sec)

ReplyDelete
Replies
Primož GabrijelčičFebruary 25, 2016 at 5:54 AM
Interesting. NeverSleepOnMMThreadContention is a weird beast - it helps with some programs on some machines and completely kills the performance with different programs.
ReplyDelete
Replies
Eric GrangeFebruary 25, 2016 at 8:33 AM
Primož Gabrijelčič How stable do you feel it is for 24/7 usage? I could do a live test by having beginend.net use it. It's not exactly CPU intensive, but it is 24/7 :)
ReplyDelete
Replies
David HeffernanFebruary 25, 2016 at 9:48 AM
Primož Gabrijelčič Do you have much experience with NUMA machines? Have you thought about modifying the memory manager so that it uses NUMA aware memory allocation routines?

My current feeling is that there's no Delphi memory manager around that can do that, and that multi-processing is really the only viable route for large NUMA machines.
ReplyDelete
Replies
Primož GabrijelčičFebruary 25, 2016 at 11:26 AM
David Heffernan No, absolutely no experience.
ReplyDelete
Replies
Primož GabrijelčičFebruary 25, 2016 at 11:27 AM
Eric Grange I'm using it in our applications when compiled for debug (not in release yet) and haven't found any problems, even in 24/7 services.
ReplyDelete
Replies
David HeffernanFebruary 26, 2016 at 12:54 AM
Primož Gabrijelčič NUMA and processor groups is a huge issue. I have a machine that I cannot take advantage of without writing the code specifically to be aware of processor groups. And even if I do that the heap allocator is not NUMA aware.

To the very best of my knowledge there is no parallel library for Delphi that can scale on a machine with processor groups.

Anyway, I am currently exploring this in my app with a view to using a bespoke multi-processing approach.
ReplyDelete
Replies
Primož GabrijelčičFebruary 26, 2016 at 1:06 AM
David Heffernan I'm pretty sure there's no support for Delphi applications to fully use such a machine. At all. On any level.

Your best chance at the moment is using multiple processes assigned to different processor groups.

I didn't yet manage to even see such a powerful machine so I won't be of much help here :(
ReplyDelete
Replies
Primož GabrijelčičFebruary 26, 2016 at 1:08 AM
David Berneda Can you please rerun your multiprocessor test with UseReleaseStack and LogLockContention and post the _EventLog? I'm very interested in where is memory manager locking now in your app.

You can strip away call stack below the FastMM level as I'm not really interested in it if you don't want to expose internals of your application.
ReplyDelete
Replies
David HeffernanFebruary 26, 2016 at 1:11 AM
Primož Gabrijelčič Yes, that's my analysis of the situation too. I have such a machine. Many of my clients have such a machine. As you might imagine, these clients get a little upset when they realise that they've spent all the money on the machine and cannot take full advantage of it.......
ReplyDelete
Replies
Primož GabrijelčičFebruary 26, 2016 at 1:27 AM
David Heffernan Maybe we can cooperate and find some solution? Contact me on Hangouts or something else.
ReplyDelete
Replies
David BernedaFebruary 26, 2016 at 4:03 AM
Primož Gabrijelčič Done ! No problem about call stack, all TeeBI source code is publicly available

https://drive.google.com/open?id=0BymV3q6di65nTUdrVTVoMm5Mams
ReplyDelete
Replies
Steve MaughanFebruary 26, 2016 at 4:29 AM
If these changes are accepted by Pierre can i suggest the new version is called FastMM 5 I think the improved parallel support warrants more than a 0.001 increment.
ReplyDelete
Replies
Primož GabrijelčičFebruary 26, 2016 at 5:36 AM
All changes are already accepted by Pierre into his own Locking_Improvements branch. The plan is to test them fully (and add the ASM version of improvements) and then push them out as a new FastMM. I don't know what version he'll pick.
ReplyDelete
Replies
Eric GrangeFebruary 26, 2016 at 5:43 AM
FWIW passed test suite here, and is now up on https://www.beginend.net/ & https://mandelbrot.dwscript.net/
ReplyDelete
Replies
Primož GabrijelčičFebruary 26, 2016 at 7:09 AM
Great, thanks for testing!
ReplyDelete
Replies
Primož GabrijelčičFebruary 26, 2016 at 7:13 AM
David Berneda Thanks!

Looks nice - not much blocking and all blocks occur in FreeMem. Mostly in string handling.

Can you try increasing the ReleaseStackSize constant (FastMM4.pas) from 16 to, say, 32 or 64 and running test again? It would be interesting to see if increasing the stack would cause a noticeable drop in lock contention.
ReplyDelete
Replies
David BernedaFebruary 26, 2016 at 7:30 AM
Primož Gabrijelčič no visible difference in both 32bit and 64bit, I've even tried an exagerated 4096 value and the timings are almost exact. (I cannot use LogLockContention in conjunction with UseReleaseStack so I can't post a log)
ReplyDelete
Replies
Primož GabrijelčičFebruary 26, 2016 at 7:55 AM
Thanks. Then it will stay at 16 - small enough number and works OK.
ReplyDelete
Replies
Gustavo RicardiFebruary 28, 2016 at 5:32 AM
/sub
ReplyDelete
Replies
Leonardo HerreraMarch 2, 2016 at 1:32 PM
I don't understand how you manage to not to blow your computer off.
ReplyDelete
Replies
Primož GabrijelčičMarch 3, 2016 at 1:20 AM
Sometimes it sounds like an aeroplane taking off :)
ReplyDelete
Replies

Add comment

Search This Blog

Delphi Developers Archive

On the left: bench app compiled with FastMM 4.991.

Comments

Post a Comment