Martin Wienold 5.0 vs 5.4? That's increase because of slightly less optimal memory usage. In a real life app memory usage went from 153 to 157 MB, which is completely acceptable to me.
Hard to say. Really hard to say. In most cases, about nothing. The trick is really just in delayed FreeMem which can cause some memory to be freed later then expected.
We have to add some mechanism which will prevent some memory from never getting released. (At the moment I could probably create this problem with a very very tricky program, but I don't think this will ever occur in practice. [famous last words press])
Oh, and it is not per thread. There's an array of lock-free stacks in each small block allocator, in medium block allocator, and soon in large block allocator. Each array contains 64 stacks and each stack can hold up to 16 memory buffers which are waiting to be released.
Primož Gabrijelčič Oh right, sorry I was a bit sleepy when posting, forgot what you had written :(
So I'm not familiar with the deep internals of FastMM (or rather, I forgot). Does FastMM keep a small block page (or a few) per-thread? Or are all memory allocations serialized?
Did quick tests with latest Locking_improvements branch, on an old i5 with Seattle. Single-cpu is quite the same (Seattle default, FastMM Pierre and yours). Multi-cpu is great, for example 3.8 seconds (Seattle), 2.5 sec (Pierre), 2.35 sec yours.
I'm doing lots of string assignments in the test code, which hurts threads specially on this poor old i5 (perfmon cpus can't pass 50% or 60% usage). Forcing large blocks allocs shows bigger speed diff also in single-cpu.
Primož Gabrijelčič oops sorry, forgot add define UseReleaseStack in my test. Results are even better with UseReleaseStack. Now tested on i7-4770 with a more longer test.
Seattle: single-cpu: 2.8 sec multi-cpu: 11.9 sec
Latest FastMM: single-cpu: 2.8 sec multi-cpu without UseReleaseStack: 2.5 sec multi-cpu with UseReleaseStack: 1.7 sec
Primož Gabrijelčič switch values? (sorry I can't understand). The test is to create and execute 23K diverse SQL queries using a pure-pascal hand-made SQL engine against in-memory datasets
Primož Gabrijelčič yep no change at all, tested multiple times same results, I blame the heavy usage of strings and variants, copying big arrays of strings with system.copy etc. I dont use any explicit criticalsection or lock
Interesting. NeverSleepOnMMThreadContention is a weird beast - it helps with some programs on some machines and completely kills the performance with different programs.
Primož Gabrijelčič How stable do you feel it is for 24/7 usage? I could do a live test by having beginend.net use it. It's not exactly CPU intensive, but it is 24/7 :)
Primož Gabrijelčič Do you have much experience with NUMA machines? Have you thought about modifying the memory manager so that it uses NUMA aware memory allocation routines?
My current feeling is that there's no Delphi memory manager around that can do that, and that multi-processing is really the only viable route for large NUMA machines.
Primož Gabrijelčič NUMA and processor groups is a huge issue. I have a machine that I cannot take advantage of without writing the code specifically to be aware of processor groups. And even if I do that the heap allocator is not NUMA aware.
To the very best of my knowledge there is no parallel library for Delphi that can scale on a machine with processor groups.
Anyway, I am currently exploring this in my app with a view to using a bespoke multi-processing approach.
David Berneda Can you please rerun your multiprocessor test with UseReleaseStack and LogLockContention and post the _EventLog? I'm very interested in where is memory manager locking now in your app.
You can strip away call stack below the FastMM level as I'm not really interested in it if you don't want to expose internals of your application.
Primož Gabrijelčič Yes, that's my analysis of the situation too. I have such a machine. Many of my clients have such a machine. As you might imagine, these clients get a little upset when they realise that they've spent all the money on the machine and cannot take full advantage of it.......
If these changes are accepted by Pierre can i suggest the new version is called FastMM 5 I think the improved parallel support warrants more than a 0.001 increment.
All changes are already accepted by Pierre into his own Locking_Improvements branch. The plan is to test them fully (and add the ASM version of improvements) and then push them out as a new FastMM. I don't know what version he'll pick.
Looks nice - not much blocking and all blocks occur in FreeMem. Mostly in string handling.
Can you try increasing the ReleaseStackSize constant (FastMM4.pas) from 16 to, say, 32 or 64 and running test again? It would be interesting to see if increasing the stack would cause a noticeable drop in lock contention.
Primož Gabrijelčič no visible difference in both 32bit and 64bit, I've even tried an exagerated 4096 value and the timings are almost exact. (I cannot use LogLockContention in conjunction with UseReleaseStack so I can't post a log)
Ooouuuh! Me like!
ReplyDeleteDoes the increase in the "Private Bytes" part indicate anything, or is it just visual noise?
ReplyDeleteVery interesting...
ReplyDeleteDoes this improvement work, without degrading singlethread-performance?
ReplyDeleteVery nice!
ReplyDelete/sub
ReplyDeleteMartin Wienold 5.0 vs 5.4? That's increase because of slightly less optimal memory usage. In a real life app memory usage went from 153 to 157 MB, which is completely acceptable to me.
ReplyDelete/sub
ReplyDeleteAlexander Benikowski It should not affect the single-thread performance at all.
ReplyDeleteNice improvement! It's not just in the RAID world where having "hot spares" is a good thing :)
ReplyDeleteNote to everybody: Please test & report findings!
ReplyDeleteSo is there any reason why this shouldn't be the default configuration settings?
ReplyDeletea) Not yet implemented in assembler version. b) Testing. c) Testing. d) Testing.
ReplyDeleteDoes this have 64 bit support?
ReplyDeleteDavid Heffernan Yes.
ReplyDelete/sub
ReplyDelete/GoingToTestToday! :-)
ReplyDeleteSo this adds a small memory overhead per thread? How much per thread for a contended application?
ReplyDeleteHard to say. Really hard to say. In most cases, about nothing. The trick is really just in delayed FreeMem which can cause some memory to be freed later then expected.
ReplyDeleteWe have to add some mechanism which will prevent some memory from never getting released. (At the moment I could probably create this problem with a very very tricky program, but I don't think this will ever occur in practice. [famous last words press])
Oh, and it is not per thread. There's an array of lock-free stacks in each small block allocator, in medium block allocator, and soon in large block allocator. Each array contains 64 stacks and each stack can hold up to 16 memory buffers which are waiting to be released.
ReplyDeletePrimož Gabrijelčič Oh right, sorry I was a bit sleepy when posting, forgot what you had written :(
ReplyDeleteSo I'm not familiar with the deep internals of FastMM (or rather, I forgot). Does FastMM keep a small block page (or a few) per-thread? Or are all memory allocations serialized?
Did quick tests with latest Locking_improvements branch, on an old i5 with Seattle. Single-cpu is quite the same (Seattle default, FastMM Pierre and yours). Multi-cpu is great, for example 3.8 seconds (Seattle), 2.5 sec (Pierre), 2.35 sec yours.
ReplyDeleteI'm doing lots of string assignments in the test code, which hurts threads specially on this poor old i5 (perfmon cpus can't pass 50% or 60% usage). Forcing large blocks allocs shows bigger speed diff also in single-cpu.
David Berneda Thanks for the feedback!
ReplyDelete/sub
ReplyDelete/sub
ReplyDelete/sub the really fancy stuff :)
ReplyDelete/sub
ReplyDeletePrimož Gabrijelčič oops sorry, forgot add define UseReleaseStack in my test. Results are even better with UseReleaseStack. Now tested on i7-4770 with a more longer test.
ReplyDeleteSeattle:
single-cpu: 2.8 sec
multi-cpu: 11.9 sec
Latest FastMM:
single-cpu: 2.8 sec
multi-cpu without UseReleaseStack: 2.5 sec
multi-cpu with UseReleaseStack: 1.7 sec
David Berneda Great improvement! BTW, did you switch values for single and multi cpu?
ReplyDeletePrimož Gabrijelčič switch values? (sorry I can't understand). The test is to create and execute 23K diverse SQL queries using a pure-pascal hand-made SQL engine against in-memory datasets
ReplyDeleteDavid Berneda You wrote that single-cpu needed 2.8 sec and multi-cpu 11.9 sec. That seems strange to me.
ReplyDeletePrimož Gabrijelčič yep no change at all, tested multiple times same results, I blame the heavy usage of strings and variants, copying big arrays of strings with system.copy etc. I dont use any explicit criticalsection or lock
ReplyDeleteDavid Berneda So you don't use multiple CPUs to split work but you run full set of queries on each CPU?
ReplyDeletePrimož Gabrijelčič The test repeats 1000 times a TParallel.For(0,22...)
ReplyDeleteWhat is more puzzling now, adding NeverSleepOnMMThreadContention:=True I get the same results for both default Seattle and UseReleaseStack (1.7 sec)
Interesting. NeverSleepOnMMThreadContention is a weird beast - it helps with some programs on some machines and completely kills the performance with different programs.
ReplyDeletePrimož Gabrijelčič How stable do you feel it is for 24/7 usage? I could do a live test by having beginend.net use it. It's not exactly CPU intensive, but it is 24/7 :)
ReplyDeletePrimož Gabrijelčič Do you have much experience with NUMA machines? Have you thought about modifying the memory manager so that it uses NUMA aware memory allocation routines?
ReplyDeleteMy current feeling is that there's no Delphi memory manager around that can do that, and that multi-processing is really the only viable route for large NUMA machines.
David Heffernan No, absolutely no experience.
ReplyDeleteEric Grange I'm using it in our applications when compiled for debug (not in release yet) and haven't found any problems, even in 24/7 services.
ReplyDeletePrimož Gabrijelčič NUMA and processor groups is a huge issue. I have a machine that I cannot take advantage of without writing the code specifically to be aware of processor groups. And even if I do that the heap allocator is not NUMA aware.
ReplyDeleteTo the very best of my knowledge there is no parallel library for Delphi that can scale on a machine with processor groups.
Anyway, I am currently exploring this in my app with a view to using a bespoke multi-processing approach.
David Heffernan I'm pretty sure there's no support for Delphi applications to fully use such a machine. At all. On any level.
ReplyDeleteYour best chance at the moment is using multiple processes assigned to different processor groups.
I didn't yet manage to even see such a powerful machine so I won't be of much help here :(
David Berneda Can you please rerun your multiprocessor test with UseReleaseStack and LogLockContention and post the _EventLog? I'm very interested in where is memory manager locking now in your app.
ReplyDeleteYou can strip away call stack below the FastMM level as I'm not really interested in it if you don't want to expose internals of your application.
Primož Gabrijelčič Yes, that's my analysis of the situation too. I have such a machine. Many of my clients have such a machine. As you might imagine, these clients get a little upset when they realise that they've spent all the money on the machine and cannot take full advantage of it.......
ReplyDeleteDavid Heffernan Maybe we can cooperate and find some solution? Contact me on Hangouts or something else.
ReplyDeletePrimož Gabrijelčič Done ! No problem about call stack, all TeeBI source code is publicly available
ReplyDeletehttps://drive.google.com/open?id=0BymV3q6di65nTUdrVTVoMm5Mams
If these changes are accepted by Pierre can i suggest the new version is called FastMM 5 I think the improved parallel support warrants more than a 0.001 increment.
ReplyDeleteAll changes are already accepted by Pierre into his own Locking_Improvements branch. The plan is to test them fully (and add the ASM version of improvements) and then push them out as a new FastMM. I don't know what version he'll pick.
ReplyDeleteFWIW passed test suite here, and is now up on https://www.beginend.net/ & https://mandelbrot.dwscript.net/
ReplyDeleteGreat, thanks for testing!
ReplyDeleteDavid Berneda Thanks!
ReplyDeleteLooks nice - not much blocking and all blocks occur in FreeMem. Mostly in string handling.
Can you try increasing the ReleaseStackSize constant (FastMM4.pas) from 16 to, say, 32 or 64 and running test again? It would be interesting to see if increasing the stack would cause a noticeable drop in lock contention.
Primož Gabrijelčič no visible difference in both 32bit and 64bit, I've even tried an exagerated 4096 value and the timings are almost exact. (I cannot use LogLockContention in conjunction with UseReleaseStack so I can't post a log)
ReplyDeleteThanks. Then it will stay at 16 - small enough number and works OK.
ReplyDelete/sub
ReplyDeleteI don't understand how you manage to not to blow your computer off.
ReplyDeleteSometimes it sounds like an aeroplane taking off :)
ReplyDelete