On the left: bench app compiled with FastMM 4.991.


On the left: bench app compiled with FastMM 4.991.

On the right: same app, compiled with https://github.com/pleriche/FastMM4/tree/Locking_Improvements and /dUseReleaseStack.

Comments

  1. Does the increase in the "Private Bytes" part indicate anything, or is it just visual noise?

    ReplyDelete
  2. Does this improvement work, without degrading singlethread-performance?

    ReplyDelete
  3. Martin Wienold 5.0 vs 5.4? That's increase because of slightly less optimal memory usage. In a real life app memory usage went from 153 to 157 MB, which is completely acceptable to me.

    ReplyDelete
  4. Alexander Benikowski It should not affect the single-thread performance at all.

    ReplyDelete
  5. Nice improvement! It's not just in the RAID world where having "hot spares" is a good thing :)

    ReplyDelete
  6. Note to everybody: Please test & report findings!

    ReplyDelete
  7. So is there any reason why this shouldn't be the default configuration settings?

    ReplyDelete
  8. a) Not yet implemented in assembler version. b) Testing. c) Testing. d) Testing.

    ReplyDelete
  9. So this adds a small memory overhead per thread? How much per thread for a contended application?

    ReplyDelete
  10. Hard to say. Really hard to say. In most cases, about nothing. The trick is really just in delayed FreeMem which can cause some memory to be freed later then expected.

    We have to add some mechanism which will prevent some memory from never getting released. (At the moment I could probably create this problem with a very very tricky program, but I don't think this will ever occur in practice. [famous last words press])

    ReplyDelete
  11. Oh, and it is not per thread. There's an array of lock-free stacks in each small block allocator, in medium block allocator, and soon in large block allocator. Each array contains 64 stacks and each stack can hold up to 16 memory buffers which are waiting to be released.

    ReplyDelete
  12. Primož Gabrijelčič Oh right, sorry I was a bit sleepy when posting, forgot what you had written :(

    So I'm not familiar with the deep internals of FastMM (or rather, I forgot). Does FastMM keep a small block page (or a few) per-thread? Or are all memory allocations serialized?

    ReplyDelete
  13. Did quick tests with latest Locking_improvements branch, on an old i5 with Seattle. Single-cpu is quite the same (Seattle default, FastMM Pierre and yours). Multi-cpu is great, for example 3.8 seconds (Seattle), 2.5 sec (Pierre), 2.35 sec yours.

    I'm doing lots of string assignments in the test code, which hurts threads specially on this poor old i5 (perfmon cpus can't pass 50% or 60% usage). Forcing large blocks allocs shows bigger speed diff also in single-cpu.

    ReplyDelete
  14. Primož Gabrijelčič oops sorry, forgot add define UseReleaseStack in my test. Results are even better with UseReleaseStack. Now tested on i7-4770 with a more longer test.

    Seattle:
    single-cpu: 2.8 sec
    multi-cpu: 11.9 sec

    Latest FastMM:
    single-cpu: 2.8 sec
    multi-cpu without UseReleaseStack: 2.5 sec
    multi-cpu with UseReleaseStack: 1.7 sec

    ReplyDelete
  15. David Berneda Great improvement! BTW, did you switch values for single and multi cpu?

    ReplyDelete
  16. Primož Gabrijelčič switch values? (sorry I can't understand).  The test is to create and execute 23K diverse SQL queries using a pure-pascal hand-made SQL engine against in-memory datasets

    ReplyDelete
  17. David Berneda You wrote that single-cpu needed 2.8 sec and multi-cpu 11.9 sec. That seems strange to me.

    ReplyDelete
  18. Primož Gabrijelčič yep no change at all, tested multiple times same results, I blame the heavy usage of strings and variants, copying big arrays of strings with system.copy etc. I dont use any explicit criticalsection or lock

    ReplyDelete
  19. David Berneda So you don't use multiple CPUs to split work but you run full set of queries on each CPU?

    ReplyDelete
  20. Primož Gabrijelčič The test repeats 1000 times a TParallel.For(0,22...)

    What is more puzzling now, adding NeverSleepOnMMThreadContention:=True I get the same results for both default Seattle and UseReleaseStack (1.7 sec)



    ReplyDelete
  21. Interesting. NeverSleepOnMMThreadContention is a weird beast - it helps with some programs on some machines and completely kills the performance with different programs.

    ReplyDelete
  22. Primož Gabrijelčič  How stable do you feel it is for 24/7 usage? I could do a live test by having beginend.net use it. It's not exactly CPU intensive, but it is 24/7 :)

    ReplyDelete
  23. Primož Gabrijelčič Do you have much experience with NUMA machines? Have you thought about modifying the memory manager so that it uses NUMA aware memory allocation routines?

    My current feeling is that there's no Delphi memory manager around that can do that, and that multi-processing is really the only viable route for large NUMA machines.

    ReplyDelete
  24. David Heffernan No, absolutely no experience.

    ReplyDelete
  25. Eric Grange I'm using it in our applications when compiled for debug (not in release yet) and haven't found any problems, even in 24/7 services.

    ReplyDelete
  26. Primož Gabrijelčič NUMA and processor groups is a huge issue. I have a machine that I cannot take advantage of without writing the code specifically to be aware of processor groups. And even if I do that the heap allocator is not NUMA aware.

    To the very best of my knowledge there is no parallel library for Delphi that can scale on a machine with processor groups.

    Anyway, I am currently exploring this in my app with a view to using a bespoke multi-processing approach.

    ReplyDelete
  27. David Heffernan I'm pretty sure there's no support for Delphi applications to fully use such a machine. At all. On any level.

    Your best chance at the moment is using multiple processes assigned to different processor groups.

    I didn't yet manage to even see such a powerful machine so I won't be of much help here :(

    ReplyDelete
  28. David Berneda Can you please rerun your multiprocessor test with UseReleaseStack and LogLockContention and post the _EventLog? I'm very interested in where is memory manager locking now in your app.

    You can strip away call stack below the FastMM level as I'm not really interested in it if you don't want to expose internals of your application.

    ReplyDelete
  29. Primož Gabrijelčič  Yes, that's my analysis of the situation too. I have such a machine. Many of my clients have such a machine. As you might imagine, these clients get a little upset when they realise that they've spent all the money on the machine and cannot take full advantage of it.......

    ReplyDelete
  30. David Heffernan Maybe we can cooperate and find some solution? Contact me on Hangouts or something else.

    ReplyDelete
  31. Primož Gabrijelčič Done ! No problem about call stack, all TeeBI source code is publicly available

    https://drive.google.com/open?id=0BymV3q6di65nTUdrVTVoMm5Mams

    ReplyDelete
  32. If these changes are accepted by Pierre can i suggest the new version is called FastMM 5 I think the improved parallel support warrants more than a 0.001 increment.

    ReplyDelete
  33. All changes are already accepted by Pierre into his own Locking_Improvements branch. The plan is to test them fully (and add the ASM version of improvements) and then push them out as a new FastMM. I don't know what version he'll pick.

    ReplyDelete
  34. David Berneda Thanks!

    Looks nice - not much blocking and all blocks occur in FreeMem. Mostly in string handling.

    Can you try increasing the ReleaseStackSize constant (FastMM4.pas) from 16 to, say, 32 or 64 and running test again? It would be interesting to see if increasing the stack would cause a noticeable drop in lock contention.

    ReplyDelete
  35. Primož Gabrijelčič no visible difference in both 32bit and 64bit, I've even tried an exagerated 4096 value and the timings are almost exact. (I cannot use LogLockContention in conjunction with UseReleaseStack so I can't post a log)

    ReplyDelete
  36. Thanks. Then it will stay at 16 - small enough number and works OK.

    ReplyDelete
  37. I don't understand how you manage to not to blow your computer off.

    ReplyDelete
  38. Sometimes it sounds like an aeroplane taking off :)

    ReplyDelete

Post a Comment