Found an interesting conclusion as to the performance comparison of "string + string" vs "TStringBuilder" by Andrei Aleksandrov: "I think that PUREPASCAL implementation is used in all “new” compilers, so it means that under x86 str+str is faster than StringBuilder, but in all other cases StringBuilder is faster. ".

Found an interesting conclusion as to the performance comparison of "string + string" vs "TStringBuilder" by Andrei Aleksandrov: "I think that PUREPASCAL implementation is used in all “new” compilers, so it means that under x86 str+str is faster than StringBuilder, but in all other cases StringBuilder is faster. ".
https://medium.com/@Zawuza/stringbuilder-vs-for-string-string-d1c82e14f990

Comments

  1. Eric Grange Both are well-written articles with good tests, thanks. On the other hand, I think Andrei also tested the so-called nextgen compilers?

    ReplyDelete
  2. No, I have not tested the nextgen compilers, I expect they'll just punish short-lived TStringBuilder a little more with extra reference counting overhead.

    TStringBuilder was already bad back in the day, and while string concatenation got a lot worse with the PUREPASCAL implementation, TStringBuilder became only slightly worse, but it's a snail race.

    If you have a lot of concatenations to the same string, TStringBuffer is still a drag, and for a small number of concatenations to many different strings, it has a high overhead.

    There are basically no good options "out of the RTL box" now, just a choice between bad and worse :/

    ReplyDelete
  3. To see how FastMM4-AVX do wonders run string concatenation benchmark from Eric Grange from https://www.delphitools.info/2013/11/06/source-code-for-the-string-concatenationbuilding-benchmark/ with Delphi memory manager and after that with FastMM4-AVX. FastMM4-AVX has half of the time of standard memory manager and CPU utilization goes from 40-50% to 95-100%. That benefit only because FastMM4-AVX replaces Sleep(0) and Sleep(1) with proper acquire in case of contention.
    PS: I ran the multithreaded tests with function MeasureThreaded (4 threads) from benchmark on a i3-6100T.

    ReplyDelete
  4. Does anybody know if FastMM4-AVX *is stable* and can be used as a drop-in replacement for Fastmm4?

    ReplyDelete
  5. Since the author of FastMM4-AVX, it's using it in production code with thousands of deployments https://www.ritlabs.com/de/products/thebat/revision-history/7027/, I think it's "stable" enough :) and it's a pretty strong statement to backup his work.

    ReplyDelete
  6. Edwin Yip according to the author, this version of fastmm is used in the email client The Bat

    ReplyDelete
  7. That's what I'm thinking about, just wondering who else are also using it.

    ReplyDelete
  8. This is one of the reasons why we rewrote a string builder, with UTF-8 native support, and performance in mind, especially on multi-thread (avoiding memory allocations as most as possible), for our mORMot framework.

    ReplyDelete
  9. For the issue of contention, you can replace memory managers all you like but what you really need to do is write code that minimises heap allocation. Once you do that it doesn't matter what heap allocator you use.

    ReplyDelete
  10. David Heffernan I agree with you: the best memory manager is that one who doesn't need to allocate memory :)
    But, to my surprise, when I ran the test with FastMM4-AVX, the most improvement in execution time was on those already optimized (preallocated memory) Eric Grange TWriteOnlyBlockStream and A. Bouchez TTextWriter. So, I think, eliminating Sleep(0) and Sleep(1) will get rid of multiple unnecessary expensive context switch and remain a lot of CPU power for real work. That's why the usage of processor goes from 40-50% with standard FastMM, to 100% with FastMM4-AVX.

    ReplyDelete
  11. Emil Mustea How could a change in memory manager lead to speed up in code that doesn't ever call memory allocation?

    ReplyDelete
  12. When I say preallocated, means chunks of 8 kb for TWriteOnlyBlockStream and 4 kb for TTextWriter (enough for a typical JSON object). But the tests are running for much more memory than initial chunk - there is a lot of allocations. So, a better algorithm (preallocate in chunks) + a better acquire (minimize context switch) makes a winning combination.
    As always, there is no solution fits all size problems, because if you know you have to allocate a lot you make a bigger chunk and minimize the allocations - change in algorithm customized for the situation is first step, but comparing apples to apples, in the same situation, FastMM4-AVX is better than standard memory manager.

    ReplyDelete
  13. Emil Mustea You can specify the internal buffer size in TTextWriter - your 4KB value is the default of one constructor, others use 8KB (potentially allocated from stack), but for bigger content generation, the framework uses 64KB for the internal buffer - and in such case, the FastMM4-AVX benefit will be less noticeable I guess. But anyway, such micro-benchmarks are IMHO pointless.

    ReplyDelete
  14. A. Bouchez I know the default value can be changed.
    Like I said no need to allocate is the best solution but in the end will have to allocate some memory. So a little benefit here + a little benefit there IMHO helps per total.

    ReplyDelete
  15. As I wrote, such micro-benchmarks are most of the time pointless, and could easily lead to premature optimization. Text
    concatenation is just one part of the process. For instance, when generating JSON, working directly with UTF-8 may be better than UTF-16 followed by a conversion. Or retrieving data from the DB will be a much bigger bottleneck than JSON serialization...
    It will always depend on the actual application defined, and other parts of the libraries.

    ReplyDelete
  16. I think that it is a good idea to apply both better memory manager and a pre-allocated memory. For example, for our Delphi application "The Bat!" we use both better memory manager (FastMM4AVX which is publicly available on github) and a custom manager that keeps data in fixed blocks, tailored for specifics of our application.

    So, I think that we should make 4 benchmarks (2x2 matrix): with a better memory manager and with pre-allocated blocks.

    ReplyDelete
  17. ... everything after actual profiling of the real application, since the MM is only a small part of the bottlenecks. ;)

    ReplyDelete
  18. A. Bouchez Not every piece of software is good as your mORMot :) You knew the bottleneck from beggining and avoid it.
    As Maxim Masiutin said single thread performance is negligible, but congested multithreaded app who allocate memory will benefit - so it's good to have it.

    ReplyDelete
  19. Emil Mustea You are missing the point. The point is that micro benchmarks that only measure contended allocation aren't representative of real world scenarios. You always do something with the memory. Real world benchmarks are what count.

    ReplyDelete
  20. David Heffernan Really get it, and I know MM is a small part of any app but in time I sought many times complains about FastMM4 handles bad multi-thread apps (if weren't any problems , were no complains). So, if somebody work hard to improve that I'm giving applause ;)

    ReplyDelete
  21. Emil Mustea I have my own scalable MM that handles NUMA memory for my app

    ReplyDelete

Post a Comment