Found an interesting conclusion as to the performance comparison of "string + string" vs "TStringBuilder" by Andrei Aleksandrov: "I think that PUREPASCAL implementation is used in all “new” compilers, so it means that under x86 str+str is faster than StringBuilder, but in all other cases StringBuilder is faster. ".

- April 26, 2018

Found an interesting conclusion as to the performance comparison of "string + string" vs "TStringBuilder" by Andrei Aleksandrov: "I think that PUREPASCAL implementation is used in all “new” compilers, so it means that under x86 str+str is faster than StringBuilder, but in all other cases StringBuilder is faster. ".
https://medium.com/@Zawuza/stringbuilder-vs-for-string-string-d1c82e14f990

Comments

Eric GrangeApril 26, 2018 at 10:41 PM
Old news :)
https://www.delphitools.info/2013/10/30/efficient-string-building-in-delphi/
delphitools.info - Efficient String Building in Delphi - DelphiTools
ReplyDelete
Replies
Edwin YipApril 26, 2018 at 10:51 PM
Eric Grange Both are well-written articles with good tests, thanks. On the other hand, I think Andrei also tested the so-called nextgen compilers?
ReplyDelete
Replies
Eric GrangeApril 27, 2018 at 12:11 AM
No, I have not tested the nextgen compilers, I expect they'll just punish short-lived TStringBuilder a little more with extra reference counting overhead.

TStringBuilder was already bad back in the day, and while string concatenation got a lot worse with the PUREPASCAL implementation, TStringBuilder became only slightly worse, but it's a snail race.

If you have a lot of concatenations to the same string, TStringBuffer is still a drag, and for a small number of concatenations to many different strings, it has a high overhead.

There are basically no good options "out of the RTL box" now, just a choice between bad and worse :/
ReplyDelete
Replies
Ralf StockerApril 27, 2018 at 12:59 AM
Never ever use StringBuilder.
ReplyDelete
Replies
Emil MusteaApril 27, 2018 at 3:20 AM
To see how FastMM4-AVX do wonders run string concatenation benchmark from Eric Grange from https://www.delphitools.info/2013/11/06/source-code-for-the-string-concatenationbuilding-benchmark/ with Delphi memory manager and after that with FastMM4-AVX. FastMM4-AVX has half of the time of standard memory manager and CPU utilization goes from 40-50% to 95-100%. That benefit only because FastMM4-AVX replaces Sleep(0) and Sleep(1) with proper acquire in case of contention.
PS: I ran the multithreaded tests with function MeasureThreaded (4 threads) from benchmark on a i3-6100T.
ReplyDelete
Replies
Eric GrangeApril 27, 2018 at 3:39 AM
Emil Mustea impressive!
ReplyDelete
Replies
Edwin YipApril 27, 2018 at 5:44 AM
Does anybody know if FastMM4-AVX *is stable* and can be used as a drop-in replacement for Fastmm4?
ReplyDelete
Replies
Emil MusteaApril 27, 2018 at 6:54 AM
Since the author of FastMM4-AVX, it's using it in production code with thousands of deployments https://www.ritlabs.com/de/products/thebat/revision-history/7027/, I think it's "stable" enough :) and it's a pretty strong statement to backup his work.
ReplyDelete
Replies
Jacek LaskowskiApril 27, 2018 at 7:06 AM
Edwin Yip according to the author, this version of fastmm is used in the email client The Bat
ReplyDelete
Replies
Edwin YipApril 27, 2018 at 8:08 AM
That's what I'm thinking about, just wondering who else are also using it.
ReplyDelete
Replies
A. BouchezApril 28, 2018 at 8:48 AM
This is one of the reasons why we rewrote a string builder, with UTF-8 native support, and performance in mind, especially on multi-thread (avoiding memory allocations as most as possible), for our mORMot framework.
ReplyDelete
Replies
David HeffernanApril 29, 2018 at 1:08 AM
For the issue of contention, you can replace memory managers all you like but what you really need to do is write code that minimises heap allocation. Once you do that it doesn't matter what heap allocator you use.
ReplyDelete
Replies
Emil MusteaApril 29, 2018 at 2:18 AM
David Heffernan I agree with you: the best memory manager is that one who doesn't need to allocate memory :)
But, to my surprise, when I ran the test with FastMM4-AVX, the most improvement in execution time was on those already optimized (preallocated memory) Eric Grange TWriteOnlyBlockStream and A. Bouchez TTextWriter. So, I think, eliminating Sleep(0) and Sleep(1) will get rid of multiple unnecessary expensive context switch and remain a lot of CPU power for real work. That's why the usage of processor goes from 40-50% with standard FastMM, to 100% with FastMM4-AVX.
ReplyDelete
Replies
David HeffernanApril 29, 2018 at 4:21 AM
Emil Mustea How could a change in memory manager lead to speed up in code that doesn't ever call memory allocation?
ReplyDelete
Replies
Emil MusteaApril 29, 2018 at 6:58 AM
When I say preallocated, means chunks of 8 kb for TWriteOnlyBlockStream and 4 kb for TTextWriter (enough for a typical JSON object). But the tests are running for much more memory than initial chunk - there is a lot of allocations. So, a better algorithm (preallocate in chunks) + a better acquire (minimize context switch) makes a winning combination.
As always, there is no solution fits all size problems, because if you know you have to allocate a lot you make a bigger chunk and minimize the allocations - change in algorithm customized for the situation is first step, but comparing apples to apples, in the same situation, FastMM4-AVX is better than standard memory manager.
ReplyDelete
Replies
A. BouchezApril 29, 2018 at 8:07 AM
Emil Mustea You can specify the internal buffer size in TTextWriter - your 4KB value is the default of one constructor, others use 8KB (potentially allocated from stack), but for bigger content generation, the framework uses 64KB for the internal buffer - and in such case, the FastMM4-AVX benefit will be less noticeable I guess. But anyway, such micro-benchmarks are IMHO pointless.
ReplyDelete
Replies
Emil MusteaApril 29, 2018 at 8:36 AM
A. Bouchez I know the default value can be changed.
Like I said no need to allocate is the best solution but in the end will have to allocate some memory. So a little benefit here + a little benefit there IMHO helps per total.
ReplyDelete
Replies
A. BouchezApril 30, 2018 at 5:10 AM
As I wrote, such micro-benchmarks are most of the time pointless, and could easily lead to premature optimization. Text
concatenation is just one part of the process. For instance, when generating JSON, working directly with UTF-8 may be better than UTF-16 followed by a conversion. Or retrieving data from the DB will be a much bigger bottleneck than JSON serialization...
It will always depend on the actual application defined, and other parts of the libraries.
ReplyDelete
Replies
Maxim MasiutinMay 2, 2018 at 6:25 AM
I think that it is a good idea to apply both better memory manager and a pre-allocated memory. For example, for our Delphi application "The Bat!" we use both better memory manager (FastMM4AVX which is publicly available on github) and a custom manager that keeps data in fixed blocks, tailored for specifics of our application.

So, I think that we should make 4 benchmarks (2x2 matrix): with a better memory manager and with pre-allocated blocks.
ReplyDelete
Replies
A. BouchezMay 4, 2018 at 3:46 AM
... everything after actual profiling of the real application, since the MM is only a small part of the bottlenecks. ;)
ReplyDelete
Replies
Emil MusteaMay 4, 2018 at 5:11 AM
A. Bouchez Not every piece of software is good as your mORMot :) You knew the bottleneck from beggining and avoid it.
As Maxim Masiutin said single thread performance is negligible, but congested multithreaded app who allocate memory will benefit - so it's good to have it.
ReplyDelete
Replies
David HeffernanMay 4, 2018 at 6:01 AM
Emil Mustea You are missing the point. The point is that micro benchmarks that only measure contended allocation aren't representative of real world scenarios. You always do something with the memory. Real world benchmarks are what count.
ReplyDelete
Replies
Emil MusteaMay 4, 2018 at 1:15 PM
David Heffernan Really get it, and I know MM is a small part of any app but in time I sought many times complains about FastMM4 handles bad multi-thread apps (if weren't any problems , were no complains). So, if somebody work hard to improve that I'm giving applause ;)
ReplyDelete
Replies
David HeffernanMay 4, 2018 at 2:20 PM
Emil Mustea I have my own scalable MM that handles NUMA memory for my app
ReplyDelete
Replies

Add comment

Search This Blog

Delphi Developers Archive

Comments

Post a Comment