Given I have two char pointers that point to the start and the end+1 of a substring inside of a large string, whats the quickest way to turn that into a string?
Given I have two char pointers that point to the start and the end+1 of a substring inside of a large string, whats the quickest way to turn that into a string?
I used these 2 ways that work and was wondering if there's an even faster/better one (I used 2 first and then changed it to 1 which seemed to be a bit faster).
1.
len := fBufferPos - fStart;
SetLength(s, len);
Move(fStart^, s[1], len * SizeOf(Char));
1a) modified a bit to avoid UniqueStringU call caused by s[1]
Move(fStart^, Pointer(s)^, len * SizeOf(Char));
2.
SetString(s, fStart, fBufferPos - fStart);
I used these 2 ways that work and was wondering if there's an even faster/better one (I used 2 first and then changed it to 1 which seemed to be a bit faster).
1.
len := fBufferPos - fStart;
SetLength(s, len);
Move(fStart^, s[1], len * SizeOf(Char));
1a) modified a bit to avoid UniqueStringU call caused by s[1]
Move(fStart^, Pointer(s)^, len * SizeOf(Char));
2.
SetString(s, fStart, fBufferPos - fStart);
as usual, the first one would be faster but would prefer to avoid the nitty gritty of pointers if I can.
ReplyDeleteafter all, the second one is far more comprehensive and the speed difference is negligible.
for me, 1 is more obvious of what it's doing, 2 seems more like voodoo
ReplyDeleteWhat character set? No multibyte chars?
ReplyDeleteLars Fosdal Why does is matter in this context - its standard PWideChars.
ReplyDeleteLen*SizeOf(Char)?
ReplyDeleteAh nvm - the pos/start are indexes, not pointers.
ReplyDeleteThey are pointers but pointer math kicks in when you subtract 2 typed pointers. ;)
ReplyDeleteIt's late here - I'm confusing myself. My first thought was that the len difference from fbufferend - fstart was # of bytes.
ReplyDeleteIIRC they're about equal, but feel free to benchmark. Besides that the only thing that is faster is to not do it at all.
ReplyDeleteAsbjørn Heid I already did that - if you read the question again you see I was asking if there is another alternative that might be even faster :)
ReplyDeleteFor a new string, with a variable size buffer, the fastest is always SetString.
ReplyDeleteThe main difference is that if s does already point to some string, SetLength(s,len) will reallocate the existing memory, so is doing an unnecessary data move.
1a) would not be faster since SetLength() would already have called UniqueString.
1b) may be more or less the same than SetString:
s := '';
setlength(s,len);
move(fstart^,pointer(s)^,len*sizeof(char));
So IMHO SetString() is the preferred way.
It is the fastest, and also the easiest/safest to write.
The SetLength & Move one is faster, if the string s (the output buffer) can be reused to avoid memory allocation.
ReplyDeleteOtherwise SetString is faster, at least for me in my synthetic benchmark.
What am I missing?
http://pastebin.com/4ZeuKWF7
Martin Wienold You were missing the fact that the length changes most of the time. In your benchmark it stays the same.
ReplyDeleteA. Bouchez The SetLength/Move version is about 10-15% faster than the SetString version (in Martins benchmark and in my real test case).
Martin Wienold If only the size is the same, there won't be a reallocation nor any move, so SetLength/Move "may" be slightly faster, since SetString would allocate a new memory buffer for the string. You are perfectly right. But your test case is a typical meaningless microbenchmark, only showing a single - utopical - use case.
ReplyDeleteStefan Glienke SetLength() will move the existing data after resizing the memory block. SetString() won't do that.I do not know what your "real test case" is, but in mine - the 20,000,000 regression tests of mORMot with JSON and data marshalling, SQL process, business process, client/server work over several communication protocols, SetString is always slightly faster. I never trust micro benchmarks. Just use the right function depending on the exact context, after profiling. Sometimes, I use SetLength, sometimes SetString, and sometimes a dedicated FastNewRawUTF8() or SetRawUTF8() function - see http://pastebin.com/Mz6G8dBe and http://pastebin.com/7GTxSfLJ
Just look at how SetString is coded in System.pas (_UStrFromPCharLen/_LStrFromPCharLen AFAIR), and pickup the best, if it is worth it after real profiling.
In most cases, SetString is fastest, and safest to use, since you have a new string from a memory buffer in a single function call, with no potential error about sizeof(char) and so on. You can create UnicodeString or AnsiString safely with the same function.
Also note that FPC does not have the same behavior about @s[1]: it does not make the string unique! So I had to write an UniqueRawUTF8() function for consistency.
ReplyDeleteSee http://pastebin.com/pMptcncx
With SetString, you won't suffer from such compatibility issue. And if strings in Dephi are about to become immutable, SetString would be definitively the way to go - and I would leave this compiler for sure, BTW, in profit to FPC.
If you're really looking for every cycle you can optimize by rewriting the first line to generate better code and rewriting the second line to call system.pas's _New*String.
ReplyDeleteSee the following code:
function Test2(const fStart, fBufferPos: PChar): string;
var
len: integer;
TempPointer: pointer;
begin
len := (NativeInt(fBufferPos) - NativeInt(fStart)) shr (SizeOf(Char)-1); //better code generation
case SizeOf(Char) of
1: begin
pointer(Result):= _NewAnsiString(len,0);
end;
2: begin
Pointer(Result):= _NewUnicodeString(len);
end;
end;
//SetLength(Result, len);
Move(fStart^, Pointer(Result)^, len * SizeOf(Char));
end;
You'll have to copy/paste the _NewXString routines from the system unit.
Why are you checking for SizeOf(Char) to call a different function?
ReplyDelete{$IFDEF UNICODE} should to the trick just fine.
Martin Wienold
ReplyDeletePersonal preference.
The compiler evaluates the case statement at compile-time and eliminates the dead code.
So (potatoes/patatoes) it works out exactly the same.
It also allows me to test that optimizations are actually enabled and working when looking at the assembly.
The difference in running time is:
Stefan's option 1a: 21793
Optimized code: 15216
All this is nonsense. You are still optimizing for a single use case, which is just for one unrealistic micro-benchmark. You are speaking about a loop allocation of the same string variable, with the same length! Which software algorithm is actually doing this?
ReplyDeleteIf you have this pattern, you would NOT allocate the string, and reuse the same fixed size buffer, perhaps allocated on the stack:
var tmp: array[0..sizeofstring] of char;
and you would never be able to beat the speed of a single asm opcode generated by the compiler:
add esp,-sizeofstring*2
Or... just reuse the same variable!
Why on earth would you not use SetString() which is
- safe (i.e. not error prone),
- fast in almost all cases (but perhaps Martin's unrealistic microbenchmark) ,
- supports both UnicodeString and AnsiString with compiler-generated overloaded functions,
- is a compiler intrinsic function,
- and built in the RTL for the exact purpose Stefan was asking for, i.e. creating a string from a text memory buffer.
A. Bouchez
ReplyDeleteStill I wish we had direct access to System.NewString, that way we'd have the best of both options.
Johan Bontes But you can! SetString IS actually System.NewString (which does not exist as such in Delphi). Just check the generated assembler.
ReplyDeleteIn fact, there is no such System.NewString - just _UStrFromPCharLen/_LStrFromPCharLen.
This is what I meant: SetString() is indeed a compiler intrinsic pseudo function.
A. Bouchez There is a `system._NewUnicodeString`. Yes, `SetString` is almost a NewString, but SetString does a little bit extra. If the first parameter is not an empty string, it clears the old string/adjusts the refcount. This extra work causes it to be slightly slower than the NewString + Move pair. Never mind, missed your comment above.
ReplyDeleteHaving said that SetString (code 2) is definitely faster than code 1a. However care needs to be taken to pass an empty string every time when testing in a loop.
Results after stripping out all calls to UniqueString.
Optimized use of copied _NewUnicodeString: 6018
Code 1a: 10233
Optimized use of SetString : o8336o 6589
(Had to tweak the code to stop compiler from cleaning up strings are hard).
SetStrings wins I'd say.