Given I have two char pointers that point to the start and the end+1 of a substring inside of a large string, whats the quickest way to turn that into a string?

- May 23, 2015

Given I have two char pointers that point to the start and the end+1 of a substring inside of a large string, whats the quickest way to turn that into a string?

I used these 2 ways that work and was wondering if there's an even faster/better one (I used 2 first and then changed it to 1 which seemed to be a bit faster).

1.
len := fBufferPos - fStart;
SetLength(s, len);
Move(fStart^, s[1], len * SizeOf(Char));

1a) modified a bit to avoid UniqueStringU call caused by s[1]
Move(fStart^, Pointer(s)^, len * SizeOf(Char));

2.
SetString(s, fStart, fBufferPos - fStart);

Comments

Ugochukwu MmaduekweMay 23, 2015 at 4:01 PM
as usual, the first one would be faster but would prefer to avoid the nitty gritty of pointers if I can.
after all, the second one is far more comprehensive and the speed difference is negligible.
ReplyDelete
Replies
Dorin DuminicaMay 23, 2015 at 4:07 PM
for me, 1 is more obvious of what it's doing, 2 seems more like voodoo
ReplyDelete
Replies
Lars FosdalMay 23, 2015 at 4:24 PM
What character set? No multibyte chars?
ReplyDelete
Replies
Stefan GlienkeMay 23, 2015 at 4:28 PM
Lars Fosdal Why does is matter in this context - its standard PWideChars.
ReplyDelete
Replies
Lars FosdalMay 23, 2015 at 4:30 PM
Len*SizeOf(Char)?
ReplyDelete
Replies
Lars FosdalMay 23, 2015 at 4:31 PM
Ah nvm - the pos/start are indexes, not pointers.
ReplyDelete
Replies
Stefan GlienkeMay 23, 2015 at 4:33 PM
They are pointers but pointer math kicks in when you subtract 2 typed pointers. ;)
ReplyDelete
Replies
Lars FosdalMay 23, 2015 at 4:35 PM
It's late here - I'm confusing myself. My first thought was that the len difference from fbufferend - fstart was # of bytes.
ReplyDelete
Replies
Asbjørn HeidMay 23, 2015 at 5:54 PM
IIRC they're about equal, but feel free to benchmark. Besides that the only thing that is faster is to not do it at all.
ReplyDelete
Replies
Stefan GlienkeMay 23, 2015 at 6:12 PM
Asbjørn Heid I already did that - if you read the question again you see I was asking if there is another alternative that might be even faster :)
ReplyDelete
Replies
A. BouchezMay 23, 2015 at 10:52 PM
For a new string, with a variable size buffer, the fastest is always SetString.
The main difference is that if s does already point to some string, SetLength(s,len) will reallocate the existing memory, so is doing an unnecessary data move.
1a) would not be faster since SetLength() would already have called UniqueString.
1b) may be more or less the same than SetString:
s := '';
setlength(s,len);
move(fstart^,pointer(s)^,len*sizeof(char));

So IMHO SetString() is the preferred way.
It is the fastest, and also the easiest/safest to write.
ReplyDelete
Replies
Martin WienoldMay 23, 2015 at 11:58 PM
The SetLength & Move one is faster, if the string s (the output buffer) can be reused to avoid memory allocation.
Otherwise SetString is faster, at least for me in my synthetic benchmark.

What am I missing?
http://pastebin.com/4ZeuKWF7
ReplyDelete
Replies
Stefan GlienkeMay 24, 2015 at 2:44 AM
Martin Wienold You were missing the fact that the length changes most of the time. In your benchmark it stays the same.
A. Bouchez The SetLength/Move version is about 10-15% faster than the SetString version (in Martins benchmark and in my real test case).
ReplyDelete
Replies
A. BouchezMay 24, 2015 at 3:50 AM
Martin Wienold If only the size is the same, there won't be a reallocation nor any move, so SetLength/Move "may" be slightly faster, since SetString would allocate a new memory buffer for the string. You are perfectly right. But your test case is a typical meaningless microbenchmark, only showing a single - utopical - use case.
Stefan Glienke SetLength() will move the existing data after resizing the memory block. SetString() won't do that.I do not know what your "real test case" is, but in mine - the 20,000,000 regression tests of mORMot with JSON and data marshalling, SQL process, business process, client/server work over several communication protocols, SetString is always slightly faster. I never trust micro benchmarks. Just use the right function depending on the exact context, after profiling. Sometimes, I use SetLength, sometimes SetString, and sometimes a dedicated FastNewRawUTF8() or SetRawUTF8() function - see http://pastebin.com/Mz6G8dBe and http://pastebin.com/7GTxSfLJ
Just look at how SetString is coded in System.pas (_UStrFromPCharLen/_LStrFromPCharLen AFAIR), and pickup the best, if it is worth it after real profiling.
In most cases, SetString is fastest, and safest to use, since you have a new string from a memory buffer in a single function call, with no potential error about sizeof(char) and so on. You can create UnicodeString or AnsiString safely with the same function.
ReplyDelete
Replies
A. BouchezMay 24, 2015 at 3:53 AM
Also note that FPC does not have the same behavior about @s[1]: it does not make the string unique! So I had to write an UniqueRawUTF8() function for consistency.
See http://pastebin.com/pMptcncx
With SetString, you won't suffer from such compatibility issue. And if strings in Dephi are about to become immutable, SetString would be definitively the way to go - and I would leave this compiler for sure, BTW, in profit to FPC.
ReplyDelete
Replies
Johan BontesMay 24, 2015 at 5:04 AM
If you're really looking for every cycle you can optimize by rewriting the first line to generate better code and rewriting the second line to call system.pas's _New*String.

See the following code:

function Test2(const fStart, fBufferPos: PChar): string;
var
len: integer;
TempPointer: pointer;
begin
len := (NativeInt(fBufferPos) - NativeInt(fStart)) shr (SizeOf(Char)-1); //better code generation
case SizeOf(Char) of
    1: begin
      pointer(Result):= _NewAnsiString(len,0);
    end;
    2: begin
      Pointer(Result):= _NewUnicodeString(len);
    end;
end;
//SetLength(Result, len);
Move(fStart^, Pointer(Result)^, len * SizeOf(Char));
end;

You'll have to copy/paste the _NewXString routines from the system unit.
ReplyDelete
Replies
Martin WienoldMay 24, 2015 at 5:11 AM
Why are you checking for SizeOf(Char) to call a different function?
{$IFDEF UNICODE} should to the trick just fine.
ReplyDelete
Replies
Johan BontesMay 24, 2015 at 5:33 AM
Martin Wienold
Personal preference.
The compiler evaluates the case statement at compile-time and eliminates the dead code.
So (potatoes/patatoes) it works out exactly the same.
It also allows me to test that optimizations are actually enabled and working when looking at the assembly.

The difference in running time is:
Stefan's option 1a: 21793
Optimized code: 15216
ReplyDelete
Replies
A. BouchezMay 24, 2015 at 6:18 AM
All this is nonsense. You are still optimizing for a single use case, which is just for one unrealistic micro-benchmark. You are speaking about a loop allocation of the same string variable, with the same length! Which software algorithm is actually doing this?
If you have this pattern, you would NOT allocate the string, and reuse the same fixed size buffer, perhaps allocated on the stack:
var tmp: array[0..sizeofstring] of char;
and you would never be able to beat the speed of a single asm opcode generated by the compiler:
add esp,-sizeofstring*2
Or... just reuse the same variable!
Why on earth would you not use SetString() which is
- safe (i.e. not error prone),
- fast in almost all cases (but perhaps Martin's unrealistic microbenchmark) ,
- supports both UnicodeString and AnsiString with compiler-generated overloaded functions,
- is a compiler intrinsic function,
- and built in the RTL for the exact purpose Stefan was asking for, i.e. creating a string from a text memory buffer.
ReplyDelete
Replies
Johan BontesMay 24, 2015 at 6:38 AM
A. Bouchez
Still I wish we had direct access to System.NewString, that way we'd have the best of both options.
ReplyDelete
Replies
A. BouchezMay 24, 2015 at 7:08 AM
Johan Bontes But you can! SetString IS actually System.NewString (which does not exist as such in Delphi). Just check the generated assembler.
In fact, there is no such System.NewString - just _UStrFromPCharLen/_LStrFromPCharLen.
This is what I meant: SetString() is indeed a compiler intrinsic pseudo function.
ReplyDelete
Replies
Johan BontesMay 24, 2015 at 9:15 AM
A. Bouchez There is a `system._NewUnicodeString`. Yes, `SetString` is almost a NewString, but SetString does a little bit extra. If the first parameter is not an empty string, it clears the old string/adjusts the refcount. This extra work causes it to be slightly slower than the NewString + Move pair. Never mind, missed your comment above.
Having said that SetString (code 2) is definitely faster than code 1a. However care needs to be taken to pass an empty string every time when testing in a loop.

Results after stripping out all calls to UniqueString.

Optimized use of copied _NewUnicodeString: 6018
Code 1a: 10233
Optimized use of SetString : o8336o 6589
(Had to tweak the code to stop compiler from cleaning up strings are hard).
SetStrings wins I'd say.
ReplyDelete
Replies

Search This Blog

Delphi Developers Archive

Given I have two char pointers that point to the start and the end+1 of a substring inside of a large string, whats the quickest way to turn that into a string?

Comments

Post a Comment