Given I have two char pointers that point to the start and the end+1 of a substring inside of a large string, whats the quickest way to turn that into a string?

Given I have two char pointers that point to the start and the end+1 of a substring inside of a large string, whats the quickest way to turn that into a string?

I used these 2 ways that work and was wondering if there's an even faster/better one (I used 2 first and then changed it to 1 which seemed to be a bit faster).

1.
  len := fBufferPos - fStart;
  SetLength(s, len);
  Move(fStart^, s[1], len * SizeOf(Char));

1a) modified a bit to avoid UniqueStringU call caused by s[1]
  Move(fStart^, Pointer(s)^, len * SizeOf(Char));

2. 
  SetString(s, fStart, fBufferPos - fStart);

Comments

  1. as usual, the first one would be faster but would prefer to avoid the nitty gritty of pointers if I can.
    after all, the second one is far more comprehensive and the speed difference is negligible.

    ReplyDelete
  2. for me, 1 is more obvious of what it's doing, 2 seems more like voodoo

    ReplyDelete
  3. What character set? No multibyte chars?

    ReplyDelete
  4. Lars Fosdal Why does is matter in this context - its standard PWideChars.

    ReplyDelete
  5. Ah nvm - the pos/start are indexes, not pointers.

    ReplyDelete
  6. They are pointers but pointer math kicks in when you subtract 2 typed pointers. ;)

    ReplyDelete
  7. It's late here - I'm confusing myself. My first thought was that the len difference from fbufferend - fstart was # of bytes.

    ReplyDelete
  8. IIRC they're about equal, but feel free to benchmark. Besides that the only thing that is faster is to not do it at all.

    ReplyDelete
  9. Asbjørn Heid I already did that - if you read the question again you see I was asking if there is another alternative that might be even faster :)

    ReplyDelete
  10. For a new string, with a variable size buffer, the fastest is always SetString.
    The main difference is that if s does already point to some string, SetLength(s,len) will reallocate the existing memory, so is doing an unnecessary data move.
    1a) would not be faster since SetLength() would already have called UniqueString.
    1b) may be more or less the same than SetString:
    s := '';
    setlength(s,len);
    move(fstart^,pointer(s)^,len*sizeof(char));

    So IMHO SetString() is the preferred way.
    It is the fastest, and also the easiest/safest to write.

    ReplyDelete
  11. The SetLength & Move one is faster, if the string s (the output buffer) can be reused to avoid memory allocation.
    Otherwise SetString is faster, at least for me in my synthetic benchmark.

    What am I missing?
    http://pastebin.com/4ZeuKWF7

    ReplyDelete
  12. Martin Wienold You were missing the fact that the length changes most of the time. In your benchmark it stays the same.
    A. Bouchez The SetLength/Move version is about 10-15% faster than the SetString version (in Martins benchmark and in my real test case).

    ReplyDelete
  13. Martin Wienold  If only the size is the same, there won't be a reallocation nor any move, so SetLength/Move "may" be slightly faster, since SetString would allocate a new memory buffer for the string. You are perfectly right. But your test case is a typical meaningless microbenchmark, only showing a single - utopical - use case.
    Stefan Glienke  SetLength() will move the existing data after resizing the memory block. SetString() won't do that.I do not know what your "real test case" is, but in mine - the 20,000,000 regression tests of mORMot with JSON and data marshalling, SQL process, business process, client/server work over several communication protocols, SetString is always slightly faster. I never trust micro benchmarks. Just use the right function depending on the exact context, after profiling. Sometimes, I use SetLength, sometimes SetString, and sometimes a dedicated FastNewRawUTF8() or SetRawUTF8() function - see http://pastebin.com/Mz6G8dBe and http://pastebin.com/7GTxSfLJ
    Just look at how SetString is coded in System.pas (_UStrFromPCharLen/_LStrFromPCharLen AFAIR), and pickup the best, if it is worth it after real profiling.
    In most cases, SetString is fastest, and safest to use, since you have a new string from a memory buffer in a single function call, with no potential error about sizeof(char) and so on. You can create UnicodeString or AnsiString safely with the same function.

    ReplyDelete
  14. Also note that FPC does not have the same behavior about @s[1]: it does not make the string unique! So I had to write an UniqueRawUTF8() function for consistency.
    See http://pastebin.com/pMptcncx
    With SetString, you won't suffer from such compatibility issue. And if strings in Dephi are about to become immutable, SetString would be definitively the way to go - and I would leave this compiler for sure, BTW, in profit to FPC.

    ReplyDelete
  15. If you're really looking for every cycle you can optimize by rewriting the first line to generate better code and rewriting the second line to call system.pas's _New*String. 

    See the following code:

    function Test2(const fStart, fBufferPos: PChar): string;
    var
      len: integer;
      TempPointer: pointer;
    begin
      len := (NativeInt(fBufferPos) - NativeInt(fStart)) shr (SizeOf(Char)-1);  //better code generation
      case SizeOf(Char) of
        1: begin
          pointer(Result):= _NewAnsiString(len,0);
        end;
        2: begin
          Pointer(Result):= _NewUnicodeString(len);
        end;
      end;
      //SetLength(Result, len);
      Move(fStart^, Pointer(Result)^, len * SizeOf(Char));
    end;

    You'll have to copy/paste the _NewXString routines from the system unit.

    ReplyDelete
  16. Why are you checking for SizeOf(Char) to call a different function?
    {$IFDEF UNICODE} should to the trick just fine.

    ReplyDelete
  17. Martin Wienold
    Personal preference.
    The compiler evaluates the case statement at compile-time and eliminates the dead code. 
    So (potatoes/patatoes) it works out exactly the same.  
    It also allows me to test that optimizations are actually enabled and working when looking at the assembly. 

    The difference in running time is:
    Stefan's option 1a: 21793
    Optimized code: 15216

    ReplyDelete
  18. All this is nonsense. You are still optimizing for a single use case, which is just for one unrealistic micro-benchmark. You are speaking about a loop allocation of the same string variable, with the same length! Which software algorithm is actually doing this?
    If you have this pattern, you would NOT allocate the string, and reuse the same fixed size buffer, perhaps allocated on the stack:
    var tmp: array[0..sizeofstring] of char;
    and you would never be able to beat the speed of a single asm opcode generated by the compiler:
    add esp,-sizeofstring*2
    Or... just reuse the same variable!
    Why on earth would you not use SetString() which is
    - safe (i.e. not error prone),
    - fast in almost all cases (but perhaps Martin's unrealistic microbenchmark) ,
    - supports both UnicodeString and AnsiString with compiler-generated overloaded functions,
    - is a compiler intrinsic function,
    - and built in the RTL for the exact purpose Stefan was asking for, i.e. creating a string from a text memory buffer.

    ReplyDelete
  19. A. Bouchez
    Still I wish we had direct access to System.NewString, that way we'd have the best of both options.

    ReplyDelete
  20. Johan Bontes But you can! SetString IS actually System.NewString (which does not exist as such in Delphi). Just check the generated assembler.
    In fact, there is no such System.NewString - just _UStrFromPCharLen/_LStrFromPCharLen.
    This is what I meant: SetString() is indeed a compiler intrinsic pseudo function.

    ReplyDelete
  21. A. Bouchez There is a `system._NewUnicodeString`. Yes, `SetString` is almost a NewString, but SetString does a little bit extra. If the first parameter is not an empty string, it clears the old string/adjusts the refcount. This extra work causes it to be slightly slower than the NewString + Move pair. Never mind, missed your comment above.
    Having said that SetString (code 2) is definitely faster than code 1a. However care needs to be taken to pass an empty string every time when testing in a loop. 

    Results after stripping out all calls to UniqueString.

    Optimized use of copied _NewUnicodeString: 6018 
    Code 1a: 10233
    Optimized use of SetString : o8336o 6589  
    (Had to tweak the code to stop compiler from cleaning up strings are hard). 
    SetStrings wins I'd say.

    ReplyDelete

Post a Comment