I hate the way Delphi handles UTF8 String


I hate the way Delphi handles UTF8 String

Why the IDE do not shows the difference between "é" and "é" in two UTF8String !

DN and Info.DN are of the same type, UTF8String :

type
TInfoPS = record
DN: UTF8String;
function Equals(const Info: TInfoPS): Boolean;
end;

function TInfoPS.Equals(const Info: TInfoPS): Boolean;
begin
Result := Info.DN = DN:
end;

Comments

  1. Info.DN is clearly not a valid UTF8String (é is always encoded as two byte)

    ReplyDelete
  2. In fact, I know where the problem come from.

    my TInfoPS record was saved and reload from a JSON file with my JSON unit....this Unit uses "string" to create a JSON representation of the record and the UTF8 conversion is made just before saving to file and just after reloading file. with TEncoding.UTF8.GetBytes() and TEncoding.UTF8.GetString()...

    so the UTF8String was added to the JSON "string" with
    AppendString(string(PAnsiString(Instance)^));
    and reload back with
    AnsiString(Instance^) := AnsiString(GetValue);

    not sure how all this puts an AnsiString in my UTF8String but it's somewhere there.

    But the IDE do not shows the difference ! That is my real problem !

    ReplyDelete
  3. I think the IDE knows they are in two different encodings so figures out how to display them correctly.

    ReplyDelete
  4. The IDE does not show the difference because the representation of both values as string are the same because it converts them to UnicodeString.

    Point is you are doing it wrong. A UTF8String is an AnsiString with codepage 65001 (UTF8). The RTL internally deals with that properly but if you do some string casting you might end up with an ansistring that does NOT have codepage 65001 but your system CP.

    What you are looking for is a debugger visualizer for UTF8String to display them in their raw format which you probably can simply write for yourself.

    ReplyDelete
  5. Stefan Glienke you're a friend of Rudy ? there was a time when you don't have to write everything yourself :(

    ReplyDelete
  6. I do write many things myself, but the way the compiler handles string is not in my hands

    ReplyDelete
  7. Paul TOTH I think there is some extensive doc about that by Marco Cantù

    ReplyDelete
  8. one of my mistake is to user UTF8String() or AnsiString() like String(), but it is not the same.

    string() act as a conversion function when the two others are just type cast.

    EDIT: Worst, it do the job FROM a string, and do a type cast from an 8bit string... so UTF8String(string) is UTF8 encoded but UTF8String(AnsiString) stills an AnsiString.

    ReplyDelete
  9. Shouldn't you just fix your code? Fix rose erroneous casts ?

    ReplyDelete
  10. David Heffernan I will, but I know now why I've made this error, when you assign an AnsiSTring to a Utf8String - or a String to a Utf8String - you have a warning W1057 or W1058,. And I fight against warnings, they are precious information if you don't let the program raise thousands of them. When you type cast the source string, the warning is gone; from a string it works as expected, but from an AnsiString it change the UTF8 code page, and the UTF8String var is in fact an AnsiString. You have to use UTF8Encode() instead (or nothing with a warning). My mistake was to think that Delphi will do the same things for both type....and I'm pretty sure that many developpers will do the same mistake.

    ReplyDelete
  11. Paul TOTH it's time to blog about this ;) , with some nice examples !

    ReplyDelete
  12. Stéphane Wierzbicki sure, but until I can live feeding a blog, I have to finish some import task before ;)

    leetchi.com - Contribution à l'OpenSource - Leetchi.com

    ReplyDelete

Post a Comment