I think I found a compiler issue, could you please vote for it?
I think I found a compiler issue, could you please vote for it?
For "if AChar <= #255 then" compiler (XE6) gently generates the following code:
005D734A 66817DFA4F04 cmp word ptr [ebp-$06],$044f
$044f does not equal 255!
But It works fine when Ord function is used:
if Ord(AChar) <= 255 then
005D7352 668B45FA mov ax,[ebp-$06]
005D7356 663DFF00 cmp ax,$00ff
http://qc.embarcadero.com/wc/qcmain.aspx?d=124402
For "if AChar <= #255 then" compiler (XE6) gently generates the following code:
005D734A 66817DFA4F04 cmp word ptr [ebp-$06],$044f
$044f does not equal 255!
But It works fine when Ord function is used:
if Ord(AChar) <= 255 then
005D7352 668B45FA mov ax,[ebp-$06]
005D7356 663DFF00 cmp ax,$00ff
http://qc.embarcadero.com/wc/qcmain.aspx?d=124402
Actually it's the old Char horror introduced with Delphi 2009.
ReplyDelete#255 is interpreted as character 255 in your system codepage, and then converted to Unicode (the outcome depends of course on your system codepage). It has many variants, not all involve the # notation.
You have to use #$00FF if you want to specify a 255
It looks like a bug to me, and I do that sort of comparison all the time (not using XE6 though). Did you test with other values (like #13 or #10 which are common tests)? Voted.
ReplyDeleteEric Grange oh, thanks. This makes sense. But it works just fine for code like "if AChar in [#1, #255] then" - in this code #255 = #$ff
ReplyDeletePlease attach a compiling unit test (preferably a DUnit unit containing the unit test) with regressions for #255, #$FF, #$00FF, Chr(255), Chr($FF), Chr($00FF) and similar permutations for Char(...) and I'll upvote and tag it for promotion to the internal bug system.
ReplyDeleteFor more background on the why of these permutations: http://wiert.me/2010/01/18/delphi-highcharunicode-directive-delphi-rad-studio/
Roman Yankovsky Actually no, it doesn't work, it just looks like it works, but that's contextual
ReplyDeletehttp://www.delphitools.info/2013/11/18/unicode-leftover-bug-from-hell/
This is yet another rule to check for SourceOddity (SpaceOddity theme for Delphi), I guess :)
ReplyDeleteOn your system that probably has a Cyrillic codepage, #255 maps to U+044F: http://www.charbase.com/044f-unicode-cyrillic-small-letter-ya
ReplyDeleteAs Jeroen hinted there is a HIGHCHARUNICODE directive for compatibility. Do you see a difference when you turn it on and off? If so I'd say this is expected. The idea is if you are using #80 you can the Euro symbol no matter what (not that Unicode number, so it gets converted for you)... helping the code move over.
ReplyDeleteThis error is since Delphi 2009. :-(
ReplyDeletetake
if AChar <= char(255)
#$FF -> AnsiChar <> Unicode = #$00FF
Marco Cantù the behavior introduced with D2009 did not provide backward compatibility. All explicit character codes here broke during our migration because of implicit conversion based on system codepage (which of course depends on the system).
ReplyDeletePre-D2009 the compiler took a "hands off" approach to numeric charcodes, so the code compiled in the same predictable way regardless of the system codepage it was compiled on.
Marco Cantù As an illustration, from the doc http://docs.embarcadero.com/products/rad_studio/delphiAndcpp2009/HelpUpdate2/EN/html/devcommon/compdirshighcharunicode_xml.html
ReplyDeletethe backward compatible behavior would be to have a way to have both A & W be $80, but none of the HIGHCHARUNICODE modes correspond to that
No, to me the backward compatible behavior is that if you assigned a char you get the same char. This isn't a byte, but a char! Any way, I tried the following
ReplyDelete{$HIGHCHARUNICODE ON}
procedure TForm3.Button1Click(Sender: TObject);
var
ch: Char;
begin
ch := #128;
ShowMessage (ch);
ShowMessage (Ord(ch).ToString);
end;
And it does work as you'd expect (in XE6, at least) the ordinal value is indeed 128. With that OFF, you get the Euro symbol (at least in Western European code page, I know) which is 8163 (or something similar).
Marco Cantù I disagree with this. If you specify a char by number in two different ways in any Delphi version it should come out the same.
ReplyDeleteMarco Cantù What you describe is what the 2009 doc says, but not what previous Delphi versions did, hence it's not backward compatible - by definition : )
ReplyDeleteAlso in the grand scheme of things, the only two purposes of explicit numeric charcodes are control characters (below 32) and codepage-agnosticity, otherwise you might as well type the literal character in the source code directly.
Jeroen Wiert Pluimers Please specify what you mean by "the same": if you display a char on the screen, should it be the same character? Or if you do some processing, should it be the same numeric value? You cannot have both because some of the numeric values between 127 and 255 are different in Windows code page (not ISO) and Unicode. In any case, Delphi have a compiler option that let's you decide what "the same" means for you, I'm not sure I understand what the suggestion. Is it changing the default value for that compiler option? Removing it?
ReplyDeleteEric Grange It tries to be backward compatible with previous code, trying to "read" the programmer intent, which is not always possible. But there is a compiler flag to adjust the behavior. If the compiler flag works as documented, it might be a nuisance, but you cannot qualify it as a bug.
ReplyDeleteI fully agree with you that explicit numeric char codes is quite a bad idea in the first place. But they are used by developers. Given this applies to the range 127 to 255 and there is only one "real" control char there on windows (namely 255) and a bunch of other literals Microsoft pushed into some "unused space" between 80 and 96 (hex), this is more likely the developer intent. And there is a reason developers don't use those literals: many of them don't show up in keyboards, with the Euro symbol added only more recently...
So I do understand you point, and it is true things did change, but I still think CodeGear (back than) implemented the best migration strategy in regard to this issue.
Marco Cantù And why is char(128) <> #128?
ReplyDeleteMarco Cantù You misunderstood: for numeric charcodes, the compiler shouldn't try to be "smart" IMHO, pagecoding regular characters is okay, pagecoding numeric charcodes isn't.
ReplyDeleteNumeric characters are not a "bad idea", they are a requirement for control characters and special cases, such as to detect the range (ASCII vs not ASCII) or for non-visible codes (no-break-space, unicode diacritics, etc.), and in all those cases you just don't want the compiler to interpret anything.
WRT to the issue they actually implemented the worst possible strategy: not only you don't have any backward-compatible option, but you get two hard-to-predict options, both with side-effects. The result is that you have to workaround the compiler when you need to specify special Unicode chars, and you have to be extra wary of any single-character strings/chars constants (as there is another terrible design choice that was made there).
Eric Grange I totally agree with you. Did you write unit tests for this? If not: want to chat on how to set this up?
ReplyDeleteEric Grange "for numeric charcodes, the compiler shouldn't try to be smart". Ok, so we should make the code less backwards compatible? As I wrote, numeric characters are/were used in practice in other cases, whether you like it or not.
ReplyDeleteHaving said this I agree that for AnsiChar you should have the same behavior no matter what the HIGHCHARUNICODE is, and I see that as a bug. For Char (ie. WideChar) you have two different options, providing backwards compatibility for different scenarios (and the option is indeed local and specific to a code fragment). For AnisChar, it seems it is messed up!
Jeroen Wiert Pluimers Yes, my system has Cyrillic codepage. I will write unit tests, no problem :)
ReplyDeleteHow did you set your code page? And which is the number of your code page? (I know of `chcp`, but maybe you use another way)
ReplyDeletehttp://stackoverflow.com/questions/1259084/what-encoding-code-page-is-cmd-exe-using
Marco Cantù I suppose the situation has become very messy, given the D2009 choice wasn't sanitized earlier. But backward-compatibility-wise, the hands off approach is the one that matches the pre-D2009 behavior.
ReplyDeleteNeither of the HIGHCHARUNICODE option is backward-compatible, we experienced that firsthand :/
Jeroen Wiert Pluimers there are some unit tests as part of the DWScript unit tests that stress this (though not enough as I the tokenizer had a bug in some edge cases when compiled from Delphi running on a Cyrillic system...). I think I also saw some in the mOrmot Framework (but I'm not 100% positive).
Eric Grange if you have names of source files, please let me know and I will try to integrate them into a bigger suite of tests. Those should at lease serve as knowledge to people on which coding patterns to avoid and hopefully to some of the embarcadero guys to improve their things.
ReplyDeleteEric Grange and Roman Yankovsky any more input on unit tests yet?
ReplyDeleteJeroen Wiert Pluimers There are a number of unittests http://yankovsky.me/TestChars.zip
ReplyDeleteRoman Yankovsky Thanks. I'm adapting my code generator so it can generate my and your tests. Will keep you posted.
ReplyDeleteManaged to adapt the codegenerator so it can generate my own unit tests. See https://bitbucket.org/jeroenp/besharp.net/commits/1f9a3ef28f63da07e7786fa17a874bb66eb8c1c9
ReplyDeleteWill start working on your unit tests later this week (problem: you mixed unit tests that do/don't compile with Delphi 2007 in one unit. I need to split that out first.)