UTF-8 everywhere would mean a conversion overhead in Delphi, and a very significant overhead for Smart/JS. I want FreePascal, but I don't want to sacrifice Windows Delphi or JS for it ;-)
UTF8 is a variable-length encoding (UTF16 as well, but we can ignore variable part safely), so it's not very handy. UTF8 is to be used when the "backend" is ASCII-based and can not work with UTF16.
Eugene Mayevski there is nothing safe about ignoring the variable part of utf-16 if you deal with Chinese or Apple text (which can use combining for accented characters)
Apple had different conventions and practices wrt to Unicode normalization and equivalence (in simple terms in Unicode "é" can be precomposed as a single character or an "e"+combining diacritic ). When dealing with Unicode text, be it Chinese or a non-precomposed character, it means that a character, even a Latin one like "é", can be made of two WideChar, which shouldn't be separated (but can be normalized, though normalization can be lossy as not all combinations have a pre-composed character).
Eugene Mayevski "we can safely ignore the variable part" That’s like saying "what the heck, let's just stick to ANSI". Read this: http://www.utf8everywhere.org/#faq.almostfw
When I was offering Unicode support for non-Unicode Delphi with ElPack, it was very popular among chinese users. Nobody ever complained about missing composite characters or other problems with chinese. So let's split theory from practice. Well, if one wants to fight about from what edge to start breaking the egg, I quit.
Eugene Mayevski If your components were targeted at Wintel, they didn't have to care about utf-16 endianness or normalization (as for nobody ever complaining, google tells otherwise)
Kevin Powick It would manifest itself as the same source code on the Delphi and FreePascal side, but with slight incompatibilities on the scripting side. The script would match the encoding of the platform that compiled it, rather than have an encoding independent from the platform.
Eric Grange I guess Z̫̜̬̘̩̬͎͙̻͋ͦ̓͗͝͠a̟ͩ̂͆̀l͙̟͓͓͆͢͞g̸͍̙̻͍͔͌ͬ̇ͦͤ̈́̽͘o̱̮̫̟̪̥͙͚ͯͤ̑ͨ̋ ̝̤͈͕̱̭̞͖ͫͫ͢͞͞t̶͓̤͈̣͖̓̎̔̎̓̋̓ͮ̈́͜ͅh̛̥͎̦̖̭̋͂̊ͮ͑́͋͡e̹̲̩̬̮͔̗̦̿̿̋ͨͪ ̷̥̫̰ͦ̋̀G̵̬̜̲̈́̈́ͤ̅͜͜r̾ͦͦͧ̐ͣͬ̒͘͏̴̰͙ȩ̼̭̥͔̳̿̽̿̉́ͅa̶̶̳̲̼͕ͧͩ͘t͕̯̫͇̩̳̤͖ͫ͋̎͋̾ͩ͒̈́͒͢͜͝ͅ ̬͈̜͉͔̺͕͂̆͞O̲̙͖̻̲̠̙̔͂͟͝ͅn̞̈́͐͠e̛̤͖ͨ͐̌ͯͩ͋̌ͤ͞ has been a bit responsible as well...
(interesting; on my Win7 machine, Chrome 28 completely messes up Zalgo the Diacritical One, whereas Firefox 24, Opera 12.16, Safari 5.1.7 and IE 9.0 render it without a hitch).
why not stick with UTF-8 everywhere? me no likey forky
ReplyDeleteUTF-8 everywhere would mean a conversion overhead in Delphi, and a very significant overhead for Smart/JS.
ReplyDeleteI want FreePascal, but I don't want to sacrifice Windows Delphi or JS for it ;-)
UTF8 is a variable-length encoding (UTF16 as well, but we can ignore variable part safely), so it's not very handy. UTF8 is to be used when the "backend" is ASCII-based and can not work with UTF16.
ReplyDeleteEugene Mayevski there is nothing safe about ignoring the variable part of utf-16 if you deal with Chinese or Apple text (which can use combining for accented characters)
ReplyDeleteWhat's wrong with Apple? Did they steal the alphabet as well?
ReplyDeleteApple had different conventions and practices wrt to Unicode normalization and equivalence (in simple terms in Unicode "é" can be precomposed as a single character or an "e"+combining diacritic ). When dealing with Unicode text, be it Chinese or a non-precomposed character, it means that a character, even a Latin one like "é", can be made of two WideChar, which shouldn't be separated (but can be normalized, though normalization can be lossy as not all combinations have a pre-composed character).
ReplyDeleteFor more details, see http://en.wikipedia.org/wiki/Precomposed_character and http://en.wikipedia.org/wiki/Combining_character
Eugene Mayevski "we can safely ignore the variable part"
ReplyDeleteThat’s like saying "what the heck, let's just stick to ANSI".
Read this: http://www.utf8everywhere.org/#faq.almostfw
When I was offering Unicode support for non-Unicode Delphi with ElPack, it was very popular among chinese users. Nobody ever complained about missing composite characters or other problems with chinese. So let's split theory from practice. Well, if one wants to fight about from what edge to start breaking the egg, I quit.
ReplyDeleteInteresting enough, non-BMP characters have become quite common thanks to twitter ;-)
ReplyDeletehttp://stackoverflow.com/questions/5567249/what-are-the-most-common-non-bmp-unicode-characters-in-actual-use
Eugene Mayevski If your components were targeted at Wintel, they didn't have to care about utf-16 endianness or normalization (as for nobody ever complaining, google tells otherwise)
Would a fork just manifest itself as having to download the appropriate version of DWS for each platform? i.e. No source code changes for a developer?
ReplyDeleteIf yes, then it seems the way to go from the standpoint of performance.
Kevin Powick It would manifest itself as the same source code on the Delphi and FreePascal side, but with slight incompatibilities on the scripting side. The script would match the encoding of the platform that compiled it, rather than have an encoding independent from the platform.
ReplyDeleteEric Grange I guess Z̫̜̬̘̩̬͎͙̻͋ͦ̓͗͝͠a̟ͩ̂͆̀l͙̟͓͓͆͢͞g̸͍̙̻͍͔͌ͬ̇ͦͤ̈́̽͘o̱̮̫̟̪̥͙͚ͯͤ̑ͨ̋ ̝̤͈͕̱̭̞͖ͫͫ͢͞͞t̶͓̤͈̣͖̓̎̔̎̓̋̓ͮ̈́͜ͅh̛̥͎̦̖̭̋͂̊ͮ͑́͋͡e̹̲̩̬̮͔̗̦̿̿̋ͨͪ ̷̥̫̰ͦ̋̀G̵̬̜̲̈́̈́ͤ̅͜͜r̾ͦͦͧ̐ͣͬ̒͘͏̴̰͙ȩ̼̭̥͔̳̿̽̿̉́ͅa̶̶̳̲̼͕ͧͩ͘t͕̯̫͇̩̳̤͖ͫ͋̎͋̾ͩ͒̈́͒͢͜͝ͅ ̬͈̜͉͔̺͕͂̆͞O̲̙͖̻̲̠̙̔͂͟͝ͅn̞̈́͐͠e̛̤͖ͨ͐̌ͯͩ͋̌ͤ͞ has been a bit responsible as well...
ReplyDelete(interesting; on my Win7 machine, Chrome 28 completely messes up Zalgo the Diacritical One, whereas Firefox 24, Opera 12.16, Safari 5.1.7 and IE 9.0 render it without a hitch).
ReplyDelete