Which file format do you use for your delphi (code) source?
Which file format do you use for your delphi (code) source?
As far as I know, delphi up to 2007 only used ANSI. Since Delphi 2009 it can use UTF-8 instead. The current XE8 even allows for UCS-2 and UCS-4 in both big and little endian format (and a few more which I am going to ignore because I have no idea what they actually mean: "Text Form", "Binary Form", "Binary").
In case you are wondering: This is about reading files for various functionality in GExperts. Currently (in the current source code that is, not the released dll), it works fine if the file is already open in the IDE, but if it has to read a file itself, it tries to convert it from UTF-8 to the native string format, regardless what encoding it actually is in. On my computer all source files seem to be ANSI, I had explicitly tell Delphi to use other formats to actually get some test files.
(If you use ANSI + any of the other formats, please vote for the "other format".)
As far as I know, delphi up to 2007 only used ANSI. Since Delphi 2009 it can use UTF-8 instead. The current XE8 even allows for UCS-2 and UCS-4 in both big and little endian format (and a few more which I am going to ignore because I have no idea what they actually mean: "Text Form", "Binary Form", "Binary").
In case you are wondering: This is about reading files for various functionality in GExperts. Currently (in the current source code that is, not the released dll), it works fine if the file is already open in the IDE, but if it has to read a file itself, it tries to convert it from UTF-8 to the native string format, regardless what encoding it actually is in. On my computer all source files seem to be ANSI, I had explicitly tell Delphi to use other formats to actually get some test files.
(If you use ANSI + any of the other formats, please vote for the "other format".)
The poll only lets you select a single answer. Most of our code files are encoded in ANSI, for the obvious historical reasons. Whenever a non-ANSI character shows up, Delphi now suggests saving the file in UTF-8, which we then do. Other than that, most of my new units are UTF8-encoded.
ReplyDeleteI use whatever the IDE uses by default. Sometimes it asks about UTF8 and if so, I click Yes. In other words, while I can't answer "What are you talking about" since I know what you're talking about, this is absolutely something I let the IDE take care of for me.
ReplyDeleteBtw, so far as I can tell, in modern versions the editor always uses UTF8 internally, even if the file is saves as ANSI. (Eg, for a IOTAEditReader - it gives you UTF8, even though it writes to an PAnsiChar buffer. That caused me a nasty bug once :p) Perhaps you can always run a conversion to UTF8 from file, ie assume ANSI unless there is a BOM?
ReplyDeletePlease add ASCII 7 bits... I use it for compatibility on all targets and rely on external utf8 files for translation.
ReplyDeleteA. Bouchez Delphi does not distinguish between 7 bit ASCII and ANSI because there is no difference unless you have got high bit ASCII characters.
ReplyDeleteMartijn Coppoolse I can't change the vote options any more. If you use UTF8 and ANSI, please vote UTF8.
ReplyDeleteThe editor uses UTF-8 internally? Why wouldn't it use UTF-16?
ReplyDeleteAnthony Frazier I don't know, but I suspect it was to make it easier to transition its code from ANSI strings, since (I think) it supported UTF8 before Delphi itself did. (Also I vaguely remember hearing that the editor is written in C, so doesn't benefit from Delphi support for Unicode anyway.)
ReplyDeletePlus, UTF8 is best ;) http://utf8everywhere.org/
The problem is that sometimes I encounter Delphi files without BOM that are nevertheless coded in UTF8. The IDE seems to be able to detect this, although I have no idea how it is done.
ReplyDeleteDavid Millington the IDE started to use UTF8 with Delphi 8, at least that's what the conditional defines in GExperts suggest.
ReplyDeleteUwe Raabe in theory you could detect UTF8 by reading a file as ANSI and check whether it contains any UTF8 specific byte sequences. I'm not going to implement that though.
ReplyDeleteUwe Raabe UTF8 detectors work by decoding the string as UTF8, and seeing if there are any errors. There are valid ANSI byte sequences that are not valid UTF8 sequences. To have an ANSI file that incorrectly appears to be valid UTF8, you have to have some very weird multi-character sequences, which are unusual, and so decoding is a fairly good check.
ReplyDeleteI can't remember a good source for this, and I know there is plenty of bad code online. One useful-looking SO question was here: http://codereview.stackexchange.com/questions/59428/validating-utf-8-byte-array
Edit: You could try wrapping MS's MLang too: http://www.codeproject.com/Articles/17201/Detect-Encoding-for-In-and-Outgoing-Text
Edit 2: A native Delphi solution! Based on Mozilla's i18n. http://stackoverflow.com/questions/373081/how-can-i-best-guess-the-encoding-when-the-bom-byte-order-mark-is-missing linking to http://chsdet.sourceforge.net/
Perhaps reading the file into TBytes and checking if TEncoding.UTF8.GetCharCount > 0 can be a simple solution. Assuming the file is not empty, of course. That would put the burden on the OS.
ReplyDeleteOurs are a mix of ANSI (Windows 1252) and UTF-8. We have a fair bit of comments in norwegian, hence the files often contain non-ASCII characters.
ReplyDeleteC'mon. Aren't We all using Delphi 7 ? Unicode does not exist.
ReplyDeleteI have been using UTF-8 in my apps since Delphi 7. It would be nice if Delphi would save source code in UTF-8 as default (or it already does, but I have never noticed), but that is not crucial for my code because my code is pure ASCII. All UTF-8 strings are kept separately and loaded on app startup from other resources and not directly from coded string constants.
ReplyDeleteTo sum it up, after 108 votes:
ReplyDelete36% ANSI
58% UTF-8
6% WTF?
There was one vote for UCS-2 that I saw but apparently the person changed his vote afterwards.
So, nobody seems to use UCS-2 or UCS-4 which is fine for me, because it gives me a reason to limit GExperts to reading ANSI and UTF-8, using the BOM to detect UTF-8.
Thanks to everybody who voted!
> "using the BOM to detect UTF-8."
ReplyDeleteThat might be a mistake? UTF8 files are usually not saved with a BOM. What about one of the detection methods in the comments above?
At least Delphi XE8 saves them with a BOM.
ReplyDeleteThomas Mueller For BOM-less files, it would be strongly preferable if it assumes UTF-8 until "proven wrong" (ie decode failure).
ReplyDeleteAsbjørn Heid that's how it used to work, unfortunately when decoding failed, it tried to work with an empty string.
ReplyDeleteThomas Mueller So it's only missing the fallback logic then.
ReplyDeleteAsbjørn Heid actually I had already added the fallback logic but took it out after I added BOM detection. Now it's back: Assume everything to bo UTF-8, try to convert it to native string and if that fails, use the original string. That solved one problem, but I found a few more. It's like a can of worms: Once you opened it, you will never get all the worms back into it. I found quite a few GExperts functions that I didn't even know existed. Damn you Stefan Glienke for getting me to start looking for broken GExperts functionality. ;-)
ReplyDeleteThomas Mueller Just to make things more complex... :) Do you still have BOM detection there? Because if not, IMO the best path is:
ReplyDelete1. BOM detection. If UCS2/UTF16/UTF8, it's known and can be handled directly.
2. No BOM - go into what you described above; assume UTF8 and only fallback to ANSI if UTF8 decoding fails.
I'm not clear if you're doing the first step or not, sorry, thus asking - but I suspect it's worth doing.
I just found some interesting info about this. From Danny Thorpe's blog, "A Byte Order Mark at the start of the source file is mandatory for the compiler to recognize your character encoding. Source without a BOM is assumed to be in the current locale charset."
ReplyDeleteSo it looks like you can reply on a UTF8 BOM being present if the code is UTF8. (Maybe test that to be sure it's still the case with one sample UTF8 file - I haven't.)
Source: http://dannythorpe.com/2004/09/03/unicode-identifiers/