Which file format do you use for your delphi (code) source?

- June 10, 2015

Which file format do you use for your delphi (code) source?
As far as I know, delphi up to 2007 only used ANSI. Since Delphi 2009 it can use UTF-8 instead. The current XE8 even allows for UCS-2 and UCS-4 in both big and little endian format (and a few more which I am going to ignore because I have no idea what they actually mean: "Text Form", "Binary Form", "Binary").

In case you are wondering: This is about reading files for various functionality in GExperts. Currently (in the current source code that is, not the released dll), it works fine if the file is already open in the IDE, but if it has to read a file itself, it tries to convert it from UTF-8 to the native string format, regardless what encoding it actually is in. On my computer all source files seem to be ANSI, I had explicitly tell Delphi to use other formats to actually get some test files.

(If you use ANSI + any of the other formats, please vote for the "other format".)

Comments

Martijn CoppoolseJune 10, 2015 at 12:02 PM
The poll only lets you select a single answer. Most of our code files are encoded in ANSI, for the obvious historical reasons. Whenever a non-ANSI character shows up, Delphi now suggests saving the file in UTF-8, which we then do. Other than that, most of my new units are UTF8-encoded.
ReplyDelete
Replies
David MillingtonJune 10, 2015 at 12:25 PM
I use whatever the IDE uses by default. Sometimes it asks about UTF8 and if so, I click Yes. In other words, while I can't answer "What are you talking about" since I know what you're talking about, this is absolutely something I let the IDE take care of for me.
ReplyDelete
Replies
David MillingtonJune 10, 2015 at 12:28 PM
Btw, so far as I can tell, in modern versions the editor always uses UTF8 internally, even if the file is saves as ANSI. (Eg, for a IOTAEditReader - it gives you UTF8, even though it writes to an PAnsiChar buffer. That caused me a nasty bug once :p) Perhaps you can always run a conversion to UTF8 from file, ie assume ANSI unless there is a BOM?
ReplyDelete
Replies
A. BouchezJune 10, 2015 at 12:29 PM
Please add ASCII 7 bits... I use it for compatibility on all targets and rely on external utf8 files for translation.
ReplyDelete
Replies
Thomas Mueller (dummzeuch)June 10, 2015 at 12:37 PM
A. Bouchez Delphi does not distinguish between 7 bit ASCII and ANSI because there is no difference unless you have got high bit ASCII characters.
ReplyDelete
Replies
Thomas Mueller (dummzeuch)June 10, 2015 at 12:39 PM
Martijn Coppoolse I can't change the vote options any more. If you use UTF8 and ANSI, please vote UTF8.
ReplyDelete
Replies
Anthony FrazierJune 10, 2015 at 2:40 PM
The editor uses UTF-8 internally? Why wouldn't it use UTF-16?
ReplyDelete
Replies
David MillingtonJune 10, 2015 at 2:50 PM
Anthony Frazier I don't know, but I suspect it was to make it easier to transition its code from ANSI strings, since (I think) it supported UTF8 before Delphi itself did. (Also I vaguely remember hearing that the editor is written in C, so doesn't benefit from Delphi support for Unicode anyway.)

Plus, UTF8 is best ;) http://utf8everywhere.org/
ReplyDelete
Replies
Uwe RaabeJune 10, 2015 at 2:57 PM
The problem is that sometimes I encounter Delphi files without BOM that are nevertheless coded in UTF8. The IDE seems to be able to detect this, although I have no idea how it is done.
ReplyDelete
Replies
Thomas Mueller (dummzeuch)June 10, 2015 at 3:00 PM
David Millington the IDE started to use UTF8 with Delphi 8, at least that's what the conditional defines in GExperts suggest.
ReplyDelete
Replies
Thomas Mueller (dummzeuch)June 10, 2015 at 3:03 PM
Uwe Raabe in theory you could detect UTF8 by reading a file as ANSI and check whether it contains any UTF8 specific byte sequences. I'm not going to implement that though.
ReplyDelete
Replies
David MillingtonJune 10, 2015 at 3:09 PM
Uwe Raabe UTF8 detectors work by decoding the string as UTF8, and seeing if there are any errors. There are valid ANSI byte sequences that are not valid UTF8 sequences. To have an ANSI file that incorrectly appears to be valid UTF8, you have to have some very weird multi-character sequences, which are unusual, and so decoding is a fairly good check.

I can't remember a good source for this, and I know there is plenty of bad code online. One useful-looking SO question was here: http://codereview.stackexchange.com/questions/59428/validating-utf-8-byte-array

Edit: You could try wrapping MS's MLang too: http://www.codeproject.com/Articles/17201/Detect-Encoding-for-In-and-Outgoing-Text

Edit 2: A native Delphi solution! Based on Mozilla's i18n. http://stackoverflow.com/questions/373081/how-can-i-best-guess-the-encoding-when-the-bom-byte-order-mark-is-missing linking to http://chsdet.sourceforge.net/
ReplyDelete
Replies
Uwe RaabeJune 10, 2015 at 3:21 PM
Perhaps reading the file into TBytes and checking if TEncoding.UTF8.GetCharCount > 0 can be a simple solution. Assuming the file is not empty, of course. That would put the burden on the OS.
ReplyDelete
Replies
Asbjørn HeidJune 10, 2015 at 4:54 PM
Ours are a mix of ANSI (Windows 1252) and UTF-8. We have a fair bit of comments in norwegian, hence the files often contain non-ASCII characters.
ReplyDelete
Replies
Sakesun RoykiattisakJune 10, 2015 at 8:07 PM
C'mon. Aren't We all using Delphi 7 ? Unicode does not exist.
ReplyDelete
Replies
Dalija PrasnikarJune 11, 2015 at 2:38 AM
I have been using UTF-8 in my apps since Delphi 7. It would be nice if Delphi would save source code in UTF-8 as default (or it already does, but I have never noticed), but that is not crucial for my code because my code is pure ASCII. All UTF-8 strings are kept separately and loaded on app startup from other resources and not directly from coded string constants.
ReplyDelete
Replies
Thomas Mueller (dummzeuch)June 13, 2015 at 1:45 AM
To sum it up, after 108 votes:
36% ANSI
58% UTF-8
6% WTF?

There was one vote for UCS-2 that I saw but apparently the person changed his vote afterwards.

So, nobody seems to use UCS-2 or UCS-4 which is fine for me, because it gives me a reason to limit GExperts to reading ANSI and UTF-8, using the BOM to detect UTF-8.

Thanks to everybody who voted!
ReplyDelete
Replies
David MillingtonJune 13, 2015 at 5:43 AM
> "using the BOM to detect UTF-8."

That might be a mistake? UTF8 files are usually not saved with a BOM. What about one of the detection methods in the comments above?
ReplyDelete
Replies
Thomas Mueller (dummzeuch)June 13, 2015 at 7:09 AM
At least Delphi XE8 saves them with a BOM.
ReplyDelete
Replies
Asbjørn HeidJune 13, 2015 at 7:37 AM
Thomas Mueller For BOM-less files, it would be strongly preferable if it assumes UTF-8 until "proven wrong" (ie decode failure).
ReplyDelete
Replies
Thomas Mueller (dummzeuch)June 13, 2015 at 8:09 AM
Asbjørn Heid that's how it used to work, unfortunately when decoding failed, it tried to work with an empty string.
ReplyDelete
Replies
Asbjørn HeidJune 13, 2015 at 8:12 AM
Thomas Mueller So it's only missing the fallback logic then.
ReplyDelete
Replies
Thomas Mueller (dummzeuch)June 14, 2015 at 5:33 AM
Asbjørn Heid actually I had already added the fallback logic but took it out after I added BOM detection. Now it's back: Assume everything to bo UTF-8, try to convert it to native string and if that fails, use the original string. That solved one problem, but I found a few more. It's like a can of worms: Once you opened it, you will never get all the worms back into it. I found quite a few GExperts functions that I didn't even know existed. Damn you Stefan Glienke for getting me to start looking for broken GExperts functionality. ;-)
ReplyDelete
Replies
David MillingtonJune 14, 2015 at 6:05 AM
Thomas Mueller Just to make things more complex... :) Do you still have BOM detection there? Because if not, IMO the best path is:
1. BOM detection. If UCS2/UTF16/UTF8, it's known and can be handled directly.
2. No BOM - go into what you described above; assume UTF8 and only fallback to ANSI if UTF8 decoding fails.

I'm not clear if you're doing the first step or not, sorry, thus asking - but I suspect it's worth doing.
ReplyDelete
Replies
David MillingtonJuly 11, 2015 at 5:59 AM
I just found some interesting info about this. From Danny Thorpe's blog, "A Byte Order Mark at the start of the source file is mandatory for the compiler to recognize your character encoding. Source without a BOM is assumed to be in the current locale charset."

So it looks like you can reply on a UTF8 BOM being present if the code is UTF8. (Maybe test that to be sure it's still the case with one sample UTF8 file - I haven't.)

Source: http://dannythorpe.com/2004/09/03/unicode-identifiers/
ReplyDelete
Replies

Add comment

Search This Blog

Delphi Developers Archive

Which file format do you use for your delphi (code) source?

Comments

Post a Comment