Having been a little underwhelmed by the performance of TStreamReader when reading huge text files line by line, I attempted to roll my own. I managed to implement something that was 5 times faster on my test files (800MB of UTF-16LE).

Having been a little underwhelmed by the performance of TStreamReader when reading huge text files line by line, I attempted to roll my own. I managed to implement something that was 5 times faster on my test files (800MB of UTF-16LE).

This simple Python script

with open(filename, 'r', encoding='utf-16-le') as f:
for line in f:
pass

was twice as fast again so clearly there is more that could be done.

The Delphi equivalent was

for Line in TLineReader.FromFile(filename, TEncoding.Unicode) do
;

So it is at least quite elegant to use.

A part of me wonders whether there already exists Delphi code that performs a similar task. Does anybody know if there is anything out there?

Comments

  1. A. Bouchez I'm with David Heffernan here:

    At its release in 1993, Windows NT was very early in supporting Unicode. Development of Windows NT started in 1990 where they opted for UCS-2 having 2 bytes per character and had a non-required annex on UTF-1.

    UTF-1 - that later evolved into UTF-8 - did not even exist at that time. Even UCS-2 was still young: it got designed in 1989. UTF-8 was outlined late 1992 and became a standard in 1993

    Some references:
    - http://www.itprotoday.com/management-mobility/windows-nt-and-vms-rest-story
    - https://en.wikipedia.org/wiki/Windows_NT
    - https://en.wikipedia.org/wiki/UTF-8#History
    - https://en.wikipedia.org/wiki/Universal_Coded_Character_Set#History
    - en.wikipedia.org - UTF-1 - Wikipedia
    .

    ReplyDelete
  2. David Heffernan Jeroen Wiert Pluimers I know the history very well, but my point was that a UTF-16LE file in 2017 doesn't make any sense, just like a CP1252 file, or EBCDIC file or using https://en.wikipedia.org/wiki/ZX81_character_set - even if UTF-16 is used internally by Windows, Java or DotNet.

    ReplyDelete
  3. A. Bouchez But you said "MS blindness". Nobody forces application developers to use UTF-16.

    ReplyDelete

Post a Comment