Having been a little underwhelmed by the performance of TStreamReader when reading huge text files line by line, I attempted to roll my own. I managed to implement something that was 5 times faster on my test files (800MB of UTF-16LE).

- December 21, 2017

Having been a little underwhelmed by the performance of TStreamReader when reading huge text files line by line, I attempted to roll my own. I managed to implement something that was 5 times faster on my test files (800MB of UTF-16LE).

This simple Python script

with open(filename, 'r', encoding='utf-16-le') as f:
for line in f:
pass

was twice as fast again so clearly there is more that could be done.

The Delphi equivalent was

for Line in TLineReader.FromFile(filename, TEncoding.Unicode) do
;

So it is at least quite elegant to use.

A part of me wonders whether there already exists Delphi code that performs a similar task. Does anybody know if there is anything out there?

Comments

Jeroen Wiert PluimersDecember 26, 2017 at 3:22 AM
A. Bouchez I'm with David Heffernan here:

At its release in 1993, Windows NT was very early in supporting Unicode. Development of Windows NT started in 1990 where they opted for UCS-2 having 2 bytes per character and had a non-required annex on UTF-1.

UTF-1 - that later evolved into UTF-8 - did not even exist at that time. Even UCS-2 was still young: it got designed in 1989. UTF-8 was outlined late 1992 and became a standard in 1993

Some references:
- http://www.itprotoday.com/management-mobility/windows-nt-and-vms-rest-story
- https://en.wikipedia.org/wiki/Windows_NT
- https://en.wikipedia.org/wiki/UTF-8#History
- https://en.wikipedia.org/wiki/Universal_Coded_Character_Set#History
- en.wikipedia.org - UTF-1 - Wikipedia
.
ReplyDelete
Replies
A. BouchezDecember 26, 2017 at 6:54 AM
David Heffernan Jeroen Wiert Pluimers I know the history very well, but my point was that a UTF-16LE file in 2017 doesn't make any sense, just like a CP1252 file, or EBCDIC file or using https://en.wikipedia.org/wiki/ZX81_character_set - even if UTF-16 is used internally by Windows, Java or DotNet.
ReplyDelete
Replies
David HeffernanDecember 26, 2017 at 8:17 AM
A. Bouchez But you said "MS blindness". Nobody forces application developers to use UTF-16.
ReplyDelete
Replies

Add comment

Search This Blog

Delphi Developers Archive

Having been a little underwhelmed by the performance of TStreamReader when reading huge text files line by line, I attempted to roll my own. I managed to implement something that was 5 times faster on my test files (800MB of UTF-16LE).

Comments

Post a Comment