Having been a little underwhelmed by the performance of TStreamReader when reading huge text files line by line, I attempted to roll my own. I managed to implement something that was 5 times faster on my test files (800MB of UTF-16LE).
Having been a little underwhelmed by the performance of TStreamReader when reading huge text files line by line, I attempted to roll my own. I managed to implement something that was 5 times faster on my test files (800MB of UTF-16LE).
This simple Python script
with open(filename, 'r', encoding='utf-16-le') as f:
for line in f:
pass
was twice as fast again so clearly there is more that could be done.
The Delphi equivalent was
for Line in TLineReader.FromFile(filename, TEncoding.Unicode) do
;
So it is at least quite elegant to use.
A part of me wonders whether there already exists Delphi code that performs a similar task. Does anybody know if there is anything out there?
This simple Python script
with open(filename, 'r', encoding='utf-16-le') as f:
for line in f:
pass
was twice as fast again so clearly there is more that could be done.
The Delphi equivalent was
for Line in TLineReader.FromFile(filename, TEncoding.Unicode) do
;
So it is at least quite elegant to use.
A part of me wonders whether there already exists Delphi code that performs a similar task. Does anybody know if there is anything out there?
A. Bouchez I'm with David Heffernan here:
ReplyDeleteAt its release in 1993, Windows NT was very early in supporting Unicode. Development of Windows NT started in 1990 where they opted for UCS-2 having 2 bytes per character and had a non-required annex on UTF-1.
UTF-1 - that later evolved into UTF-8 - did not even exist at that time. Even UCS-2 was still young: it got designed in 1989. UTF-8 was outlined late 1992 and became a standard in 1993
Some references:
- http://www.itprotoday.com/management-mobility/windows-nt-and-vms-rest-story
- https://en.wikipedia.org/wiki/Windows_NT
- https://en.wikipedia.org/wiki/UTF-8#History
- https://en.wikipedia.org/wiki/Universal_Coded_Character_Set#History
- en.wikipedia.org - UTF-1 - Wikipedia
.
David Heffernan Jeroen Wiert Pluimers I know the history very well, but my point was that a UTF-16LE file in 2017 doesn't make any sense, just like a CP1252 file, or EBCDIC file or using https://en.wikipedia.org/wiki/ZX81_character_set - even if UTF-16 is used internally by Windows, Java or DotNet.
ReplyDeleteA. Bouchez But you said "MS blindness". Nobody forces application developers to use UTF-16.
ReplyDelete