So I got this file format where the header is pure text, and the contents is binary, separated by a couple of linebreaks. Quite similar to how HTTP works.

- January 01, 2015

So I got this file format where the header is pure text, and the contents is binary, separated by a couple of linebreaks. Quite similar to how HTTP works.

So I thought great, I'll just use TStreamReader to read the header lines, and then process the binary part afterwards.

Then I realised that TStreamReader isn't made to cooperate with anything else, as it doesn't sync the amount actually read from the stream versus how much it has buffered. Meaning after you've read one line worth 50 bytes, the stream position will be over at 4096 because that's TStreamReader's default buffer size.

I guess that is so it can work with forwards-only streams (like decompression streams), but I'm working on file and memory streams.

So, any easy, existing alternative here? I mean it's not hard to roll my own alternative, just wondering if I'm missing some existing stuff.

Comments

Nicholas RingJanuary 1, 2015 at 2:38 PM
If possible, change the buffer size (1 byte?) - but that defeats the purpose..

Just throwing ideas out there :)
ReplyDelete
Replies
Asbjørn HeidJanuary 1, 2015 at 2:41 PM
Nicholas Ring Heh yeah that'd be pretty pointless :)

Not hard to write my own, but I'd prefer spending my time writing more fun stuff :)
ReplyDelete
Replies
Nicholas RingJanuary 1, 2015 at 2:46 PM
It could be fun writing a buffer that can be reset... It would be weird and have limited use but it could came in handy (like you need).
ReplyDelete
Replies
Sam Shaw (蕭沖)January 1, 2015 at 2:54 PM
You can try TBinaryReader, I think it may meet your need, but I'm not sure.

If you still have to use TStreamReader, you should take a look at the example here:

http://msdn.microsoft.com/en-us/library/system.io.streamreader.discardbuffereddata
ReplyDelete
Replies
Asbjørn HeidJanuary 1, 2015 at 3:06 PM
Sam Shaw DiscardBufferedData does not help me, because I don't know how many bytes it has "scanned" before finding the line break, and at least in Delphi's implementation it doesn't actually remember this information. Without it, I don't know the correct stream position after the ReadLine call.
ReplyDelete
Replies
Asbjørn HeidJanuary 1, 2015 at 3:09 PM
Also note I cannot just count the bytes of the string it returns, as the underlying data may (theoretically) contain invalid characters per the encoding, in which case my returned string will end up with the "unknown char" character, which does not consume the same storage when converted back to the source encoding.

With my file format this would definitely be a corner case, but I prefer doing it properly.
ReplyDelete
Replies
Sam Shaw (蕭沖)January 1, 2015 at 3:21 PM
Um... you can calculate the returned string length after readline, and according to this value to change the stream position then discard internal buffer. It's trivial. Oops! you told me after I posted.

I hope TBinaryReader will sync position, but I just guess it is because its constructor does not have buffer parameter inside.
ReplyDelete
Replies
Nicholas RingJanuary 1, 2015 at 3:26 PM
Asbjørn Heid So what you need is something that can read both text and binary, using the same buffer
ReplyDelete
Replies
Sam Shaw (蕭沖)January 1, 2015 at 3:32 PM
I found there's no readline in binaryReader :(
ReplyDelete
Replies
Asbjørn HeidJanuary 1, 2015 at 4:04 PM
Sam Shaw How do you tell if the line was delimited by a single LF or a CRLF pair? Without it, you don't know the correct stream position. Unfortunately the file format is not explicit about which exact line break to use, so there are files with all three common ones (nix/mac/dos).

Also, if your encoding is non-throwing (ie don't return error but replace invalid chars with the unknown char character), you have the additional issue with the invalid char as mentioned.
ReplyDelete
Replies
Asbjørn HeidJanuary 1, 2015 at 4:09 PM
Nicholas Ring Essentially.

I'm leaning towards making a TBufferedStream which itself is a TStream descendant, where you can access the buffered data, request more bytes in the buffer, and "consume" parts of the buffered data. The latter would move the internal position pointer and remove data from the buffer.

Then subsequent stream read operations would be served from the remaining buffered data before going to the underlying stream.

I'll then have to make a streamreader which takes one of these beasts to do the parsing. Once the header is done I can then pass the TBufferedStream to the routine handling the binary data as any old TStream.
ReplyDelete
Replies
Sam Shaw (蕭沖)January 1, 2015 at 4:47 PM
Asbjørn Heid After I reconsider the string returned by readline, I think it's only reasonable when only one encoding for every TEXT parts. If some characters or lines encoding different from each other, it means thoese are not pure "text". In such condition, there should be some indications(maybe binary) to point out which charaters or lines Encoding belong to.

So, if that's really the text, there's only an encoding belongs to it, which you write them into the file and can decode the strings from it.

Finally, using TEncoding.GetBytes to returned string can tell us the counts and change the stream position.
Regarding the binary part, just switch to use TStream methods from the binary position.

PS. You mention about LF or CRLF, this is critical! However, it has solution:
Just assume it is LF, then goto that position, check if it is a LF. If it is a LF, you increasing one step. If is is CR, you increase two steps. The returned string does NOT inclue LF or CRLF.
ReplyDelete
Replies
Nicholas RingJanuary 1, 2015 at 4:59 PM
Asbjørn Heid A TBufferedStream is would be a good idea... but has it already been done before (surely someone has) - just thinking it might save you some time re-inventing the stream, so to speak.

If you do find one, please let me know - I would be interested in one.
ReplyDelete
Replies
Asbjørn HeidJanuary 1, 2015 at 5:18 PM
Sam Shaw The format in question only states that two empty lines is the "delimiter" between header and content. It could be two CR's it could be two LF's and it could be two CRLF pairs... sad but true.

Yes I could go back and second guess, but it would be much faster to just keep count of how many bytes has been scanned.
ReplyDelete
Replies
Asbjørn HeidJanuary 1, 2015 at 5:19 PM
Nicholas Ring Yeah that's kinda why I asked here :P
ReplyDelete
Replies
Nicholas RingJanuary 1, 2015 at 5:23 PM
So we have gone full circle :-D Well, no more replies require, a reader can just go back to the top and start again.
ReplyDelete
Replies
Nicholas RingJanuary 1, 2015 at 5:38 PM
Asbjørn Heid Checking through my external third party source, I did find TJclBufferedStream (https://github.com/project-jedi/jcl/blob/master/jcl/source/common/JclStreams.pas)
ReplyDelete
Replies
Asbjørn HeidJanuary 1, 2015 at 9:58 PM
Nicholas Ring Thanks, would require calling Read for each byte though :(
ReplyDelete
Replies
Lübbe OnkenJanuary 1, 2015 at 11:01 PM
Didn't Simon Stuart have some stream stuff in his LKSL? Admittedly I haven't looked, but if memory serves me, he had mentioned something like this a while ago.
ReplyDelete
Replies
Nicholas RingJanuary 1, 2015 at 11:11 PM
There are some stream helpers there but no buffering streams :(
ReplyDelete
Replies
Asbjørn HeidJanuary 1, 2015 at 11:45 PM
Well then, my own it is :)
ReplyDelete
Replies
Vitali BurkovJanuary 1, 2015 at 11:57 PM
Asbjørn Heid Try ours, https://dl.dropboxusercontent.com/u/45498379/STStream.pas
ReplyDelete
Replies
Asbjørn HeidJanuary 2, 2015 at 12:16 AM
Vitali Burkov Thanks, I'll check it out when I get home.
ReplyDelete
Replies
Sam Shaw (蕭沖)January 2, 2015 at 5:44 AM
Vitali Burkov it looks great while there is an evil Ansistring in helper methods, which ain't supported by mobile platform. Can you or can I distribute TBytes overload version?

Thanks.
ReplyDelete
Replies
Vitali BurkovJanuary 2, 2015 at 10:52 AM
Do whatever you want, Sam Shaw.
ReplyDelete
Replies
Asbjørn HeidJanuary 3, 2015 at 2:13 AM
Whipped up my own couple of classes, a TBufferedStream and a TBufferedStreamReader which takes (or wraps a TStream in) a TBufferedStream. I'll share my results later tonight.
ReplyDelete
Replies
Asbjørn HeidJanuary 3, 2015 at 9:46 PM
As promised: https://github.com/lordcrc/BufferedStreamReader

Needs more testing but it's a start.
ReplyDelete
Replies

Add comment

Search This Blog

Delphi Developers Archive

So I got this file format where the header is pure text, and the contents is binary, separated by a couple of linebreaks. Quite similar to how HTTP works.

Comments

Post a Comment