So I got this file format where the header is pure text, and the contents is binary, separated by a couple of linebreaks. Quite similar to how HTTP works.
So I got this file format where the header is pure text, and the contents is binary, separated by a couple of linebreaks. Quite similar to how HTTP works.
So I thought great, I'll just use TStreamReader to read the header lines, and then process the binary part afterwards.
Then I realised that TStreamReader isn't made to cooperate with anything else, as it doesn't sync the amount actually read from the stream versus how much it has buffered. Meaning after you've read one line worth 50 bytes, the stream position will be over at 4096 because that's TStreamReader's default buffer size.
I guess that is so it can work with forwards-only streams (like decompression streams), but I'm working on file and memory streams.
So, any easy, existing alternative here? I mean it's not hard to roll my own alternative, just wondering if I'm missing some existing stuff.
So I thought great, I'll just use TStreamReader to read the header lines, and then process the binary part afterwards.
Then I realised that TStreamReader isn't made to cooperate with anything else, as it doesn't sync the amount actually read from the stream versus how much it has buffered. Meaning after you've read one line worth 50 bytes, the stream position will be over at 4096 because that's TStreamReader's default buffer size.
I guess that is so it can work with forwards-only streams (like decompression streams), but I'm working on file and memory streams.
So, any easy, existing alternative here? I mean it's not hard to roll my own alternative, just wondering if I'm missing some existing stuff.
If possible, change the buffer size (1 byte?) - but that defeats the purpose..
ReplyDeleteJust throwing ideas out there :)
Nicholas Ring Heh yeah that'd be pretty pointless :)
ReplyDeleteNot hard to write my own, but I'd prefer spending my time writing more fun stuff :)
It could be fun writing a buffer that can be reset... It would be weird and have limited use but it could came in handy (like you need).
ReplyDeleteYou can try TBinaryReader, I think it may meet your need, but I'm not sure.
ReplyDeleteIf you still have to use TStreamReader, you should take a look at the example here:
http://msdn.microsoft.com/en-us/library/system.io.streamreader.discardbuffereddata
Sam Shaw DiscardBufferedData does not help me, because I don't know how many bytes it has "scanned" before finding the line break, and at least in Delphi's implementation it doesn't actually remember this information. Without it, I don't know the correct stream position after the ReadLine call.
ReplyDeleteAlso note I cannot just count the bytes of the string it returns, as the underlying data may (theoretically) contain invalid characters per the encoding, in which case my returned string will end up with the "unknown char" character, which does not consume the same storage when converted back to the source encoding.
ReplyDeleteWith my file format this would definitely be a corner case, but I prefer doing it properly.
Um... you can calculate the returned string length after readline, and according to this value to change the stream position then discard internal buffer. It's trivial. Oops! you told me after I posted.
ReplyDeleteI hope TBinaryReader will sync position, but I just guess it is because its constructor does not have buffer parameter inside.
Asbjørn Heid So what you need is something that can read both text and binary, using the same buffer
ReplyDeleteI found there's no readline in binaryReader :(
ReplyDeleteSam Shaw How do you tell if the line was delimited by a single LF or a CRLF pair? Without it, you don't know the correct stream position. Unfortunately the file format is not explicit about which exact line break to use, so there are files with all three common ones (nix/mac/dos).
ReplyDeleteAlso, if your encoding is non-throwing (ie don't return error but replace invalid chars with the unknown char character), you have the additional issue with the invalid char as mentioned.
Nicholas Ring Essentially.
ReplyDeleteI'm leaning towards making a TBufferedStream which itself is a TStream descendant, where you can access the buffered data, request more bytes in the buffer, and "consume" parts of the buffered data. The latter would move the internal position pointer and remove data from the buffer.
Then subsequent stream read operations would be served from the remaining buffered data before going to the underlying stream.
I'll then have to make a streamreader which takes one of these beasts to do the parsing. Once the header is done I can then pass the TBufferedStream to the routine handling the binary data as any old TStream.
Asbjørn Heid After I reconsider the string returned by readline, I think it's only reasonable when only one encoding for every TEXT parts. If some characters or lines encoding different from each other, it means thoese are not pure "text". In such condition, there should be some indications(maybe binary) to point out which charaters or lines Encoding belong to.
ReplyDeleteSo, if that's really the text, there's only an encoding belongs to it, which you write them into the file and can decode the strings from it.
Finally, using TEncoding.GetBytes to returned string can tell us the counts and change the stream position.
Regarding the binary part, just switch to use TStream methods from the binary position.
PS. You mention about LF or CRLF, this is critical! However, it has solution:
Just assume it is LF, then goto that position, check if it is a LF. If it is a LF, you increasing one step. If is is CR, you increase two steps. The returned string does NOT inclue LF or CRLF.
Asbjørn Heid A TBufferedStream is would be a good idea... but has it already been done before (surely someone has) - just thinking it might save you some time re-inventing the stream, so to speak.
ReplyDeleteIf you do find one, please let me know - I would be interested in one.
Sam Shaw The format in question only states that two empty lines is the "delimiter" between header and content. It could be two CR's it could be two LF's and it could be two CRLF pairs... sad but true.
ReplyDeleteYes I could go back and second guess, but it would be much faster to just keep count of how many bytes has been scanned.
Nicholas Ring Yeah that's kinda why I asked here :P
ReplyDeleteSo we have gone full circle :-D Well, no more replies require, a reader can just go back to the top and start again.
ReplyDeleteAsbjørn Heid Checking through my external third party source, I did find TJclBufferedStream (https://github.com/project-jedi/jcl/blob/master/jcl/source/common/JclStreams.pas)
ReplyDeleteNicholas Ring Thanks, would require calling Read for each byte though :(
ReplyDeleteDidn't Simon Stuart have some stream stuff in his LKSL? Admittedly I haven't looked, but if memory serves me, he had mentioned something like this a while ago.
ReplyDeleteThere are some stream helpers there but no buffering streams :(
ReplyDeleteWell then, my own it is :)
ReplyDeleteAsbjørn Heid Try ours, https://dl.dropboxusercontent.com/u/45498379/STStream.pas
ReplyDeleteVitali Burkov Thanks, I'll check it out when I get home.
ReplyDeleteVitali Burkov it looks great while there is an evil Ansistring in helper methods, which ain't supported by mobile platform. Can you or can I distribute TBytes overload version?
ReplyDeleteThanks.
Do whatever you want, Sam Shaw.
ReplyDeleteWhipped up my own couple of classes, a TBufferedStream and a TBufferedStreamReader which takes (or wraps a TStream in) a TBufferedStream. I'll share my results later tonight.
ReplyDeleteAs promised: https://github.com/lordcrc/BufferedStreamReader
ReplyDeleteNeeds more testing but it's a start.