So I got this file format where the header is pure text, and the contents is binary, separated by a couple of linebreaks. Quite similar to how HTTP works.

So I got this file format where the header is pure text, and the contents is binary, separated by a couple of linebreaks. Quite similar to how HTTP works.

So I thought great, I'll just use TStreamReader to read the header lines, and then process the binary part afterwards.

Then I realised that TStreamReader isn't made to cooperate with anything else, as it doesn't sync the amount actually read from the stream versus how much it has buffered. Meaning after you've read one line worth 50 bytes, the stream position will be over at 4096 because that's TStreamReader's default buffer size.

I guess that is so it can work with forwards-only streams (like decompression streams), but I'm working on file and memory streams.

So, any easy, existing alternative here? I mean it's not hard to roll my own alternative, just wondering if I'm missing some existing stuff.

Comments

  1. If possible, change the buffer size (1 byte?) - but that defeats the purpose..

    Just throwing ideas out there :)

    ReplyDelete
  2. Nicholas Ring Heh yeah that'd be pretty pointless :)

    Not hard to write my own, but I'd prefer spending my time writing more fun stuff :)

    ReplyDelete
  3. It could be fun writing a buffer that can be reset... It would be weird and have limited use but it could came in handy (like you need).

    ReplyDelete
  4. You can try TBinaryReader, I think it may meet your need, but I'm not sure.

    If you still have to use TStreamReader, you should take a look at the example here:

    http://msdn.microsoft.com/en-us/library/system.io.streamreader.discardbuffereddata

    ReplyDelete
  5. Sam Shaw DiscardBufferedData does not help me, because I don't know how many bytes it has "scanned" before finding the line break, and at least in Delphi's implementation it doesn't actually remember this information. Without it, I don't know the correct stream position after the ReadLine call.

    ReplyDelete
  6. Also note I cannot just count the bytes of the string it returns, as the underlying data may (theoretically) contain invalid characters per the encoding, in which case my returned string will end up with the "unknown char" character, which does not consume the same storage when converted back to the source encoding.

    With my file format this would definitely be a corner case, but I prefer doing it properly.

    ReplyDelete
  7. Um... you can calculate the returned string length after readline, and according to this value to change the stream position then discard internal buffer. It's trivial.  Oops! you told me after I posted.

    I hope TBinaryReader will sync position, but I just guess it is because its constructor does not have buffer parameter inside.

    ReplyDelete
  8. Asbjørn Heid​ So what you need is something that can read both text and binary, using the same buffer

    ReplyDelete
  9. I found there's no readline in binaryReader :(

    ReplyDelete
  10. Sam Shaw How do you tell if the line was delimited by a single LF or a CRLF pair? Without it, you don't know the correct stream position. Unfortunately the file format is not explicit about which exact line break to use, so there are files with all three common ones (nix/mac/dos).

    Also, if your encoding is non-throwing (ie don't return error but replace invalid chars with the unknown char character), you have the additional issue with the invalid char as mentioned.

    ReplyDelete
  11. Nicholas Ring Essentially.

    I'm leaning towards making a TBufferedStream which itself is a TStream descendant, where you can access the buffered data, request more bytes in the buffer, and "consume" parts of the buffered data. The latter would move the internal position pointer and remove data from the buffer.

    Then subsequent stream read operations would be served from the remaining buffered data before going to the underlying stream.

    I'll then have to make a streamreader which takes one of these beasts to do the parsing. Once the header is done I can then pass the TBufferedStream to the routine handling the binary data as any old TStream.

    ReplyDelete
  12. Asbjørn Heid  After I reconsider the string returned by readline, I think it's only reasonable when only one encoding for every TEXT parts.  If some characters or lines encoding different from each other, it means thoese are not pure "text".  In such condition, there should be some indications(maybe binary) to point out which charaters or lines Encoding belong to.

    So, if that's really the text, there's only an encoding belongs to it, which you write them into the file and can decode the strings from it.

    Finally, using TEncoding.GetBytes to returned string can tell us the counts and change the stream position.
    Regarding the binary part, just switch to use TStream methods from the binary position.

    PS. You mention about LF or CRLF, this is critical!  However, it has solution:
    Just assume it is LF, then goto that position, check if it is a LF. If it is a LF, you increasing one step. If is is CR, you increase two steps. The returned string does NOT inclue LF or CRLF.

    ReplyDelete
  13. Asbjørn Heid A TBufferedStream is would be a good idea... but has it already been done before (surely someone has) - just thinking it might save you some time re-inventing the stream, so to speak.

    If you do find one, please let me know - I would be interested in one.

    ReplyDelete
  14. Sam Shaw The format in question only states that two empty lines is the "delimiter" between header and content. It could be two CR's it could be two LF's and it could be two CRLF pairs... sad but true.

    Yes I could go back and second guess, but it would be much faster to just keep count of how many bytes has been scanned.

    ReplyDelete
  15. Nicholas Ring Yeah that's kinda why I asked here :P

    ReplyDelete
  16. So we have gone full circle :-D Well, no more replies require, a reader can just go back to the top and start again.

    ReplyDelete
  17. Asbjørn Heid Checking through my external third party source, I did find TJclBufferedStream (https://github.com/project-jedi/jcl/blob/master/jcl/source/common/JclStreams.pas)

    ReplyDelete
  18. Nicholas Ring Thanks, would require calling Read for each byte though :(

    ReplyDelete
  19. Didn't Simon Stuart have some stream stuff in his LKSL? Admittedly I haven't looked, but if memory serves me, he had mentioned something like this a while ago.

    ReplyDelete
  20. There are some stream helpers there but no buffering streams :(

    ReplyDelete
  21. Vitali Burkov Thanks, I'll check it out when I get home.

    ReplyDelete
  22. Vitali Burkov it looks great while there is an evil Ansistring in helper methods, which ain't supported by mobile platform. Can you or can I distribute TBytes overload version?

    Thanks.

    ReplyDelete
  23. Whipped up my own couple of classes, a TBufferedStream and a TBufferedStreamReader which takes (or wraps a TStream in) a TBufferedStream. I'll share my results later tonight.

    ReplyDelete

Post a Comment