I need to build a small app to repair malformed RTF content, as automagically as possible. RTF formatted comments are being imported from an old DB, and some of them have in some way been damaged. The goal is to either convert them to plain text, or to minimally formatted RTF. Any suggestions? Has anyone found any open source RTF parser/fixer tools?

Comments

  1. One thing I have not figured out is why the TRichEdit appears to truncate some of the RTF I give it. I'm guessing that the parser is missing a closing element, so drops that level of content. But it's only a guess, and I am NOT very interested just now in trying to seriously parse RTF.

    The make repairs stuff is not complicated. If I find RTF delimiters in the TMemo, or in the Lines.Text of the RichEdit, then something is wrong. That triggers the need for repairs. Alternately, finding a difference between the Memo and RichEdit in Lines.Text (plain text in the latter) is a trigger.

    Either of those cases causes me to replace Lines.Text in the RichEdit with Lines.Text from the Memo. If RTF delimiters are visible in the RichEdit, then the RTF is broken, and often, it seems, truncated.

    I have been exploring for simple-minded tools, to deal with a finite data set (ca. 48,000 records) on a one-time basis.

    I may post more on a blog, where it will fit better than here, and is easier for me to format well.

    ReplyDelete
  2. I have the same problem. My fix has been to manually tweak each field. Luckily, most of the records have null values in the RTF fields so it's not that much work. An algorithm to fix them would be nice.

    ReplyDelete
  3. Kevin McCoy Update. I completely disabled my routine which attempted fixes in the RTF-as-plain-text, because it had some issues, and also because I found in the DB some records which I can only describe as pathological.

    So now my test is quite simple: I load the RTF to a TRichEdit, then scan the Lines.Text for the presence of any of '{', '}', or '\'. If any are found, then I flag this record for further attention. I did implement a relatively simple routine to clean it up. Assigned the RTF as a string to a StringList.CommaText, then deleted each item which looked like RTF.

    What I found, in a DB with 48,799 records needing to be checked was 83 with issues. And in those 83, there exists a wide range of different problems.  I therefore suggested to my client that I simply repair them by hand. He declined, and as the percentage is so small--and some of those comments may be so old they will never again be examined--decided we will not fix them. 

    One thing I observed, though, and I assume it comes from people composing in Word, then pasting to our app: numerous of the records had long collections of red, green, blue triplets in the RTF. I would be very happy to know if there exists some tool for stripping that content from an otherwise well-formed RTF packet.

    ReplyDelete

Post a Comment