Did you know that the Jedi Code Library has got a Unicode aware search engine using Boyer-Moore? It's called TUTBMSearch and located in JclUnicode.pas

Comments

  1. Given that modern CPU architectures have really high memory bandwidth and really dislikes unpredicable branching, in which cases does BM and similar "clever" search algorithms still pay off? Anyone got any experiences to share?

    edit: nice reminder BTW, I really should check out the entire JCL, so I know what I don't have to reinvent ;)

    ReplyDelete
  2. Hm, good question. There are two options to find out: Theoretical or practical, the later meaning write a test and let it run on many modern computers.

    ReplyDelete
  3. Btw. the JCL engine does not work very well for word search: Single letters are found, even if they are inside a word, and punctuation characters are treated as part of words rather than separators.

    ReplyDelete
  4. Reason I ask is that when I wrote my own BM implementation some 10 years ago, it struggled to outpace Pos() in several cases, and CPUs haven't gotten more fond of branching since then.

    ReplyDelete
  5. So it's been ages since I looked into this, decided to freshen up a little. If you use the modified version of BM (using Galil's rule) it's worst case is O(n) and not O(n*m). That's certainly good compared to the regular Pos(). I haven't had time to check out the implementation in JCL yet though to see if it uses this version or not.

    ReplyDelete

Post a Comment