Hi!

Hi!
I have a database ' people ' that may have duplicate records.
I found this function (StringSimilarityRatio) to find out how similar two strings are.
Is there a way I can use it inside an ADOQuery?
http://www.delphiarea.com/articles/how-to-match-two-strings-approximately/

Comments

  1. I would probably use a query to extract the strings (and keys), then run some Delphi code in memory to build a table of string similarity ratios.

    ReplyDelete
  2. If you want to compare names and addresses I think you would want something more like Soundex or better still a Metaphone

    ReplyDelete
  3. If you are using ADOQuery, locate SQL field and enter the following SQL.  

    SELECT DUPLICATE_FIELD, 
     COUNT(DUPLICATE_FIELD) AS NUMBER_OF_TIMES
    FROM people
    GROUP BY DUPLICATE_FIELD
    HAVING ( COUNT(DUPLICATE_FIELD) > 1 )


    Replace the DUPLICATE_FIELD with the actual field where you believe the duplicates reside and the return data set will contain actual data along with the number of times it occurs, either 1 or 2 or etc.

    Syntax is dependent on the database you are using this works under MS SQL Server and MS ACCESS, it should be good with MYSQL, not sure on ORACLE.

    ReplyDelete
  4. those duplicates are a bit difficult like:
    Lars Fosdal, Lers Fodsal, Lares Fosdall and they all have the same birthday the same parents, or the father is like : Russell, Rusell, Russel ....
    that's a bit difficult.
    so I tried to use that function 
    http://www.delphiarea.com/articles/how-to-match-two-strings-approximately/
    but it's hard to use it with SQL

    ReplyDelete
  5. As there are several distinct parts to the name you will probably need to add a field and write some code to store the dedupe string there. this could be a combination such as metaphone of last name,first name and city. You will probably need several iterations to check he best combinations in your database - usually a visual check go start with. Once you have created your metaphone keys you can use the D-L distance calculation on those.

    ReplyDelete
  6. One alternative is to use a UDF in your database. What database are you using?

    ReplyDelete
  7. One reason to create a sp is so that the sorting (ranking) can be done on the sql server. The proc would need a name to compare the results with. If you want to find similar entries, the sp approach may be bad because of the amount of iterations needed. In that case a service as a db-client might be the best approach. Something similar to real fts engines like lucene, sphinx or the components in IBO.

    Soundex IMHO is extremely simplistic. And it's based on Anglo-style orthography to phonetic "mapping". There's a German version out there too.

    Calculating the distance is not a bad idea, methinks.

    ReplyDelete
  8. Its just a question of time and money. Your customer need this function? Take soundex on p.e. three fields. It is not the best solution but for shure the fastest to realize, especially if you already use stored procedures.
    And its bloody quick. If you need it more accurate, then you have to go for another solution of course.

    ReplyDelete

Post a Comment