Hi!
Hi!
I have a database ' people ' that may have duplicate records.
I found this function (StringSimilarityRatio) to find out how similar two strings are.
Is there a way I can use it inside an ADOQuery?
http://www.delphiarea.com/articles/how-to-match-two-strings-approximately/
I have a database ' people ' that may have duplicate records.
I found this function (StringSimilarityRatio) to find out how similar two strings are.
Is there a way I can use it inside an ADOQuery?
http://www.delphiarea.com/articles/how-to-match-two-strings-approximately/
I would probably use a query to extract the strings (and keys), then run some Delphi code in memory to build a table of string similarity ratios.
ReplyDeleteIf you want to compare names and addresses I think you would want something more like Soundex or better still a Metaphone
ReplyDeletethis looks interesting - http://www.namethesaurus.com/
ReplyDeleteIf you are using ADOQuery, locate SQL field and enter the following SQL.
ReplyDeleteSELECT DUPLICATE_FIELD,
COUNT(DUPLICATE_FIELD) AS NUMBER_OF_TIMES
FROM people
GROUP BY DUPLICATE_FIELD
HAVING ( COUNT(DUPLICATE_FIELD) > 1 )
Replace the DUPLICATE_FIELD with the actual field where you believe the duplicates reside and the return data set will contain actual data along with the number of times it occurs, either 1 or 2 or etc.
Syntax is dependent on the database you are using this works under MS SQL Server and MS ACCESS, it should be good with MYSQL, not sure on ORACLE.
those duplicates are a bit difficult like:
ReplyDeleteLars Fosdal, Lers Fodsal, Lares Fosdall and they all have the same birthday the same parents, or the father is like : Russell, Rusell, Russel ....
that's a bit difficult.
so I tried to use that function
http://www.delphiarea.com/articles/how-to-match-two-strings-approximately/
but it's hard to use it with SQL
i you use a SQL Server, then Soundex is your friend:
ReplyDeletehttps://msdn.microsoft.com/en-us/library/ms187384.aspx
http://anastasiosyal.com/POST/2009/01/11/18.ASPX?
As there are several distinct parts to the name you will probably need to add a field and write some code to store the dedupe string there. this could be a combination such as metaphone of last name,first name and city. You will probably need several iterations to check he best combinations in your database - usually a visual check go start with. Once you have created your metaphone keys you can use the D-L distance calculation on those.
ReplyDeleteOne alternative is to use a UDF in your database. What database are you using?
ReplyDeleteOne reason to create a sp is so that the sorting (ranking) can be done on the sql server. The proc would need a name to compare the results with. If you want to find similar entries, the sp approach may be bad because of the amount of iterations needed. In that case a service as a db-client might be the best approach. Something similar to real fts engines like lucene, sphinx or the components in IBO.
ReplyDeleteSoundex IMHO is extremely simplistic. And it's based on Anglo-style orthography to phonetic "mapping". There's a German version out there too.
Calculating the distance is not a bad idea, methinks.
Its just a question of time and money. Your customer need this function? Take soundex on p.e. three fields. It is not the best solution but for shure the fastest to realize, especially if you already use stored procedures.
ReplyDeleteAnd its bloody quick. If you need it more accurate, then you have to go for another solution of course.