If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
#1
|
|||
|
|||
Duplicate detection (looking for code)
I would like to add code to detect for duplicates or records with similar
information. Looking to do this check on (firstname and lastname) and on address (street1, street2, city, state, zip) But I want something a little more advanced than just checking for exact matches. Wondering if anyone has some code they would care to share that might make my job of writing it a little easier? Example: Bob Smith and Bobby Smith would be detected as duplicates Rob Jones and Robert Jones would be detected as duplicates 123 main street pittsburgh, pa 15126 123 main st pittsburgh pa 15230 might be detected as duplicates Thanks in advance, Mark |
#2
|
|||
|
|||
Duplicate detection (looking for code)
You may not have received any responses yet because what you are proposing
is not particularly simple. .... and there are some potentially serious flaws with your analysis! How do you expect Access to be able to correctly categorize "Bob Smith" as duplicating "Bobby Smith" when your database could legitimately contain two separate individuals with those names? And what happens when you have two unique individuals, both named Lynne Johnson? (there are two in my state, and they were both born on the same date!) I suspect you'll have to create your own code that tells Access exactly when and how to consider two records to be close enough to be a match ... and you might want to consider them only as "potential" matches. After all, can YOU be sure that all of the following are duplicates?: John Smith 12345 Elm St J. J. Smith 12345 Elm Street John J. Smith 12345 Elm St NW Johnny Smith 12354 Elm St. J. Smith 12345 Elm St Regards Jeff Boyce Microsoft Access MVP -- Disclaimer: This author may have received products and services mentioned in this post. Mention and/or description of a product or service herein does not constitute endorsement thereof. Any code or pseudocode included in this post is offered "as is", with no guarantee as to suitability. You can thank the FTC of the USA for making this disclaimer possible/necessary. "Mark Andrews" wrote in message ... I would like to add code to detect for duplicates or records with similar information. Looking to do this check on (firstname and lastname) and on address (street1, street2, city, state, zip) But I want something a little more advanced than just checking for exact matches. Wondering if anyone has some code they would care to share that might make my job of writing it a little easier? Example: Bob Smith and Bobby Smith would be detected as duplicates Rob Jones and Robert Jones would be detected as duplicates 123 main street pittsburgh, pa 15126 123 main st pittsburgh pa 15230 might be detected as duplicates Thanks in advance, Mark |
#3
|
|||
|
|||
Duplicate detection (looking for code)
The idea is:
- the user enters "Bobby Smith" and the program pops up a screen saying here are some similar contacts "just to make sure this is not a duplicate contact being entered". Example: Bob Smith 123 Main Street Pittsburgh PA 15230 If they knew that the "Bobby Smith" they were entering lived on Main Street in Pittsburgh they would choose to quit entering data for this new contact because they know that they are entering a duplicate or alternately they could continue entering the contact - similar type logic for addresses Yes I should of used the wording "potential duplicates" in my post and yes it's not extremely simple thus the newsgroup post. Don't you MVPs like a challenge once in a while? First thoughts: I could do a simple comparison with 'like' or some sort of character by character comparison (if 90% of the characters match consider it a "potential duplicate"). I need some sort of "Similar" function. I don't think there are any potentially serious flaws with my analysis, my analysis at this point is "hey this might be a little work, I wonder if anyone else has attempted this and would let me see their code". Now is your chance to post the code you wrote to do these types of checks, I'm sure others have tackled duplicate issues in various ways (some approaches better than others), Thanks, Mark "Jeff Boyce" wrote in message ... You may not have received any responses yet because what you are proposing is not particularly simple. ... and there are some potentially serious flaws with your analysis! How do you expect Access to be able to correctly categorize "Bob Smith" as duplicating "Bobby Smith" when your database could legitimately contain two separate individuals with those names? And what happens when you have two unique individuals, both named Lynne Johnson? (there are two in my state, and they were both born on the same date!) I suspect you'll have to create your own code that tells Access exactly when and how to consider two records to be close enough to be a match ... and you might want to consider them only as "potential" matches. After all, can YOU be sure that all of the following are duplicates?: John Smith 12345 Elm St J. J. Smith 12345 Elm Street John J. Smith 12345 Elm St NW Johnny Smith 12354 Elm St. J. Smith 12345 Elm St Regards Jeff Boyce Microsoft Access MVP -- Disclaimer: This author may have received products and services mentioned in this post. Mention and/or description of a product or service herein does not constitute endorsement thereof. Any code or pseudocode included in this post is offered "as is", with no guarantee as to suitability. You can thank the FTC of the USA for making this disclaimer possible/necessary. "Mark Andrews" wrote in message ... I would like to add code to detect for duplicates or records with similar information. Looking to do this check on (firstname and lastname) and on address (street1, street2, city, state, zip) But I want something a little more advanced than just checking for exact matches. Wondering if anyone has some code they would care to share that might make my job of writing it a little easier? Example: Bob Smith and Bobby Smith would be detected as duplicates Rob Jones and Robert Jones would be detected as duplicates 123 main street pittsburgh, pa 15126 123 main st pittsburgh pa 15230 might be detected as duplicates Thanks in advance, Mark |
#4
|
|||
|
|||
Duplicate detection (looking for code)
Mark
Just for the record, the folks who read and write here in the newsgroups are not all MVPs ... and some of the best answers I've seen come from folks who aren't. Don't limit your audience... Most of the approaches I've seen that work for this involve USB (using someone's brain). It sounds like that's part of your approach, too. Do a search on "Soundex". This is an algorithm that uses how words (e.g., names) sound to compare them. Words with similar soundex scores sounds similar. This could help with last names and street names, but I don't see it helping with Bobby vs. Robert, or with all the embellishments that addresses have. Again, you'd need to tell Access that Bobby and Robert are (sometimes) synonymous. One approach might be to sort all entries by "lastname, firstname - delivery address" (as a concatenated field) and USB to break the ties. Another approach might be to use the built-in autocomplete feature in Access comboboxes on a form. Have your user start typing a lastname and have Access jump to the "lastname, firstname - delivery address"es that start that way. You don't mention whether you're working with a couple hundreds entries, a couple thousand, or a couple hundred thousand. The approach you take may need to differ, depending on volume. Good luck! Regards Jeff Boyce Microsoft Access MVP "Mark Andrews" wrote in message ... The idea is: - the user enters "Bobby Smith" and the program pops up a screen saying here are some similar contacts "just to make sure this is not a duplicate contact being entered". Example: Bob Smith 123 Main Street Pittsburgh PA 15230 If they knew that the "Bobby Smith" they were entering lived on Main Street in Pittsburgh they would choose to quit entering data for this new contact because they know that they are entering a duplicate or alternately they could continue entering the contact - similar type logic for addresses Yes I should of used the wording "potential duplicates" in my post and yes it's not extremely simple thus the newsgroup post. Don't you MVPs like a challenge once in a while? First thoughts: I could do a simple comparison with 'like' or some sort of character by character comparison (if 90% of the characters match consider it a "potential duplicate"). I need some sort of "Similar" function. I don't think there are any potentially serious flaws with my analysis, my analysis at this point is "hey this might be a little work, I wonder if anyone else has attempted this and would let me see their code". Now is your chance to post the code you wrote to do these types of checks, I'm sure others have tackled duplicate issues in various ways (some approaches better than others), Thanks, Mark "Jeff Boyce" wrote in message ... You may not have received any responses yet because what you are proposing is not particularly simple. ... and there are some potentially serious flaws with your analysis! How do you expect Access to be able to correctly categorize "Bob Smith" as duplicating "Bobby Smith" when your database could legitimately contain two separate individuals with those names? And what happens when you have two unique individuals, both named Lynne Johnson? (there are two in my state, and they were both born on the same date!) I suspect you'll have to create your own code that tells Access exactly when and how to consider two records to be close enough to be a match ... and you might want to consider them only as "potential" matches. After all, can YOU be sure that all of the following are duplicates?: John Smith 12345 Elm St J. J. Smith 12345 Elm Street John J. Smith 12345 Elm St NW Johnny Smith 12354 Elm St. J. Smith 12345 Elm St Regards Jeff Boyce Microsoft Access MVP -- Disclaimer: This author may have received products and services mentioned in this post. Mention and/or description of a product or service herein does not constitute endorsement thereof. Any code or pseudocode included in this post is offered "as is", with no guarantee as to suitability. You can thank the FTC of the USA for making this disclaimer possible/necessary. "Mark Andrews" wrote in message ... I would like to add code to detect for duplicates or records with similar information. Looking to do this check on (firstname and lastname) and on address (street1, street2, city, state, zip) But I want something a little more advanced than just checking for exact matches. Wondering if anyone has some code they would care to share that might make my job of writing it a little easier? Example: Bob Smith and Bobby Smith would be detected as duplicates Rob Jones and Robert Jones would be detected as duplicates 123 main street pittsburgh, pa 15126 123 main st pittsburgh pa 15230 might be detected as duplicates Thanks in advance, Mark |
#5
|
|||
|
|||
Duplicate detection (looking for code)
Jeff,
Ok searching on Soundex I think will help me dig up some code or flush out the best approach a little better. If anyone has done this would love to see your approach! Code I'm writing will be installed at multiple companies, most will have 5000 or less contact records. The method should be designed to work well for 10,000 contacts. Thanks, Mark "Jeff Boyce" wrote in message ... Mark Just for the record, the folks who read and write here in the newsgroups are not all MVPs ... and some of the best answers I've seen come from folks who aren't. Don't limit your audience... Most of the approaches I've seen that work for this involve USB (using someone's brain). It sounds like that's part of your approach, too. Do a search on "Soundex". This is an algorithm that uses how words (e.g., names) sound to compare them. Words with similar soundex scores sounds similar. This could help with last names and street names, but I don't see it helping with Bobby vs. Robert, or with all the embellishments that addresses have. Again, you'd need to tell Access that Bobby and Robert are (sometimes) synonymous. One approach might be to sort all entries by "lastname, firstname - delivery address" (as a concatenated field) and USB to break the ties. Another approach might be to use the built-in autocomplete feature in Access comboboxes on a form. Have your user start typing a lastname and have Access jump to the "lastname, firstname - delivery address"es that start that way. You don't mention whether you're working with a couple hundreds entries, a couple thousand, or a couple hundred thousand. The approach you take may need to differ, depending on volume. Good luck! Regards Jeff Boyce Microsoft Access MVP "Mark Andrews" wrote in message ... The idea is: - the user enters "Bobby Smith" and the program pops up a screen saying here are some similar contacts "just to make sure this is not a duplicate contact being entered". Example: Bob Smith 123 Main Street Pittsburgh PA 15230 If they knew that the "Bobby Smith" they were entering lived on Main Street in Pittsburgh they would choose to quit entering data for this new contact because they know that they are entering a duplicate or alternately they could continue entering the contact - similar type logic for addresses Yes I should of used the wording "potential duplicates" in my post and yes it's not extremely simple thus the newsgroup post. Don't you MVPs like a challenge once in a while? First thoughts: I could do a simple comparison with 'like' or some sort of character by character comparison (if 90% of the characters match consider it a "potential duplicate"). I need some sort of "Similar" function. I don't think there are any potentially serious flaws with my analysis, my analysis at this point is "hey this might be a little work, I wonder if anyone else has attempted this and would let me see their code". Now is your chance to post the code you wrote to do these types of checks, I'm sure others have tackled duplicate issues in various ways (some approaches better than others), Thanks, Mark "Jeff Boyce" wrote in message ... You may not have received any responses yet because what you are proposing is not particularly simple. ... and there are some potentially serious flaws with your analysis! How do you expect Access to be able to correctly categorize "Bob Smith" as duplicating "Bobby Smith" when your database could legitimately contain two separate individuals with those names? And what happens when you have two unique individuals, both named Lynne Johnson? (there are two in my state, and they were both born on the same date!) I suspect you'll have to create your own code that tells Access exactly when and how to consider two records to be close enough to be a match ... and you might want to consider them only as "potential" matches. After all, can YOU be sure that all of the following are duplicates?: John Smith 12345 Elm St J. J. Smith 12345 Elm Street John J. Smith 12345 Elm St NW Johnny Smith 12354 Elm St. J. Smith 12345 Elm St Regards Jeff Boyce Microsoft Access MVP -- Disclaimer: This author may have received products and services mentioned in this post. Mention and/or description of a product or service herein does not constitute endorsement thereof. Any code or pseudocode included in this post is offered "as is", with no guarantee as to suitability. You can thank the FTC of the USA for making this disclaimer possible/necessary. "Mark Andrews" wrote in message ... I would like to add code to detect for duplicates or records with similar information. Looking to do this check on (firstname and lastname) and on address (street1, street2, city, state, zip) But I want something a little more advanced than just checking for exact matches. Wondering if anyone has some code they would care to share that might make my job of writing it a little easier? Example: Bob Smith and Bobby Smith would be detected as duplicates Rob Jones and Robert Jones would be detected as duplicates 123 main street pittsburgh, pa 15126 123 main st pittsburgh pa 15230 might be detected as duplicates Thanks in advance, Mark |
#6
|
|||
|
|||
Duplicate detection (looking for code)
Mark
Check on-line for Allen Browne's website. He has a routine that helps limit the number of records a combobox has to pull down by "waiting" for the first "n" letters to get entered. This would be useful if you followed the combobox and concatenation route. Good luck! Regards Jeff Boyce Microsoft Access MVP -- Disclaimer: This author may have received products and services mentioned in this post. Mention and/or description of a product or service herein does not constitute endorsement thereof. Any code or pseudocode included in this post is offered "as is", with no guarantee as to suitability. You can thank the FTC of the USA for making this disclaimer possible/necessary. "Mark Andrews" wrote in message ... Jeff, Ok searching on Soundex I think will help me dig up some code or flush out the best approach a little better. If anyone has done this would love to see your approach! Code I'm writing will be installed at multiple companies, most will have 5000 or less contact records. The method should be designed to work well for 10,000 contacts. Thanks, Mark "Jeff Boyce" wrote in message ... Mark Just for the record, the folks who read and write here in the newsgroups are not all MVPs ... and some of the best answers I've seen come from folks who aren't. Don't limit your audience... Most of the approaches I've seen that work for this involve USB (using someone's brain). It sounds like that's part of your approach, too. Do a search on "Soundex". This is an algorithm that uses how words (e.g., names) sound to compare them. Words with similar soundex scores sounds similar. This could help with last names and street names, but I don't see it helping with Bobby vs. Robert, or with all the embellishments that addresses have. Again, you'd need to tell Access that Bobby and Robert are (sometimes) synonymous. One approach might be to sort all entries by "lastname, firstname - delivery address" (as a concatenated field) and USB to break the ties. Another approach might be to use the built-in autocomplete feature in Access comboboxes on a form. Have your user start typing a lastname and have Access jump to the "lastname, firstname - delivery address"es that start that way. You don't mention whether you're working with a couple hundreds entries, a couple thousand, or a couple hundred thousand. The approach you take may need to differ, depending on volume. Good luck! Regards Jeff Boyce Microsoft Access MVP "Mark Andrews" wrote in message ... The idea is: - the user enters "Bobby Smith" and the program pops up a screen saying here are some similar contacts "just to make sure this is not a duplicate contact being entered". Example: Bob Smith 123 Main Street Pittsburgh PA 15230 If they knew that the "Bobby Smith" they were entering lived on Main Street in Pittsburgh they would choose to quit entering data for this new contact because they know that they are entering a duplicate or alternately they could continue entering the contact - similar type logic for addresses Yes I should of used the wording "potential duplicates" in my post and yes it's not extremely simple thus the newsgroup post. Don't you MVPs like a challenge once in a while? First thoughts: I could do a simple comparison with 'like' or some sort of character by character comparison (if 90% of the characters match consider it a "potential duplicate"). I need some sort of "Similar" function. I don't think there are any potentially serious flaws with my analysis, my analysis at this point is "hey this might be a little work, I wonder if anyone else has attempted this and would let me see their code". Now is your chance to post the code you wrote to do these types of checks, I'm sure others have tackled duplicate issues in various ways (some approaches better than others), Thanks, Mark "Jeff Boyce" wrote in message ... You may not have received any responses yet because what you are proposing is not particularly simple. ... and there are some potentially serious flaws with your analysis! How do you expect Access to be able to correctly categorize "Bob Smith" as duplicating "Bobby Smith" when your database could legitimately contain two separate individuals with those names? And what happens when you have two unique individuals, both named Lynne Johnson? (there are two in my state, and they were both born on the same date!) I suspect you'll have to create your own code that tells Access exactly when and how to consider two records to be close enough to be a match ... and you might want to consider them only as "potential" matches. After all, can YOU be sure that all of the following are duplicates?: John Smith 12345 Elm St J. J. Smith 12345 Elm Street John J. Smith 12345 Elm St NW Johnny Smith 12354 Elm St. J. Smith 12345 Elm St Regards Jeff Boyce Microsoft Access MVP -- Disclaimer: This author may have received products and services mentioned in this post. Mention and/or description of a product or service herein does not constitute endorsement thereof. Any code or pseudocode included in this post is offered "as is", with no guarantee as to suitability. You can thank the FTC of the USA for making this disclaimer possible/necessary. "Mark Andrews" wrote in message ... I would like to add code to detect for duplicates or records with similar information. Looking to do this check on (firstname and lastname) and on address (street1, street2, city, state, zip) But I want something a little more advanced than just checking for exact matches. Wondering if anyone has some code they would care to share that might make my job of writing it a little easier? Example: Bob Smith and Bobby Smith would be detected as duplicates Rob Jones and Robert Jones would be detected as duplicates 123 main street pittsburgh, pa 15126 123 main st pittsburgh pa 15230 might be detected as duplicates Thanks in advance, Mark |
#7
|
|||
|
|||
Duplicate detection (looking for code)
I searched a little, my best lead right now is the code I found at:
http://www.kdkeys.net/forums/thread/6450.aspx You do need to sign up to the forum to download it. It's an MDE but it looks like he included the source code for most of the fuzzy logic search algorithms. algorithms included - Levenshtein Edit Distance - Dice Coefficient - Longest Common Subsequence - Double Metaphone I also read soundex2 is good (soundex is a little too general). I am by no means an expert but I did do a little searching and if you want to do fuzzy matching of some sort I guess you need to jump into this stuff. Perhaps even store some algorithm results so at runtime you can compare faster to the thousands of records you have in the db. Maybe this will help someone else out? Mark "Jeff Boyce" wrote in message ... Mark Check on-line for Allen Browne's website. He has a routine that helps limit the number of records a combobox has to pull down by "waiting" for the first "n" letters to get entered. This would be useful if you followed the combobox and concatenation route. Good luck! Regards Jeff Boyce Microsoft Access MVP -- Disclaimer: This author may have received products and services mentioned in this post. Mention and/or description of a product or service herein does not constitute endorsement thereof. Any code or pseudocode included in this post is offered "as is", with no guarantee as to suitability. You can thank the FTC of the USA for making this disclaimer possible/necessary. "Mark Andrews" wrote in message ... Jeff, Ok searching on Soundex I think will help me dig up some code or flush out the best approach a little better. If anyone has done this would love to see your approach! Code I'm writing will be installed at multiple companies, most will have 5000 or less contact records. The method should be designed to work well for 10,000 contacts. Thanks, Mark "Jeff Boyce" wrote in message ... Mark Just for the record, the folks who read and write here in the newsgroups are not all MVPs ... and some of the best answers I've seen come from folks who aren't. Don't limit your audience... Most of the approaches I've seen that work for this involve USB (using someone's brain). It sounds like that's part of your approach, too. Do a search on "Soundex". This is an algorithm that uses how words (e.g., names) sound to compare them. Words with similar soundex scores sounds similar. This could help with last names and street names, but I don't see it helping with Bobby vs. Robert, or with all the embellishments that addresses have. Again, you'd need to tell Access that Bobby and Robert are (sometimes) synonymous. One approach might be to sort all entries by "lastname, firstname - delivery address" (as a concatenated field) and USB to break the ties. Another approach might be to use the built-in autocomplete feature in Access comboboxes on a form. Have your user start typing a lastname and have Access jump to the "lastname, firstname - delivery address"es that start that way. You don't mention whether you're working with a couple hundreds entries, a couple thousand, or a couple hundred thousand. The approach you take may need to differ, depending on volume. Good luck! Regards Jeff Boyce Microsoft Access MVP "Mark Andrews" wrote in message ... The idea is: - the user enters "Bobby Smith" and the program pops up a screen saying here are some similar contacts "just to make sure this is not a duplicate contact being entered". Example: Bob Smith 123 Main Street Pittsburgh PA 15230 If they knew that the "Bobby Smith" they were entering lived on Main Street in Pittsburgh they would choose to quit entering data for this new contact because they know that they are entering a duplicate or alternately they could continue entering the contact - similar type logic for addresses Yes I should of used the wording "potential duplicates" in my post and yes it's not extremely simple thus the newsgroup post. Don't you MVPs like a challenge once in a while? First thoughts: I could do a simple comparison with 'like' or some sort of character by character comparison (if 90% of the characters match consider it a "potential duplicate"). I need some sort of "Similar" function. I don't think there are any potentially serious flaws with my analysis, my analysis at this point is "hey this might be a little work, I wonder if anyone else has attempted this and would let me see their code". Now is your chance to post the code you wrote to do these types of checks, I'm sure others have tackled duplicate issues in various ways (some approaches better than others), Thanks, Mark "Jeff Boyce" wrote in message ... You may not have received any responses yet because what you are proposing is not particularly simple. ... and there are some potentially serious flaws with your analysis! How do you expect Access to be able to correctly categorize "Bob Smith" as duplicating "Bobby Smith" when your database could legitimately contain two separate individuals with those names? And what happens when you have two unique individuals, both named Lynne Johnson? (there are two in my state, and they were both born on the same date!) I suspect you'll have to create your own code that tells Access exactly when and how to consider two records to be close enough to be a match ... and you might want to consider them only as "potential" matches. After all, can YOU be sure that all of the following are duplicates?: John Smith 12345 Elm St J. J. Smith 12345 Elm Street John J. Smith 12345 Elm St NW Johnny Smith 12354 Elm St. J. Smith 12345 Elm St Regards Jeff Boyce Microsoft Access MVP -- Disclaimer: This author may have received products and services mentioned in this post. Mention and/or description of a product or service herein does not constitute endorsement thereof. Any code or pseudocode included in this post is offered "as is", with no guarantee as to suitability. You can thank the FTC of the USA for making this disclaimer possible/necessary. "Mark Andrews" wrote in message ... I would like to add code to detect for duplicates or records with similar information. Looking to do this check on (firstname and lastname) and on address (street1, street2, city, state, zip) But I want something a little more advanced than just checking for exact matches. Wondering if anyone has some code they would care to share that might make my job of writing it a little easier? Example: Bob Smith and Bobby Smith would be detected as duplicates Rob Jones and Robert Jones would be detected as duplicates 123 main street pittsburgh, pa 15126 123 main st pittsburgh pa 15230 might be detected as duplicates Thanks in advance, Mark |
#8
|
|||
|
|||
Duplicate detection (looking for code)
Thanks for posting back what you found. That will undoubtedly help someone
in their (future) search. Be aware that "thousands" of records in a combobox leads to poor response time. Allen B's approach speeds that up considerably. Good luck! Regards Jeff Boyce Microsoft Access MVP -- Disclaimer: This author may have received products and services mentioned in this post. Mention and/or description of a product or service herein does not constitute endorsement thereof. Any code or pseudocode included in this post is offered "as is", with no guarantee as to suitability. You can thank the FTC of the USA for making this disclaimer possible/necessary. "Mark Andrews" wrote in message ... I searched a little, my best lead right now is the code I found at: http://www.kdkeys.net/forums/thread/6450.aspx You do need to sign up to the forum to download it. It's an MDE but it looks like he included the source code for most of the fuzzy logic search algorithms. algorithms included - Levenshtein Edit Distance - Dice Coefficient - Longest Common Subsequence - Double Metaphone I also read soundex2 is good (soundex is a little too general). I am by no means an expert but I did do a little searching and if you want to do fuzzy matching of some sort I guess you need to jump into this stuff. Perhaps even store some algorithm results so at runtime you can compare faster to the thousands of records you have in the db. Maybe this will help someone else out? Mark "Jeff Boyce" wrote in message ... Mark Check on-line for Allen Browne's website. He has a routine that helps limit the number of records a combobox has to pull down by "waiting" for the first "n" letters to get entered. This would be useful if you followed the combobox and concatenation route. Good luck! Regards Jeff Boyce Microsoft Access MVP -- Disclaimer: This author may have received products and services mentioned in this post. Mention and/or description of a product or service herein does not constitute endorsement thereof. Any code or pseudocode included in this post is offered "as is", with no guarantee as to suitability. You can thank the FTC of the USA for making this disclaimer possible/necessary. "Mark Andrews" wrote in message ... Jeff, Ok searching on Soundex I think will help me dig up some code or flush out the best approach a little better. If anyone has done this would love to see your approach! Code I'm writing will be installed at multiple companies, most will have 5000 or less contact records. The method should be designed to work well for 10,000 contacts. Thanks, Mark "Jeff Boyce" wrote in message ... Mark Just for the record, the folks who read and write here in the newsgroups are not all MVPs ... and some of the best answers I've seen come from folks who aren't. Don't limit your audience... Most of the approaches I've seen that work for this involve USB (using someone's brain). It sounds like that's part of your approach, too. Do a search on "Soundex". This is an algorithm that uses how words (e.g., names) sound to compare them. Words with similar soundex scores sounds similar. This could help with last names and street names, but I don't see it helping with Bobby vs. Robert, or with all the embellishments that addresses have. Again, you'd need to tell Access that Bobby and Robert are (sometimes) synonymous. One approach might be to sort all entries by "lastname, firstname - delivery address" (as a concatenated field) and USB to break the ties. Another approach might be to use the built-in autocomplete feature in Access comboboxes on a form. Have your user start typing a lastname and have Access jump to the "lastname, firstname - delivery address"es that start that way. You don't mention whether you're working with a couple hundreds entries, a couple thousand, or a couple hundred thousand. The approach you take may need to differ, depending on volume. Good luck! Regards Jeff Boyce Microsoft Access MVP "Mark Andrews" wrote in message ... The idea is: - the user enters "Bobby Smith" and the program pops up a screen saying here are some similar contacts "just to make sure this is not a duplicate contact being entered". Example: Bob Smith 123 Main Street Pittsburgh PA 15230 If they knew that the "Bobby Smith" they were entering lived on Main Street in Pittsburgh they would choose to quit entering data for this new contact because they know that they are entering a duplicate or alternately they could continue entering the contact - similar type logic for addresses Yes I should of used the wording "potential duplicates" in my post and yes it's not extremely simple thus the newsgroup post. Don't you MVPs like a challenge once in a while? First thoughts: I could do a simple comparison with 'like' or some sort of character by character comparison (if 90% of the characters match consider it a "potential duplicate"). I need some sort of "Similar" function. I don't think there are any potentially serious flaws with my analysis, my analysis at this point is "hey this might be a little work, I wonder if anyone else has attempted this and would let me see their code". Now is your chance to post the code you wrote to do these types of checks, I'm sure others have tackled duplicate issues in various ways (some approaches better than others), Thanks, Mark "Jeff Boyce" wrote in message ... You may not have received any responses yet because what you are proposing is not particularly simple. ... and there are some potentially serious flaws with your analysis! How do you expect Access to be able to correctly categorize "Bob Smith" as duplicating "Bobby Smith" when your database could legitimately contain two separate individuals with those names? And what happens when you have two unique individuals, both named Lynne Johnson? (there are two in my state, and they were both born on the same date!) I suspect you'll have to create your own code that tells Access exactly when and how to consider two records to be close enough to be a match ... and you might want to consider them only as "potential" matches. After all, can YOU be sure that all of the following are duplicates?: John Smith 12345 Elm St J. J. Smith 12345 Elm Street John J. Smith 12345 Elm St NW Johnny Smith 12354 Elm St. J. Smith 12345 Elm St Regards Jeff Boyce Microsoft Access MVP -- Disclaimer: This author may have received products and services mentioned in this post. Mention and/or description of a product or service herein does not constitute endorsement thereof. Any code or pseudocode included in this post is offered "as is", with no guarantee as to suitability. You can thank the FTC of the USA for making this disclaimer possible/necessary. "Mark Andrews" wrote in message ... I would like to add code to detect for duplicates or records with similar information. Looking to do this check on (firstname and lastname) and on address (street1, street2, city, state, zip) But I want something a little more advanced than just checking for exact matches. Wondering if anyone has some code they would care to share that might make my job of writing it a little easier? Example: Bob Smith and Bobby Smith would be detected as duplicates Rob Jones and Robert Jones would be detected as duplicates 123 main street pittsburgh, pa 15126 123 main st pittsburgh pa 15230 might be detected as duplicates Thanks in advance, Mark |
#9
|
|||
|
|||
Duplicate detection (looking for code)
No problem. Hope it ends up working. The word Soundex got me started down
the right path. "Fuzzy logic vba" is also a good search keyword. "Jeff Boyce" wrote in message ... Thanks for posting back what you found. That will undoubtedly help someone in their (future) search. Be aware that "thousands" of records in a combobox leads to poor response time. Allen B's approach speeds that up considerably. Good luck! Regards Jeff Boyce Microsoft Access MVP -- Disclaimer: This author may have received products and services mentioned in this post. Mention and/or description of a product or service herein does not constitute endorsement thereof. Any code or pseudocode included in this post is offered "as is", with no guarantee as to suitability. You can thank the FTC of the USA for making this disclaimer possible/necessary. "Mark Andrews" wrote in message ... I searched a little, my best lead right now is the code I found at: http://www.kdkeys.net/forums/thread/6450.aspx You do need to sign up to the forum to download it. It's an MDE but it looks like he included the source code for most of the fuzzy logic search algorithms. algorithms included - Levenshtein Edit Distance - Dice Coefficient - Longest Common Subsequence - Double Metaphone I also read soundex2 is good (soundex is a little too general). I am by no means an expert but I did do a little searching and if you want to do fuzzy matching of some sort I guess you need to jump into this stuff. Perhaps even store some algorithm results so at runtime you can compare faster to the thousands of records you have in the db. Maybe this will help someone else out? Mark "Jeff Boyce" wrote in message ... Mark Check on-line for Allen Browne's website. He has a routine that helps limit the number of records a combobox has to pull down by "waiting" for the first "n" letters to get entered. This would be useful if you followed the combobox and concatenation route. Good luck! Regards Jeff Boyce Microsoft Access MVP -- Disclaimer: This author may have received products and services mentioned in this post. Mention and/or description of a product or service herein does not constitute endorsement thereof. Any code or pseudocode included in this post is offered "as is", with no guarantee as to suitability. You can thank the FTC of the USA for making this disclaimer possible/necessary. "Mark Andrews" wrote in message ... Jeff, Ok searching on Soundex I think will help me dig up some code or flush out the best approach a little better. If anyone has done this would love to see your approach! Code I'm writing will be installed at multiple companies, most will have 5000 or less contact records. The method should be designed to work well for 10,000 contacts. Thanks, Mark "Jeff Boyce" wrote in message ... Mark Just for the record, the folks who read and write here in the newsgroups are not all MVPs ... and some of the best answers I've seen come from folks who aren't. Don't limit your audience... Most of the approaches I've seen that work for this involve USB (using someone's brain). It sounds like that's part of your approach, too. Do a search on "Soundex". This is an algorithm that uses how words (e.g., names) sound to compare them. Words with similar soundex scores sounds similar. This could help with last names and street names, but I don't see it helping with Bobby vs. Robert, or with all the embellishments that addresses have. Again, you'd need to tell Access that Bobby and Robert are (sometimes) synonymous. One approach might be to sort all entries by "lastname, firstname - delivery address" (as a concatenated field) and USB to break the ties. Another approach might be to use the built-in autocomplete feature in Access comboboxes on a form. Have your user start typing a lastname and have Access jump to the "lastname, firstname - delivery address"es that start that way. You don't mention whether you're working with a couple hundreds entries, a couple thousand, or a couple hundred thousand. The approach you take may need to differ, depending on volume. Good luck! Regards Jeff Boyce Microsoft Access MVP "Mark Andrews" wrote in message ... The idea is: - the user enters "Bobby Smith" and the program pops up a screen saying here are some similar contacts "just to make sure this is not a duplicate contact being entered". Example: Bob Smith 123 Main Street Pittsburgh PA 15230 If they knew that the "Bobby Smith" they were entering lived on Main Street in Pittsburgh they would choose to quit entering data for this new contact because they know that they are entering a duplicate or alternately they could continue entering the contact - similar type logic for addresses Yes I should of used the wording "potential duplicates" in my post and yes it's not extremely simple thus the newsgroup post. Don't you MVPs like a challenge once in a while? First thoughts: I could do a simple comparison with 'like' or some sort of character by character comparison (if 90% of the characters match consider it a "potential duplicate"). I need some sort of "Similar" function. I don't think there are any potentially serious flaws with my analysis, my analysis at this point is "hey this might be a little work, I wonder if anyone else has attempted this and would let me see their code". Now is your chance to post the code you wrote to do these types of checks, I'm sure others have tackled duplicate issues in various ways (some approaches better than others), Thanks, Mark "Jeff Boyce" wrote in message ... You may not have received any responses yet because what you are proposing is not particularly simple. ... and there are some potentially serious flaws with your analysis! How do you expect Access to be able to correctly categorize "Bob Smith" as duplicating "Bobby Smith" when your database could legitimately contain two separate individuals with those names? And what happens when you have two unique individuals, both named Lynne Johnson? (there are two in my state, and they were both born on the same date!) I suspect you'll have to create your own code that tells Access exactly when and how to consider two records to be close enough to be a match ... and you might want to consider them only as "potential" matches. After all, can YOU be sure that all of the following are duplicates?: John Smith 12345 Elm St J. J. Smith 12345 Elm Street John J. Smith 12345 Elm St NW Johnny Smith 12354 Elm St. J. Smith 12345 Elm St Regards Jeff Boyce Microsoft Access MVP -- Disclaimer: This author may have received products and services mentioned in this post. Mention and/or description of a product or service herein does not constitute endorsement thereof. Any code or pseudocode included in this post is offered "as is", with no guarantee as to suitability. You can thank the FTC of the USA for making this disclaimer possible/necessary. "Mark Andrews" wrote in message ... I would like to add code to detect for duplicates or records with similar information. Looking to do this check on (firstname and lastname) and on address (street1, street2, city, state, zip) But I want something a little more advanced than just checking for exact matches. Wondering if anyone has some code they would care to share that might make my job of writing it a little easier? Example: Bob Smith and Bobby Smith would be detected as duplicates Rob Jones and Robert Jones would be detected as duplicates 123 main street pittsburgh, pa 15126 123 main st pittsburgh pa 15230 might be detected as duplicates Thanks in advance, Mark |
#10
|
|||
|
|||
Duplicate detection (looking for code)
On Thu, 18 Mar 2010 12:24:01 -0700, "Jeff Boyce"
wrote: After all, can YOU be sure that all of the following are duplicates?: And what about my friends, Fred Brown and Fred Brown? Young Fred is no longer living at home, but he was when I first met him... -- John W. Vinson [MVP] |
Thread Tools | |
Display Modes | |
|
|