A Microsoft Office (Excel, Word) forum. OfficeFrustration

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Go Back   Home » OfficeFrustration forum » Microsoft Access » General Discussion
Site Map Home Register Authors List Search Today's Posts Mark Forums Read  

Duplicate detection (looking for code)



 
 
Thread Tools Display Modes
  #1  
Old March 18th, 2010, 05:40 PM posted to microsoft.public.access
Mark Andrews[_4_]
external usenet poster
 
Posts: 169
Default Duplicate detection (looking for code)

I would like to add code to detect for duplicates or records with similar
information.
Looking to do this check on (firstname and lastname) and on address
(street1, street2, city, state, zip)

But I want something a little more advanced than just checking for exact
matches.

Wondering if anyone has some code they would care to share that might make
my job of writing it a little easier?

Example:
Bob Smith and Bobby Smith would be detected as duplicates
Rob Jones and Robert Jones would be detected as duplicates
123 main street pittsburgh, pa 15126
123 main st pittsburgh pa 15230 might be detected as duplicates

Thanks in advance,
Mark

  #2  
Old March 18th, 2010, 07:24 PM posted to microsoft.public.access
Jeff Boyce
external usenet poster
 
Posts: 8,621
Default Duplicate detection (looking for code)

You may not have received any responses yet because what you are proposing
is not particularly simple.

.... and there are some potentially serious flaws with your analysis!

How do you expect Access to be able to correctly categorize "Bob Smith" as
duplicating "Bobby Smith" when your database could legitimately contain two
separate individuals with those names?

And what happens when you have two unique individuals, both named Lynne
Johnson? (there are two in my state, and they were both born on the same
date!)

I suspect you'll have to create your own code that tells Access exactly when
and how to consider two records to be close enough to be a match ... and you
might want to consider them only as "potential" matches.

After all, can YOU be sure that all of the following are duplicates?:

John Smith 12345 Elm St
J. J. Smith 12345 Elm Street
John J. Smith 12345 Elm St NW
Johnny Smith 12354 Elm St.
J. Smith 12345 Elm St

Regards

Jeff Boyce
Microsoft Access MVP

--
Disclaimer: This author may have received products and services mentioned
in this post. Mention and/or description of a product or service herein
does not constitute endorsement thereof.

Any code or pseudocode included in this post is offered "as is", with no
guarantee as to suitability.

You can thank the FTC of the USA for making this disclaimer
possible/necessary.


"Mark Andrews" wrote in message
...
I would like to add code to detect for duplicates or records with similar
information.
Looking to do this check on (firstname and lastname) and on address
(street1, street2, city, state, zip)

But I want something a little more advanced than just checking for exact
matches.

Wondering if anyone has some code they would care to share that might make
my job of writing it a little easier?

Example:
Bob Smith and Bobby Smith would be detected as duplicates
Rob Jones and Robert Jones would be detected as duplicates
123 main street pittsburgh, pa 15126
123 main st pittsburgh pa 15230 might be detected as duplicates

Thanks in advance,
Mark



  #3  
Old March 18th, 2010, 07:55 PM posted to microsoft.public.access
Mark Andrews[_4_]
external usenet poster
 
Posts: 169
Default Duplicate detection (looking for code)

The idea is:
- the user enters "Bobby Smith" and the program pops up a screen saying here
are some similar contacts "just to make sure this is not a duplicate contact
being entered". Example: Bob Smith 123 Main Street Pittsburgh PA 15230
If they knew that the "Bobby Smith" they were entering lived on Main Street
in Pittsburgh they would choose to quit entering data for this new contact
because they know that they are entering a duplicate or alternately they
could continue entering the contact

- similar type logic for addresses

Yes I should of used the wording "potential duplicates" in my post and yes
it's not extremely simple thus the newsgroup post.
Don't you MVPs like a challenge once in a while?

First thoughts: I could do a simple comparison with 'like' or some sort of
character by character comparison (if 90% of the characters match consider
it a "potential duplicate"). I need some sort of "Similar" function.

I don't think there are any potentially serious flaws with my analysis, my
analysis at this point is "hey this might be a little work, I wonder if
anyone else has attempted this and would let me see their code".

Now is your chance to post the code you wrote to do these types of checks,
I'm sure others have tackled duplicate issues in various ways (some
approaches better than others),
Thanks,
Mark

"Jeff Boyce" wrote in message
...
You may not have received any responses yet because what you are proposing
is not particularly simple.

... and there are some potentially serious flaws with your analysis!

How do you expect Access to be able to correctly categorize "Bob Smith" as
duplicating "Bobby Smith" when your database could legitimately contain
two separate individuals with those names?

And what happens when you have two unique individuals, both named Lynne
Johnson? (there are two in my state, and they were both born on the same
date!)

I suspect you'll have to create your own code that tells Access exactly
when and how to consider two records to be close enough to be a match ...
and you might want to consider them only as "potential" matches.

After all, can YOU be sure that all of the following are duplicates?:

John Smith 12345 Elm St
J. J. Smith 12345 Elm Street
John J. Smith 12345 Elm St NW
Johnny Smith 12354 Elm St.
J. Smith 12345 Elm St

Regards

Jeff Boyce
Microsoft Access MVP

--
Disclaimer: This author may have received products and services mentioned
in this post. Mention and/or description of a product or service herein
does not constitute endorsement thereof.

Any code or pseudocode included in this post is offered "as is", with no
guarantee as to suitability.

You can thank the FTC of the USA for making this disclaimer
possible/necessary.


"Mark Andrews" wrote in message
...
I would like to add code to detect for duplicates or records with similar
information.
Looking to do this check on (firstname and lastname) and on address
(street1, street2, city, state, zip)

But I want something a little more advanced than just checking for exact
matches.

Wondering if anyone has some code they would care to share that might
make my job of writing it a little easier?

Example:
Bob Smith and Bobby Smith would be detected as duplicates
Rob Jones and Robert Jones would be detected as duplicates
123 main street pittsburgh, pa 15126
123 main st pittsburgh pa 15230 might be detected as duplicates

Thanks in advance,
Mark



  #4  
Old March 18th, 2010, 08:11 PM posted to microsoft.public.access
Jeff Boyce
external usenet poster
 
Posts: 8,621
Default Duplicate detection (looking for code)

Mark

Just for the record, the folks who read and write here in the newsgroups are
not all MVPs ... and some of the best answers I've seen come from folks who
aren't. Don't limit your audience...

Most of the approaches I've seen that work for this involve USB (using
someone's brain). It sounds like that's part of your approach, too.

Do a search on "Soundex". This is an algorithm that uses how words (e.g.,
names) sound to compare them. Words with similar soundex scores sounds
similar. This could help with last names and street names, but I don't see
it helping with Bobby vs. Robert, or with all the embellishments that
addresses have. Again, you'd need to tell Access that Bobby and Robert are
(sometimes) synonymous.

One approach might be to sort all entries by "lastname, firstname - delivery
address" (as a concatenated field) and USB to break the ties.

Another approach might be to use the built-in autocomplete feature in Access
comboboxes on a form.

Have your user start typing a lastname and have Access jump to the
"lastname, firstname - delivery address"es that start that way.

You don't mention whether you're working with a couple hundreds entries, a
couple thousand, or a couple hundred thousand. The approach you take may
need to differ, depending on volume.

Good luck!

Regards

Jeff Boyce
Microsoft Access MVP

"Mark Andrews" wrote in message
...
The idea is:
- the user enters "Bobby Smith" and the program pops up a screen saying
here are some similar contacts "just to make sure this is not a duplicate
contact being entered". Example: Bob Smith 123 Main Street Pittsburgh PA
15230
If they knew that the "Bobby Smith" they were entering lived on Main
Street in Pittsburgh they would choose to quit entering data for this new
contact because they know that they are entering a duplicate or
alternately they could continue entering the contact

- similar type logic for addresses

Yes I should of used the wording "potential duplicates" in my post and yes
it's not extremely simple thus the newsgroup post.
Don't you MVPs like a challenge once in a while?

First thoughts: I could do a simple comparison with 'like' or some sort of
character by character comparison (if 90% of the characters match consider
it a "potential duplicate"). I need some sort of "Similar" function.

I don't think there are any potentially serious flaws with my analysis, my
analysis at this point is "hey this might be a little work, I wonder if
anyone else has attempted this and would let me see their code".

Now is your chance to post the code you wrote to do these types of checks,
I'm sure others have tackled duplicate issues in various ways (some
approaches better than others),
Thanks,
Mark

"Jeff Boyce" wrote in message
...
You may not have received any responses yet because what you are
proposing is not particularly simple.

... and there are some potentially serious flaws with your analysis!

How do you expect Access to be able to correctly categorize "Bob Smith"
as duplicating "Bobby Smith" when your database could legitimately
contain two separate individuals with those names?

And what happens when you have two unique individuals, both named Lynne
Johnson? (there are two in my state, and they were both born on the same
date!)

I suspect you'll have to create your own code that tells Access exactly
when and how to consider two records to be close enough to be a match ...
and you might want to consider them only as "potential" matches.

After all, can YOU be sure that all of the following are duplicates?:

John Smith 12345 Elm St
J. J. Smith 12345 Elm Street
John J. Smith 12345 Elm St NW
Johnny Smith 12354 Elm St.
J. Smith 12345 Elm St

Regards

Jeff Boyce
Microsoft Access MVP

--
Disclaimer: This author may have received products and services mentioned
in this post. Mention and/or description of a product or service herein
does not constitute endorsement thereof.

Any code or pseudocode included in this post is offered "as is", with no
guarantee as to suitability.

You can thank the FTC of the USA for making this disclaimer
possible/necessary.


"Mark Andrews" wrote in message
...
I would like to add code to detect for duplicates or records with similar
information.
Looking to do this check on (firstname and lastname) and on address
(street1, street2, city, state, zip)

But I want something a little more advanced than just checking for exact
matches.

Wondering if anyone has some code they would care to share that might
make my job of writing it a little easier?

Example:
Bob Smith and Bobby Smith would be detected as duplicates
Rob Jones and Robert Jones would be detected as duplicates
123 main street pittsburgh, pa 15126
123 main st pittsburgh pa 15230 might be detected as duplicates

Thanks in advance,
Mark





  #5  
Old March 18th, 2010, 08:38 PM posted to microsoft.public.access
Mark Andrews[_4_]
external usenet poster
 
Posts: 169
Default Duplicate detection (looking for code)

Jeff,

Ok searching on Soundex I think will help me dig up some code or flush out
the best approach a little better.

If anyone has done this would love to see your approach!

Code I'm writing will be installed at multiple companies, most will have
5000 or less contact records.
The method should be designed to work well for 10,000 contacts.

Thanks,
Mark

"Jeff Boyce" wrote in message
...
Mark

Just for the record, the folks who read and write here in the newsgroups
are not all MVPs ... and some of the best answers I've seen come from
folks who aren't. Don't limit your audience...

Most of the approaches I've seen that work for this involve USB (using
someone's brain). It sounds like that's part of your approach, too.

Do a search on "Soundex". This is an algorithm that uses how words (e.g.,
names) sound to compare them. Words with similar soundex scores sounds
similar. This could help with last names and street names, but I don't
see it helping with Bobby vs. Robert, or with all the embellishments that
addresses have. Again, you'd need to tell Access that Bobby and Robert
are (sometimes) synonymous.

One approach might be to sort all entries by "lastname, firstname -
delivery address" (as a concatenated field) and USB to break the ties.

Another approach might be to use the built-in autocomplete feature in
Access comboboxes on a form.

Have your user start typing a lastname and have Access jump to the
"lastname, firstname - delivery address"es that start that way.

You don't mention whether you're working with a couple hundreds entries, a
couple thousand, or a couple hundred thousand. The approach you take may
need to differ, depending on volume.

Good luck!

Regards

Jeff Boyce
Microsoft Access MVP

"Mark Andrews" wrote in message
...
The idea is:
- the user enters "Bobby Smith" and the program pops up a screen saying
here are some similar contacts "just to make sure this is not a duplicate
contact being entered". Example: Bob Smith 123 Main Street Pittsburgh PA
15230
If they knew that the "Bobby Smith" they were entering lived on Main
Street in Pittsburgh they would choose to quit entering data for this new
contact because they know that they are entering a duplicate or
alternately they could continue entering the contact

- similar type logic for addresses

Yes I should of used the wording "potential duplicates" in my post and
yes it's not extremely simple thus the newsgroup post.
Don't you MVPs like a challenge once in a while?

First thoughts: I could do a simple comparison with 'like' or some sort
of character by character comparison (if 90% of the characters match
consider it a "potential duplicate"). I need some sort of "Similar"
function.

I don't think there are any potentially serious flaws with my analysis,
my analysis at this point is "hey this might be a little work, I wonder
if anyone else has attempted this and would let me see their code".

Now is your chance to post the code you wrote to do these types of
checks,
I'm sure others have tackled duplicate issues in various ways (some
approaches better than others),
Thanks,
Mark

"Jeff Boyce" wrote in message
...
You may not have received any responses yet because what you are
proposing is not particularly simple.

... and there are some potentially serious flaws with your analysis!

How do you expect Access to be able to correctly categorize "Bob Smith"
as duplicating "Bobby Smith" when your database could legitimately
contain two separate individuals with those names?

And what happens when you have two unique individuals, both named Lynne
Johnson? (there are two in my state, and they were both born on the
same date!)

I suspect you'll have to create your own code that tells Access exactly
when and how to consider two records to be close enough to be a match
... and you might want to consider them only as "potential" matches.

After all, can YOU be sure that all of the following are duplicates?:

John Smith 12345 Elm St
J. J. Smith 12345 Elm Street
John J. Smith 12345 Elm St NW
Johnny Smith 12354 Elm St.
J. Smith 12345 Elm St

Regards

Jeff Boyce
Microsoft Access MVP

--
Disclaimer: This author may have received products and services
mentioned
in this post. Mention and/or description of a product or service herein
does not constitute endorsement thereof.

Any code or pseudocode included in this post is offered "as is", with no
guarantee as to suitability.

You can thank the FTC of the USA for making this disclaimer
possible/necessary.


"Mark Andrews" wrote in message
...
I would like to add code to detect for duplicates or records with
similar information.
Looking to do this check on (firstname and lastname) and on address
(street1, street2, city, state, zip)

But I want something a little more advanced than just checking for
exact matches.

Wondering if anyone has some code they would care to share that might
make my job of writing it a little easier?

Example:
Bob Smith and Bobby Smith would be detected as duplicates
Rob Jones and Robert Jones would be detected as duplicates
123 main street pittsburgh, pa 15126
123 main st pittsburgh pa 15230 might be detected as duplicates

Thanks in advance,
Mark




  #6  
Old March 18th, 2010, 08:49 PM posted to microsoft.public.access
Jeff Boyce
external usenet poster
 
Posts: 8,621
Default Duplicate detection (looking for code)

Mark

Check on-line for Allen Browne's website. He has a routine that helps limit
the number of records a combobox has to pull down by "waiting" for the first
"n" letters to get entered. This would be useful if you followed the
combobox and concatenation route.

Good luck!

Regards

Jeff Boyce
Microsoft Access MVP

--
Disclaimer: This author may have received products and services mentioned
in this post. Mention and/or description of a product or service herein
does not constitute endorsement thereof.

Any code or pseudocode included in this post is offered "as is", with no
guarantee as to suitability.

You can thank the FTC of the USA for making this disclaimer
possible/necessary.

"Mark Andrews" wrote in message
...
Jeff,

Ok searching on Soundex I think will help me dig up some code or flush out
the best approach a little better.

If anyone has done this would love to see your approach!

Code I'm writing will be installed at multiple companies, most will have
5000 or less contact records.
The method should be designed to work well for 10,000 contacts.

Thanks,
Mark

"Jeff Boyce" wrote in message
...
Mark

Just for the record, the folks who read and write here in the newsgroups
are not all MVPs ... and some of the best answers I've seen come from
folks who aren't. Don't limit your audience...

Most of the approaches I've seen that work for this involve USB (using
someone's brain). It sounds like that's part of your approach, too.

Do a search on "Soundex". This is an algorithm that uses how words
(e.g., names) sound to compare them. Words with similar soundex scores
sounds similar. This could help with last names and street names, but I
don't see it helping with Bobby vs. Robert, or with all the
embellishments that addresses have. Again, you'd need to tell Access
that Bobby and Robert are (sometimes) synonymous.

One approach might be to sort all entries by "lastname, firstname -
delivery address" (as a concatenated field) and USB to break the ties.

Another approach might be to use the built-in autocomplete feature in
Access comboboxes on a form.

Have your user start typing a lastname and have Access jump to the
"lastname, firstname - delivery address"es that start that way.

You don't mention whether you're working with a couple hundreds entries,
a couple thousand, or a couple hundred thousand. The approach you take
may need to differ, depending on volume.

Good luck!

Regards

Jeff Boyce
Microsoft Access MVP

"Mark Andrews" wrote in message
...
The idea is:
- the user enters "Bobby Smith" and the program pops up a screen saying
here are some similar contacts "just to make sure this is not a
duplicate contact being entered". Example: Bob Smith 123 Main Street
Pittsburgh PA 15230
If they knew that the "Bobby Smith" they were entering lived on Main
Street in Pittsburgh they would choose to quit entering data for this
new contact because they know that they are entering a duplicate or
alternately they could continue entering the contact

- similar type logic for addresses

Yes I should of used the wording "potential duplicates" in my post and
yes it's not extremely simple thus the newsgroup post.
Don't you MVPs like a challenge once in a while?

First thoughts: I could do a simple comparison with 'like' or some sort
of character by character comparison (if 90% of the characters match
consider it a "potential duplicate"). I need some sort of "Similar"
function.

I don't think there are any potentially serious flaws with my analysis,
my analysis at this point is "hey this might be a little work, I wonder
if anyone else has attempted this and would let me see their code".

Now is your chance to post the code you wrote to do these types of
checks,
I'm sure others have tackled duplicate issues in various ways (some
approaches better than others),
Thanks,
Mark

"Jeff Boyce" wrote in message
...
You may not have received any responses yet because what you are
proposing is not particularly simple.

... and there are some potentially serious flaws with your analysis!

How do you expect Access to be able to correctly categorize "Bob Smith"
as duplicating "Bobby Smith" when your database could legitimately
contain two separate individuals with those names?

And what happens when you have two unique individuals, both named Lynne
Johnson? (there are two in my state, and they were both born on the
same date!)

I suspect you'll have to create your own code that tells Access exactly
when and how to consider two records to be close enough to be a match
... and you might want to consider them only as "potential" matches.

After all, can YOU be sure that all of the following are duplicates?:

John Smith 12345 Elm St
J. J. Smith 12345 Elm Street
John J. Smith 12345 Elm St NW
Johnny Smith 12354 Elm St.
J. Smith 12345 Elm St

Regards

Jeff Boyce
Microsoft Access MVP

--
Disclaimer: This author may have received products and services
mentioned
in this post. Mention and/or description of a product or service herein
does not constitute endorsement thereof.

Any code or pseudocode included in this post is offered "as is", with
no
guarantee as to suitability.

You can thank the FTC of the USA for making this disclaimer
possible/necessary.


"Mark Andrews" wrote in message
...
I would like to add code to detect for duplicates or records with
similar information.
Looking to do this check on (firstname and lastname) and on address
(street1, street2, city, state, zip)

But I want something a little more advanced than just checking for
exact matches.

Wondering if anyone has some code they would care to share that might
make my job of writing it a little easier?

Example:
Bob Smith and Bobby Smith would be detected as duplicates
Rob Jones and Robert Jones would be detected as duplicates
123 main street pittsburgh, pa 15126
123 main st pittsburgh pa 15230 might be detected as duplicates

Thanks in advance,
Mark






  #7  
Old March 18th, 2010, 09:49 PM posted to microsoft.public.access
Mark Andrews[_4_]
external usenet poster
 
Posts: 169
Default Duplicate detection (looking for code)

I searched a little, my best lead right now is the code I found at:
http://www.kdkeys.net/forums/thread/6450.aspx

You do need to sign up to the forum to download it.

It's an MDE but it looks like he included the source code for most of the
fuzzy logic search algorithms.

algorithms included
- Levenshtein Edit Distance
- Dice Coefficient
- Longest Common Subsequence
- Double Metaphone

I also read soundex2 is good (soundex is a little too general).

I am by no means an expert but I did do a little searching and if you want
to do fuzzy matching of some
sort I guess you need to jump into this stuff. Perhaps even store some
algorithm results so at runtime
you can compare faster to the thousands of records you have in the db.

Maybe this will help someone else out?
Mark



"Jeff Boyce" wrote in message
...
Mark

Check on-line for Allen Browne's website. He has a routine that helps
limit the number of records a combobox has to pull down by "waiting" for
the first "n" letters to get entered. This would be useful if you
followed the combobox and concatenation route.

Good luck!

Regards

Jeff Boyce
Microsoft Access MVP

--
Disclaimer: This author may have received products and services mentioned
in this post. Mention and/or description of a product or service herein
does not constitute endorsement thereof.

Any code or pseudocode included in this post is offered "as is", with no
guarantee as to suitability.

You can thank the FTC of the USA for making this disclaimer
possible/necessary.

"Mark Andrews" wrote in message
...
Jeff,

Ok searching on Soundex I think will help me dig up some code or flush
out the best approach a little better.

If anyone has done this would love to see your approach!

Code I'm writing will be installed at multiple companies, most will have
5000 or less contact records.
The method should be designed to work well for 10,000 contacts.

Thanks,
Mark

"Jeff Boyce" wrote in message
...
Mark

Just for the record, the folks who read and write here in the newsgroups
are not all MVPs ... and some of the best answers I've seen come from
folks who aren't. Don't limit your audience...

Most of the approaches I've seen that work for this involve USB (using
someone's brain). It sounds like that's part of your approach, too.

Do a search on "Soundex". This is an algorithm that uses how words
(e.g., names) sound to compare them. Words with similar soundex scores
sounds similar. This could help with last names and street names, but I
don't see it helping with Bobby vs. Robert, or with all the
embellishments that addresses have. Again, you'd need to tell Access
that Bobby and Robert are (sometimes) synonymous.

One approach might be to sort all entries by "lastname, firstname -
delivery address" (as a concatenated field) and USB to break the ties.

Another approach might be to use the built-in autocomplete feature in
Access comboboxes on a form.

Have your user start typing a lastname and have Access jump to the
"lastname, firstname - delivery address"es that start that way.

You don't mention whether you're working with a couple hundreds entries,
a couple thousand, or a couple hundred thousand. The approach you take
may need to differ, depending on volume.

Good luck!

Regards

Jeff Boyce
Microsoft Access MVP

"Mark Andrews" wrote in message
...
The idea is:
- the user enters "Bobby Smith" and the program pops up a screen saying
here are some similar contacts "just to make sure this is not a
duplicate contact being entered". Example: Bob Smith 123 Main Street
Pittsburgh PA 15230
If they knew that the "Bobby Smith" they were entering lived on Main
Street in Pittsburgh they would choose to quit entering data for this
new contact because they know that they are entering a duplicate or
alternately they could continue entering the contact

- similar type logic for addresses

Yes I should of used the wording "potential duplicates" in my post and
yes it's not extremely simple thus the newsgroup post.
Don't you MVPs like a challenge once in a while?

First thoughts: I could do a simple comparison with 'like' or some sort
of character by character comparison (if 90% of the characters match
consider it a "potential duplicate"). I need some sort of "Similar"
function.

I don't think there are any potentially serious flaws with my analysis,
my analysis at this point is "hey this might be a little work, I wonder
if anyone else has attempted this and would let me see their code".

Now is your chance to post the code you wrote to do these types of
checks,
I'm sure others have tackled duplicate issues in various ways (some
approaches better than others),
Thanks,
Mark

"Jeff Boyce" wrote in message
...
You may not have received any responses yet because what you are
proposing is not particularly simple.

... and there are some potentially serious flaws with your analysis!

How do you expect Access to be able to correctly categorize "Bob
Smith" as duplicating "Bobby Smith" when your database could
legitimately contain two separate individuals with those names?

And what happens when you have two unique individuals, both named
Lynne Johnson? (there are two in my state, and they were both born on
the same date!)

I suspect you'll have to create your own code that tells Access
exactly when and how to consider two records to be close enough to be
a match ... and you might want to consider them only as "potential"
matches.

After all, can YOU be sure that all of the following are duplicates?:

John Smith 12345 Elm St
J. J. Smith 12345 Elm Street
John J. Smith 12345 Elm St NW
Johnny Smith 12354 Elm St.
J. Smith 12345 Elm St

Regards

Jeff Boyce
Microsoft Access MVP

--
Disclaimer: This author may have received products and services
mentioned
in this post. Mention and/or description of a product or service
herein
does not constitute endorsement thereof.

Any code or pseudocode included in this post is offered "as is", with
no
guarantee as to suitability.

You can thank the FTC of the USA for making this disclaimer
possible/necessary.


"Mark Andrews" wrote in message
...
I would like to add code to detect for duplicates or records with
similar information.
Looking to do this check on (firstname and lastname) and on address
(street1, street2, city, state, zip)

But I want something a little more advanced than just checking for
exact matches.

Wondering if anyone has some code they would care to share that might
make my job of writing it a little easier?

Example:
Bob Smith and Bobby Smith would be detected as duplicates
Rob Jones and Robert Jones would be detected as duplicates
123 main street pittsburgh, pa 15126
123 main st pittsburgh pa 15230 might be detected as duplicates

Thanks in advance,
Mark






  #8  
Old March 18th, 2010, 10:20 PM posted to microsoft.public.access
Jeff Boyce
external usenet poster
 
Posts: 8,621
Default Duplicate detection (looking for code)

Thanks for posting back what you found. That will undoubtedly help someone
in their (future) search.

Be aware that "thousands" of records in a combobox leads to poor response
time. Allen B's approach speeds that up considerably.

Good luck!

Regards

Jeff Boyce
Microsoft Access MVP

--
Disclaimer: This author may have received products and services mentioned
in this post. Mention and/or description of a product or service herein
does not constitute endorsement thereof.

Any code or pseudocode included in this post is offered "as is", with no
guarantee as to suitability.

You can thank the FTC of the USA for making this disclaimer
possible/necessary.

"Mark Andrews" wrote in message
...
I searched a little, my best lead right now is the code I found at:
http://www.kdkeys.net/forums/thread/6450.aspx

You do need to sign up to the forum to download it.

It's an MDE but it looks like he included the source code for most of the
fuzzy logic search algorithms.

algorithms included
- Levenshtein Edit Distance
- Dice Coefficient
- Longest Common Subsequence
- Double Metaphone

I also read soundex2 is good (soundex is a little too general).

I am by no means an expert but I did do a little searching and if you want
to do fuzzy matching of some
sort I guess you need to jump into this stuff. Perhaps even store some
algorithm results so at runtime
you can compare faster to the thousands of records you have in the db.

Maybe this will help someone else out?
Mark



"Jeff Boyce" wrote in message
...
Mark

Check on-line for Allen Browne's website. He has a routine that helps
limit the number of records a combobox has to pull down by "waiting" for
the first "n" letters to get entered. This would be useful if you
followed the combobox and concatenation route.

Good luck!

Regards

Jeff Boyce
Microsoft Access MVP

--
Disclaimer: This author may have received products and services mentioned
in this post. Mention and/or description of a product or service herein
does not constitute endorsement thereof.

Any code or pseudocode included in this post is offered "as is", with no
guarantee as to suitability.

You can thank the FTC of the USA for making this disclaimer
possible/necessary.

"Mark Andrews" wrote in message
...
Jeff,

Ok searching on Soundex I think will help me dig up some code or flush
out the best approach a little better.

If anyone has done this would love to see your approach!

Code I'm writing will be installed at multiple companies, most will have
5000 or less contact records.
The method should be designed to work well for 10,000 contacts.

Thanks,
Mark

"Jeff Boyce" wrote in message
...
Mark

Just for the record, the folks who read and write here in the
newsgroups are not all MVPs ... and some of the best answers I've seen
come from folks who aren't. Don't limit your audience...

Most of the approaches I've seen that work for this involve USB (using
someone's brain). It sounds like that's part of your approach, too.

Do a search on "Soundex". This is an algorithm that uses how words
(e.g., names) sound to compare them. Words with similar soundex scores
sounds similar. This could help with last names and street names, but
I don't see it helping with Bobby vs. Robert, or with all the
embellishments that addresses have. Again, you'd need to tell Access
that Bobby and Robert are (sometimes) synonymous.

One approach might be to sort all entries by "lastname, firstname -
delivery address" (as a concatenated field) and USB to break the ties.

Another approach might be to use the built-in autocomplete feature in
Access comboboxes on a form.

Have your user start typing a lastname and have Access jump to the
"lastname, firstname - delivery address"es that start that way.

You don't mention whether you're working with a couple hundreds
entries, a couple thousand, or a couple hundred thousand. The approach
you take may need to differ, depending on volume.

Good luck!

Regards

Jeff Boyce
Microsoft Access MVP

"Mark Andrews" wrote in message
...
The idea is:
- the user enters "Bobby Smith" and the program pops up a screen
saying here are some similar contacts "just to make sure this is not a
duplicate contact being entered". Example: Bob Smith 123 Main Street
Pittsburgh PA 15230
If they knew that the "Bobby Smith" they were entering lived on Main
Street in Pittsburgh they would choose to quit entering data for this
new contact because they know that they are entering a duplicate or
alternately they could continue entering the contact

- similar type logic for addresses

Yes I should of used the wording "potential duplicates" in my post and
yes it's not extremely simple thus the newsgroup post.
Don't you MVPs like a challenge once in a while?

First thoughts: I could do a simple comparison with 'like' or some
sort of character by character comparison (if 90% of the characters
match consider it a "potential duplicate"). I need some sort of
"Similar" function.

I don't think there are any potentially serious flaws with my
analysis, my analysis at this point is "hey this might be a little
work, I wonder if anyone else has attempted this and would let me see
their code".

Now is your chance to post the code you wrote to do these types of
checks,
I'm sure others have tackled duplicate issues in various ways (some
approaches better than others),
Thanks,
Mark

"Jeff Boyce" wrote in message
...
You may not have received any responses yet because what you are
proposing is not particularly simple.

... and there are some potentially serious flaws with your analysis!

How do you expect Access to be able to correctly categorize "Bob
Smith" as duplicating "Bobby Smith" when your database could
legitimately contain two separate individuals with those names?

And what happens when you have two unique individuals, both named
Lynne Johnson? (there are two in my state, and they were both born
on the same date!)

I suspect you'll have to create your own code that tells Access
exactly when and how to consider two records to be close enough to be
a match ... and you might want to consider them only as "potential"
matches.

After all, can YOU be sure that all of the following are duplicates?:

John Smith 12345 Elm St
J. J. Smith 12345 Elm Street
John J. Smith 12345 Elm St NW
Johnny Smith 12354 Elm St.
J. Smith 12345 Elm St

Regards

Jeff Boyce
Microsoft Access MVP

--
Disclaimer: This author may have received products and services
mentioned
in this post. Mention and/or description of a product or service
herein
does not constitute endorsement thereof.

Any code or pseudocode included in this post is offered "as is", with
no
guarantee as to suitability.

You can thank the FTC of the USA for making this disclaimer
possible/necessary.


"Mark Andrews" wrote in
message ...
I would like to add code to detect for duplicates or records with
similar information.
Looking to do this check on (firstname and lastname) and on address
(street1, street2, city, state, zip)

But I want something a little more advanced than just checking for
exact matches.

Wondering if anyone has some code they would care to share that
might make my job of writing it a little easier?

Example:
Bob Smith and Bobby Smith would be detected as duplicates
Rob Jones and Robert Jones would be detected as duplicates
123 main street pittsburgh, pa 15126
123 main st pittsburgh pa 15230 might be detected as duplicates

Thanks in advance,
Mark








  #9  
Old March 18th, 2010, 11:22 PM posted to microsoft.public.access
Mark Andrews[_4_]
external usenet poster
 
Posts: 169
Default Duplicate detection (looking for code)

No problem. Hope it ends up working. The word Soundex got me started down
the right path.
"Fuzzy logic vba" is also a good search keyword.

"Jeff Boyce" wrote in message
...
Thanks for posting back what you found. That will undoubtedly help
someone in their (future) search.

Be aware that "thousands" of records in a combobox leads to poor response
time. Allen B's approach speeds that up considerably.

Good luck!

Regards

Jeff Boyce
Microsoft Access MVP

--
Disclaimer: This author may have received products and services mentioned
in this post. Mention and/or description of a product or service herein
does not constitute endorsement thereof.

Any code or pseudocode included in this post is offered "as is", with no
guarantee as to suitability.

You can thank the FTC of the USA for making this disclaimer
possible/necessary.

"Mark Andrews" wrote in message
...
I searched a little, my best lead right now is the code I found at:
http://www.kdkeys.net/forums/thread/6450.aspx

You do need to sign up to the forum to download it.

It's an MDE but it looks like he included the source code for most of the
fuzzy logic search algorithms.

algorithms included
- Levenshtein Edit Distance
- Dice Coefficient
- Longest Common Subsequence
- Double Metaphone

I also read soundex2 is good (soundex is a little too general).

I am by no means an expert but I did do a little searching and if you
want to do fuzzy matching of some
sort I guess you need to jump into this stuff. Perhaps even store some
algorithm results so at runtime
you can compare faster to the thousands of records you have in the db.

Maybe this will help someone else out?
Mark



"Jeff Boyce" wrote in message
...
Mark

Check on-line for Allen Browne's website. He has a routine that helps
limit the number of records a combobox has to pull down by "waiting" for
the first "n" letters to get entered. This would be useful if you
followed the combobox and concatenation route.

Good luck!

Regards

Jeff Boyce
Microsoft Access MVP

--
Disclaimer: This author may have received products and services
mentioned
in this post. Mention and/or description of a product or service herein
does not constitute endorsement thereof.

Any code or pseudocode included in this post is offered "as is", with no
guarantee as to suitability.

You can thank the FTC of the USA for making this disclaimer
possible/necessary.

"Mark Andrews" wrote in message
...
Jeff,

Ok searching on Soundex I think will help me dig up some code or flush
out the best approach a little better.

If anyone has done this would love to see your approach!

Code I'm writing will be installed at multiple companies, most will
have 5000 or less contact records.
The method should be designed to work well for 10,000 contacts.

Thanks,
Mark

"Jeff Boyce" wrote in message
...
Mark

Just for the record, the folks who read and write here in the
newsgroups are not all MVPs ... and some of the best answers I've seen
come from folks who aren't. Don't limit your audience...

Most of the approaches I've seen that work for this involve USB (using
someone's brain). It sounds like that's part of your approach, too.

Do a search on "Soundex". This is an algorithm that uses how words
(e.g., names) sound to compare them. Words with similar soundex
scores sounds similar. This could help with last names and street
names, but I don't see it helping with Bobby vs. Robert, or with all
the embellishments that addresses have. Again, you'd need to tell
Access that Bobby and Robert are (sometimes) synonymous.

One approach might be to sort all entries by "lastname, firstname -
delivery address" (as a concatenated field) and USB to break the ties.

Another approach might be to use the built-in autocomplete feature in
Access comboboxes on a form.

Have your user start typing a lastname and have Access jump to the
"lastname, firstname - delivery address"es that start that way.

You don't mention whether you're working with a couple hundreds
entries, a couple thousand, or a couple hundred thousand. The
approach you take may need to differ, depending on volume.

Good luck!

Regards

Jeff Boyce
Microsoft Access MVP

"Mark Andrews" wrote in message
...
The idea is:
- the user enters "Bobby Smith" and the program pops up a screen
saying here are some similar contacts "just to make sure this is not
a duplicate contact being entered". Example: Bob Smith 123 Main
Street Pittsburgh PA 15230
If they knew that the "Bobby Smith" they were entering lived on Main
Street in Pittsburgh they would choose to quit entering data for this
new contact because they know that they are entering a duplicate or
alternately they could continue entering the contact

- similar type logic for addresses

Yes I should of used the wording "potential duplicates" in my post
and yes it's not extremely simple thus the newsgroup post.
Don't you MVPs like a challenge once in a while?

First thoughts: I could do a simple comparison with 'like' or some
sort of character by character comparison (if 90% of the characters
match consider it a "potential duplicate"). I need some sort of
"Similar" function.

I don't think there are any potentially serious flaws with my
analysis, my analysis at this point is "hey this might be a little
work, I wonder if anyone else has attempted this and would let me see
their code".

Now is your chance to post the code you wrote to do these types of
checks,
I'm sure others have tackled duplicate issues in various ways (some
approaches better than others),
Thanks,
Mark

"Jeff Boyce" wrote in message
...
You may not have received any responses yet because what you are
proposing is not particularly simple.

... and there are some potentially serious flaws with your analysis!

How do you expect Access to be able to correctly categorize "Bob
Smith" as duplicating "Bobby Smith" when your database could
legitimately contain two separate individuals with those names?

And what happens when you have two unique individuals, both named
Lynne Johnson? (there are two in my state, and they were both born
on the same date!)

I suspect you'll have to create your own code that tells Access
exactly when and how to consider two records to be close enough to
be a match ... and you might want to consider them only as
"potential" matches.

After all, can YOU be sure that all of the following are
duplicates?:

John Smith 12345 Elm St
J. J. Smith 12345 Elm Street
John J. Smith 12345 Elm St NW
Johnny Smith 12354 Elm St.
J. Smith 12345 Elm St

Regards

Jeff Boyce
Microsoft Access MVP

--
Disclaimer: This author may have received products and services
mentioned
in this post. Mention and/or description of a product or service
herein
does not constitute endorsement thereof.

Any code or pseudocode included in this post is offered "as is",
with no
guarantee as to suitability.

You can thank the FTC of the USA for making this disclaimer
possible/necessary.


"Mark Andrews" wrote in
message ...
I would like to add code to detect for duplicates or records with
similar information.
Looking to do this check on (firstname and lastname) and on address
(street1, street2, city, state, zip)

But I want something a little more advanced than just checking for
exact matches.

Wondering if anyone has some code they would care to share that
might make my job of writing it a little easier?

Example:
Bob Smith and Bobby Smith would be detected as duplicates
Rob Jones and Robert Jones would be detected as duplicates
123 main street pittsburgh, pa 15126
123 main st pittsburgh pa 15230 might be detected as duplicates

Thanks in advance,
Mark








  #10  
Old March 19th, 2010, 01:51 AM posted to microsoft.public.access
John W. Vinson
external usenet poster
 
Posts: 18,261
Default Duplicate detection (looking for code)

On Thu, 18 Mar 2010 12:24:01 -0700, "Jeff Boyce"
wrote:

After all, can YOU be sure that all of the following are duplicates?:


And what about my friends, Fred Brown and Fred Brown? Young Fred is no longer
living at home, but he was when I first met him...
--

John W. Vinson [MVP]
 




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump


All times are GMT +1. The time now is 01:10 PM.


Powered by vBulletin® Version 3.6.4
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright ©2004-2024 OfficeFrustration.
The comments are property of their posters.