If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
#1
|
|||
|
|||
Evaluating similarity of text strings
Hi All, I have a data list - roughly 2000 items in all, which are names (people, companies, organisations etc) and an amount they have paid. The data is collected from four separate sources and trimmed already to avoid extra spaces. The upshot is that there are instances where the same name is entered more than one way. For example: Alan B Chadd Bob Charles Chadd Alan B Now a human eye can easily spot that the first and third are (probably) the same person and it is worth investigating further. This is fine when they are close to each other, but not practical when they are two pages apart. Therefore, I would like to write a formula that could give a score that ranks the likely similar entries. I am thinking that something like this would be good: 1) Take each letter of the target name, and count how many times it appears in every other of the 2000 entries ignoring capitals (it is possible someone will have typed in a name with or without any proper capitalisation). 2) Add up the totals for each of the 2000 entries 3) Show any items with a score over X (to be picked by trial) or just sort them by the score. Example From Above: The score for "Alan B Chadd" against "Bob Charles" would be: A = 0 l = 1 a = 0 n = 0 Space = 1 B = 1 Space = 1 C = 1 h = 1 a = 0 d = 0 d = 0 Total = 5 The score for "Alan B Chadd" against "Chadd Alan B" would be: A = 3 l = 1 a = 3 n = 1 Space = 2 B = 1 Space = 2 C = 1 h = 1 a = 3 d = 2 d = 2 Total = 22 Obviously it is not perfect, but it should be good enough to point the human in the right direction! Can anyone suggest a way to do this? Thanks in advance, Alan. |
#2
|
|||
|
|||
Not perfect either, but...with names in column A, place
this formula in row 1 of an open column and fill down: =SUMPRODUCT(CODE(MID(A1,ROW(INDIRECT("1:"&LEN(A1)) ),1))) Then sort the data based on this column. Similar strings should be closely grouped. HTH Jason Atlanta, GA -----Original Message----- Hi All, I have a data list - roughly 2000 items in all, which are names (people, companies, organisations etc) and an amount they have paid. The data is collected from four separate sources and trimmed already to avoid extra spaces. The upshot is that there are instances where the same name is entered more than one way. For example: Alan B Chadd Bob Charles Chadd Alan B Now a human eye can easily spot that the first and third are (probably) the same person and it is worth investigating further. This is fine when they are close to each other, but not practical when they are two pages apart. Therefore, I would like to write a formula that could give a score that ranks the likely similar entries. I am thinking that something like this would be good: 1) Take each letter of the target name, and count how many times it appears in every other of the 2000 entries ignoring capitals (it is possible someone will have typed in a name with or without any proper capitalisation). 2) Add up the totals for each of the 2000 entries 3) Show any items with a score over X (to be picked by trial) or just sort them by the score. Example From Above: The score for "Alan B Chadd" against "Bob Charles" would be: A = 0 l = 1 a = 0 n = 0 Space = 1 B = 1 Space = 1 C = 1 h = 1 a = 0 d = 0 d = 0 Total = 5 The score for "Alan B Chadd" against "Chadd Alan B" would be: A = 3 l = 1 a = 3 n = 1 Space = 2 B = 1 Space = 2 C = 1 h = 1 a = 3 d = 2 d = 2 Total = 22 Obviously it is not perfect, but it should be good enough to point the human in the right direction! Can anyone suggest a way to do this? Thanks in advance, Alan. . |
#3
|
|||
|
|||
"Jason Morin" wrote
in message ... Not perfect either, but...with names in column A, place this formula in row 1 of an open column and fill down: =SUMPRODUCT(CODE(MID(A1,ROW(INDIRECT("1:"&LEN(A1)) ),1))) Then sort the data based on this column. Similar strings should be closely grouped. HTH Jason Atlanta, GA Hi Jason, I like the simplicity of your approach - just adding the code values together. However, if I change my example slightly and apply the formula I get: Alan B Chadd................978 Bob Charles...................1013 Chadd A B.....................663 I think that in a larger sample we would still find that the first and third lines are too far apart for a human to spot them together. The problem arises because it is very sensitive to the number of characters. Whether someone has, for example, entered a middle name or not, will affect the score too much. Similarly the scores for the following two versions of a company name are too far apart: ABC LTD...............458 abc limited...............1070 I have tried thinking about a way to modify your approach to make it work better under this circumstance. I could apply an UPPER function to each string first giving the improved result of: ABC LTD...............458 ABC LIMITED........750 It can be further improved by subtracting 31 from the code values (a space is char(32) so let's just avoid negative results): =SUMPRODUCT(CODE(MID(A1,ROW(INDIRECT("1:"&LEN(A1)) ),1))-31) This gives a further improvement: ABC LTD........................241 ABC LIMITED...............409 But this is still likely to put them a long way apart in a sort and it *feels* to me that we are just compressing the distribution of results rather than improving the ability to identify similarities. Do you have any other ideas? Thanks for your help, Alan. |
#4
|
|||
|
|||
Well, with the examples you gave, my example worked
pretty well. It's obvious that you data is much more inconsistent in format than your original post lead me to believe. What you're looking for is an extremely complex algorithem to handle such wide disparities in the data. Even then, you won't get close on some of them. I think the solution lies upstream in the process, before receiving the data. If you have any influence on your sources, you would try to insert some type of consistency in the way they report their data, so that you don't end up in the mess you're in now. I'd suggest applying what I've given you, enhance it as much as you can, then suck it up and do a manual check on the rest. Jason -----Original Message----- "Jason Morin" wrote in message ... Not perfect either, but...with names in column A, place this formula in row 1 of an open column and fill down: =SUMPRODUCT(CODE(MID(A1,ROW(INDIRECT("1:"&LEN (A1))),1))) Then sort the data based on this column. Similar strings should be closely grouped. HTH Jason Atlanta, GA Hi Jason, I like the simplicity of your approach - just adding the code values together. However, if I change my example slightly and apply the formula I get: Alan B Chadd................978 Bob Charles...................1013 Chadd A B.....................663 I think that in a larger sample we would still find that the first and third lines are too far apart for a human to spot them together. The problem arises because it is very sensitive to the number of characters. Whether someone has, for example, entered a middle name or not, will affect the score too much. Similarly the scores for the following two versions of a company name are too far apart: ABC LTD...............458 abc limited...............1070 I have tried thinking about a way to modify your approach to make it work better under this circumstance. I could apply an UPPER function to each string first giving the improved result of: ABC LTD...............458 ABC LIMITED........750 It can be further improved by subtracting 31 from the code values (a space is char(32) so let's just avoid negative results): =SUMPRODUCT(CODE(MID(A1,ROW(INDIRECT("1:"&LEN(A1) )),1))- 31) This gives a further improvement: ABC LTD........................241 ABC LIMITED...............409 But this is still likely to put them a long way apart in a sort and it *feels* to me that we are just compressing the distribution of results rather than improving the ability to identify similarities. Do you have any other ideas? Thanks for your help, Alan. . |
#5
|
|||
|
|||
"Jason Morin" wrote in message ... I'd suggest applying what I've given you, enhance it as much as you can, then suck it up and do a manual check on the rest. or use it as supporting material as to how difficult the problem is, and how it is better to fix the source of the problem, not the manifestation. |
#6
|
|||
|
|||
"Jason Morin"
wrote in message ... Well, with the examples you gave, my example worked pretty well. It's obvious that you data is much more inconsistent in format than your original post lead me to believe. Agreed - I do appreciate your assistance. What you're looking for is an extremely complex algorithem to handle such wide disparities in the data. Even then, you won't get close on some of them. I think the solution lies upstream in the process, before receiving the data. If you have any influence on your sources, you would try to insert some type of consistency in the way they report their data, so that you don't end up in the mess you're in now. Unfortunately the data is already in existence for many periods and that is what we have to work with. The systems going forwards are fully integrated so the issue does exist in the same way. I'd suggest applying what I've given you, enhance it as much as you can, then suck it up and do a manual check on the rest. Jason I wish we could afford the manual time, it would be easier from a management perspective at least - I will keep on looking for an algorithm, but thank you again for your assistance. Alan. |
#7
|
|||
|
|||
"Bob Phillips" wrote in message ... "Jason Morin" wrote in message ... I'd suggest applying what I've given you, enhance it as much as you can, then suck it up and do a manual check on the rest. or use it as supporting material as to how difficult the problem is, and how it is better to fix the source of the problem, not the manifestation. Hi Bob, Unfortunately the data is already in existence for many periods and that is what we have to work with for the statutory returns. The systems going forwards are fully integrated so the issue does exist in the same way - the source of the problem is already fixed, now I just have to find a way to report the history accurately enough for the authorities. I will keep looking for an algorithm - if you have any other ideas, please do post back. Thanks, Alan. |
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Compare cells/columns and highlight matching text strings | luxbelle | Worksheet Functions | 1 | February 25th, 2005 06:34 PM |
Outline | Renee Hendershott | Page Layout | 2 | December 25th, 2004 02:49 PM |
How does the "auto" setting work in Paragraph Spacing? | Joey | General Discussion | 9 | October 11th, 2004 08:44 PM |
find a date on sheet 2 and count text in that column | jtinne | General Discussion | 4 | October 4th, 2004 09:06 PM |
SUMPRODUCT with calculated text strings returning Zero | jbwhite99 | Worksheet Functions | 0 | June 15th, 2004 09:12 PM |