Query to group similar number combinations?

**esn** · May 14th, 2010, 05:06 PM posted to microsoft.public.access.queries

I'm trying to create a make-table query that will group records by two
fields. The fields, named "Left" and "Right," both contain 4-digit
identification numbers, and each combination of the two fields should
identify a unique individual. Unfortunately, data entry/recording
errors are pretty common in these fields.

The point of the query is to assign a unique, single-field ID to each
individual, using the primary key from the table created by this
query. In a perfect world, I could just group records by "Left" and
"Right," add the results to a temporary table, and use the PK from
that table as an ID. But as I mentioned above, the data is not
exactly perfect.

I have created queries to find individuals with similar Left and Right
codes, but I can't think of a good way to group all these records
together. Here's and input example:

RecordID Left Right
1 1001 1002
2 1001 1002
3 1002 1001
4 1003 1004
5 1005 1006
6 1005 1009
7 1005 1006

What I'm trying to do is create some field, based on "Left" and
"Right," that I could use in a group by statement to return three rows
from the table above, since there are probably 3 individuals
represented there. (The first three records are probably all the same
individual, as are the last three records). Not even sure if this
would be possible, since in different cases things are grouped based
on different criteria, but I thought I would put it out there and see
what people think.

**KARL DEWEY** · May 14th, 2010, 07:12 PM posted to microsoft.public.access.queries

The first three records are probably all the same individual, as are the
last three records
I do not see your logic as all three have 1005, two have 1006, but another
has 1009 that does not fit.

--
Build a little, test a little.

"esn" wrote:

I'm trying to create a make-table query that will group records by two
fields. The fields, named "Left" and "Right," both contain 4-digit
identification numbers, and each combination of the two fields should
identify a unique individual. Unfortunately, data entry/recording
errors are pretty common in these fields.

The point of the query is to assign a unique, single-field ID to each
individual, using the primary key from the table created by this
query. In a perfect world, I could just group records by "Left" and
"Right," add the results to a temporary table, and use the PK from
that table as an ID. But as I mentioned above, the data is not
exactly perfect.

I have created queries to find individuals with similar Left and Right
codes, but I can't think of a good way to group all these records
together. Here's and input example:

RecordID Left Right
1 1001 1002
2 1001 1002
3 1002 1001
4 1003 1004
5 1005 1006
6 1005 1009
7 1005 1006

What I'm trying to do is create some field, based on "Left" and
"Right," that I could use in a group by statement to return three rows
from the table above, since there are probably 3 individuals
represented there. (The first three records are probably all the same
individual, as are the last three records). Not even sure if this
would be possible, since in different cases things are grouped based
on different criteria, but I thought I would put it out there and see
what people think.
.

**esn** · May 14th, 2010, 08:13 PM posted to microsoft.public.access.queries

The example was chosen to represent some of the more common errors in
recording/entering data - one common error is to replace a 6 with a 9
(I also see a lot of 3s replaced with 8s, etc). From poring through
this data for a few months, I know (based largely on other data stored
with these records and checking entered data against recorded data)
that it's much more likely that those last three records represent the
same individual and that the "1009" is an error, than that there are
two distinct individuals with numbers that are so similar. Had this
particular project been designed a little more carefully, I could
actually guarantee that, as each four-digit code would only be
available once, so there would be no way for two different individuals
to share a code in one column but have different codes in the other.
In reality, there have been a few codes used more than once. Still,
the odds of two different individuals ending up with 7 of the 8 digits
matching is miniscule compared to the probability of error.

The same goes for the first three records - technically, it may be
possible for two different individuals to have the same codes but in
opposite columns. However, the probability that the codes were
accidentally transposed somewhere in the path from the real world to
the database is orders of magnitude higher than the probability that
those codes are correct and actually represent two different
individuals. Sounds like a nasty assumption, I know, but I've looked
into it quite a bit and in every case I've found (somewhere in the
hundreds, out of around 10000 records) the evidence from other fields
in the records overwhelmingly pointed to errors in the data, as
opposed to those records actually representing unique individuals.

I should also have mentioned that it is possible for these codes to be
lost and replaced (they refer to physical markers) - in which case an
single individual can be represented by more than one combination of
left and right codes. It's a very messy situation, and I have a
feeling that automating the process of identifying which records
represent the same individual may not be possible, but was hoping
someone might have some ideas for me. I've built a form where users
can select from a group of records with "similar" codes to identify
the individual, but that doesn't do any good until I've assigned an
individual ID to most of the records.

**KARL DEWEY** · May 14th, 2010, 08:56 PM posted to microsoft.public.access.queries

one common error is to replace a 6 with a 9 (I also see a lot of 3s
replaced with 8s, etc).
You can not create a query to find those that are the typos you mentioned
unless either you build a table of possible error combinations and let it
suggest possible errors. The other way that might be possible is if there is
supposed to be a minimum number of records in a set and it fill a set with
something that is an half match.

What do you want the output to look like for those that are easy to find
such as the reversed field?

--
Build a little, test a little.

"esn" wrote:

The example was chosen to represent some of the more common errors in
recording/entering data - one common error is to replace a 6 with a 9
(I also see a lot of 3s replaced with 8s, etc). From poring through
this data for a few months, I know (based largely on other data stored
with these records and checking entered data against recorded data)
that it's much more likely that those last three records represent the
same individual and that the "1009" is an error, than that there are
two distinct individuals with numbers that are so similar. Had this
particular project been designed a little more carefully, I could
actually guarantee that, as each four-digit code would only be
available once, so there would be no way for two different individuals
to share a code in one column but have different codes in the other.
In reality, there have been a few codes used more than once. Still,
the odds of two different individuals ending up with 7 of the 8 digits
matching is miniscule compared to the probability of error.

The same goes for the first three records - technically, it may be
possible for two different individuals to have the same codes but in
opposite columns. However, the probability that the codes were
accidentally transposed somewhere in the path from the real world to
the database is orders of magnitude higher than the probability that
those codes are correct and actually represent two different
individuals. Sounds like a nasty assumption, I know, but I've looked
into it quite a bit and in every case I've found (somewhere in the
hundreds, out of around 10000 records) the evidence from other fields
in the records overwhelmingly pointed to errors in the data, as
opposed to those records actually representing unique individuals.

I should also have mentioned that it is possible for these codes to be
lost and replaced (they refer to physical markers) - in which case an
single individual can be represented by more than one combination of
left and right codes. It's a very messy situation, and I have a
feeling that automating the process of identifying which records
represent the same individual may not be possible, but was hoping
someone might have some ideas for me. I've built a form where users
can select from a group of records with "similar" codes to identify
the individual, but that doesn't do any good until I've assigned an
individual ID to most of the records.
.

**esn** · May 14th, 2010, 11:55 PM posted to microsoft.public.access.queries

What do you want the output to look like for those that are easy to find
such as the reversed field?

That's the next hurdle - once I have those records grouped correctly,
the idea is to make a temporary table from them, then use the
autonumber ID from the temporary table as an "IndividualID," and
insert that ID back into the original records. The eventual result
would be, using the example above,

RecordID Left Right IndividualID
1 1001 1002 1
2 1001 1002 1
3 1002 1001 1
4 1003 1004 2
5 1005 1006 3
6 1005 1009 3
7 1005 1006 3

There are more elegant ways to do that as well, as any search for
"sequential numbering group" will turn up. That would be fine too -
the initial problem is getting access to understand that those records
comprise a group, so that I can tackle the problem of numbering them.

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode