If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
#1
|
|||
|
|||
Merging duplicate entries in one table
Hello,
I work in a theatre and we use access for our mailing list and a FoxPro program for our ticketing software. The FoxPro program is not versatile when it comes to database maintenance (Choice ticketing). It is limited to deleting and changing customers one by one. Unfortunately, someone added several mailing lists to our ticketing software without deleting duplicates in access or excel first. I have 27,000 entries in that program which I would guess about 10,000 are duplicates. My goal is to have one mailing list in the office which has any additional data extracted from Choice but without duplicates. So far I have made some progress. My [Main] list and my [Choice] list are individually de-duped and uniform. By uniform I mean they all have consistent formatting (i.e. ave. is avenue and st. is street or saint, all mr. or mrs. are put in a prefix field, etc.) I deduped everything manually using different duplicate queries like first, last/last, address/first, address/and phone number. I then merged main with choice, by making sure all the columns were uniform and appending data from one to the other. I am using checkboxes for fields that determine what type of shows someone is interested in, what shows they attended, if they have donated money, etc. I have an extremely daunting task. I have several thousand duplicates - but all the checkboxes do not match up. I want to merge my entries like so: When [Last] & [Address] is the same Choose [Phone] with number over [Phone] with no number And choose yes over no for [Film] [Cabaret] [Blues] or [Theatre] If [First] has different values then keep [First] with phone number Is that possible? I'm not experienced in sql or the union function but can follow instructions. Thanks in advance for any help Charles |
#2
|
|||
|
|||
Merging duplicate entries in one table
Charles:
Firstly treat the following as guidance only as its difficult to be categorical without seeing exactly how the data is currently represented. Start by selecting all distinct Last and Address values from your combined Main and Choice tables, which I've called YourTable below: SELECT DISTINCT Last, Address FROM YourTable ; Insert these rows into a new table which you've already designed with columns CustomerID (autonumber), First, Last, Address and Phone. I've called this table NewTable below. Select all rows from YourTable where the phone column has a number: SELECT Last, Address, First, Phone FROM YourTable WHERE Phone IS NOT NULL; Join this to NewTable on the Last and Address columns and UPDATE the Fisrt and Phone columns in NewTable with the values of the First and Phone columns from YourTable. Select all rows from YourTable with empty (NULL) Phone columns where a row does not yet exist in NewTable with a First value for the current Last and Address values: SELECT Last, Address, First FROM YourTable WHERE Phone IS NULL AND NOT EXISTS (SELECT * FROM NewTable WHERE NewTable.Last= YourTable.Last AND NewTable.Address = Yourtable.Address AND NewTable.First IS NULL); Join this to NewTable on the Last and Address columns and UPDATE the First column in NewTable with the value of the First column from YourTable. That should create a list of all customers. As regards the type of shows someone is interested in, what shows they attended, if they have donated money, etc, having separate Boolean (Yes/No) columns for each of these is completely the wrong design in a relational databse. You should have separate tables of Shows etc. You then need to model the many-to-many relationships between them and NewTable (lets now call it Customers) with other tables which each contain foreign key columns referencing the primary keys of the other two, so a ShowsAttended table would have columns CustomerID and ShowID. Its difficult to give precise guidance how to populate these tables without knowing exactly how this data is currently represented, but essentially you will need to execute a series of 'append' queries which join the new Customers table to the original YourTable table on the Last and Address columns and inserts rows into tables such as ShowsAttended. Ken Sheridan Stafford, England "CharlesZ" wrote: Hello, I work in a theatre and we use access for our mailing list and a FoxPro program for our ticketing software. The FoxPro program is not versatile when it comes to database maintenance (Choice ticketing). It is limited to deleting and changing customers one by one. Unfortunately, someone added several mailing lists to our ticketing software without deleting duplicates in access or excel first. I have 27,000 entries in that program which I would guess about 10,000 are duplicates. My goal is to have one mailing list in the office which has any additional data extracted from Choice but without duplicates. So far I have made some progress. My [Main] list and my [Choice] list are individually de-duped and uniform. By uniform I mean they all have consistent formatting (i.e. ave. is avenue and st. is street or saint, all mr. or mrs. are put in a prefix field, etc.) I deduped everything manually using different duplicate queries like first, last/last, address/first, address/and phone number. I then merged main with choice, by making sure all the columns were uniform and appending data from one to the other. I am using checkboxes for fields that determine what type of shows someone is interested in, what shows they attended, if they have donated money, etc. I have an extremely daunting task. I have several thousand duplicates - but all the checkboxes do not match up. I want to merge my entries like so: When [Last] & [Address] is the same Choose [Phone] with number over [Phone] with no number And choose yes over no for [Film] [Cabaret] [Blues] or [Theatre] If [First] has different values then keep [First] with phone number Is that possible? I'm not experienced in sql or the union function but can follow instructions. Thanks in advance for any help Charles |
#3
|
|||
|
|||
Merging duplicate entries in one table
Ken,
I am not sure if I really understood your answer. I am truly a beginner with Access. Let me clarify my situation a little. I am not the Marketing Director at my theatre but I think he is pretty satisfied with the layout of the master list. We have about 100 shows every year which all fall under about 7 different categories. It would be annoying to have to do a query for 30 different shows over a period of 4 years if we were just looking for all of our customers who were interested in bluegrass. Currently our mail goes to customers who have expressed interest in a certain type of show. I would like to add to that list people who have bought tickets to a certain type of show as well as new columns for the best attended shows. That information is contained in the "Choice List". The mailing list that we use is the "Mainlist". I guess I was hoping for a step by step procedure for implementing a sql code that would merge the duplicates as I said originally: When [Last] & [Address] is the same Choose [Phone] with number over [Phone] with no number And choose yes over no for [Film] [Cabaret] [Blues] or [Theatre] If [First] has different values then keep [First] with phone number I recognize that this may be difficult, so I got to thinking about another way that I could do this. First I would create a find duplicates query which would show one yes/no column (FILM for instance). Then I would append code to the original query to find duplicates which have different Yes/No values for that column. Since one would be yes and one would be no and I would prefer a yes value over a no value I would simply copy and paste all the yes values on to the no values and then do a similar query for the next column. That way all of my data would be uniform across the board and I could start deleting duplicate values without worrying about lost data. Hopefully you could help me out with that? Thank you, Charles |
#4
|
|||
|
|||
Merging duplicate entries in one table
Charles:
It might help if we go back to first principles and quickly outline how relational databases model the part of the real world they are concerned with and how this applies in your case. The database relational model was originally developed around 1970 by Codd and has become the dominant model for business databases.. Over the years the model has been developed by a vast amount of published research, and the SQL language has similarly undergone ongoing development. The important thing, however, is that the model is based on firm scientific principles, so when the 'rules' of the model are respected applications can be relied upon to work correctly. A well designed relational database identifies entity types in the real world and represents each of these as tables. The attributes of an entity type are represented by columns in the tables. Its important that there is no redundancy (or at least as little as is possible without losing information) in the tables and this is ensured by means of an analytical processknown as normalization. Tables are said to be in 'normal forms' which range from First Normal Form (1NF) to Fifth Normal Form (5NF). Other normal forms beyond this have been identified, but these are rather bizarre in nature and we don't need to concern ourselves with them here. In fact for the momemnt we need only concern ourselves with the first three. Normalization is defined in terms of the functional dependencies within a table (and in the case of 5NF across tables). A column in a table is sid to be functionally dependent on another column (or columns) if the value in the column is determined by the value in the other column. If I had CustomerID 42 in your database then that determines my FirstName value as 'Ken' and nothing else and my LastName value as 'Sheridan' and nothing else because the FirstName and LastName columns in a Cutomers table are functionally dependent on the CustomerID column. To be in Third Normal Form all non-key field in the table must be functionally dependent on the whole of the primary key and nothing but the primary key. So a Customers table with these three columns is fine as the two non-key columns FirstName and LastName are dependent on the primary key CustomerID and nothing else. Say I go to some shows and you have a ShowsAttended table with columns ShowName, CustomerID, FirstName and LastName. This table is not correctly normalized because FirstName and LastName are not functionally dependent on the whole of the table's primary key (The combination of Showname and CutsomerID in this case), but on part of it (CustomerID). It would be perfectly possible to have CustomerID 42 and completely different FirstName and LastName values to mine in the same row. If we take out the FirstName and LastName columns from this shows table then the CustomerID maps back to my row in Customers so my names are always correct. But the table is still not good enough. Its theoretically possible there might be two different shows with the same name (this certainly happens in the cinema), so we don't know which one I actually attended. Also it would be very easy for the Showname to be entered inconsistently in different rows (I once found three versions of myself amongst the authors of technical articles in a database I worked on!). What we need is a Shows table with ShowID (a unique number) and ShowName columns. In ShowsAttended we the have CustomerID and ShowID columns, plus columns such as DateAttended. This ShowsAttended table is actually modelling a special kind of entity type because it is a relationship type between Customers and Shows modelling the many-to-many relationship between those entity types. When it comes to types of shows, this is another entity type so we need a Categories table with columns CategoryID, Category. A show might be categorised as more than one type, however, Jazz and Blues say, so we need a table, ShowCategories, to model the many-to-many relationship between Shows and Categories, with columns ShowID and CategoryID. With his set of tables we are able to extract whatever information we need from the data because it is a solid model of the part of the real world we are interested in. To find out which customers attended shows categorised as Blues or Country say we'd join the Customers, ShowsAttended, Shows, ShowCategories and Categories in a query and restrict that query's result set to rows where the Category was Blues or Country. Customer interests regardless of whether they've attended any shows of those categories is another entity type, so is modelled by another table CustomerInterests with columns CustomerID and CategoryID. So if Jazz is category 6 and Blues is category 9 and I' have interests in each there would be two rows in this table for me with values 42;6 and 42:9. This table can then also be incorporated in the query so it returns not only customers who’ve attended shows of particular categories, but also who've expressed an interest in them. I hope this has given you an idea of how the sort of real world scenario you are dealing with can be robustly modelled in a relational database. The way of modelling it by means of a table with many different Boolean (Yes/No)columns to represent customer interests is not only fundamentally wrong in a relational database, but would make it very difficult to extract useful information. The beauty of a solidly designed logical model is that, because it is a scientifically sound representation of the underlying realities, it enables you to extract whatever information those realities can provide. The tools for doing this, particularly the SQL language, provide a very powerful and flexible armoury for doing so. If I were in your shoes and faced with the task of producing a solid database which will not only fulfil your immediate requirements but provide a firm basis for adding other functionality, the first thing I would do would be to design the logical model in terms of the entity types and relationships between them. I'd do this, not on my machine, but with a good supply of pencils and paper, drawing out the model diagrammatically, and testing it by asking whether it can not only give me the right answers, but also whether there is anything in the design which might lead it to give me the wrong answers, which is what results from a badly normalized model. Only then would I build the 'back end' database as a set of empty tables and the relationships between them along with the 'front end' interface for entering, editing and querying the database. Finally I would transfer the data from its present sources using various 'append' and 'update' queries. The exact nature of these would depend on the final design of the logical model and the nature of the existing source data, but the SQL statements I posted in my earlier post show the general principles involved in the first stage, producing the Customers table itself. Inserting the Shows and Categories data should not be difficult. The trickier work is in populating the other tables which represent the relationships between these main entity types. The process tends to be tedious rather than difficult, however. Ken Sheridan Stafford, England "CharlesZ" wrote: Ken, I am not sure if I really understood your answer. I am truly a beginner with Access. Let me clarify my situation a little. I am not the Marketing Director at my theatre but I think he is pretty satisfied with the layout of the master list. We have about 100 shows every year which all fall under about 7 different categories. It would be annoying to have to do a query for 30 different shows over a period of 4 years if we were just looking for all of our customers who were interested in bluegrass. Currently our mail goes to customers who have expressed interest in a certain type of show. I would like to add to that list people who have bought tickets to a certain type of show as well as new columns for the best attended shows. That information is contained in the "Choice List". The mailing list that we use is the "Mainlist". I guess I was hoping for a step by step procedure for implementing a sql code that would merge the duplicates as I said originally: When [Last] & [Address] is the same Choose [Phone] with number over [Phone] with no number And choose yes over no for [Film] [Cabaret] [Blues] or [Theatre] If [First] has different values then keep [First] with phone number I recognize that this may be difficult, so I got to thinking about another way that I could do this. First I would create a find duplicates query which would show one yes/no column (FILM for instance). Then I would append code to the original query to find duplicates which have different Yes/No values for that column. Since one would be yes and one would be no and I would prefer a yes value over a no value I would simply copy and paste all the yes values on to the no values and then do a similar query for the next column. That way all of my data would be uniform across the board and I could start deleting duplicate values without worrying about lost data. Hopefully you could help me out with that? Thank you, Charles |
#5
|
|||
|
|||
Merging duplicate entries in one table
Ken,
As helpful and interesting as that was it didn't really answer my question which was: Can you run a find duplicates query which only returns results in which the duplicates have different values in one specific yes/no field? At my theatre we run Choice ticketing software which does have a relational database in it. I am able to look up any customer and find out what shows they attended and when you enter a new event in the system you do in fact have to come up with a unique id for that code and you can select a price code or create a new one to associate it with. You can also say what interest codes you want associated with that event. When a customer buys a ticket it will automatically put down an interest code or two for them. You can also manually type in interest codes for a specific customer if they call you and say they suddenly have an interest in dance, where they never did before. You can run queries that rely on all this data. It is a very complete program - but you have to change, merge, and delete customers one by one. My problem is that I am dealing with 10,000 duplicates and some missing information where non-customers called the administrative office to be put on the mailing list. My other problem is that I need to come up with a comprehensive mailing list in just a couple months. My plan is this: I want a mailing list by interest code (forget shows and accounting.. our ticketing software will take care of that) I simply want to fill the 275 seats per show by sending out advertisements to those with interest in certain types of shows. I am going to delete all non-ticket buyers from our ticketing software, eventually, and eventually I am going to put back names of non-ticket buyers once i have taken out all the duplicates in access. Once I have reached that final goal - we can safely use the ticketing software for a more complete relational database. My first task at hand is to clean up the mailing list before I begin to work on a relational database in our ticketing software or in access. Thanks, Charles |
#6
|
|||
|
|||
Merging duplicate entries in one table
Charles:
I think I have answered your question to a greater degree than you realize, but I'll try and spell it out in more detail below. The way to eliminate duplicates is not to get rid of the redundant rows from a badly designed table, which just gives you a badly designed table without duplicates, but to insert the data into a properly designed set of tables. That way the system will do the work for you. For customers' interests you need a simple schema of the following tables: Customers, CustomerInterests, Categories. The Customers table should have columns for the attributes of Customers, i.e. Name and Address and phone. If the name and address are stored as just two columns in your existing tables rather than as FirstName, LastName, and several 'Address Line' columns you should create a unique index on the two columns. However they are stored at present, though, you should initially have the same columns in your new Customers table and index on the set of name and address columns uniquely. Also give the table a CustomerID autonumber column as its primary key. All you then need to do is insert rows into the Customers table with an 'append' query based on your existing table (or several append queries if the existing data is still in more than one table. The indexing in the new Customers table will allow only one row per customer to be inserted no matter how many times they might be duplicated in the existing table(s). To ensure that the existing records with phone numbers are inserted insert those first by testing for Phone IS NOT NULL, e.g. INSERT INTO Customers (Name, Address, Phone) SELECT Name, Address, Phone FROM ExistingTable WHERE Phone IS NOT NULL; Repeat this for whatever 'ExistingTables' you have. Then insert rows without phone numbers: INSERT INTO Customers (Name, Address, Phone) SELECT Name, Address, Phone FROM ExistingTable WHERE Phone IS NULL; Because of the unique indexing on Name and Address only one row will be inserted by the first query for each name and address with a phone number. Similarly the second query will insert only one row for each name and address without a phone number where that name and address hasn't already been inserted by the first query. I imagine you can create the Categories table easily enough by hand as the number of categories is presumably going to be relatively limited. This should have a CategoryID autonumber primary key and a Category text column with values like Blues, Jazz, Rock etc. The CustomerInterests table will have columns CustomerID and CategoryID, both of long integer number data type (not autonumber). Make both columns the composite primary key of the table. To insert rows into this table you need one 'append' query for each Category (which is why I said the process was tedious but not difficult). So for Blues say the query would be: INSERT INTO CustomerInterests(CustomerID, CategoryID) SELECT CustomerID, CategoryID FROM ExistingTable, Customers, Categories WHERE Customers.CustomerName = ExistingTable.CustomerName AND Customers.Address = ExistingTable.Address AND Blues = TRUE AND Category = "Blues" In this query Blues is the name of the Boolean (Yes/No) column in ExistingTable and is the value in the Category column of one row in the categories table. Repeat this for each ExistingTable if you have more than one. Only one row for each CustomerID/CategoryID combination will be inserted by virtue of the unique indexing of the two columns in CustomerInterests which is implicit in their being defined as the primary key. Run variations on this query for each category, simply changing the two references to Blues to the category in question, in one case as the name of the Boolean column in the ExistingTable and in the other as the value in the Categories table. Once you grasp the principles involved in this sort of thing its very easy to implement it for a more complex schema involving a larger number of tables. Fundamentally all it involves is first filling the tables which represent the main entity types, Customers and Categories here, with distinct values, then by joining them to the original table inserting rows to represent the relationships between the main entity types. In case you are wondering how the tables are joined in the last query without it having any JOIN clause the answer is that the joins are done by 'join criteria' in the WHERE clause. This was the only way of doing it in SQL until the JOIN operator was introduced in the SQL-92 standard. It was introduced then to cater for OUTER JOINs, which can't be done in the WHERE clause. You'll notice that there is no join criterion in relation to the categories table. This is because only one row is being returned from that table. Without any join criteria every row in each table is joined to every row in the other(s), what is known as the Cartesian product of the tables. As only one row is returned from categories here it returns rows in which the CategoryID value for 'Blues' is returned along with each of CustomerID values from Customers where the name and address match the name and address in those rows of ExistingTable where the Boolean Blues column is True. Ken Sheridan Stafford, England "CharlesZ" wrote: Ken, As helpful and interesting as that was it didn't really answer my question which was: Can you run a find duplicates query which only returns results in which the duplicates have different values in one specific yes/no field? At my theatre we run Choice ticketing software which does have a relational database in it. I am able to look up any customer and find out what shows they attended and when you enter a new event in the system you do in fact have to come up with a unique id for that code and you can select a price code or create a new one to associate it with. You can also say what interest codes you want associated with that event. When a customer buys a ticket it will automatically put down an interest code or two for them. You can also manually type in interest codes for a specific customer if they call you and say they suddenly have an interest in dance, where they never did before. You can run queries that rely on all this data. It is a very complete program - but you have to change, merge, and delete customers one by one. My problem is that I am dealing with 10,000 duplicates and some missing information where non-customers called the administrative office to be put on the mailing list. My other problem is that I need to come up with a comprehensive mailing list in just a couple months. My plan is this: I want a mailing list by interest code (forget shows and accounting.. our ticketing software will take care of that) I simply want to fill the 275 seats per show by sending out advertisements to those with interest in certain types of shows. I am going to delete all non-ticket buyers from our ticketing software, eventually, and eventually I am going to put back names of non-ticket buyers once i have taken out all the duplicates in access. Once I have reached that final goal - we can safely use the ticketing software for a more complete relational database. My first task at hand is to clean up the mailing list before I begin to work on a relational database in our ticketing software or in access. Thanks, Charles |
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Add New Field to DB | Karen | Database Design | 7 | October 19th, 2005 08:03 PM |
duplicate entries in table of contents | Rob | Tables | 3 | September 29th, 2005 12:06 PM |
unable to repair inobox | Sudheer Mumbai | General Discussion | 1 | February 20th, 2005 11:55 AM |
Table of Contents contains duplicate entries | Zimran Douglas | Tables | 1 | December 8th, 2004 11:09 PM |
Manual line break spaces on TOC or Table of tables | Eric | Page Layout | 9 | October 29th, 2004 04:42 PM |