Prev: 2 questions on mssql and ldap
Next: retrieve displayname and mail for all users in an ldap (AD) group
From: Bill Mudry on 16 May 2010 01:21 I thank Tyrael and Neil for their prior efforts a few days ago. Unfortunately their suggestions did not work. I have searched the web and tried many different combinations but I am surprised that it is that hard to add from one table of about 11,300 records into a master file I use with about 6,550 relevant records while keeping duplicates in the larger file to what is already in the smaller one stopped. This has made me think that first, perhaps I have to get back to understanding some of the basics of how various SQL commands and statements work and also how they do NOT work. Please help me confirm as true what I will say below or else correct my understanding. ISSUE 1 DISTINCT - This is commonly used to get rid of duplicates in ONE specific file. All records being found to be duplicate is always in the ONE table and duplicates are never defined as duplicate to another record in another table. I have never seen any example or reference in which it can be used to prevent new files coming into a table to be a duplicate against all records already in a table. Is this not true? Correct me if I am wrong. This massive addition of records to the master table is the scenario I am trying to do while not introducing new duplicates common to both. What is in the table accepting new records must not have any records erased in an effort to not have duplicates. It just has too much additional data worth preserving. The file to be appended to (called species_master) has the exact 4 same column names as in the first table with the 11,300 records (which I named tervuren_target) but with many more fields that would tag along for the ride during a merge of the two files. The tervuren_target file has already been cleaned of itself of duplicate records. Therefore, if my comment on using DISTINCT is true ..... and cannot be used to prevent new duplicates from forming when records are transferred from one table to another, then in the append or merge I am trying for ..... using 'DISTINCT' to stop new duplicates would be a waste of time! True or False? ISSUE 2 When using a JOIN, we usually equate a foreign key in one table to a primary key of the table being joined to. However, in my need, I am not talking about a relational connection between the two tables. I had to first merge the genus and epithet columns together in the tervuren_target table to create one-column species names in column species_name. Once that was done, all the columns I want to copy over into the master species file were deliberately made the same as in the species_master file. There is no hierarchy between them --- only more of the same kind of data coming in from a HUGE source. SO ...... on a Join (or any other efforts to bring over the data WITHOUT duplicates) as said above, it is normally written written as table1 = table2. In this case, the comparison would be ..... tervuren_target.species_name = species_master.species_name. BUT wait! That would allow all records with a species_name equal to one in the master file to come flying in, right?? Wow, that is the very OPPOSITE OF WHAT I WANT! Unless there is something I do not know yet ---> that seems like it would let nothing BUT duplicates in!! ..... Totally and absolutely not acceptable if that would happen. Therefore, with joins, is it only the equal operator that can be used? I am leaning to think that the following is not allowed even if it spells out the condition that I really need: tervuren_target.species_name <> species_master.species_name. So please do correct me if my understanding is false or inadequate. For now, assuming I am correct, then just how does one specify how to add further records into a table from another table with parallel fields while stopping records coming over that would create duplicate data (in this case duplicates in the species_name field? Because of the close similarity of structures of the two tables and because of data merging into the same key column to create duplicates (guaranteed ---- over 3,000 of them!) I am now wondering if using a JOIN can be used, should be used or is relevant at all. So I still need to find how to merge over all those tervuren records without creating NEW duplicates in the process. First, though, there is one more issue. ISSUE 3 The script that was suggested to me is: INSERT INTO species_master (genera, name, species_name, authorities, species_source) SELECT DISTINCT genera, name, species_name, authorities, species_source FROM 'tervuren_target' LEFT JOIN species_master ON tervuren_target.species_name = species_master.species_name WHERE species_master.species_name IS NULL My next issue is about the last line: WHERE species_master.species_name IS NULL I believe I was told this was to help prevent duplicates --- but I cannot see that. Here is my understanding why. Perhaps I do not know enough details about how that WHERE statement is parsed and made to work. Before the tables are merged, all records in species_master.species_name have species names in them. That column just does not have ANY NULL entries! Therefore ---- am I not correct to say that there is absolutely no conditions to satisfy the WHERE statement? Nothing can happen! One other thing just came to mind. In all my efforts in this, as I ran different script attempts, I would frequently realize I had to state fields as "table.column" only to find that the system would say it does not know that column. Strange thought I. Why would it do that? Did I still do something wrong or incomplete? One last word before I am told I have been off topic. My entire project is done in PHP with storage in MySQL. I was hoping that doing this above merge of data in PHP might be overkill if SQL can handle it. I am still quite open to a PHP script solution if that is better or if SQL just cannot cut it. So ...... <mild sigh> ...... what really will do the job? Always thankful for help, Bill Mudry Mississauga, Ontario |