Hi, this one has me stumped. Any help would be appreciated.
Client 1 has the same postal address as client 3
Client 2 has the same email address as client 3
So even though clients 1 and 2 have different postal and email addresses we
know they are dupes of each other.
So in my table clients_dupes I have
client_id dupe_id
1 3
2 3
and I need to add in (1,2)
insert into clients_dupes (client_id, dupe_id)
select distinct cd1.client_id, cd2.client_id
from clients_dupes cd1, clients_dupes cd2
where cd1.dupe_id = cd2.dupe_id
and (select count(*) from cd3 -- where its not already in there
where cd3.client_id = cd1.client_id
and cd3.dupe_id = cd2.client_id)
then if @.@.rowcount > 0 I rereun it as it could have created more links,
usually 2 or 3 goes will match all dupes with all others.
easy so far, eh? however, clients_dupes contains 700,000 records so this
takes an absolute age and fills up my 15GB log file before it finishes. It
works in theory but not in practice. Any ideas? Thanks in advance.I don't understand why you don't consider that Client 3 is also a dupe. The
simple fact that Client 1 as the same address as Client 3 should tell use
that both 1 and 3 are dupes of each other. In the same way, Client 2 as the
same email address as client 3 and 2 and 3 should be considered dupes, too;
so the final solution is not 1, 2 but 1, 2 and 3.
You can argue that two different clients can share the same postal address
and be different people but you could also argue the same thing with the
email address: two people can have the same postal address and the same
email address but still be two different persons.
In your case, the memory taken goes up the sky because you have probably a
combinatory explosion stemming out of th fact that you are inner joining the
same table two times with itself. You should take a look at the query plan
to see what happens.
Possible first solution: make a Inner Join between 1 and 3 UNION an Inner
Join between 2 and 3 then Group By on 3 and extract everything with an
Having clause on Count >= 2 .
Possible second solution: sort by the Postal address and use a cursor to
retrieve all clients that have the same address. Repeat the processus with
the Email address. This should keep the memory at the lower end.
(BTW, in the case of the Postal address, the use of fuzzy logic will sort
out some address that are not exactly the same but are sufficiently close to
be considered the same. DTS 2005 have things about that.
Sylvain Lafontaine, ing.
MVP - Technologies Virtual-PC
E-mail: http://cerbermail.com/?QugbLEWINF
"Krop" <Krop@.discussions.microsoft.com> wrote in message
news:24929CB3-88B2-4A38-90D2-0BC5C9AD4425@.microsoft.com...
> Hi, this one has me stumped. Any help would be appreciated.
> Client 1 has the same postal address as client 3
> Client 2 has the same email address as client 3
> So even though clients 1 and 2 have different postal and email addresses
> we
> know they are dupes of each other.
> So in my table clients_dupes I have
> client_id dupe_id
> 1 3
> 2 3
> and I need to add in (1,2)
> insert into clients_dupes (client_id, dupe_id)
> select distinct cd1.client_id, cd2.client_id
> from clients_dupes cd1, clients_dupes cd2
> where cd1.dupe_id = cd2.dupe_id
> and (select count(*) from cd3 -- where its not already in there
> where cd3.client_id = cd1.client_id
> and cd3.dupe_id = cd2.client_id)
> then if @.@.rowcount > 0 I rereun it as it could have created more links,
> usually 2 or 3 goes will match all dupes with all others.
> easy so far, eh? however, clients_dupes contains 700,000 records so this
> takes an absolute age and fills up my 15GB log file before it finishes.
> It
> works in theory but not in practice. Any ideas? Thanks in advance.|||Hi, thanks for responding, comments inline.
"Sylvain Lafontaine" wrote:
> I don't understand why you don't consider that Client 3 is also a dupe. T
he
> simple fact that Client 1 as the same address as Client 3 should tell use
> that both 1 and 3 are dupes of each other. In the same way, Client 2 as t
he
> same email address as client 3 and 2 and 3 should be considered dupes, too
;
> so the final solution is not 1, 2 but 1, 2 and 3.
I was just providing (1,2) as an example the final table should look
1,1
1,2
1,3
2,1
2,2
2,3
3,1
3,2
3,3
so I can take the mailing list and say
delete mailing_list
from mailing_list
inner join clients_dupes cd on cd.client_id = mailing_list.client_id
inner join clients on clients.id = cd.dupe_id
where clients.opted_out = 1
so it will remove anyone who has a dupe that has opted out
> You can argue that two different clients can share the same postal address
> and be different people but you could also argue the same thing with the
> email address: two people can have the same postal address and the same
> email address but still be two different persons.
Agreed but its a "business decision"
> In your case, the memory taken goes up the sky because you have probably a
> combinatory explosion stemming out of th fact that you are inner joining t
he
> same table two times with itself. You should take a look at the query pla
n
> to see what happens.
> Possible first solution: make a Inner Join between 1 and 3 UNION an Inner
> Join between 2 and 3 then Group By on 3 and extract everything with an
> Having clause on Count >= 2 .
> Possible second solution: sort by the Postal address and use a cursor to
> retrieve all clients that have the same address. Repeat the processus wit
h
> the Email address. This should keep the memory at the lower end.
Both seem like good ideas, I'll have a go. Many Thanks,|||Hi Sylvain,
Couldn't get the first way to work but the cursor worked a treat. Many
Thanks!
"Sylvain Lafontaine" wrote:
> I don't understand why you don't consider that Client 3 is also a dupe. T
he
> simple fact that Client 1 as the same address as Client 3 should tell use
> that both 1 and 3 are dupes of each other. In the same way, Client 2 as t
he
> same email address as client 3 and 2 and 3 should be considered dupes, too
;
> so the final solution is not 1, 2 but 1, 2 and 3.
> You can argue that two different clients can share the same postal address
> and be different people but you could also argue the same thing with the
> email address: two people can have the same postal address and the same
> email address but still be two different persons.
> In your case, the memory taken goes up the sky because you have probably a
> combinatory explosion stemming out of th fact that you are inner joining t
he
> same table two times with itself. You should take a look at the query pla
n
> to see what happens.
> Possible first solution: make a Inner Join between 1 and 3 UNION an Inner
> Join between 2 and 3 then Group By on 3 and extract everything with an
> Having clause on Count >= 2 .
> Possible second solution: sort by the Postal address and use a cursor to
> retrieve all clients that have the same address. Repeat the processus wit
h
> the Email address. This should keep the memory at the lower end.
> (BTW, in the case of the Postal address, the use of fuzzy logic will sort
> out some address that are not exactly the same but are sufficiently close
to
> be considered the same. DTS 2005 have things about that.
> --
> Sylvain Lafontaine, ing.
> MVP - Technologies Virtual-PC
> E-mail: http://cerbermail.com/?QugbLEWINF
>
> "Krop" <Krop@.discussions.microsoft.com> wrote in message
> news:24929CB3-88B2-4A38-90D2-0BC5C9AD4425@.microsoft.com...
>
>