From: Mathematisch on
Hi,

I hope somebody understands the problem here and can come out with
some good ideas/suggestions.

What would be a good binary relationship definition in this following
scenario:

I do a pairwise comparison in "a set of items" to find out how many
"features" a given pair of items share. The number of items in the set
is about ~55000. So the number of computations goes into > 1 billion.


After the computation I get a result table which shows how many
features each pair of items share with each other.

On the basis of this table I would like to create clusters using the
transitive closure (If you have suggestions for any other better
suited method, please do tell).

At the moment I do not know what would be the best choice for the
criterion on: "the minimum number of features a pair of items need to
share, so that this pair is considered as related". Is there any
mathematical method to specify an optimal number or a sound
approximation for that minimum number? Does it make sense to create/
guess that minimum number on the basis of e.g. average number of items
shared between all the item pairs?


Thanks a lot.
Regards, M.
From: William Elliot on
On Fri, 18 Jun 2010, Mathematisch wrote:

> What would be a good binary relationship definition in this following
> scenario:
>
> I do a pairwise comparison in "a set of items" to find out how many
> "features" a given pair of items share.
>
> After the computation I get a result table which shows how many
> features each pair of items share with each other.
>
f(a,b) = f(b,a) = n

> On the basis of this table I would like to create clusters using the
> transitive closure (If you have suggestions for any other better
> suited method, please do tell).
>
What binary relation?
The result table is a symmetric two place function.

> At the moment I do not know what would be the best choice for the
> criterion on: "the minimum number of features a pair of items need to
> share, so that this pair is considered as related". Is there any
> mathematical method to specify an optimal number or a sound
> approximation for that minimum number?

No, those are heuristic decisions.

> Does it make sense to create/guess that minimum number on the basis of
> e.g. average number of items shared between all the item pairs?
>
Why not? That's the usual, average rule of thumb.

What are you trying to do? Order the items by similarity?
For that you'll need a base line.

How much are you offering for technical consultation?

From: Chip Eastham on
On Jun 18, 4:28 am, Mathematisch <mathemati...(a)gmail.com> wrote:
> Hi,
>
> I hope somebody understands the problem here and can come out with
> some good ideas/suggestions.
>
> What would be a good binary relationship definition in this following
> scenario:
>
> I do a pairwise comparison in "a set of items" to find out how many
> "features" a given pair of items share. The number of items in the set
> is about ~55000. So the number of computations goes into > 1 billion.
>
> After the computation I get a result table which shows how many
> features each pair of items share with each other.
>
> On the basis of this table I would like to create clusters using the
> transitive closure (If you have suggestions for any other better
> suited method, please do tell).
>
> At the moment I do not know what would be the best choice for the
> criterion on: "the minimum number of features  a pair of items need to
> share, so that this pair is considered as related". Is there any
> mathematical method to specify an optimal number or a sound
> approximation for that minimum number? Does it make sense to create/
> guess that minimum number on the basis of e.g. average number of items
> shared between all the item pairs?
>
> Thanks a lot.
> Regards, M.

To say that you will "create clusters using the transitive closure"
after forming the table counting shared features doesn't give much
information about why you want to do this & whether another approach
would be better suited.

Are there a finite number of "features"? (If so this obviously
bounds the number that any pair of items can share.) Should three
items that share the _same_ ten features be treated the same as
three items that share pairwise ten features but have no features
in common to all three?

However the method you suggest can certainly be tried to see if
it suits your purpose. It seems to me that while no criterion
has been proposed that makes one threshold "optimal" or "sound",
it is clearly the case that starting with a higher threshold &
reducing it will decrease monotonically the number of equivalence
classes.

You might be interested in this approach to clustering:

[Hierarchical clustering]
http://www.ics.uci.edu/~eppstein/280/tree.html

regards, chip