Grouping by pattern discovery
Given: D, a set of domains
find a “good” pattern Pi for an acceptable subset Si of the examples D,
D:= D - Si % remove match set Si from D
K := K ? {(Pi,Si)} % add (pattern Pi and match set Si) to K
Output: K, the set of (Pattern,MatchSet) pairs
Note that is it not guaranteed that any Pi exclusively matches domains from Si and no other Sj (j?i). I.e. the grouping is not a partition, and Pi is therefore characteristic of Si, not a classifier function.
- A “good” pattern P matching an acceptable subset S of examples D is one where the function F( G(P), C(S,D)) is below some given value (“pruneval”).
- At present F( G(P), C(S,D)) = log(G(P)) * C(S,D)
- G(P) is the goodness of pattern P, where “goodness” is given by a measure of compression
- C(S,D) is the cover value |S|/|D| where |X| is the number of items in set X