Linear time Closed item set Miner¶
LCM looks for closed itemset with respect to an input minimum support
[1]:
import skmine
print("This tutorial was tested with the following version of skmine :", skmine.__version__)
This tutorial was tested with the following version of skmine : 1.0.0
load the chess dataset¶
[2]:
from skmine.datasets.fimi import fetch_chess
chess = fetch_chess()
chess.head()
[2]:
0 [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25...
1 [1, 3, 5, 7, 9, 12, 13, 15, 17, 19, 21, 23, 25...
2 [1, 3, 5, 7, 9, 12, 13, 16, 17, 19, 21, 23, 25...
3 [1, 3, 5, 7, 9, 11, 13, 15, 17, 20, 21, 23, 25...
4 [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25...
Name: chess, dtype: object
[3]:
chess.shape
[3]:
(3196,)
fit_discover()¶
fit_discover makes pattern discovery more user friendly by outputting pretty formatted patterns, instead of the traditional tabular format used in the scikit
community
[4]:
from skmine.itemsets import LCM
lcm = LCM(min_supp=2000, n_jobs=4)
# minimum support of 2000, running on 4 processes
%time patterns = lcm.fit_transform(chess)
CPU times: user 74.8 ms, sys: 30.1 ms, total: 105 ms
Wall time: 1.75 s
[5]:
patterns.shape
[5]:
(68967, 2)
This format in which patterns are rendered makes post hoc analysis easier
Here we filter patterns with a length strictly superior to 3
[6]:
patterns[patterns.itemset.map(len) > 3]
[6]:
itemset | support | |
---|---|---|
14 | [29, 40, 52, 58] | 3143 |
22 | [29, 52, 58, 60] | 3124 |
26 | [40, 52, 58, 60] | 3112 |
28 | [29, 40, 58, 60] | 3110 |
29 | [29, 40, 52, 60] | 3100 |
... | ... | ... |
68960 | [15, 52, 58, 60] | 2003 |
68962 | [15, 29, 58, 60] | 2002 |
68964 | [29, 40, 58, 70] | 2006 |
68965 | [29, 40, 52, 70] | 2001 |
68966 | [29, 40, 52, 58, 70] | 2000 |
67716 rows × 2 columns
Note
Even when setting a very high minimum support threshold, we discovered more than 60K from only 3196 original transactions. This is a good illustration of the so-called pattern explosion problem
We could also get the top-k patterns in terms of supports, with a single line of code
[7]:
patterns.nlargest(10, columns=['support']) # top 10 patterns
[7]:
itemset | support | |
---|---|---|
0 | [58] | 3195 |
1 | [52] | 3185 |
2 | [52, 58] | 3184 |
3 | [29] | 3181 |
4 | [29, 58] | 3180 |
5 | [29, 52] | 3170 |
7 | [40] | 3170 |
6 | [29, 52, 58] | 3169 |
8 | [40, 58] | 3169 |
9 | [40, 52] | 3159 |