Itemsets¶
Models and functions to mine mere itemsets
SLIM¶
-
class
skmine.itemsets.
SLIM
(pruning=True, max_time=- 1, items=None)[source]¶ SLIM: Directly Mining Descriptive Patterns
SLIM looks for a compressed representation of transactional data. This compressed representation if a set of descriptive patterns, and can be used to:
provide a natively interpretable modeling of this data
make predictions on new data, using this condensed representation as an encoding scheme
Note: The reconstruction of the initial dataset from the codetable itemsets is lossless.
- Parameters
pruning (bool, default=True) – Either to activate pruning or not. Pruned itemsets may be useful at prediction time, so it is usually recommended to set it to False to build a classifier. The model will be less concise, but will lead to more accurate predictions on average.
max_time (int, default=-1) – Specify a maximum calculation time before SLIM returns the itemsets. This parameter is in seconds and is useful for large datasets when SLIM may take a long time to return a table code. In addition, after a while, the compression progress of the database gets weaker and weaker, so it may be useful to use this parameter.
Examples
>>> from skmine.itemsets import SLIM >>> D = [['bananas', 'milk'], ['milk', 'bananas', 'cookies'], ['cookies', 'butter', 'tea']] >>> SLIM().fit(D).transform(D, singletons=True, return_tids=True) itemset tids 0 [bananas, milk] (0, 1) 1 [butter, tea] (2) 2 [cookies] (1, 2)
References
- 1
Smets, K & Vreeken, J “Slim: Directly Mining Descriptive Patterns”, 2012
- 2
Gandhi, M & Vreeken, J “Slimmer, outsmarting Slim”, 2014
-
fit
(X, y=None)[source]¶ fit SLIM on a transactional dataset
This generates new candidate patterns and add those which improve compression, iteratibely refining
self.codetable_
- Parameters
X (iterable of iterables or array-like) – Transactional dataset, either as an iterable of iterables or encoded as tabular binary data
y (Ignored) – Not used, present here for API consistency by convention.
-
get_code_length
(D)[source]¶ Compute covers on new data, and return code length
Setting
pruning
to False when creating the model is recommended to cover unseen data, and especially when building a classifier.- Parameters
D (list or np.ndarray or pd.DataFrame) – new data to make predictions on, in tabular format.
Note
type of input D should be the same as in one ingest in fit method
- Returns
codes_length – codes length for all itemset in D (float32)
- Return type
pandas.Series
Example
>>> from skmine.itemsets import SLIM; import pandas as pd >>> def to_tabular(D): return pd.Series(D).str.join('|').str.get_dummies(sep="|") >>> D = [['bananas', 'milk'], ['milk', 'bananas', 'cookies'], ['cookies', 'butter', 'tea']] >>> new_D = to_tabular([['cookies', 'butter']]) >>> slim = SLIM().fit(to_tabular(D)) >>> slim.get_code_length(new_D) 0 5.584963 dtype: float32
See also
cover
,discover
-
decision_function
(D)[source]¶ - Function use to predict class of itemset D (can be a list of iterable of itemset).
It is used by a classifier predict method, like in sklearn.multiclass.OneVsRestClassifier where the highest values of decision_function among all classes.
The shorter code length ($cl in ]0 + infty[$) of D is, the higher is the probability for D to belong to same class. Therefore, mapping function _fonc(.) should be monotonically decreasing on $]0 + infty[$ We choose by default $_fonc(x)=exp(-0.2*x)$ to get probability in $]0,1]$ but other can be coded like $_fonc(x)=-x$
Unlike to sklearn-convention (https://scikit-learn.org/stable/glossary.html#term-decision_function), this decision function doesn’t return a signed distance from a boundary because it’s not easy to compute it.
- Parameters
D (list or np.ndarray or pd.DataFrame) – new data to make predictions on, in tabular format
Note
type of input D should be the same as in one ingest in fit method
- Returns
prob – probability for D to belong to the same class
- Return type
pandas.Series
-
generate_candidates
(stack=None)[source]¶ Generate candidates from a copy of the codetable and return the candidates sorted in descending order of estimated gain. For each element of the codetable, the union with each following element is performed and an estimated gain is calculated. If this gain is positive, the union is returned.
The result is a in-memory list
- Parameters
stack (set[frozenset], defaut=set()) – A stack of already seen itemsets, which will not be considered in output Note that this function updates the stack, passed as a reference
-
evaluate
(candidate) → tuple[source]¶ Evaluate
candidate
, considering the current codetable and a datasetD
- Parameters
candidate (frozenset) – a new candidate to be evaluated
- Returns
updated (data size, model size, codetable)
- Return type
(float, float, dict)
-
update
(candidate=None, model_size=None, data_size=None, usages=None) → None[source]¶ Update the current codetable.
If candidate is passed as None, model_size, data_size and usages will be used If candidate is not None, model_size, data_size and usages will be computed by calling .evaluate
- Parameters
candidate (frozenset, default=None) – candidate to be inserted
model_size (float, default=None) – new model size (in bits) to be set
data_size (float) – new data size (in bits) to be set
usages (dict, default=None) – optional for usage outside this class e.g.if simply needs to include an itemset in the current codetable as in interactive data mining
- Raises
AssertionError –
-
cover
(D) → pandas.core.frame.DataFrame[source]¶ cover unseen data
items never seen are dropped out
Examples
>>> from skmine.itemsets import SLIM >>> D = ["ABC", "AB", "BCD"] >>> s = SLIM().fit(D) >>> s.cover(["BC", "AB"]) (A, B) (B,) (A,) (C,) 0 False True False True 1 True False False False
- Returns
- Return type
pd.DataFrame
-
transform
(D, singletons=True, return_tids=False, lexicographic_order=True, drop_null_usage=True, return_dl=False, out=None)[source]¶ Get a user-friendly copy of the codetable
- Parameters
D (iterable of iterables or array-like) – Transactional dataset, either as an iterable of iterables or encoded as tabular binary data. Does not influence the results. Present for compatibility with scikit-learn.
singletons (bool, default=True) – Either to include itemsets of length 1 in the result
return_tids (bool, default=False) – Either returns the tids of each itemset associated with the coverage or simply the usage, i.e. the number of times the itemset is used, if set to False.
lexicographic_order (bool, default=True) – Either the order of the items in each itemset is not ordered or the items are ordered lexicographically
drop_null_usage (bool, default=True) – Either to include itemset with no usage in the training data (i.e itemsets under cover of other itemsets)
return_dl (bool, default=False) – Display the total size of the model L(CT, D) according to MDL (Minimum Description Length) which is equal to the sum of the encoded database size L(D | CT) and the encoded table code size L(CT | D).
out (str, default=None) – File where results are written. Discover return None. The file contains in parentheses the usage of the itemset located after the closing parenthesis. If the ‘return_tids’ option is enabled then the line under each itemset is a line of transaction ids containing the previous itemset.
Example
>>> from skmine.itemsets import SLIM >>> D = ["ABC", "AB", "BCD"] >>> SLIM().fit_transform(D, singletons=True, return_tids=False, lexicographic_order=True, drop_null_usage=True) itemset usage 0 [A, B] 2 1 [B, D] 1 2 [C] 2
- Returns
codetable containing patterns and their usages or ids of transactions in which they are used
- Return type
pd.Dataframe
-
reconstruct
() → pandas.core.series.Series[source]¶ Reconstruct the original data from the current self.codetable_. This is possible because SLIM is a lossless algorithm.
Example
>>> from skmine.itemsets import SLIM >>> D = ["ABC", "AB", "BCD"] >>> slim = SLIM() >>> slim.fit(D) >>> slim.reconstruct() 0 [A, B, C] 1 [A, B] 2 [B, C, D] dtype: object
- Returns
original database containing a list of transactions
- Return type
pd.Series
-
get_support
(*items)[source]¶ Get support from an itemset
Note
Items in an itemset must be passed as positional arguments
Unseen items will throw errors
-
prefit
(D, y=None)[source]¶ - Parameters
D (iterable of iterables or array-like) – Transactional dataset, either as an iterable of iterables or encoded as tabular binary data
Note
works in 3 steps
ingest data D
track bitmaps for the top self.n_items frequent items from D
set self.data_size_ and self.model_size given the standard codetable
SLIM Classifier¶
-
class
skmine.itemsets.slim_classifier.
SlimClassifier
(items=None, pruning=False)[source]¶ Classifier using the SLIM compression algorithm. Works for binary and multi-class problems.
This classifier uses one SLIM instance per class in the database, resulting in a code table per class. To classify a transaction, we simply assign the class belonging to the code table that provides the minimal encoded length for the transaction.
- Parameters
items (set, default=None) – The list of items in the complete dataset not only the training set. This improves the accuracy of the model. Without this set of items, the classifier works but is less good in particular with small datasets.
pruning (bool, default=False) – Indicates whether each SLIM classifier enables pruning
- Variables
classes_ (array-like) – All the unique classes
models_ (list) – A list of SLIM instances corresponding to classes_
classes_X_ (list) – A list where each element is a subset of X and each element contains the transactions of X associated with the class from classes_ of the same index
LCM¶
-
class
skmine.itemsets.
LCM
(*, min_supp=0.2, n_jobs=1, verbose=False)[source]¶ Linear time Closed item set Miner.
LCM can be used as a generic purpose miner, yielding some patterns that will be later submitted to a custom acceptance criterion.
It can also be used to simply discover the set of closed itemsets from a transactional dataset.
- Parameters
min_supp (int or float, default=0.2) – The minimum support for itemsets to be rendered in the output either an int representing the absolute support, or a float for relative support. By Default to 0.2 (20%)
n_jobs (int, default=1 The number of jobs to use for the computation. Each single item is attributed a job to) – discover potential itemsets, considering this item as a root in the search space. Processes are preferred over threads. Carefully adjust the number of jobs otherwise the results may be corrupted especially if you have the following warning: UserWarning: A worker stopped while some jobs were given to the executor.
References
- 1
Takeaki Uno, Masashi Kiyomi, Hiroki Arimura “LCM ver. 2: Efficient mining algorithms for frequent/closed/maximal itemsets”, 2004
- 2
Alexandre Termier “Pattern mining rock: more, faster, better”
Examples
>>> from skmine.itemsets import LCM >>> from skmine.datasets.fimi import fetch_chess >>> chess = fetch_chess() >>> lcm = LCM(min_supp=2000) >>> patterns = lcm.fit_transform(chess) >>> patterns.head() itemset support 0 [58] 3195 1 [52] 3185 2 [52, 58] 3184 3 [29] 3181 4 [29, 58] 3180 >>> patterns[patterns.itemset.map(len) > 3]
-
fit
(D, y=None)[source]¶ fit LCM on the transactional database, by keeping records of singular items and their transaction ids.
- Parameters
D (pd.Series or iterable) – a transactional database. All entries in this D should be lists. If D is a pandas.Series, then (D.map(type) == list).all() should return True
y (Ignored) – Not used, present here for API consistency by convention.
- Raises
TypeError – if any entry in D is not iterable itself OR if any item is not hashable OR if all items are not comparable with each other.
-
transform
(D, return_tids=False, lexicographic_order=True, max_length=- 1, out=None)[source]¶ Return the set of closed itemsets, with respect to the minimum support
- Parameters
D (pd.Series or Iterable) – The input transactional database where every entry contain singular items must be both hashable and comparable. Does not influence the results. Present for compatibility with scikit-learn.
return_tids (bool, default=False) – Either to return transaction ids along with itemset. Default to False, will return supports instead
lexicographic_order (bool, default=True) – Either the order of the items in each itemset is not ordered or the items are ordered lexicographically
max_length (int, default=-1) – Maximum length of an itemset. By default, -1 means that LCM returns itemsets of all lengths.
out (str, default=None) – File where results are written. Discover return None. The ‘out’ option is usefull to save memory : Instead of store all branch of lcm-tree in memory , each root branch of lcm is written in a separated file in dir (TEMP_dir), and all files are concatenated in the final ‘out’ file.
- Returns
- DataFrame with the following columns
itemset
a list of co-occured items
support
frequence for this itemset
- if return_tids=True then
itemset
a tuple of co-occured items
support
frequence for this itemset
tids
a bitmap tracking positions
- Return type
pd.DataFrame
Example
>>> from skmine.itemsets import LCM >>> D = [[1, 2, 3, 4, 5, 6], [2, 3, 5], [2, 5]] >>> LCM(min_supp=2).fit_transform(D, lexicographic_order=True) itemset support 0 [2, 5] 3 1 [2, 3, 5] 2 >>> LCM(min_supp=2).fit_transform(D, return_tids=True) itemset support tids 0 [2, 5] 3 (0, 1, 2) 1 [2, 3, 5] 2 (0, 1)
LCMMax¶
-
class
skmine.itemsets.lcm.
LCMMax
(*, min_supp=0.2, n_jobs=1, verbose=False)[source]¶ Linear time Closed item set Miner adapted to Maximal itemsets (or borders).
A maximal itemset is an itemset with no frequent superset.
- Parameters
min_supp (int or float, default=0.2) – The minimum support for itemsets to be rendered in the output Either an int representing the absolute support, or a float for relative support Default to 0.2 (20%)
n_jobs (int, default=1) – The number of jobs to use for the computation. Each single item is attributed a job to discover potential itemsets, considering this item as a root in the search space. Processes are preferred over threads.
See also
LCM
-
transform
(X=None, *args, **kwargs)[source]¶ Return the set of closed itemsets, with respect to the minimum support
- Parameters
D (pd.Series or Iterable) – The input transactional database where every entry contain singular items must be both hashable and comparable. Does not influence the results. Present for compatibility with scikit-learn.
return_tids (bool, default=False) – Either to return transaction ids along with itemset. Default to False, will return supports instead
lexicographic_order (bool, default=True) – Either the order of the items in each itemset is not ordered or the items are ordered lexicographically
max_length (int, default=-1) – Maximum length of an itemset. By default, -1 means that LCM returns itemsets of all lengths.
out (str, default=None) – File where results are written. Discover return None. The ‘out’ option is usefull to save memory : Instead of store all branch of lcm-tree in memory , each root branch of lcm is written in a separated file in dir (TEMP_dir), and all files are concatenated in the final ‘out’ file.
- Returns
- DataFrame with the following columns
itemset
a list of co-occured items
support
frequence for this itemset
- if return_tids=True then
itemset
a tuple of co-occured items
support
frequence for this itemset
tids
a bitmap tracking positions
- Return type
pd.DataFrame