Itemsets

Models and functions to mine mere itemsets

SLIM

class skmine.itemsets.SLIM(pruning=True, max_time=- 1, items=None)[source]

SLIM: Directly Mining Descriptive Patterns

SLIM looks for a compressed representation of transactional data. This compressed representation if a set of descriptive patterns, and can be used to:

  • provide a natively interpretable modeling of this data

  • make predictions on new data, using this condensed representation as an encoding scheme

Note: The reconstruction of the initial dataset from the codetable itemsets is lossless.

Parameters
  • pruning (bool, default=True) – Either to activate pruning or not. Pruned itemsets may be useful at prediction time, so it is usually recommended to set it to False to build a classifier. The model will be less concise, but will lead to more accurate predictions on average.

  • max_time (int, default=-1) – Specify a maximum calculation time before SLIM returns the itemsets. This parameter is in seconds and is useful for large datasets when SLIM may take a long time to return a table code. In addition, after a while, the compression progress of the database gets weaker and weaker, so it may be useful to use this parameter.

Examples

>>> from skmine.itemsets import SLIM
>>> D = [['bananas', 'milk'], ['milk', 'bananas', 'cookies'], ['cookies', 'butter', 'tea']]
>>> SLIM().fit(D).transform(D, singletons=True, return_tids=True)
           itemset    tids
0  [bananas, milk]  (0, 1)
1    [butter, tea]     (2)
2        [cookies]  (1, 2)

References

1

Smets, K & Vreeken, J “Slim: Directly Mining Descriptive Patterns”, 2012

2

Gandhi, M & Vreeken, J “Slimmer, outsmarting Slim”, 2014

fit(X, y=None)[source]

fit SLIM on a transactional dataset

This generates new candidate patterns and add those which improve compression, iteratibely refining self.codetable_

Parameters
  • X (iterable of iterables or array-like) – Transactional dataset, either as an iterable of iterables or encoded as tabular binary data

  • y (Ignored) – Not used, present here for API consistency by convention.

get_code_length(D)[source]

Compute covers on new data, and return code length

Setting pruning to False when creating the model is recommended to cover unseen data, and especially when building a classifier.

Parameters

D (list or np.ndarray or pd.DataFrame) – new data to make predictions on, in tabular format.

Note

type of input D should be the same as in one ingest in fit method

Returns

codes_length – codes length for all itemset in D (float32)

Return type

pandas.Series

Example

>>> from skmine.itemsets import SLIM; import pandas as pd
>>> def to_tabular(D): return pd.Series(D).str.join('|').str.get_dummies(sep="|")
>>> D = [['bananas', 'milk'], ['milk', 'bananas', 'cookies'], ['cookies', 'butter', 'tea']]
>>> new_D = to_tabular([['cookies', 'butter']])
>>> slim = SLIM().fit(to_tabular(D))
>>> slim.get_code_length(new_D)
0    5.584963
dtype: float32

See also

cover, discover

decision_function(D)[source]
Function use to predict class of itemset D (can be a list of iterable of itemset).

It is used by a classifier predict method, like in sklearn.multiclass.OneVsRestClassifier where the highest values of decision_function among all classes.

The shorter code length ($cl in ]0 + infty[$) of D is, the higher is the probability for D to belong to same class. Therefore, mapping function _fonc(.) should be monotonically decreasing on $]0 + infty[$ We choose by default $_fonc(x)=exp(-0.2*x)$ to get probability in $]0,1]$ but other can be coded like $_fonc(x)=-x$

Unlike to sklearn-convention (https://scikit-learn.org/stable/glossary.html#term-decision_function), this decision function doesn’t return a signed distance from a boundary because it’s not easy to compute it.

Parameters

D (list or np.ndarray or pd.DataFrame) – new data to make predictions on, in tabular format

Note

type of input D should be the same as in one ingest in fit method

Returns

prob – probability for D to belong to the same class

Return type

pandas.Series

generate_candidates(stack=None)[source]

Generate candidates from a copy of the codetable and return the candidates sorted in descending order of estimated gain. For each element of the codetable, the union with each following element is performed and an estimated gain is calculated. If this gain is positive, the union is returned.

The result is a in-memory list

Parameters

stack (set[frozenset], defaut=set()) – A stack of already seen itemsets, which will not be considered in output Note that this function updates the stack, passed as a reference

evaluate(candidate) → tuple[source]

Evaluate candidate, considering the current codetable and a dataset D

Parameters

candidate (frozenset) – a new candidate to be evaluated

Returns

updated (data size, model size, codetable)

Return type

(float, float, dict)

update(candidate=None, model_size=None, data_size=None, usages=None) → None[source]

Update the current codetable.

If candidate is passed as None, model_size, data_size and usages will be used If candidate is not None, model_size, data_size and usages will be computed by calling .evaluate

Parameters
  • candidate (frozenset, default=None) – candidate to be inserted

  • model_size (float, default=None) – new model size (in bits) to be set

  • data_size (float) – new data size (in bits) to be set

  • usages (dict, default=None) – optional for usage outside this class e.g.if simply needs to include an itemset in the current codetable as in interactive data mining

Raises

AssertionError

cover(D) → pandas.core.frame.DataFrame[source]

cover unseen data

items never seen are dropped out

Examples

>>> from skmine.itemsets import SLIM
>>> D = ["ABC", "AB", "BCD"]
>>> s = SLIM().fit(D)
>>> s.cover(["BC", "AB"])
   (A, B)   (B,)   (A,)   (C,)
0   False   True  False   True
1    True  False  False  False
Returns

Return type

pd.DataFrame

transform(D, singletons=True, return_tids=False, lexicographic_order=True, drop_null_usage=True, return_dl=False, out=None)[source]

Get a user-friendly copy of the codetable

Parameters
  • D (iterable of iterables or array-like) – Transactional dataset, either as an iterable of iterables or encoded as tabular binary data. Does not influence the results. Present for compatibility with scikit-learn.

  • singletons (bool, default=True) – Either to include itemsets of length 1 in the result

  • return_tids (bool, default=False) – Either returns the tids of each itemset associated with the coverage or simply the usage, i.e. the number of times the itemset is used, if set to False.

  • lexicographic_order (bool, default=True) – Either the order of the items in each itemset is not ordered or the items are ordered lexicographically

  • drop_null_usage (bool, default=True) – Either to include itemset with no usage in the training data (i.e itemsets under cover of other itemsets)

  • return_dl (bool, default=False) – Display the total size of the model L(CT, D) according to MDL (Minimum Description Length) which is equal to the sum of the encoded database size L(D | CT) and the encoded table code size L(CT | D).

  • out (str, default=None) – File where results are written. Discover return None. The file contains in parentheses the usage of the itemset located after the closing parenthesis. If the ‘return_tids’ option is enabled then the line under each itemset is a line of transaction ids containing the previous itemset.

Example

>>> from skmine.itemsets import SLIM
>>> D = ["ABC", "AB", "BCD"]
>>> SLIM().fit_transform(D, singletons=True, return_tids=False, lexicographic_order=True, drop_null_usage=True)
  itemset  usage
0  [A, B]      2
1  [B, D]      1
2     [C]      2
Returns

codetable containing patterns and their usages or ids of transactions in which they are used

Return type

pd.Dataframe

reconstruct() → pandas.core.series.Series[source]

Reconstruct the original data from the current self.codetable_. This is possible because SLIM is a lossless algorithm.

Example

>>> from skmine.itemsets import SLIM
>>> D = ["ABC", "AB", "BCD"]
>>> slim = SLIM()
>>> slim.fit(D) 
>>> slim.reconstruct() 
0    [A, B, C]
1       [A, B]
2    [B, C, D]
dtype: object
Returns

original database containing a list of transactions

Return type

pd.Series

get_support(*items)[source]

Get support from an itemset

Note

Items in an itemset must be passed as positional arguments

Unseen items will throw errors

prefit(D, y=None)[source]
Parameters

D (iterable of iterables or array-like) – Transactional dataset, either as an iterable of iterables or encoded as tabular binary data

Note

works in 3 steps

  1. ingest data D

  2. track bitmaps for the top self.n_items frequent items from D

  3. set self.data_size_ and self.model_size given the standard codetable

SLIM Classifier

class skmine.itemsets.slim_classifier.SlimClassifier(items=None, pruning=False)[source]

Classifier using the SLIM compression algorithm. Works for binary and multi-class problems.

This classifier uses one SLIM instance per class in the database, resulting in a code table per class. To classify a transaction, we simply assign the class belonging to the code table that provides the minimal encoded length for the transaction.

Parameters
  • items (set, default=None) – The list of items in the complete dataset not only the training set. This improves the accuracy of the model. Without this set of items, the classifier works but is less good in particular with small datasets.

  • pruning (bool, default=False) – Indicates whether each SLIM classifier enables pruning

Variables
  • classes_ (array-like) – All the unique classes

  • models_ (list) – A list of SLIM instances corresponding to classes_

  • classes_X_ (list) – A list where each element is a subset of X and each element contains the transactions of X associated with the class from classes_ of the same index

fit(X, y)[source]

Fit the model according to the given training data.

Parameters
  • X (iterable, {array_like}) – containing n_transactions containing themselves n_items

  • y (array-like of shape (n_samples,)) – Target vector relative to X.

Returns

self – An instance of the estimator

Return type

object

predict(X)[source]

Perform classification on samples in X

Parameters

X (iterable containing n_transactions containing themselves n_items) –

Returns

y_pred – Class labels for samples in X

Return type

np.array of shape (n_samples,)

LCM

class skmine.itemsets.LCM(*, min_supp=0.2, n_jobs=1, verbose=False)[source]

Linear time Closed item set Miner.

LCM can be used as a generic purpose miner, yielding some patterns that will be later submitted to a custom acceptance criterion.

It can also be used to simply discover the set of closed itemsets from a transactional dataset.

Parameters
  • min_supp (int or float, default=0.2) – The minimum support for itemsets to be rendered in the output either an int representing the absolute support, or a float for relative support. By Default to 0.2 (20%)

  • n_jobs (int, default=1 The number of jobs to use for the computation. Each single item is attributed a job to) – discover potential itemsets, considering this item as a root in the search space. Processes are preferred over threads. Carefully adjust the number of jobs otherwise the results may be corrupted especially if you have the following warning: UserWarning: A worker stopped while some jobs were given to the executor.

References

1

Takeaki Uno, Masashi Kiyomi, Hiroki Arimura “LCM ver. 2: Efficient mining algorithms for frequent/closed/maximal itemsets”, 2004

2

Alexandre Termier “Pattern mining rock: more, faster, better”

Examples

>>> from skmine.itemsets import LCM
>>> from skmine.datasets.fimi import fetch_chess
>>> chess = fetch_chess()
>>> lcm = LCM(min_supp=2000)
>>> patterns = lcm.fit_transform(chess)
>>> patterns.head()
    itemset  support
0      [58]     3195
1      [52]     3185
2  [52, 58]     3184
3      [29]     3181
4  [29, 58]     3180
>>> patterns[patterns.itemset.map(len) > 3]  
fit(D, y=None)[source]

fit LCM on the transactional database, by keeping records of singular items and their transaction ids.

Parameters
  • D (pd.Series or iterable) – a transactional database. All entries in this D should be lists. If D is a pandas.Series, then (D.map(type) == list).all() should return True

  • y (Ignored) – Not used, present here for API consistency by convention.

Raises

TypeError – if any entry in D is not iterable itself OR if any item is not hashable OR if all items are not comparable with each other.

transform(D, return_tids=False, lexicographic_order=True, max_length=- 1, out=None)[source]

Return the set of closed itemsets, with respect to the minimum support

Parameters
  • D (pd.Series or Iterable) – The input transactional database where every entry contain singular items must be both hashable and comparable. Does not influence the results. Present for compatibility with scikit-learn.

  • return_tids (bool, default=False) – Either to return transaction ids along with itemset. Default to False, will return supports instead

  • lexicographic_order (bool, default=True) – Either the order of the items in each itemset is not ordered or the items are ordered lexicographically

  • max_length (int, default=-1) – Maximum length of an itemset. By default, -1 means that LCM returns itemsets of all lengths.

  • out (str, default=None) – File where results are written. Discover return None. The ‘out’ option is usefull to save memory : Instead of store all branch of lcm-tree in memory , each root branch of lcm is written in a separated file in dir (TEMP_dir), and all files are concatenated in the final ‘out’ file.

Returns

DataFrame with the following columns

itemset

a list of co-occured items

support

frequence for this itemset

if return_tids=True then

itemset

a tuple of co-occured items

support

frequence for this itemset

tids

a bitmap tracking positions

Return type

pd.DataFrame

Example

>>> from skmine.itemsets import LCM
>>> D = [[1, 2, 3, 4, 5, 6], [2, 3, 5], [2, 5]]
>>> LCM(min_supp=2).fit_transform(D, lexicographic_order=True)
     itemset  support
0     [2, 5]        3
1  [2, 3, 5]        2
>>> LCM(min_supp=2).fit_transform(D, return_tids=True)
     itemset  support       tids
0     [2, 5]        3  (0, 1, 2)
1  [2, 3, 5]        2     (0, 1)

LCMMax

class skmine.itemsets.lcm.LCMMax(*, min_supp=0.2, n_jobs=1, verbose=False)[source]

Linear time Closed item set Miner adapted to Maximal itemsets (or borders).

A maximal itemset is an itemset with no frequent superset.

Parameters
  • min_supp (int or float, default=0.2) – The minimum support for itemsets to be rendered in the output Either an int representing the absolute support, or a float for relative support Default to 0.2 (20%)

  • n_jobs (int, default=1) – The number of jobs to use for the computation. Each single item is attributed a job to discover potential itemsets, considering this item as a root in the search space. Processes are preferred over threads.

See also

LCM

transform(X=None, *args, **kwargs)[source]

Return the set of closed itemsets, with respect to the minimum support

Parameters
  • D (pd.Series or Iterable) – The input transactional database where every entry contain singular items must be both hashable and comparable. Does not influence the results. Present for compatibility with scikit-learn.

  • return_tids (bool, default=False) – Either to return transaction ids along with itemset. Default to False, will return supports instead

  • lexicographic_order (bool, default=True) – Either the order of the items in each itemset is not ordered or the items are ordered lexicographically

  • max_length (int, default=-1) – Maximum length of an itemset. By default, -1 means that LCM returns itemsets of all lengths.

  • out (str, default=None) – File where results are written. Discover return None. The ‘out’ option is usefull to save memory : Instead of store all branch of lcm-tree in memory , each root branch of lcm is written in a separated file in dir (TEMP_dir), and all files are concatenated in the final ‘out’ file.

Returns

DataFrame with the following columns

itemset

a list of co-occured items

support

frequence for this itemset

if return_tids=True then

itemset

a tuple of co-occured items

support

frequence for this itemset

tids

a bitmap tracking positions

Return type

pd.DataFrame