Emerging

Module dedicated to emerging patterns

The EPM problem was defined by Dong & Li as the task of finding patterns whose support increases significantly from one dataset to another

MBD-LLBorder

class skmine.emerging.MBDLLBorder(min_growth_rate=2, min_supp=0.1, n_jobs=1)[source]

MBD-LLBorder aims at discovering patterns characterizing the difference between two collections of data.

It first discovers two sets of maximal itemsets, one for each collection. It then looks for borders of these sets, and characterizes the difference by only manipulating theses borders.

This results in the algorithm only keeping a concise description (borders) as an internal state.

Last but not least, it enumerates the set of emerging patterns from the borders.

Parameters
  • min_growth_rate (int or float, default=2) – A pattern is considered as emerging iff its support in the first collection is at least min_growth_rate times its support in the second collection.

  • min_supp (int or float, default=.1) – Minimum support in each of the collection Must be a relative support between 0 and 1 Default to 0.1 (10%)

  • n_jobs (int, default=1) – The number of jobs to use for the computation.

Variables

borders_ (list of tuple[list[set], set]) – List of pairs representing left and right borders For every pair, emerging patterns will be uncovered by enumerating itemsets from the right side, while checking for non membership in the left side

References

.. [1]

Guozhu Dong, Jinyan Li “Efficient Mining of Emerging Patterns : Discovering Trends and Differences”, 1999

fit(D, y)[source]

fit MBD-LLBorder on D, splitted on y

This is done in two steps
  1. Discover maximal itemsets for the two disinct collections contained in D

  2. Mine borders from these maximal itemsets

Parameters
  • D (pd.Series) – The input transactional database Where every entry contains singular items Items must be both hashable and comparable

  • y (array-like of shape (n_samples,)) – Targets on which to split D Must contain only two disctinct values, i.e len(np.unique(y)) == 2

discover(min_size=3)[source]

Enumerate emerging patterns from borders Subsets are drawn from the right borders, and are accepted iff they do not belong to the corresponding left border.

This implementation is parallel, we consider every couple of right/left border in a separate worker and run the computation

Parameters

min_size (int) – minimum size for an itemset to be valid