Feature Extraction

The skmine.feature_extraction module provides a set of methods that can be used to extract features in a format supported by machine learning algorithms from datasets containing raw data.

SLIMVectorizer

class skmine.feature_extraction.SLIMVectorizer(strategy='codes', k=5, pruning=False, stop_items=None, **kwargs)[source]

SLIM mining, turned into a feature extraction step for sklearn

k new itemsets (associations of one or more items) are learned at training time

The model (pattern set) is then used to cover new data, in order of usage. This is similar to one-hot-encoding, except the dimension will be much more concise, because the columns will be patterns learned via an MDL criterion.

Parameters
  • strategy (str, default="codes") –

    If the chosen strategy is set to one-hot, non-zero cells are filled with ones.

    If the chosen strategy is left to codes, non-zero cells are filled with code lengths, i.e the probabity of the pattern in the training data.

  • k (int, default=5) –

    Number of non-singleton itemsets to mine. A singleton is an itemset containing a single item.

    Calls to .transform will output pandas.DataFrame with k columns

  • pruning (bool, default=False) – Either to activate pruning or not.

  • stop_items (iterable, default=None) – Set of items to filter out while ingesting the input data.

Examples

>>> from skmine.feature_extraction import SLIMVectorizer
>>> D = [['bananas', 'milk'], ['milk', 'bananas', 'cookies'], ['cookies', 'butter', 'tea']]
>>> res  = SLIMVectorizer(k=2).fit_transform(D)
>>> print(res.to_string())
   (bananas, milk)  (cookies,)
0              0.4         0.0
1              0.4         0.4
2              0.0         0.4

Notes

This transformer does not output scipy.sparse matrices, as SLIM should learn a concise description of the data, and covering new data with this small set of high usage patterns should output matrices with very few zeros.

fit(D, y=None)[source]

fit SLIM on a transactional dataset

This generates new candidate patterns and add those which improve compression, iteratibely refining self.codetable_

Parameters
  • X (iterable of iterables or array-like) – Transactional dataset, either as an iterable of iterables or encoded as tabular binary data

  • y (Ignored) – Not used, present here for API consistency by convention.

transform(newD, y=None, **tsf_params) → pandas.core.frame.DataFrame[source]

Transform new data

Parameters

D (iterable) – transactional data

Returns

a dataframe of len(D) rows and self.k columns

Return type

pd.DataFrame