Datasets

Methods to fetch, generate, and describe datasets.

Standard datasets for itemset mining

skmine.datasets.fimi.fetch_file(filepath, separator=' ', int_values=False)[source]

Loader for files in FIMI format

Parameters
  • filepath (str) – Path of the file to load

  • separator (str) – Indicate a custom separator between the items. By default, it is a space.

  • int_values (bool, default=False) – Specify if the items in the file are all integers. If not, then the items are considered as strings. With integers, the algorithms are more efficient.

Returns

Transactions from the requested dataset, as an in-memory pandas Series

Return type

pd.Series

skmine.datasets.fimi.fetch_any(filename, base_url='http://fimi.uantwerpen.be/data/', data_home=None)[source]

Base loader for all datasets from the FIMI and CGI repository Each unique transaction will be represented as a Python list in the resulting pandas Series

see: http://fimi.uantwerpen.be/data/ https://cgi.csc.liv.ac.uk/~frans/KDD/Software/LUCS-KDD-DN/DataSets/dataSets.html

Parameters
  • data_home (optional, default: None) – Specify another download and cache folder for the datasets. By default, all scikit-mine data is stored in ~/scikit_mine_data/ subfolders.

  • filename (str) – Name of the file to fetch

  • base_url (str) – URL indicating where to fetch the dataset

Returns

Transactions from the requested dataset, as an in-memory pandas Series

Return type

pd.Series

skmine.datasets.fimi.fetch_chess(data_home=None)[source]

Fetch and return the chess dataset (Frequent Itemset Mining)

Nb of items

75

Nb of transactions

3196

Avg transaction size

37.0

Density

0.493

Parameters

data_home (optional, default: None) – Specify another download and cache folder for the datasets. By default, all scikit-mine data is stored in scikit-mine_data.

Returns

Transactions from the chess dataset, as an in-memory pandas Series. Each unique transaction is represented as a Python list.

Return type

pd.Series

skmine.datasets.fimi.fetch_connect(data_home=None)[source]

Fetch and return the connect dataset (Frequent Itemset Mining).

Nb of items

129

Nb of transactions

67557

Avg transaction size

43.0

Density

0.333

Parameters

data_home (optional, default: None) – Specify another download and cache folder for the datasets. By default, all scikit-mine data is stored in scikit-mine_data.

Returns

Transactions from the connect dataset, as an in-memory pandas Series. Each unique transaction is represented as a Python list.

Return type

pd.Series

skmine.datasets.fimi.fetch_mushroom(data_home=None, return_y=False)[source]

Fetch and return the mushroom dataset (Frequent Itemset Mining)

The Mushroom data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family.

It contains information about 8124 mushrooms (transactions). 4208 (51.8%) are edible and 3916 (48.2%) are poisonous.

The data contains 22 nomoinal features plus the class attribure (edible or not). These features were translated into 117 items.

Nb of items

117

Nb of transactions

8124

Avg transaction size

22.0

Density

0.188

Parameters
  • data_home (optional, default: None) – Specify another download and cache folder for the datasets. By default, all scikit-mine data is stored in scikit-mine_data.

  • return_y (bool, default=False.) – If True, returns a tuple for both the data and the associated labels (0 for edible, 1 for poisonous)

Returns

  • mush (pd.Series) – Transactions from the mushroom dataset, as an in-memory pandas Series. Each unique transaction is represented as a Python list.

  • (mush, y) (tuple) – if return_y is True

Examples

>>> from skmine.datasets.fimi import fetch_mushroom
>>> from skmine.datasets.utils import describe
>>> X, y = fetch_mushroom(return_y=True)
>>> describe(X)['n_items']
117
>>> y.value_counts()
0    4208
1    3916
Name: mushroom, dtype: int64
skmine.datasets.fimi.fetch_pumsb(data_home=None)[source]

Fetch and return the pumsb dataset (Frequent Itemset Mining)

The Pumsb dataset contains census data for population and housing.

Nb of items

2113

Nb of transactions

49046

Avg transaction size

74.0

Density

0.035

Parameters

data_home (optional, default: None) – Specify another download and cache folder for the datasets. By default, all scikit-mine data is stored in scikit-mine_data.

Returns

Transactions from the pumsb dataset, as an in-memory pandas Series. Each unique transaction is represented as a Python list.

Return type

pd.Series

skmine.datasets.fimi.fetch_pumsb_star(data_home=None)[source]

Fetch and return the pumsb_star dataset (Frequent Itemset Mining)

Nb of items

2088

Nb of transactions

49046

Avg transaction size

50.48

Density

0.024

Parameters

data_home (optional, default: None) – Specify another download and cache folder for the datasets. By default, all scikit-mine data is stored in scikit-mine_data.

Returns

Transactions from the pumsb_star dataset, as an in-memory pandas Series. Each unique transaction is represented as a Python list.

Return type

pd.Series

skmine.datasets.fimi.fetch_kosarak(data_home=None)[source]

Fetch and return the kosarak dataset (Frequent Itemset Mining)

Click-stream data from a hungarian on-line news portal.

Nb of items

36855

Nb of transactions

990002

Avg transaction size

8.1

Density

0.000220

Parameters

data_home (optional, default: None) – Specify another download and cache folder for the datasets. By default, all scikit-mine data is stored in scikit-mine_data.

Returns

Transactions from the kosarak dataset, as an in-memory pandas Series. Each unique transaction is represented as a Python list.

Return type

pd.Series

skmine.datasets.fimi.fetch_retail(data_home=None)[source]

Fetch and return the retail dataset (Frequent Itemset Mining)

Contains market basket data from a Belgian retail store, anonymized.

see: http://fimi.uantwerpen.be/data/retail.pdf

Nb of items

16470

Nb of transactions

88162

Avg transaction size

10.3

Densisty

0.000626

Retail market basket data set supplied by a anonymous Belgian retail supermarket store.

Results in approximately 5 months of data. The total amount of receipts being collected equals 88,163.

In total, 5,133 customers have purchased at least one product during the data collection period

Parameters

data_home (optional, default: None) – Specify another download and cache folder for the datasets. By default, all scikit-mine data is stored in scikit-mine_data.

Returns

Transactions from the retail dataset, as an in-memory pandas Series. Each unique transaction is represented as a Python list.

Return type

pd.Series

skmine.datasets.fimi.fetch_accidents(data_home=None)[source]

Fetch and return the accidents dataset (Frequent Itemset Mining)

Traffic accident data, anonymized.

see: http://fimi.uantwerpen.be/data/accidents.pdf

Nb of items

468

Nb of transactions

340183

Avg transaction size

33.807

Density

0.072

Parameters

data_home (optional, default: None) – Specify another download and cache folder for the datasets. By default, all scikit-mine data is stored in ~/scikit_mine_data.

Returns

Transactions from the accidents dataset, as an in-memory pandas Series. Each unique transaction is represented as a Python list.

Return type

pd.Series

Logs datasets for periodic pattern mining

skmine.datasets.fetch_health_app(data_home=None, filename='health_app.csv')[source]

Fetch and return the health app log dataset

see: https://github.com/logpai/loghub

HealthApp is a mobile application for Android devices. Logs were collected from an Android smartphone after 10+ days of use.

Logs have been grouped by their types, hence resulting in only 20 different events.

Number of occurrences

2000

Number of events

20

Average delta per event

Timedelta(‘0 days 00:53:24.984000’)

Average nb of points per event

100.0

Parameters
  • filename (str, default: health_app.csv) – Name of the file (without the data_home directory) where the dataset will be or is already downloaded.

  • data_home (optional, default: None) – Specify another download and cache folder for the datasets. By default, all scikit-mine data is stored in scikit-mine_data.

Returns

System logs from the health app dataset, as an in-memory pandas Series. Events are indexed by timestamps.

Return type

pd.Series

skmine.datasets.fetch_canadian_tv(data_home=None, filename='canadian_tv.txt')[source]

Fetch and return canadian TV logs from August 2020

see: https://zenodo.org/record/4671512

If the dataset has never been downloaded before, it will be downloaded and stored.

The returned dataset contains only TV series programs indexed by their associated timestamps. Adverts are ignored when loading the dataset.

Number of occurrences

2093

Number of events

98

Average delta per event

Timedelta(‘19 days 02:13:36.122448979’)

Average nb of points per event

21.35714285714285

Parameters
  • filename (str, default: canadian_tv.txt) – Name of the file (without the data_home directory) where the dataset will be or is already downloaded.

  • data_home (optional, default: None) – Specify another download and cache folder for the datasets. By default, all scikit-mine data is stored in scikit-mine_data.

Returns

TV series events from canadian TV, as an in-memory pandas Series. Events are indexed by timestamps.

Return type

pd.Series

Notes

For now the entire .zip file is downloaded, being ~90mb on disk Downloading preprocessed dataset from zenodo.org is something we consider.

Synthetic data generation

skmine.datasets.make_transactions(n_transactions=1000, n_items=100, density=0.5, random_state=None, item_start=0)[source]

Generate a transactional dataset with predefined properties

see: https://liris.cnrs.fr/Documents/Liris-3716.pdf

Transaction sizes follow a normal distribution, centered around density * n_items. Individual items are integer values between 0 and n_items.

Parameters
  • n_transactions (int, default=1000) – The number of transactions to generate

  • n_items (int, default=100) – The number of indidual items, i.e the size of the set of symbols

  • density (float, default=0.5) – Density of the resulting dataset

  • random_state (int, RandomState instance, default=None) – Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls.

References

Example

>>> from skmine.datasets import make_transactions
>>> make_transactions(n_transactions=5, n_items=20, density=.25)  
0    [0, 6, 18, 10, 1, 12]
1          [2, 18, 10, 14]
2                [4, 5, 1]
3         [10, 11, 16, 19]
4     [9, 4, 19, 8, 12, 5]
dtype: object

Notes

With a binary matrix representation of the resulting dataset, we have the following equality
\[density = { Number\ of\ ones \over Number\ of\ cells }\]
This is equivalent to
\[density = { Average\ transaction\ size \over number\ of\ items }\]
Returns

pd.Series – Earch entry is a list of integer values

Return type

a Series of shape (n_transactions,)

skmine.datasets.make_classification(n_samples=100, n_items_per_class=100, *, n_classes=2, weights=None, class_sep=0.2, shuffle=True, random_state=None, densities=None)[source]

Generate a random n-class classification problem

Acts like sklearn version of make_classification, but produces transactional data instead. Transactions are drawn from a n_items_per_class number of items, respecting the class_sep parameter to ensure transactions are drawn from different alphabets for different classes.

A class_sep value of 0.0 will result in transactions being drawn from the same set of symbols.

Densities can be defined for each class given the densities parameter.

Parameters
  • n_samples (int, default=100) – The number of samples

  • n_items_per_class (int, default=100) – The number of items per class. This is similar to the n_features parameters in scikit-learn, but operates at a class level.

  • n_classes (int, default=2) – The number of classes (or labels) of the classification problem

  • weigths – The proportions of samples assigned to each class. If None, then classes are balanced

  • of shape (n_classes (array-like) – The proportions of samples assigned to each class. If None, then classes are balanced

  • default=None ()) – The proportions of samples assigned to each class. If None, then classes are balanced

  • class_sep (float, default=0.2) – The factor of different items in different between classes. Setting this to 1.0 will make classification dummy.

  • shuffle (boolean, default=True) – Shuffle the samples and the labels

  • random_state (int RandomState instance, default=None) – Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls.

Returns

  • D (pd.Series of shape [n_samples, ]) – The generated samples

  • y (pd.Series of shape [n_samples]) – Labels associated to D

See also

make_transactions

which is used internally to generate samples

utils

skmine.datasets.utils.describe(D)[source]

Give some high level properties on transactions

Number of items

int

Number of transactions

int

Average transaction size

float

Density

float in [0, 1]

Parameters

D (pd.Series) – A transactional dataset

Notes

\[density = { avg\_transaction\_size \over n\_items }\]

Example

>>> from skmine.datasets.fimi import fetch_connect
>>> from skmine.datasets.utils import describe
>>> describe(fetch_connect())  
{'n_items': 75, 'avg_transaction_size': 37.0, 'n_transactions': 3196, 'density': 0.4933}
skmine.datasets.utils.describe_logs(D)[source]

Give some high level properties on logs

Number of events

int

Average delta per event

float

Average nb of points per event

float

Parameters

D (pd.Series) – A dataset containing logs

Example

>>> from skmine.datasets.periodic import fetch_health_app
>>> from skmine.datasets.utils import describe_logs
>>> describe(fetch_health_app()) 
{'n_events': 20,
'avg_delta_per_event': Timedelta('0 days 00:53:24.984000'),
'avg_nb_points_per_event': 100.0}
skmine.datasets.get_data_home(data_home=None)[source]

Return the path of the scikit-mine data home directory.

This folder is used by some large dataset loaders to avoid downloading data several times.

By default, data_home is ./scikit_mine_data/

Alternatively, it can be set by the SCIKIT_MINE_DATA environment variable or programmatically by giving an explicit folder path.

Parameters

data_home (str | None) – The path to scikit-mine data dir.