Datasets¶
Methods to fetch, generate, and describe datasets.
Standard datasets for itemset mining¶
-
skmine.datasets.fimi.
fetch_file
(filepath, separator=' ', int_values=False)[source]¶ Loader for files in FIMI format
- Parameters
filepath (str) – Path of the file to load
separator (str) – Indicate a custom separator between the items. By default, it is a space.
int_values (bool, default=False) – Specify if the items in the file are all integers. If not, then the items are considered as strings. With integers, the algorithms are more efficient.
- Returns
Transactions from the requested dataset, as an in-memory pandas Series
- Return type
pd.Series
-
skmine.datasets.fimi.
fetch_any
(filename, base_url='http://fimi.uantwerpen.be/data/', data_home=None)[source]¶ Base loader for all datasets from the FIMI and CGI repository Each unique transaction will be represented as a Python list in the resulting pandas Series
see: http://fimi.uantwerpen.be/data/ https://cgi.csc.liv.ac.uk/~frans/KDD/Software/LUCS-KDD-DN/DataSets/dataSets.html
- Parameters
data_home (optional, default: None) – Specify another download and cache folder for the datasets. By default, all scikit-mine data is stored in ~/scikit_mine_data/ subfolders.
filename (str) – Name of the file to fetch
base_url (str) – URL indicating where to fetch the dataset
- Returns
Transactions from the requested dataset, as an in-memory pandas Series
- Return type
pd.Series
-
skmine.datasets.fimi.
fetch_chess
(data_home=None)[source]¶ Fetch and return the chess dataset (Frequent Itemset Mining)
Nb of items
75
Nb of transactions
3196
Avg transaction size
37.0
Density
0.493
- Parameters
data_home (optional, default: None) – Specify another download and cache folder for the datasets. By default, all scikit-mine data is stored in scikit-mine_data.
- Returns
Transactions from the chess dataset, as an in-memory pandas Series. Each unique transaction is represented as a Python list.
- Return type
pd.Series
-
skmine.datasets.fimi.
fetch_connect
(data_home=None)[source]¶ Fetch and return the connect dataset (Frequent Itemset Mining).
Nb of items
129
Nb of transactions
67557
Avg transaction size
43.0
Density
0.333
- Parameters
data_home (optional, default: None) – Specify another download and cache folder for the datasets. By default, all scikit-mine data is stored in scikit-mine_data.
- Returns
Transactions from the connect dataset, as an in-memory pandas Series. Each unique transaction is represented as a Python list.
- Return type
pd.Series
-
skmine.datasets.fimi.
fetch_mushroom
(data_home=None, return_y=False)[source]¶ Fetch and return the mushroom dataset (Frequent Itemset Mining)
The Mushroom data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family.
It contains information about 8124 mushrooms (transactions). 4208 (51.8%) are edible and 3916 (48.2%) are poisonous.
The data contains 22 nomoinal features plus the class attribure (edible or not). These features were translated into 117 items.
Nb of items
117
Nb of transactions
8124
Avg transaction size
22.0
Density
0.188
- Parameters
data_home (optional, default: None) – Specify another download and cache folder for the datasets. By default, all scikit-mine data is stored in scikit-mine_data.
return_y (bool, default=False.) – If True, returns a tuple for both the data and the associated labels (0 for edible, 1 for poisonous)
- Returns
mush (pd.Series) – Transactions from the mushroom dataset, as an in-memory pandas Series. Each unique transaction is represented as a Python list.
(mush, y) (tuple) – if
return_y
is True
Examples
>>> from skmine.datasets.fimi import fetch_mushroom >>> from skmine.datasets.utils import describe >>> X, y = fetch_mushroom(return_y=True) >>> describe(X)['n_items'] 117 >>> y.value_counts() 0 4208 1 3916 Name: mushroom, dtype: int64
-
skmine.datasets.fimi.
fetch_pumsb
(data_home=None)[source]¶ Fetch and return the pumsb dataset (Frequent Itemset Mining)
The Pumsb dataset contains census data for population and housing.
Nb of items
2113
Nb of transactions
49046
Avg transaction size
74.0
Density
0.035
- Parameters
data_home (optional, default: None) – Specify another download and cache folder for the datasets. By default, all scikit-mine data is stored in scikit-mine_data.
- Returns
Transactions from the pumsb dataset, as an in-memory pandas Series. Each unique transaction is represented as a Python list.
- Return type
pd.Series
-
skmine.datasets.fimi.
fetch_pumsb_star
(data_home=None)[source]¶ Fetch and return the pumsb_star dataset (Frequent Itemset Mining)
Nb of items
2088
Nb of transactions
49046
Avg transaction size
50.48
Density
0.024
- Parameters
data_home (optional, default: None) – Specify another download and cache folder for the datasets. By default, all scikit-mine data is stored in scikit-mine_data.
- Returns
Transactions from the pumsb_star dataset, as an in-memory pandas Series. Each unique transaction is represented as a Python list.
- Return type
pd.Series
-
skmine.datasets.fimi.
fetch_kosarak
(data_home=None)[source]¶ Fetch and return the kosarak dataset (Frequent Itemset Mining)
Click-stream data from a hungarian on-line news portal.
Nb of items
36855
Nb of transactions
990002
Avg transaction size
8.1
Density
0.000220
- Parameters
data_home (optional, default: None) – Specify another download and cache folder for the datasets. By default, all scikit-mine data is stored in scikit-mine_data.
- Returns
Transactions from the kosarak dataset, as an in-memory pandas Series. Each unique transaction is represented as a Python list.
- Return type
pd.Series
-
skmine.datasets.fimi.
fetch_retail
(data_home=None)[source]¶ Fetch and return the retail dataset (Frequent Itemset Mining)
Contains market basket data from a Belgian retail store, anonymized.
see: http://fimi.uantwerpen.be/data/retail.pdf
Nb of items
16470
Nb of transactions
88162
Avg transaction size
10.3
Densisty
0.000626
Retail market basket data set supplied by a anonymous Belgian retail supermarket store.
Results in approximately 5 months of data. The total amount of receipts being collected equals 88,163.
In total, 5,133 customers have purchased at least one product during the data collection period
- Parameters
data_home (optional, default: None) – Specify another download and cache folder for the datasets. By default, all scikit-mine data is stored in scikit-mine_data.
- Returns
Transactions from the retail dataset, as an in-memory pandas Series. Each unique transaction is represented as a Python list.
- Return type
pd.Series
-
skmine.datasets.fimi.
fetch_accidents
(data_home=None)[source]¶ Fetch and return the accidents dataset (Frequent Itemset Mining)
Traffic accident data, anonymized.
see: http://fimi.uantwerpen.be/data/accidents.pdf
Nb of items
468
Nb of transactions
340183
Avg transaction size
33.807
Density
0.072
- Parameters
data_home (optional, default: None) – Specify another download and cache folder for the datasets. By default, all scikit-mine data is stored in ~/scikit_mine_data.
- Returns
Transactions from the accidents dataset, as an in-memory pandas Series. Each unique transaction is represented as a Python list.
- Return type
pd.Series
Logs datasets for periodic pattern mining¶
-
skmine.datasets.
fetch_health_app
(data_home=None, filename='health_app.csv')[source]¶ Fetch and return the health app log dataset
see: https://github.com/logpai/loghub
HealthApp is a mobile application for Android devices. Logs were collected from an Android smartphone after 10+ days of use.
Logs have been grouped by their types, hence resulting in only 20 different events.
Number of occurrences
2000
Number of events
20
Average delta per event
Timedelta(‘0 days 00:53:24.984000’)
Average nb of points per event
100.0
- Parameters
filename (str, default: health_app.csv) – Name of the file (without the data_home directory) where the dataset will be or is already downloaded.
data_home (optional, default: None) – Specify another download and cache folder for the datasets. By default, all scikit-mine data is stored in scikit-mine_data.
- Returns
System logs from the health app dataset, as an in-memory pandas Series. Events are indexed by timestamps.
- Return type
pd.Series
-
skmine.datasets.
fetch_canadian_tv
(data_home=None, filename='canadian_tv.txt')[source]¶ Fetch and return canadian TV logs from August 2020
see: https://zenodo.org/record/4671512
If the dataset has never been downloaded before, it will be downloaded and stored.
The returned dataset contains only TV series programs indexed by their associated timestamps. Adverts are ignored when loading the dataset.
Number of occurrences
2093
Number of events
98
Average delta per event
Timedelta(‘19 days 02:13:36.122448979’)
Average nb of points per event
21.35714285714285
- Parameters
filename (str, default: canadian_tv.txt) – Name of the file (without the data_home directory) where the dataset will be or is already downloaded.
data_home (optional, default: None) – Specify another download and cache folder for the datasets. By default, all scikit-mine data is stored in scikit-mine_data.
- Returns
TV series events from canadian TV, as an in-memory pandas Series. Events are indexed by timestamps.
- Return type
pd.Series
Notes
For now the entire .zip file is downloaded, being ~90mb on disk Downloading preprocessed dataset from zenodo.org is something we consider.
See also
Synthetic data generation¶
-
skmine.datasets.
make_transactions
(n_transactions=1000, n_items=100, density=0.5, random_state=None, item_start=0)[source]¶ Generate a transactional dataset with predefined properties
see: https://liris.cnrs.fr/Documents/Liris-3716.pdf
Transaction sizes follow a normal distribution, centered around
density * n_items
. Individual items are integer values between 0 andn_items
.- Parameters
n_transactions (int, default=1000) – The number of transactions to generate
n_items (int, default=100) – The number of indidual items, i.e the size of the set of symbols
density (float, default=0.5) – Density of the resulting dataset
random_state (int, RandomState instance, default=None) – Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls.
References
Example
>>> from skmine.datasets import make_transactions >>> make_transactions(n_transactions=5, n_items=20, density=.25) 0 [0, 6, 18, 10, 1, 12] 1 [2, 18, 10, 14] 2 [4, 5, 1] 3 [10, 11, 16, 19] 4 [9, 4, 19, 8, 12, 5] dtype: object
Notes
- With a binary matrix representation of the resulting dataset, we have the following equality
- \[density = { Number\ of\ ones \over Number\ of\ cells }\]
- This is equivalent to
- \[density = { Average\ transaction\ size \over number\ of\ items }\]
- Returns
pd.Series – Earch entry is a list of integer values
- Return type
a Series of shape (
n_transactions
,)
-
skmine.datasets.
make_classification
(n_samples=100, n_items_per_class=100, *, n_classes=2, weights=None, class_sep=0.2, shuffle=True, random_state=None, densities=None)[source]¶ Generate a random n-class classification problem
Acts like sklearn version of make_classification, but produces transactional data instead. Transactions are drawn from a
n_items_per_class
number of items, respecting theclass_sep
parameter to ensure transactions are drawn from different alphabets for different classes.A
class_sep
value of 0.0 will result in transactions being drawn from the same set of symbols.Densities can be defined for each class given the
densities
parameter.- Parameters
n_samples (int, default=100) – The number of samples
n_items_per_class (int, default=100) – The number of items per class. This is similar to the
n_features
parameters in scikit-learn, but operates at a class level.n_classes (int, default=2) – The number of classes (or labels) of the classification problem
weigths – The proportions of samples assigned to each class. If None, then classes are balanced
of shape (n_classes (array-like) – The proportions of samples assigned to each class. If None, then classes are balanced
default=None ()) – The proportions of samples assigned to each class. If None, then classes are balanced
class_sep (float, default=0.2) – The factor of different items in different between classes. Setting this to 1.0 will make classification dummy.
shuffle (boolean, default=True) – Shuffle the samples and the labels
random_state (int RandomState instance, default=None) – Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls.
- Returns
D (pd.Series of shape [n_samples, ]) – The generated samples
y (pd.Series of shape [n_samples]) – Labels associated to D
See also
make_transactions
which is used internally to generate samples
utils¶
-
skmine.datasets.utils.
describe
(D)[source]¶ Give some high level properties on transactions
Number of items
int
Number of transactions
int
Average transaction size
float
Density
float in [0, 1]
- Parameters
D (pd.Series) – A transactional dataset
Notes
\[density = { avg\_transaction\_size \over n\_items }\]Example
>>> from skmine.datasets.fimi import fetch_connect >>> from skmine.datasets.utils import describe >>> describe(fetch_connect()) {'n_items': 75, 'avg_transaction_size': 37.0, 'n_transactions': 3196, 'density': 0.4933}
-
skmine.datasets.utils.
describe_logs
(D)[source]¶ Give some high level properties on logs
Number of events
int
Average delta per event
float
Average nb of points per event
float
- Parameters
D (pd.Series) – A dataset containing logs
Example
>>> from skmine.datasets.periodic import fetch_health_app >>> from skmine.datasets.utils import describe_logs >>> describe(fetch_health_app()) {'n_events': 20, 'avg_delta_per_event': Timedelta('0 days 00:53:24.984000'), 'avg_nb_points_per_event': 100.0}
-
skmine.datasets.
get_data_home
(data_home=None)[source]¶ Return the path of the scikit-mine data home directory.
This folder is used by some large dataset loaders to avoid downloading data several times.
By default,
data_home
is ./scikit_mine_data/Alternatively, it can be set by the SCIKIT_MINE_DATA environment variable or programmatically by giving an explicit folder path.
- Parameters
data_home (str | None) – The path to scikit-mine data dir.