Fetch FIMI datasets¶

FIMI is a popular repository referencing standard datasets and algorithms in pattern mining

Load individual datasets from the repository¶

[1]:

import skmine

print("This tutorial was tested with the following version of skmine :", skmine.__version__)

This tutorial was tested with the following version of skmine : 1.0.0

[2]:

from skmine.datasets.fimi import fetch_chess
from skmine.datasets.fimi import fetch_accidents
from skmine.datasets.fimi import fetch_kosarak
from skmine.datasets.fimi import fetch_iris

[3]:

chess = fetch_chess()
chess.head()
# .str accessor allows horizontal slicing

100% [............................................................................] 342294 / 342294

[3]:

0    [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25...
1    [1, 3, 5, 7, 9, 12, 13, 15, 17, 19, 21, 23, 25...
2    [1, 3, 5, 7, 9, 12, 13, 16, 17, 19, 21, 23, 25...
3    [1, 3, 5, 7, 9, 11, 13, 15, 17, 20, 21, 23, 25...
4    [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25...
Name: chess, dtype: object

[4]:

accidents = fetch_accidents()
accidents.head()

100% [........................................................................] 35509823 / 35509823

[4]:

0    [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...
1    [2, 5, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17, 18...
2    [7, 10, 12, 13, 14, 15, 16, 17, 18, 20, 25, 28...
3    [1, 5, 8, 10, 12, 14, 15, 16, 17, 18, 19, 20, ...
4    [5, 8, 10, 12, 14, 15, 16, 17, 18, 21, 22, 24,...
Name: accidents, dtype: object

You also have access to other known datasets from https://cgi.csc.liv.ac.uk/~frans/KDD/Software/LUCS-KDD-DN/DataSets/dataSets.html that have been discretized as the iris dataset.

[5]:

iris = fetch_iris()
iris.head()


  0% [                                                                                  ]   0 / 318
100% [..................................................................................] 318 / 318

[5]:

0     [2, 9, 12, 15, 18]
1    [1, 10, 11, 14, 17]
2    [5, 10, 13, 16, 19]
3     [2, 6, 12, 15, 18]
4     [1, 8, 11, 14, 17]
Name: iris.D19.N150.C3.num, dtype: object

Some datasets like iris also allow classification and have a target column.To get X and y separately, use the return_y parameter

[6]:

X, y = fetch_iris(return_y=True)
print(X.head(), y.head())

0     [2, 9, 12, 15]
1    [1, 10, 11, 14]
2    [5, 10, 13, 16]
3     [2, 6, 12, 15]
4     [1, 8, 11, 14]
Name: iris.D19.N150.C3.num, dtype: object 0    18
1    17
2    19
3    18
4    17
Name: iris.D19.N150.C3.num, dtype: int64

Load your own files in FIMI format¶

The fetch_file method lets you load your own dataset in FIMI format. You can indicate whether the values in the file, the item identifiers, are integers or not (e.g. strings). The performance of the algorithms is better with integers. You can also specify the separator between the different items with ‘separator’. By default, it’s a blank space. (Example FIMI format here)

[7]:

from skmine.datasets.fimi import fetch_file
db = fetch_file('data.dat', int_values=True)
db

[7]:

0    [0, 1, 2]
1       [0, 1]
2    [1, 2, 3]
Name: data, dtype: object

Print basic statistics about a dataset¶

[8]:

from skmine.datasets.utils import describe
describe(chess)

[8]:

{'n_items': 75,
 'avg_transaction_size': 37.0,
 'n_transactions': 3196,
 'density': 0.49333333333333335}