Fetch FIMI datasets¶
FIMI
is a popular repository referencing standard datasets and algorithms in pattern mining
Load individual datasets from the repository¶
[1]:
import skmine
print("This tutorial was tested with the following version of skmine :", skmine.__version__)
This tutorial was tested with the following version of skmine : 1.0.0
[2]:
from skmine.datasets.fimi import fetch_chess
from skmine.datasets.fimi import fetch_accidents
from skmine.datasets.fimi import fetch_kosarak
from skmine.datasets.fimi import fetch_iris
[3]:
chess = fetch_chess()
chess.head()
# .str accessor allows horizontal slicing
100% [............................................................................] 342294 / 342294
[3]:
0 [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25...
1 [1, 3, 5, 7, 9, 12, 13, 15, 17, 19, 21, 23, 25...
2 [1, 3, 5, 7, 9, 12, 13, 16, 17, 19, 21, 23, 25...
3 [1, 3, 5, 7, 9, 11, 13, 15, 17, 20, 21, 23, 25...
4 [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25...
Name: chess, dtype: object
[4]:
accidents = fetch_accidents()
accidents.head()
100% [........................................................................] 35509823 / 35509823
[4]:
0 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...
1 [2, 5, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17, 18...
2 [7, 10, 12, 13, 14, 15, 16, 17, 18, 20, 25, 28...
3 [1, 5, 8, 10, 12, 14, 15, 16, 17, 18, 19, 20, ...
4 [5, 8, 10, 12, 14, 15, 16, 17, 18, 21, 22, 24,...
Name: accidents, dtype: object
You also have access to other known datasets from https://cgi.csc.liv.ac.uk/~frans/KDD/Software/LUCS-KDD-DN/DataSets/dataSets.html that have been discretized as the iris dataset.
[5]:
iris = fetch_iris()
iris.head()
0% [ ] 0 / 318
100% [..................................................................................] 318 / 318
[5]:
0 [2, 9, 12, 15, 18]
1 [1, 10, 11, 14, 17]
2 [5, 10, 13, 16, 19]
3 [2, 6, 12, 15, 18]
4 [1, 8, 11, 14, 17]
Name: iris.D19.N150.C3.num, dtype: object
Some datasets like iris also allow classification and have a target column.To get X and y separately, use the return_y parameter
[6]:
X, y = fetch_iris(return_y=True)
print(X.head(), y.head())
0 [2, 9, 12, 15]
1 [1, 10, 11, 14]
2 [5, 10, 13, 16]
3 [2, 6, 12, 15]
4 [1, 8, 11, 14]
Name: iris.D19.N150.C3.num, dtype: object 0 18
1 17
2 19
3 18
4 17
Name: iris.D19.N150.C3.num, dtype: int64
Load your own files in FIMI format¶
The fetch_file method lets you load your own dataset in FIMI format. You can indicate whether the values in the file, the item identifiers, are integers or not (e.g. strings). The performance of the algorithms is better with integers. You can also specify the separator between the different items with ‘separator’. By default, it’s a blank space. (Example FIMI format here)
[7]:
from skmine.datasets.fimi import fetch_file
db = fetch_file('data.dat', int_values=True)
db
[7]:
0 [0, 1, 2]
1 [0, 1]
2 [1, 2, 3]
Name: data, dtype: object
Print basic statistics about a dataset¶
[8]:
from skmine.datasets.utils import describe
describe(chess)
[8]:
{'n_items': 75,
'avg_transaction_size': 37.0,
'n_transactions': 3196,
'density': 0.49333333333333335}