{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Fetch [FIMI](http://fimi.uantwerpen.be/data/) datasets\n", "`FIMI` is a popular repository referencing standard datasets and algorithms in pattern mining\n", "\n", "### Load individual datasets from the repository" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This tutorial was tested with the following version of skmine : 1.0.0\n" ] } ], "source": [ "import skmine\n", "\n", "print(\"This tutorial was tested with the following version of skmine :\", skmine.__version__)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from skmine.datasets.fimi import fetch_chess\n", "from skmine.datasets.fimi import fetch_accidents\n", "from skmine.datasets.fimi import fetch_kosarak\n", "from skmine.datasets.fimi import fetch_iris" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100% [............................................................................] 342294 / 342294" ] }, { "data": { "text/plain": [ "0 [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25...\n", "1 [1, 3, 5, 7, 9, 12, 13, 15, 17, 19, 21, 23, 25...\n", "2 [1, 3, 5, 7, 9, 12, 13, 16, 17, 19, 21, 23, 25...\n", "3 [1, 3, 5, 7, 9, 11, 13, 15, 17, 20, 21, 23, 25...\n", "4 [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25...\n", "Name: chess, dtype: object" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "chess = fetch_chess()\n", "chess.head()\n", "# .str accessor allows horizontal slicing" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100% [........................................................................] 35509823 / 35509823" ] }, { "data": { "text/plain": [ "0 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...\n", "1 [2, 5, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17, 18...\n", "2 [7, 10, 12, 13, 14, 15, 16, 17, 18, 20, 25, 28...\n", "3 [1, 5, 8, 10, 12, 14, 15, 16, 17, 18, 19, 20, ...\n", "4 [5, 8, 10, 12, 14, 15, 16, 17, 18, 21, 22, 24,...\n", "Name: accidents, dtype: object" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "accidents = fetch_accidents()\n", "accidents.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You also have access to other known datasets from https://cgi.csc.liv.ac.uk/~frans/KDD/Software/LUCS-KDD-DN/DataSets/dataSets.html that have been **discretized** as the **iris** dataset." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r\n", " 0% [ ] 0 / 318\r\n", "100% [..................................................................................] 318 / 318" ] }, { "data": { "text/plain": [ "0 [2, 9, 12, 15, 18]\n", "1 [1, 10, 11, 14, 17]\n", "2 [5, 10, 13, 16, 19]\n", "3 [2, 6, 12, 15, 18]\n", "4 [1, 8, 11, 14, 17]\n", "Name: iris.D19.N150.C3.num, dtype: object" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris = fetch_iris()\n", "iris.head()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Some datasets like iris also allow **classification and have a target column**.To get X and y separately, use the *return_y* parameter" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 [2, 9, 12, 15]\n", "1 [1, 10, 11, 14]\n", "2 [5, 10, 13, 16]\n", "3 [2, 6, 12, 15]\n", "4 [1, 8, 11, 14]\n", "Name: iris.D19.N150.C3.num, dtype: object 0 18\n", "1 17\n", "2 19\n", "3 18\n", "4 17\n", "Name: iris.D19.N150.C3.num, dtype: int64\n" ] } ], "source": [ "X, y = fetch_iris(return_y=True)\n", "print(X.head(), y.head())" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Load your own files in FIMI format\n", "The **fetch_file** method lets you load your own dataset in FIMI format. You can indicate whether the values in the file, the item identifiers, are **integers or not** (e.g. strings). The performance of the algorithms is better with integers. You can also **specify the separator** between the different items with 'separator'. By default, it's a blank space. (Example FIMI format [here](http://fimi.uantwerpen.be/data/chess.dat))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 [0, 1, 2]\n", "1 [0, 1]\n", "2 [1, 2, 3]\n", "Name: data, dtype: object" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from skmine.datasets.fimi import fetch_file\n", "db = fetch_file('data.dat', int_values=True)\n", "db" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Print basic statistics about a dataset" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'n_items': 75,\n", " 'avg_transaction_size': 37.0,\n", " 'n_transactions': 3196,\n", " 'density': 0.49333333333333335}" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from skmine.datasets.utils import describe\n", "describe(chess)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.9" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "calc(100% - 180px)", "left": "10px", "top": "150px", "width": "261.517px" }, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 4 }