{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Classification examples with SLIM"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"This tutorial was tested with the following version of skmine : 1.0.0\n"
]
}
],
"source": [
"import skmine\n",
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"print(\"This tutorial was tested with the following version of skmine :\", skmine.__version__)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"MDL based algorithms encode data according to a given codetable\n",
"\n",
"When calling ``.fit``, we iteratively look for the codetable that best compress the training data\n",
"\n",
"**When we are done with training our model, we can benefit from the refined codetable to make some predictions**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### SLIM Classifier for binary and multiclass classification (k>=2)\n",
"\n",
"An **integrated classifier in scikit mine is available** and allows to **solve binary and multiclass problems**. It uses the SLIM compression algorithm. \n",
"\n",
"To use it, we need to have **discretized dataset**. Let's take for example the **discretized iris dataset** with **3 classes**."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"-> Data:\n",
" 0 [2, 9, 12, 15]\n",
"1 [1, 10, 11, 14]\n",
"2 [5, 10, 13, 16]\n",
"3 [2, 6, 12, 15]\n",
"4 [1, 8, 11, 14]\n",
" ... \n",
"145 [3, 9, 13, 16]\n",
"146 [1, 10, 11, 14]\n",
"147 [3, 8, 12, 15]\n",
"148 [5, 9, 13, 16]\n",
"149 [5, 10, 13, 16]\n",
"Name: iris.D19.N150.C3.num, Length: 150, dtype: object\n",
"-> Unique label : [17 18 19]\n"
]
}
],
"source": [
"from skmine.datasets.fimi import fetch_iris\n",
"X, y = fetch_iris(return_y=True) # without return_y=True, the method would have returned the whole dataset in one variable\n",
"label_names = ['setosa', 'versicolor', 'virginica']\n",
"print(\"-> Data:\\n\", X)\n",
"print(\"-> Unique label :\", np.unique(y))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that in the discretized iris dataset, each features is discretized **with different labels** : "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"unique items in colunms 0 : [1 2 3 4 5]\n",
"unique items in colunms 1 : [ 6 7 8 9 10]\n",
"unique items in colunms 2 : [11 12 13]\n",
"unique items in colunms 3 : [14 15 16]\n"
]
}
],
"source": [
"import numpy as np\n",
"X_2d = np.array(X.to_list())\n",
"for k in range(X_2d.shape[-1]): \n",
" print(f\"unique items in colunms {k} : {np.unique(X_2d[:,k])}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The purpose of this dataset is to **predict the last column of db from the other 4**. The possible targets are: 17, 18, 19. We can prepare our train and test data set."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"X_train shape: (120,) y_train shape: (120,)\n",
"X_test shape: (30,) y_test shape: (30,)\n"
]
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"(X_train, X_test, y_train, y_test) = train_test_split(X, y, random_state=1, test_size=0.2, shuffle=True)\n",
"print(\"X_train shape:\", X_train.shape, \"y_train shape:\", y_train.shape)\n",
"print(\"X_test shape:\", X_test.shape, \"y_test shape:\", y_test.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can use our **SlimClassifier**."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"items {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}\n"
]
},
{
"data": {
"text/html": [
"
SlimClassifier(items={1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}) In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. SlimClassifier SlimClassifier(items={1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}) "
],
"text/plain": [
"SlimClassifier(items={1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16})"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from skmine.itemsets.slim_classifier import SlimClassifier\n",
"\n",
"# You can pass in parameter of your classifier the set of your items. \n",
"# This will improve its performance especially on small data sets like iris.\n",
"items = set(item for transaction in X for item in transaction)\n",
"print(\"items\", items)\n",
"clf = SlimClassifier(items=items) # You can also enable or disable the pruning of SLIM compressors via the `pruning` parameter\n",
"clf.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can use many functions of sckit learn that are compatible with classifiers. For example, build a **confusion matrix**, use **GridSearchCV** or **cross validation**.\n",
"\n",
"- **Confusion matrix**"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"-> Accuracy : 83.3 %\n",
"-> Confusion matrix :\n",
" setosa versicolor virginica\n",
"setosa 13 1 0\n",
"versicolor 0 8 1\n",
"virginica 0 3 4\n"
]
}
],
"source": [
"from sklearn.metrics import confusion_matrix\n",
"y_pred = clf.predict(X_test)\n",
"print(f\"-> Accuracy : {round(clf.score(X_test, y_test)*100,1)} %\")\n",
"\n",
"print(\"-> Confusion matrix :\\n\", pd.DataFrame(data=confusion_matrix(y_test, y_pred),columns=label_names, index=label_names))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- **GridSearchCV** (this method allows us to test many parameters for a classifier and to retain the best combination)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"-> Best params : {'items': {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}, 'pruning': False}\n",
"-> Accuracy : 98.3 %\n"
]
}
],
"source": [
"from sklearn.model_selection import GridSearchCV\n",
"\n",
"parameters = {'pruning': [False, True], 'items': [None, items]}\n",
"grid = GridSearchCV(clf, parameters)\n",
"grid.fit(X_train,y_train)\n",
"print(\"-> Best params :\", grid.best_params_)\n",
"print(f\"-> Accuracy : {round(grid.score(X_train, y_train)*100,1)} %\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With GridSearchCV we get with the best parameters an **accuracy of more than 98%**, much better than the previous score. With this combination, the item list is passed as a parameter and pruning is disabled. Since pruning does not improve the compression of codetables in the SLIM algorithm on iris, it does not matter whether it is enabled or not.\n",
"\n",
"To reduce overfitting, we can use the cross validation of sklearn.\n",
"- **Cross validation**"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"-> 10 Cross validation: [0.93 0.93 0.87 0.93 0.93 0.93 1. 1. 1. 0.93]\n",
"-> Mean Accuracy : 94.7 %\n"
]
}
],
"source": [
"from sklearn.model_selection import cross_val_score\n",
"\n",
"cross_validation = cross_val_score(clf, X, y, cv=10)\n",
"print(f\"-> 10 Cross validation: {cross_validation.round(2)}\")\n",
"print(f\"-> Mean Accuracy : {round(cross_validation.mean()*100,1)} %\") "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After cross validation, we see that the **accuracy is almost 95% on average**. So in 95% of the cases, the right type of flower is given."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### SLIM classifier from numerical dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Preprocessing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Load **iris dataset** from scikit-learn which is **not discretized**: "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"-> X.shape : (150, 4)\n",
"-> X, 5 rows :\n",
" [[5.1 3.5 1.4 0.2]\n",
" [4.9 3. 1.4 0.2]\n",
" [4.7 3.2 1.3 0.2]\n",
" [4.6 3.1 1.5 0.2]\n",
" [5. 3.6 1.4 0.2]]\n",
"-> Unique labels : [0 1 2]\n"
]
}
],
"source": [
"from sklearn.datasets import load_iris\n",
"data = load_iris()\n",
"X, y = data.data, data.target\n",
"print(\"-> X.shape : \", X.shape)\n",
"print(\"-> X, 5 rows :\\n\", X[0:5])\n",
"print(\"-> Unique labels : \", np.unique(y))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Classic standardisation "
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"-> Xst.shape : (150, 4)\n",
"-> Xst, 5 rows :\n",
" [[-0.901 1.019 -1.34 -1.315]\n",
" [-1.143 -0.132 -1.34 -1.315]\n",
" [-1.385 0.328 -1.397 -1.315]\n",
" [-1.507 0.098 -1.283 -1.315]\n",
" [-1.022 1.249 -1.34 -1.315]]\n"
]
}
],
"source": [
"from sklearn.preprocessing import StandardScaler\n",
"Xst = StandardScaler().fit_transform(X)\n",
"print(\"-> Xst.shape : \", Xst.shape)\n",
"print(\"-> Xst, 5 rows :\\n\", Xst[0:5].round(3))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"KBins discretisation"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"-> Xt.shape : (150, 4)\n",
"-> Xt, 5 rows :\n",
" [[2 1 1 1]\n",
" [1 1 1 1]\n",
" [2 1 1 1]\n",
" [1 0 1 1]\n",
" [1 0 1 1]]\n"
]
}
],
"source": [
"from sklearn.preprocessing import KBinsDiscretizer\n",
"Xt = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform').fit_transform(Xst).astype(int)\n",
"# n_bins=3 : for each column we want to have 3 categorical values\n",
"print(\"-> Xt.shape : \", Xt.shape)\n",
"print(\"-> Xt, 5 rows :\\n\", Xt[50:55])\n",
"# In the output, each column is discretized in 3 values : 0, 1 and 2."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that in this discretization of iris dataset, **each feature** is discretized with the **same labels**, which **is not what we want**"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"unique items in colunms 0 : [0 1 2]\n",
"unique items in colunms 1 : [0 1 2]\n",
"unique items in colunms 2 : [0 1 2]\n",
"unique items in colunms 3 : [0 1 2]\n"
]
}
],
"source": [
"for k in range(4): \n",
" print(f\"unique items in colunms {k} : {np.unique(Xt[:,k])}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We must **shift** values in columns in order to **avoid identical labels between columns**. "
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"unique items in colunms 0 : [0 1 2]\n",
"unique items in colunms 1 : [3 4 5]\n",
"unique items in colunms 2 : [6 7 8]\n",
"unique items in colunms 3 : [ 9 10 11]\n",
"-> Xt.shape : (150,)\n",
"-> Xt, 10 rows :\n",
" 50 [2, 4, 7, 10]\n",
"51 [1, 4, 7, 10]\n",
"52 [2, 4, 7, 10]\n",
"53 [1, 3, 7, 10]\n",
"54 [1, 3, 7, 10]\n",
"dtype: object\n"
]
}
],
"source": [
"shift_col = np.max(Xt, axis=0)\n",
"for k in range(1, len(shift_col)) : \n",
" shift_col[k]+= shift_col[k-1] + 1\n",
"shift_col+=-shift_col[0]\n",
"\n",
"for k in range(len(shift_col)) : \n",
" Xt[:,k]+=shift_col[k]\n",
"\n",
"for k in range(4): \n",
" print(f\"unique items in colunms {k} : {np.unique(Xt[:,k])}\")\n",
"\n",
"Xt = pd.Series( Xt.tolist() ) # we must tranform the array into series of list\n",
"print(\"-> Xt.shape : \", Xt.shape)\n",
"print(\"-> Xt, 10 rows :\\n\", Xt[50:55])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### In pipelines : "
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"X_train shape: (120, 4) y_train shape: (120,)\n",
"X_test shape: (30, 4) y_test shape: (30,)\n"
]
}
],
"source": [
"from skmine.itemsets.slim_classifier import SlimClassifier\n",
"\n",
"from sklearn.datasets import load_iris\n",
"from sklearn.pipeline import Pipeline\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.preprocessing import StandardScaler, KBinsDiscretizer\n",
"\n",
"\n",
"class MultiLabelsKbins(KBinsDiscretizer): # pandas DataFrames are easier to read ;) \n",
" def transform(self, X):\n",
" Xt = super().transform(X).astype(int)\n",
" \n",
" shift_col = np.max(Xt, axis=0)\n",
" for k in range(1, len(shift_col)) : \n",
" shift_col[k]+= shift_col[k-1] + 1\n",
" shift_col+=-shift_col[0]\n",
" for k in range(len(shift_col)) : \n",
" Xt[:,k]+=shift_col[k]\n",
" \n",
" return pd.Series(Xt.tolist())\n",
"\n",
"data = load_iris()\n",
"X, y = data.data, data.target\n",
"\n",
"(X_train, X_test, y_train, y_test) = train_test_split(X, y, random_state=1, test_size=0.2, shuffle=True)\n",
"print(\"X_train shape:\", X_train.shape, \"y_train shape:\", y_train.shape)\n",
"print(\"X_test shape:\", X_test.shape, \"y_test shape:\", y_test.shape)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"Pipeline(steps=[('StandardScaler', StandardScaler()),\n",
" ('MultiLabelsKbins',\n",
" MultiLabelsKbins(encode='ordinal', n_bins=3,\n",
" strategy='uniform'))]) In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. "
],
"text/plain": [
"Pipeline(steps=[('StandardScaler', StandardScaler()),\n",
" ('MultiLabelsKbins',\n",
" MultiLabelsKbins(encode='ordinal', n_bins=3,\n",
" strategy='uniform'))])"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"preproc = Pipeline([\n",
" ('StandardScaler', StandardScaler()),\n",
" ('MultiLabelsKbins', MultiLabelsKbins(n_bins=3, encode='ordinal', strategy='uniform')),\n",
"])\n",
"preproc.fit(X_train)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"-> Xt.shape : (120,)\n",
"-> Xt, 10 rows :\n",
" 50 [0, 5, 6, 9]\n",
"51 [0, 4, 6, 9]\n",
"52 [0, 5, 6, 9]\n",
"53 [2, 4, 8, 11]\n",
"54 [0, 4, 6, 9]\n",
"dtype: object\n"
]
}
],
"source": [
"Xt = preproc.transform(X_train)\n",
"print(\"-> Xt.shape : \", Xt.shape)\n",
"print(\"-> Xt, 10 rows :\\n\", Xt[50:55])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can add SlimClassifier to the pipe"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"items = set(item for transaction in Xt for item in transaction) # used by SLIM to optimize results for small datasets\n",
"# without it, our test dataset may contain only a few items and the codetable will not have all items (and vice versa), so it is incomplete and this can affect the quality of the results\n",
"\n",
"pipe = Pipeline([\n",
" ('preproc', preproc),\n",
" ('SlimClassifier', SlimClassifier(items=items))\n",
"])"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"Pipeline(steps=[('preproc',\n",
" Pipeline(steps=[('StandardScaler', StandardScaler()),\n",
" ('MultiLabelsKbins',\n",
" MultiLabelsKbins(encode='ordinal', n_bins=3,\n",
" strategy='uniform'))])),\n",
" ('SlimClassifier',\n",
" SlimClassifier(items={0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}))]) In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. Pipeline Pipeline(steps=[('preproc',\n",
" Pipeline(steps=[('StandardScaler', StandardScaler()),\n",
" ('MultiLabelsKbins',\n",
" MultiLabelsKbins(encode='ordinal', n_bins=3,\n",
" strategy='uniform'))])),\n",
" ('SlimClassifier',\n",
" SlimClassifier(items={0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}))]) SlimClassifier SlimClassifier(items={0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}) "
],
"text/plain": [
"Pipeline(steps=[('preproc',\n",
" Pipeline(steps=[('StandardScaler', StandardScaler()),\n",
" ('MultiLabelsKbins',\n",
" MultiLabelsKbins(encode='ordinal', n_bins=3,\n",
" strategy='uniform'))])),\n",
" ('SlimClassifier',\n",
" SlimClassifier(items={0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}))])"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pipe.fit(X_train,y_train)\n"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"-> Predictions : [0 1 1 0 2 1 2 0 0 2 1 0 2 1 1 0 1 1 0 0 1 1 2 0 2 1 0 0 1 2]\n",
"-> Pipe Accuracy : 96.7 %\n"
]
}
],
"source": [
"y_preds = pipe.predict(X_test)\n",
"print(\"-> Predictions : \", y_preds)\n",
"print(f\"-> Pipe Accuracy : {round(pipe.score(X_test, y_test)*100,1)} %\") "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"----------\n",
"\n",
"### OneVsRest classifier for more than 2 classes\n",
"\n",
"The **SLIM algorithm** is also compatible with **scikit-learn** to be used from other classifiers like **One-vs-the-rest (OvR)** (https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html). The limitation of this method is that the classifier **only works for multiclass classification** problems while the embedded classifier works for both binary and multiclass problems."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"from skmine.itemsets import SLIM\n",
"from sklearn.preprocessing import MultiLabelBinarizer"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"class TransactionEncoder(MultiLabelBinarizer): # pandas DataFrames are easier to read ;)\n",
" def transform(self, X):\n",
" _X = super().transform(X)\n",
" return pd.DataFrame(data=_X, columns=self.classes_)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" bananas \n",
" butter \n",
" cookies \n",
" milk \n",
" tea \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" \n",
" \n",
" 1 \n",
" 1 \n",
" 0 \n",
" 1 \n",
" 1 \n",
" 0 \n",
" \n",
" \n",
" 2 \n",
" 0 \n",
" 1 \n",
" 1 \n",
" 0 \n",
" 1 \n",
" \n",
" \n",
" 3 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" \n",
" \n",
" 4 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 1 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" bananas butter cookies milk tea\n",
"0 1 0 0 1 0\n",
"1 1 0 1 1 0\n",
"2 0 1 1 0 1\n",
"3 0 0 0 0 1\n",
"4 1 0 0 1 1"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"transactions = [ \n",
" ['bananas', 'milk'], \n",
" ['milk', 'bananas', 'cookies'], \n",
" ['cookies', 'butter', 'tea'], \n",
" ['tea'], \n",
" ['milk', 'bananas', 'tea'], \n",
"]\n",
"te = TransactionEncoder()\n",
"D = te.fit(transactions).transform(transactions)\n",
"D "
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" itemset \n",
" usage \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" [bananas, milk] \n",
" 3 \n",
" \n",
" \n",
" 1 \n",
" [tea] \n",
" 3 \n",
" \n",
" \n",
" 2 \n",
" [cookies] \n",
" 2 \n",
" \n",
" \n",
" 3 \n",
" [butter] \n",
" 1 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" itemset usage\n",
"0 [bananas, milk] 3\n",
"1 [tea] 3\n",
"2 [cookies] 2\n",
"3 [butter] 1"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"slim = SLIM()\n",
"codetable = slim.fit(D).transform(D)\n",
"codetable"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We keep this **codetable** in mind, as we will later use it **to interpret our predictions**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"----------\n",
"#### First \"predictions\" \n",
"\n",
"\n",
"We define a new transactional dataset, composed with different itemset. Let's note $x$ an itemset like the first one here :`['bananas','milk]` . \n",
"From the fitted codetable on dataset D, we derived :\n",
"- compute code length of $x$, namely $c_l(x)$, that lie in $[0,+\\infty[$ . It is obtained by method `.get_code_length`\n",
"- if x has the shortest code length, it means that $x$ is close to fitted dataset. A method `.decision_function` is implemented to reflect this closeness by **a score** of $x$. Lowest code length gives highest scores. To get probabilities (values in $[0;1]$ like logit, in sigmoid output), an negative exponential function is used `decision_function`$(x)$ = $\\exp(-0.2 \\times c_l(x))$"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/cregan/miniconda3/envs/test_skmine/lib/python3.8/site-packages/scikit_learn-1.2.2-py3.8-linux-x86_64.egg/sklearn/preprocessing/_label.py:895: UserWarning: unknown class(es) ['sirup'] will be ignored\n",
" warnings.warn(\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" bananas \n",
" butter \n",
" cookies \n",
" milk \n",
" tea \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" \n",
" \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 1 \n",
" 0 \n",
" \n",
" \n",
" 2 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" bananas butter cookies milk tea\n",
"0 1 0 0 1 0\n",
"1 0 0 1 1 0\n",
"2 0 1 0 0 1"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"new_transactions = [ \n",
" ['bananas', 'milk'], \n",
" ['milk', 'sirup', 'cookies'], \n",
" ['butter', 'tea'], \n",
"]\n",
"new_D = te.transform(new_transactions)\n",
"new_D"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" transaction \n",
" code length \n",
" score \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" [bananas, milk] \n",
" 1.907 \n",
" 0.683 \n",
" \n",
" \n",
" 1 \n",
" [milk, sirup, cookies] \n",
" 6.229 \n",
" 0.288 \n",
" \n",
" \n",
" 2 \n",
" [butter, tea] \n",
" 4.814 \n",
" 0.382 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" transaction code length score\n",
"0 [bananas, milk] 1.907 0.683\n",
"1 [milk, sirup, cookies] 6.229 0.288\n",
"2 [butter, tea] 4.814 0.382"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"codes_length = slim.get_code_length(new_D).round(3)\n",
"scores = slim.decision_function(new_D).round(3)\n",
"pd.DataFrame([new_transactions, codes_length, scores], index=['transaction', 'code length' , 'score']).T"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Built-in interpretations\n",
"Now we can interpret codes for the new data, directly by **looking at the codetable inferred from training data**\n",
"\n",
"First observations\n",
"\n",
"* `[milk, sirup, cookies]` has the code length, so the smallest score. You can see it contains `milk`, `sirup` and `cookies`. From the codetable we see `milk` and `cookies` are not grouped together, while `sirup` has never been seen\n",
" \n",
"\n",
"* `[bananas, milk]` has the lowest code length, so the highest score. It contains `bananas` and `milk`, which are grouped together in the codetable and have high occurence in the training data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Shortest code wins !!\n",
"Next, we are going to use an ensemble of SLIM encoding schemes, and utilize them via a ``OneVsRest`` methodology, to perform **multi-class classification**.\n",
"The methodology is very simple\n",
"\n",
"1. We clone our base estimator as many time as we need (one per class)\n",
"2. We fit every estimator on entries corresponding to its class in the input data\n",
"3. When calling ``.predict``, we actually call ``.decision_function`` and get scores for every class\n",
"4. The shorted code wins : we choose the class with the lowest code length, so the highest score for a given transaction"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.multiclass import OneVsRestClassifier\n",
"from sklearn.pipeline import Pipeline"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"pipe = Pipeline([\n",
" ('transaction_encoder', TransactionEncoder(sparse_output=False)),\n",
" ('slim', SLIM()),\n",
"])"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"transactions = [\n",
" ['milk', 'bananas'],\n",
" ['tea', 'New York Times', 'El Pais'],\n",
" ['New York Times'],\n",
" ['El Pais', 'The Economist'],\n",
" ['milk', 'tea'],\n",
" ['croissant', 'tea'],\n",
" ['croissant', 'chocolatine', 'milk'],\n",
"]\n",
"target = [\n",
" 'foodstore', \n",
" 'newspaper', \n",
" 'newspaper', \n",
" 'newspaper', \n",
" 'foodstore',\n",
" 'bakery',\n",
" 'bakery',\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" El Pais \n",
" New York Times \n",
" The Economist \n",
" bananas \n",
" chocolatine \n",
" croissant \n",
" milk \n",
" tea \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" \n",
" \n",
" 1 \n",
" 1 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" \n",
" \n",
" 2 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 3 \n",
" 1 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 4 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 1 \n",
" \n",
" \n",
" 5 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 1 \n",
" \n",
" \n",
" 6 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 1 \n",
" 1 \n",
" 0 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" El Pais New York Times The Economist bananas chocolatine croissant \n",
"0 0 0 0 1 0 0 \\\n",
"1 1 1 0 0 0 0 \n",
"2 0 1 0 0 0 0 \n",
"3 1 0 1 0 0 0 \n",
"4 0 0 0 0 0 0 \n",
"5 0 0 0 0 0 1 \n",
"6 0 0 0 0 1 1 \n",
"\n",
" milk tea \n",
"0 1 0 \n",
"1 0 1 \n",
"2 0 0 \n",
"3 0 0 \n",
"4 1 1 \n",
"5 0 1 \n",
"6 1 0 "
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"te = TransactionEncoder()\n",
"D = te.fit(transactions).transform(transactions)\n",
"D"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"ovr = OneVsRestClassifier(SLIM())"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[SLIM(), SLIM(), SLIM()]"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ovr.fit(D, y=target)\n",
"ovr.estimators_ # 3 estimators SLIM , one per class"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" bakery \n",
" foodstore \n",
" newspaper \n",
" predictions \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 0.238 \n",
" 0.400 \n",
" 0.218 \n",
" foodstore \n",
" \n",
" \n",
" 1 \n",
" 0.116 \n",
" 0.142 \n",
" 0.234 \n",
" newspaper \n",
" \n",
" \n",
" 2 \n",
" 0.488 \n",
" 0.488 \n",
" 0.641 \n",
" newspaper \n",
" \n",
" \n",
" 3 \n",
" 0.238 \n",
" 0.238 \n",
" 0.366 \n",
" newspaper \n",
" \n",
" \n",
" 4 \n",
" 0.238 \n",
" 0.400 \n",
" 0.266 \n",
" foodstore \n",
" \n",
" \n",
" 5 \n",
" 0.596 \n",
" 0.291 \n",
" 0.266 \n",
" bakery \n",
" \n",
" \n",
" 6 \n",
" 0.596 \n",
" 0.160 \n",
" 0.102 \n",
" bakery \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" bakery foodstore newspaper predictions\n",
"0 0.238 0.400 0.218 foodstore\n",
"1 0.116 0.142 0.234 newspaper\n",
"2 0.488 0.488 0.641 newspaper\n",
"3 0.238 0.238 0.366 newspaper\n",
"4 0.238 0.400 0.266 foodstore\n",
"5 0.596 0.291 0.266 bakery\n",
"6 0.596 0.160 0.102 bakery"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"res = pd.DataFrame(ovr.decision_function(D).round(3), columns=ovr.classes_)\n",
"res['predictions'] = ovr.predict(D)\n",
"res"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Questions on binary OneVsRest classifier"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.multiclass import OneVsRestClassifier\n",
"from sklearn.pipeline import Pipeline\n",
"\n",
"pipe = Pipeline([\n",
" ('transaction_encoder', TransactionEncoder(sparse_output=False)),\n",
" ('slim', SLIM()),\n",
"])"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"transactions = [\n",
" ['milk', 'bananas'],\n",
" ['tea', 'New York Times', 'El Pais'],\n",
" ['New York Times'],\n",
" ['El Pais', 'The Economist'],\n",
" ['milk', 'tea'],\n",
"]\n",
"target = [\n",
" 'foodstore', \n",
" 'newspaper', \n",
" 'newspaper', \n",
" 'newspaper', \n",
" 'foodstore',\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" El Pais \n",
" New York Times \n",
" The Economist \n",
" bananas \n",
" milk \n",
" tea \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 1 \n",
" 0 \n",
" \n",
" \n",
" 1 \n",
" 1 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" \n",
" \n",
" 2 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 3 \n",
" 1 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 4 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 1 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" El Pais New York Times The Economist bananas milk tea\n",
"0 0 0 0 1 1 0\n",
"1 1 1 0 0 0 1\n",
"2 0 1 0 0 0 0\n",
"3 1 0 1 0 0 0\n",
"4 0 0 0 0 1 1"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"te = TransactionEncoder()\n",
"D = te.fit(transactions).transform(transactions)\n",
"D"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"ovr = OneVsRestClassifier(SLIM())"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[SLIM()]"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ovr.fit(D, y=target)\n",
"ovr.estimators_"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" score \n",
" predictions \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 0.238 \n",
" foodstore \n",
" \n",
" \n",
" 1 \n",
" 0.268 \n",
" foodstore \n",
" \n",
" \n",
" 2 \n",
" 0.670 \n",
" newspaper \n",
" \n",
" \n",
" 3 \n",
" 0.400 \n",
" foodstore \n",
" \n",
" \n",
" 4 \n",
" 0.291 \n",
" foodstore \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" score predictions\n",
"0 0.238 foodstore\n",
"1 0.268 foodstore\n",
"2 0.670 newspaper\n",
"3 0.400 foodstore\n",
"4 0.291 foodstore"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"res = pd.DataFrame(ovr.decision_function(D).round(3),columns=['score'])\n",
"res['predictions'] =ovr.predict(D)\n",
"res\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For **binary classification**, OneVsRest create only one model and compare score of an input $x$ with a threshold. For instance, SVM score is a signed distance to an hyperplane: if score is positive, then x is predicted in the current class. If scores lie in $[0,1]$, threshold is set to 0.5 (like activation threshold in sigmoid).\n",
"\n",
" For SLIM, we don't have distance to such a bundary and no reference for that threshold. So, **OneVsRest is not suitable for binary classification**. To classify, implemented SlimClassifier creates 2 models (one for each label) and just compare score for itemset $x$."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {
"height": "calc(100% - 180px)",
"left": "10px",
"top": "150px",
"width": "342.417px"
},
"toc_section_display": true,
"toc_window_display": true
},
"vscode": {
"interpreter": {
"hash": "c4418ac5ac56bcc1d654aebcb97e1ca3ff1be77625ab045a7b4b5e8ee820789e"
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}