Callback API demo¶
In this demo we are going to use the callback API to track some useful information while training a SLIM miner
We simply define custom python functions, and submit them to create an instance of skmine.callbacks.CallBacks
.
[2]:
import skmine
print("This tutorial was tested with the following version of skmine :", skmine.__version__)
This tutorial was tested with the following version of skmine : 1.0.0
[2]:
import pandas as pd
import numpy as np
from skmine.callbacks import CallBacks
from skmine.itemsets import SLIM
from skmine.datasets.fimi import fetch_mushroom
[3]:
mushroom = fetch_mushroom()
mushroom.head()
[3]:
0 [1, 3, 9, 13, 23, 25, 34, 36, 38, 40, 52, 54, ...
1 [2, 3, 9, 14, 23, 26, 34, 36, 39, 40, 52, 55, ...
2 [2, 4, 9, 15, 23, 27, 34, 36, 39, 41, 52, 55, ...
3 [1, 3, 10, 15, 23, 25, 34, 36, 38, 41, 52, 54,...
4 [2, 3, 9, 16, 24, 28, 34, 37, 39, 40, 53, 54, ...
Name: mushroom, dtype: object
[4]:
# set max_time to a limit value so that the dataset compression is not too long
slim = SLIM(max_time=30)
Define your own callbacks¶
We define custom functions, that will take the result of the function they target as input. Those results will be ingested for later reuse
Here we define two methods:
post_evaluate
is executed afterSLIM.evaluate
. It tracks sizes for both the data and the modelpost_gen
is executed afterSLIM.generate_candidates
, and just records the size of the current batch of candidates
[8]:
sizes = list()
candidate_sizes = list()
def post_evaluate(data_size, model_size, *args):
sizes.append((data_size, model_size))
def post_gen(candidates):
candidate_sizes.append(len(candidates))
A skmine.callbacks.CallBacks
is a collection of callbacks.
It’s a mapping between function names and their dedicated callbacks. When an instance of skmine.callbacks.CallBacks
is called (() operator
) on an object, it looks for internal methods and tries to attach the callbacks.
[9]:
callbacks = CallBacks(evaluate=post_evaluate, generate_candidates=post_gen)
callbacks(slim)
warning : `f_name`='set_output' return an error for `callable(getattr(miner, f_name)`
[10]:
%time slim.fit_transform(mushroom)
CPU times: user 30.2 s, sys: 266 ms, total: 30.4 s
Wall time: 30.1 s
[10]:
itemset | usage | |
---|---|---|
0 | [2, 23, 28, 34, 36, 39, 53, 56, 59, 63, 85, 86... | 864 |
1 | [2, 23, 28, 34, 36, 39, 53, 56, 59, 63, 85, 86... | 864 |
2 | [1, 24, 29, 34, 36, 39, 52, 56, 61, 66, 85, 86... | 648 |
3 | [1, 24, 29, 34, 36, 39, 52, 56, 61, 66, 85, 86... | 648 |
4 | [1, 24, 34, 36, 38, 48, 53, 58, 59, 63, 85, 86... | 432 |
... | ... | ... |
155 | [22] | 16 |
156 | [75] | 8 |
157 | [89] | 8 |
158 | [8] | 4 |
159 | [12] | 4 |
160 rows × 2 columns
Inner view of MDL learning¶
The plot below clearly shows how SLIM performs compression.
While the blue curve represents the size of the data, red vertical lines emphasize the end of a batch of candidates.
We can clearly dinstinguish the beginning of a batch of candidates, where the learning curve is quite abrupt, from the end of a batch, where it reaches a plateau.
[11]:
sizes
[11]:
[(1064327.8204536438, 2049.1517753601074),
(1064327.8204536438, 2049.1517753601074),
(1019045.3720026016, 2051.0723099708557),
(1019045.3720026016, 2051.0723099708557),
(980036.730214119, 2075.213014602661),
(980036.730214119, 2075.213014602661),
(948613.6116323471, 2102.6085658073425),
(948613.6116323471, 2102.6085658073425),
(927717.8376092911, 2118.527575492859),
(927717.8376092911, 2118.527575492859),
(907849.3784189224, 2138.470380783081),
(907849.3784189224, 2138.470380783081),
(889519.3478527069, 2157.9796571731567),
(889519.3478527069, 2157.9796571731567),
(871974.2353172302, 2192.7303881645203),
(871974.2353172302, 2192.7303881645203),
(854607.3957920074, 2208.895321369171),
(854607.3957920074, 2208.895321369171),
(838443.3827610016, 2225.1477360725403),
(838443.3827610016, 2225.1477360725403),
(823428.2101373672, 2264.49365234375),
(823428.2101373672, 2264.49365234375),
(808966.1101493835, 2287.104096889496),
(808966.1101493835, 2287.104096889496),
(792759.5926294327, 2293.201919078827),
(792759.5926294327, 2293.201919078827),
(780306.3465356827, 2344.104003429413),
(780306.3465356827, 2344.104003429413),
(768081.5152606964, 2392.268000602722),
(768081.5152606964, 2392.268000602722),
(756362.2376337051, 2409.8285970687866),
(756362.2376337051, 2409.8285970687866),
(745031.9830083847, 2421.723343372345),
(745031.9830083847, 2421.723343372345),
(734200.0674972534, 2446.078966140747),
(734200.0674972534, 2446.078966140747),
(723434.2704172134, 2456.3236446380615),
(723434.2704172134, 2456.3236446380615),
(712732.7913103104, 2461.307161808014),
(712732.7913103104, 2461.307161808014),
(702638.7939929962, 2530.0034675598145),
(702638.7939929962, 2530.0034675598145),
(692757.5342140198, 2553.6208057403564),
(692757.5342140198, 2553.6208057403564),
(683180.6969127655, 2565.5511989593506),
(683180.6969127655, 2565.5511989593506),
(673874.3940858841, 2619.1964559555054),
(673874.3940858841, 2619.1964559555054),
(664669.040602684, 2694.46364402771),
(664669.040602684, 2694.46364402771),
(655349.6174497604, 2703.4007120132446),
(655349.6174497604, 2703.4007120132446),
(645934.4946928024, 2729.645245552063),
(645934.4946928024, 2729.645245552063),
(636988.622294426, 2742.826548576355),
(636988.622294426, 2742.826548576355),
(628349.489894867, 2745.7087755203247),
(628349.489894867, 2745.7087755203247),
(619778.3561525345, 2791.6579875946045),
(619778.3561525345, 2791.6579875946045),
(612687.4520568848, 2765.7485609054565),
(612687.4520568848, 2765.7485609054565),
(604839.8522281647, 2794.495764732361),
(604839.8522281647, 2794.495764732361),
(597740.2815599442, 2801.642425060272),
(597740.2815599442, 2801.642425060272),
(591211.9014139175, 2805.2541251182556),
(591211.9014139175, 2805.2541251182556),
(585517.1406850815, 2851.7055792808533),
(585517.1406850815, 2851.7055792808533),
(579929.8582677841, 2877.48024892807),
(579929.8582677841, 2877.48024892807),
(574430.0130376816, 2916.439881801605),
(574430.0130376816, 2916.439881801605),
(569052.748840332, 2923.763541698456),
(569052.748840332, 2923.763541698456),
(563976.3026885986, 2941.1488165855408),
(563976.3026885986, 2941.1488165855408),
(558968.347108841, 2923.6829719543457),
(558968.347108841, 2923.6829719543457),
(554284.5226974487, 3014.9049005508423),
(554284.5226974487, 3014.9049005508423),
(549905.7714662552, 3019.4470205307007),
(549905.7714662552, 3019.4470205307007),
(544997.0604438782, 3024.4853563308716),
(544997.0604438782, 3024.4853563308716),
(540950.6344032288, 3050.9515557289124),
(540950.6344032288, 3050.9515557289124),
(537252.2906827927, 3066.4759969711304),
(537252.2906827927, 3066.4759969711304),
(532131.9468197823, 3083.5683188438416),
(532131.9468197823, 3083.5683188438416),
(527885.6507277489, 3100.84166765213),
(527885.6507277489, 3100.84166765213),
(522150.7439804077, 3105.057946205139),
(522150.7439804077, 3105.057946205139),
(516739.6784582138, 3125.0889949798584),
(516739.6784582138, 3125.0889949798584),
(511636.4079031944, 3152.4172925949097),
(511636.4079031944, 3152.4172925949097),
(508022.9669327736, 3199.241506099701),
(508022.9669327736, 3199.241506099701),
(504501.93383073807, 3269.2803320884705),
(504501.93383073807, 3269.2803320884705),
(500812.76487112045, 3274.8100786209106),
(500812.76487112045, 3274.8100786209106),
(497417.9516406059, 3365.587016582489),
(497417.9516406059, 3365.587016582489),
(492801.83930683136, 3369.3371634483337),
(492801.83930683136, 3369.3371634483337),
(489391.1609711647, 3467.619252681732),
(489391.1609711647, 3467.619252681732),
(485460.7992911339, 3472.116714000702),
(485460.7992911339, 3472.116714000702),
(482427.1781849861, 3477.5840849876404),
(482427.1781849861, 3477.5840849876404),
(479693.1807589531, 3497.0940837860107),
(479693.1807589531, 3497.0940837860107),
(476574.6060566902, 3516.7616815567017),
(476574.6060566902, 3516.7616815567017),
(473364.8910665512, 3536.486557483673),
(473364.8910665512, 3536.486557483673),
(470712.14954042435, 3547.9589014053345),
(470712.14954042435, 3547.9589014053345),
(468034.41232585907, 3630.8001165390015),
(468034.41232585907, 3630.8001165390015),
(465488.4886255264, 3650.3392448425293),
(465488.4886255264, 3650.3392448425293),
(462855.69601774216, 3748.0396962165833),
(462855.69601774216, 3748.0396962165833),
(460318.39687108994, 3846.327386379242),
(460318.39687108994, 3846.327386379242),
(457533.85771226883, 3850.8916816711426),
(457533.85771226883, 3850.8916816711426),
(455091.3696861267, 3860.352683544159),
(455091.3696861267, 3860.352683544159),
(451439.63231420517, 3890.251036167145),
(451439.63231420517, 3890.251036167145),
(449009.84144067764, 3881.3497910499573),
(449009.84144067764, 3881.3497910499573),
(446528.81369543076, 3898.210277557373),
(446528.81369543076, 3898.210277557373),
(444225.6063990593, 3904.679087638855),
(444225.6063990593, 3904.679087638855),
(441868.0961327553, 3924.2356181144714),
(441868.0961327553, 3924.2356181144714),
(439658.52277326584, 3959.026750564575),
(439658.52277326584, 3959.026750564575),
(437459.57193517685, 3981.3535590171814),
(437459.57193517685, 3981.3535590171814),
(435022.1490917206, 4005.7886300086975),
(435022.1490917206, 4005.7886300086975),
(432903.7034549713, 4031.1638164520264),
(432903.7034549713, 4031.1638164520264),
(430781.06565475464, 4047.012207508087),
(430781.06565475464, 4047.012207508087),
(428734.2130870819, 4052.7625794410706),
(428734.2130870819, 4052.7625794410706),
(426716.5517811775, 4061.1834139823914),
(426716.5517811775, 4061.1834139823914),
(424729.18721199036, 4069.7247982025146),
(424729.18721199036, 4069.7247982025146),
(422736.1689748764, 4099.728016376495),
(422736.1689748764, 4099.728016376495),
(420768.4673900604, 4107.899528026581),
(420768.4673900604, 4107.899528026581),
(418418.19539165497, 4113.049150466919),
(418418.19539165497, 4113.049150466919),
(416493.6862053871, 4118.254427909851),
(416493.6862053871, 4118.254427909851),
(414565.99742126465, 4124.003650665283),
(414565.99742126465, 4124.003650665283),
(412686.0111503601, 4114.948963165283),
(412686.0111503601, 4114.948963165283),
(410796.34900665283, 4121.841695308685),
(410796.34900665283, 4121.841695308685),
(408918.20883369446, 4112.796407222748),
(408918.20883369446, 4112.796407222748),
(407037.3795146942, 4129.733689785004),
(407037.3795146942, 4129.733689785004),
(405187.4784345627, 4138.643919467926),
(405187.4784345627, 4138.643919467926),
(403360.82039642334, 4147.217342376709),
(403360.82039642334, 4147.217342376709),
(401540.30622291565, 4152.1692061424255),
(401540.30622291565, 4152.1692061424255),
(399756.1656303406, 4169.316841125488),
(399756.1656303406, 4169.316841125488),
(398016.8108062744, 4188.394077777863),
(398016.8108062744, 4188.394077777863),
(396290.0697169304, 4233.978398323059),
(396290.0697169304, 4233.978398323059),
(394593.36030578613, 4247.3708510398865),
(394593.36030578613, 4247.3708510398865)]
[12]:
df = pd.DataFrame(sizes, columns=['data_size', 'model_size'])
ax = df.data_size.plot()
Here is how the model size goes up
[13]:
df.model_size.plot()
[13]:
<AxesSubplot:>
And finally the evolution of the total size of our dataset after compression via SLIM following MDL
[14]:
df['total_size'] = df['model_size']+df['data_size']
df.total_size.plot()
[14]:
<AxesSubplot:>
From the sizes saved in df, we can determine the compression ratio of our dataset after applying SLIM
[15]:
compression_percentage = df['total_size'].iloc[-1]/df['total_size'].iloc[0]*100
compression_percentage
[15]:
37.401476358134936
[ ]: