{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Periodic pattern mining, an example with Canadian TV programs\n",
""
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"This tutorial was tested with the following version of skmine : 1.0.0\n",
"\n",
"Be sure to execute this notebook in the dir `docs/tutorials/periodic`, \n",
"to be able to load the data files `.csv`. \n",
"If not, please do so and adapt the bash commands below \n",
"\n"
]
},
{
"data": {
"text/plain": [
"'/home/hcourtei/Projects/F-WIN/scikit-mine/codes/skMineDev/docs/tutorials/periodic'"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import skmine\n",
"\n",
"print(\"This tutorial was tested with the following version of skmine :\", skmine.__version__)\n",
"\n",
"print(\"\"\"\n",
"Be sure to execute this notebook in the dir `docs/tutorials/periodic`, \n",
"to be able to load the data files `.csv`. \n",
"If not, please do so and adapt the bash commands below \n",
"\"\"\")\n",
"\n",
"%pwd\n",
"# %cd docs/tutorials/periodic"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### The problem, informally\n",
"Let's take a simple example. \n",
"\n",
"Imagine you set an alarm to wake up every day around 7:30AM, and go to work. Sometimes you wake up a bit earlier (your body anticipates the alarm), and sometimes a bit later, for example if you press the \"snooze\" button and refuse to face the fact that you have to wake up.\n",
"\n",
"In python we can load those \"wake up\" events as logs, and store them in a [pandas.Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html), like"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"2020-04-16 07:30:00 wake up\n",
"2020-04-17 07:29:00 wake up\n",
"2020-04-18 07:29:00 wake up\n",
"2020-04-19 07:30:00 wake up\n",
"2020-04-20 07:32:00 wake up\n",
"2020-04-23 07:30:00 wake up\n",
"dtype: object"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import datetime as dt\n",
"import pandas as pd\n",
"\n",
"one_day = 60 * 24 # a day in minutes\n",
"minutes = [0, one_day - 1, one_day * 2 - 1, one_day * 3, one_day * 4 + 2, one_day * 7]\n",
"\n",
"S = pd.Series(\"wake up\", index=minutes)\n",
"start = dt.datetime.strptime(\"16/04/2020 07:30\", \"%d/%m/%Y %H:%M\")\n",
"S.index = S.index.map(lambda e: start + dt.timedelta(minutes=e))\n",
"S.index = S.index.round(\"min\") # minutes as the lowest unit of difference\n",
"S"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"array(['2020-04-16T07:30:00.000000000', '2020-04-17T07:29:00.000000000',\n",
" '2020-04-18T07:29:00.000000000', '2020-04-19T07:30:00.000000000',\n",
" '2020-04-20T07:32:00.000000000', '2020-04-23T07:30:00.000000000'],\n",
" dtype='datetime64[ns]')"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"S.index.to_numpy()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see the wake-up time is not exactly the same every day, but overall a consistent regular pattern seems to emerge.\n",
"\n",
"Now imagine that in addition to wake up times, we also have records of other daily activities (meals, work, household chores, etc.), and that rather than a handful of days, those records span several years and make up several thousands of events\n",
"\n",
"**Can you automatically identify regularities to better understand your daily routine?**"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Introduction to periodic pattern mining\n",
"Periodic pattern mining aims at exploiting regularities not only about *what happens* by finding coordinated event occurrences, but also about *when it happens* and *how it happens*. It's about not just finding things that happen repeatedly, but that happen regularly, with a consistent periodicity.\n",
"\n",
"Next, we introduce the concept of cycles\n",
"\n",
"#### The cycle : a building block for periodic pattern mining\n",
"Here is an explicit example of a cycle\n",
"\n",
"\n",
"\n",
"This definition, while being relatively simple, is general enough to allow us to find regularities in different types of logs\n",
"\n",
"#### Handling noise in our timestamps\n",
"\n",
"Needless to say, it would be too easy if events in our data were equally spaced. As data often comes noisy, we have to be fault tolerant, and allow small errors to sneak into our cycles. \n",
"\n",
"That's the role of **shift corrections*, which capture the small deviations from perfectly regular periodic repetitions, and allow to reconstruct the (noisy) original sequence of events, using the following relation\n",
""
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"#### A tiny example with scikit-mine\n",
"**scikit-mine** offers a `PeriodicPatternMiner`, out of the box.\n",
"You can use it to detect regularities, in the form of *(possibly nested) cycles*, which we call *periodic patterns*.\n",
"\n",
" These patterns are evaluated using a MDL criterion, to help us extract a compact and representative collection of patterns from the data, trying to avoid redundancy between patterns, in particular."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
t0
\n",
"
pattern
\n",
"
repetition_major
\n",
"
period_major
\n",
"
sum_E
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
2020-04-16 07:30:00
\n",
"
(wake up)[r=5 p=1 day, 0:00:00]
\n",
"
5
\n",
"
1 days
\n",
"
0 days 00:04:00
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" t0 pattern repetition_major \\\n",
"0 2020-04-16 07:30:00 (wake up)[r=5 p=1 day, 0:00:00] 5 \n",
"\n",
" period_major sum_E \n",
"0 1 days 0 days 00:04:00 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from skmine.periodic import PeriodicPatternMiner\n",
"pcm = PeriodicPatternMiner().fit(S)\n",
"pcm.transform(S)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"You can see one cycle has been extracted for our event **wake up**. The cycle covers the entire business week, but not the last monday separated by the weekend\n",
"\n",
"It has a length of 5 and a period close to 1 day, as expected.\n",
"\n",
"Also, note that we mix some information here. Our period of 1 day offers the best summary for this data.\n",
"Accessing the little \"shifts\" as encountered in original data is also possible, with an extra argument in our call : `pcm.transform(S,dE_sum=False)` "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 [-1 days +23:59:00, 0 days 00:00:00, 0 days 00:01:00, 0 days 00:02:00]\n",
"Name: E, dtype: object"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.set_option('display.max_colwidth', 100)\n",
"res_wake = pcm.transform(S,dE_sum=False)\n",
"res_wake.E\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The last column named `E` contains a list of shifts to apply to our cycle in case we want to reconstruct the original data. Trailing zeros have been removed for efficiency, and their values are *relative to the period*, but we can see there is:\n",
" * -1 minute shift (-1 days +23:59:00), between the 1st and 2nd entry (waking up at 7:30 on 04-16 at 7:29 on 04-17)\n",
" * no shift between the 2nd and 3rd entry (still waking up at 7:29 on 04-18)\n",
" * \\+ 1 minute shift between the 3rd and 4th entry (back to 7:30 on 04-19)\n",
" * \\+ 2 minutes shift between the 4th and 5th entry (waking up at 7:32 on 04-20)\n",
"\n",
"Events that are not captured by any pattern are called `redisuals`"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"scrolled": true,
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
time
\n",
"
event
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
2020-04-23 07:30:00
\n",
"
wake up
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" time event\n",
"0 2020-04-23 07:30:00 wake up"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pcm.get_residuals()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This way `pcm` does not store all the data, but has all information needed to reconstruct it entirely !!"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
time
\n",
"
event
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
2020-04-16 07:30:00
\n",
"
wake up
\n",
"
\n",
"
\n",
"
1
\n",
"
2020-04-17 07:29:00
\n",
"
wake up
\n",
"
\n",
"
\n",
"
2
\n",
"
2020-04-18 07:29:00
\n",
"
wake up
\n",
"
\n",
"
\n",
"
3
\n",
"
2020-04-19 07:30:00
\n",
"
wake up
\n",
"
\n",
"
\n",
"
4
\n",
"
2020-04-20 07:32:00
\n",
"
wake up
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" time event\n",
"0 2020-04-16 07:30:00 wake up\n",
"1 2020-04-17 07:29:00 wake up\n",
"2 2020-04-18 07:29:00 wake up\n",
"3 2020-04-19 07:30:00 wake up\n",
"4 2020-04-20 07:32:00 wake up"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pcm.reconstruct()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"-----\n",
"### An example with Canadian TV programs\n",
"#### Fetching logs from canadian TV\n",
"\n",
"In this section we are going to load some event logs of TV programs (the *WHAT*), indexed by their broadcast timestamps (the *WHEN*).\n",
"\n",
"`PeriodicPatternMiner` is here to help us discovering regularities (the *HOW*)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from skmine.datasets import fetch_canadian_tv\n",
"from skmine.periodic import PeriodicPatternMiner"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Searching for cycles in TV programs\n",
"\n",
"Remember about the definition of cycles ?\n",
"Let's apply it to our TV programs\n",
"\n",
"In our case\n",
"\n",
"* $\\alpha$ is the name of a TV program\n",
"\n",
"* $r$ is the number of broadcasts (repetitions) for this TV program (inside this cycle)\n",
"\n",
"* $p$ is the optimal time delta between broadcasts in this cycle. If a program is meant to be live everyday at 14:00PM, then $p$ is likely to be *1 day*\n",
"\n",
"* $\\tau$ is the first broadcast time in this cycle\n",
"\n",
"* $dE$ are the shift corrections between the $p$ and the actual broadcast time of an event. If a TV program was scheduled at 8:30:00AM and it went on air at 8:30:23AM the same day, then we keep track of a *23 seconds shift*. This way we can summarize our data (via cycles), and reconstruct it (via shift corrections). \n",
"\n",
"\n",
"Finally we are going to dig a little deeper into these cycles, to answer quite complex questions about our logs. We will see that cycles contains usefull information about our input data"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"timestamp\n",
"2020-08-01 06:00:00 The Moblees\n",
"2020-08-01 06:11:00 Big Block Sing Song\n",
"2020-08-01 06:13:00 Big Block Sing Song\n",
"2020-08-01 06:15:00 CBC Kids\n",
"2020-08-01 06:15:00 CBC Kids\n",
"Name: canadian_tv, dtype: string"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ctv_logs = fetch_canadian_tv()\n",
"ctv_logs.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Compute only simple cycles on ctv_logs with the karg *complex=False* : "
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/hcourtei/Projects/F-WIN/scikit-mine/codes/skMineDev/skmine/periodic/cycles.py:131: UserWarning: found 234 duplicates in the input sequence, they have been removed.\n",
" warnings.warn(f\"found {diff} duplicates in the input sequence, they have been removed.\")\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
t0
\n",
"
pattern
\n",
"
repetition_major
\n",
"
period_major
\n",
"
sum_E
\n",
"
\n",
" \n",
" \n",
"
\n",
"
20
\n",
"
2020-08-01 04:00:00
\n",
"
(Rick Mercer Report)[r=5 p=7 days, 0:00:00]
\n",
"
5
\n",
"
7 days 00:00:00
\n",
"
0 days 00:00:00
\n",
"
\n",
"
\n",
"
21
\n",
"
2020-08-01 04:30:00
\n",
"
(Rick Mercer Report)[r=5 p=7 days, 0:00:00]
\n",
"
5
\n",
"
7 days 00:00:00
\n",
"
0 days 00:00:00
\n",
"
\n",
"
\n",
"
0
\n",
"
2020-08-01 05:00:00
\n",
"
(Grand Designs)[r=31 p=1 day, 0:00:00]
\n",
"
31
\n",
"
1 days 00:00:00
\n",
"
0 days 00:00:00
\n",
"
\n",
"
\n",
"
59
\n",
"
2020-08-01 06:00:00
\n",
"
(The Moblees)[r=5 p=7 days, 0:00:00]
\n",
"
5
\n",
"
7 days 00:00:00
\n",
"
0 days 00:00:00
\n",
"
\n",
"
\n",
"
29
\n",
"
2020-08-01 06:11:00
\n",
"
(Big Block Sing Song)[r=5 p=7 days, 0:00:00]
\n",
"
5
\n",
"
7 days 00:00:00
\n",
"
0 days 00:00:00
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
254
\n",
"
2020-08-26 14:00:00
\n",
"
(Jamie's Super Foods)[r=3 p=1 day, 0:00:00]
\n",
"
3
\n",
"
1 days 00:00:00
\n",
"
0 days 00:00:00
\n",
"
\n",
"
\n",
"
183
\n",
"
2020-08-27 02:00:00
\n",
"
(Mr. D)[r=4 p=0:30:00]
\n",
"
4
\n",
"
0 days 00:30:00
\n",
"
0 days 00:00:00
\n",
"
\n",
"
\n",
"
2
\n",
"
2020-08-28 00:00:00
\n",
"
(Schitt's Creek)[r=8 p=0:30:00]
\n",
"
8
\n",
"
0 days 00:30:00
\n",
"
0 days 00:00:00
\n",
"
\n",
"
\n",
"
129
\n",
"
2020-08-29 06:15:00
\n",
"
(CBC Kids)[r=23 p=0:12:00]
\n",
"
23
\n",
"
0 days 00:12:00
\n",
"
0 days 02:26:00
\n",
"
\n",
"
\n",
"
126
\n",
"
2020-08-30 06:11:00
\n",
"
(CBC Kids)[r=11 p=0:12:00]
\n",
"
11
\n",
"
0 days 00:12:00
\n",
"
0 days 00:49:00
\n",
"
\n",
" \n",
"
\n",
"
278 rows × 5 columns
\n",
"
"
],
"text/plain": [
" t0 pattern \\\n",
"20 2020-08-01 04:00:00 (Rick Mercer Report)[r=5 p=7 days, 0:00:00] \n",
"21 2020-08-01 04:30:00 (Rick Mercer Report)[r=5 p=7 days, 0:00:00] \n",
"0 2020-08-01 05:00:00 (Grand Designs)[r=31 p=1 day, 0:00:00] \n",
"59 2020-08-01 06:00:00 (The Moblees)[r=5 p=7 days, 0:00:00] \n",
"29 2020-08-01 06:11:00 (Big Block Sing Song)[r=5 p=7 days, 0:00:00] \n",
".. ... ... \n",
"254 2020-08-26 14:00:00 (Jamie's Super Foods)[r=3 p=1 day, 0:00:00] \n",
"183 2020-08-27 02:00:00 (Mr. D)[r=4 p=0:30:00] \n",
"2 2020-08-28 00:00:00 (Schitt's Creek)[r=8 p=0:30:00] \n",
"129 2020-08-29 06:15:00 (CBC Kids)[r=23 p=0:12:00] \n",
"126 2020-08-30 06:11:00 (CBC Kids)[r=11 p=0:12:00] \n",
"\n",
" repetition_major period_major sum_E \n",
"20 5 7 days 00:00:00 0 days 00:00:00 \n",
"21 5 7 days 00:00:00 0 days 00:00:00 \n",
"0 31 1 days 00:00:00 0 days 00:00:00 \n",
"59 5 7 days 00:00:00 0 days 00:00:00 \n",
"29 5 7 days 00:00:00 0 days 00:00:00 \n",
".. ... ... ... \n",
"254 3 1 days 00:00:00 0 days 00:00:00 \n",
"183 4 0 days 00:30:00 0 days 00:00:00 \n",
"2 8 0 days 00:30:00 0 days 00:00:00 \n",
"129 23 0 days 00:12:00 0 days 02:26:00 \n",
"126 11 0 days 00:12:00 0 days 00:49:00 \n",
"\n",
"[278 rows x 5 columns]"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pcm = PeriodicPatternMiner(complex=False).fit(ctv_logs)\n",
"pcm.transform(ctv_logs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Compute simple and complex (with horizontal and vertical combinations) cycles on ctv_logs : "
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/hcourtei/Projects/F-WIN/scikit-mine/codes/skMineDev/skmine/periodic/cycles.py:131: UserWarning: found 234 duplicates in the input sequence, they have been removed.\n",
" warnings.warn(f\"found {diff} duplicates in the input sequence, they have been removed.\")\n"
]
}
],
"source": [
"pcm = PeriodicPatternMiner(complex=True).fit(ctv_logs)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
t0
\n",
"
pattern
\n",
"
repetition_major
\n",
"
period_major
\n",
"
sum_E
\n",
"
\n",
" \n",
" \n",
"
\n",
"
34
\n",
"
2020-08-01 04:00:00
\n",
"
(Rick Mercer Report [d=0:30:00] Rick Mercer Report)[r=5 p=7 days, 0:00:00]
\n",
"
5
\n",
"
7 days 00:00:00
\n",
"
0 days 00:00:00
\n",
"
\n",
"
\n",
"
0
\n",
"
2020-08-01 05:00:00
\n",
"
(Grand Designs)[r=31 p=1 day, 0:00:00]
\n",
"
31
\n",
"
1 days 00:00:00
\n",
"
0 days 00:00:00
\n",
"
\n",
"
\n",
"
7
\n",
"
2020-08-01 06:00:00
\n",
"
(The Moblees [d=0:11:00] Big Block Sing Song [d=0:02:00] Big Block Sing Song [d=0:02:00] CBC Kid...
\n",
"
5
\n",
"
7 days 00:00:00
\n",
"
0 days 00:03:00
\n",
"
\n",
"
\n",
"
19
\n",
"
2020-08-01 06:15:00
\n",
"
(CBC Kids [d=1:40:00] Big Block Sing Song [d=0:02:00] Furchester Hotel [d=0:38:00] Ollie: The Bo...
\n",
"
5
\n",
"
7 days 00:00:00
\n",
"
0 days 00:02:00
\n",
"
\n",
"
\n",
"
95
\n",
"
2020-08-01 06:15:00
\n",
"
(CBC Kids)[r=29 p=0:12:00]
\n",
"
29
\n",
"
0 days 00:12:00
\n",
"
0 days 02:54:00
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
109
\n",
"
2020-08-26 14:00:00
\n",
"
(Jamie's Super Foods)[r=3 p=1 day, 0:00:00]
\n",
"
3
\n",
"
1 days 00:00:00
\n",
"
0 days 00:00:00
\n",
"
\n",
"
\n",
"
86
\n",
"
2020-08-27 02:00:00
\n",
"
(Mr. D)[r=4 p=0:30:00]
\n",
"
4
\n",
"
0 days 00:30:00
\n",
"
0 days 00:00:00
\n",
"
\n",
"
\n",
"
16
\n",
"
2020-08-28 00:00:00
\n",
"
(Schitt's Creek)[r=8 p=0:30:00]
\n",
"
8
\n",
"
0 days 00:30:00
\n",
"
0 days 00:00:00
\n",
"
\n",
"
\n",
"
100
\n",
"
2020-08-29 06:15:00
\n",
"
(CBC Kids)[r=23 p=0:12:00]
\n",
"
23
\n",
"
0 days 00:12:00
\n",
"
0 days 02:26:00
\n",
"
\n",
"
\n",
"
92
\n",
"
2020-08-31 07:00:00
\n",
"
(CBC Kids)[r=17 p=0:13:00]
\n",
"
17
\n",
"
0 days 00:13:00
\n",
"
0 days 02:15:00
\n",
"
\n",
" \n",
"
\n",
"
121 rows × 5 columns
\n",
"
"
],
"text/plain": [
" t0 \\\n",
"34 2020-08-01 04:00:00 \n",
"0 2020-08-01 05:00:00 \n",
"7 2020-08-01 06:00:00 \n",
"19 2020-08-01 06:15:00 \n",
"95 2020-08-01 06:15:00 \n",
".. ... \n",
"109 2020-08-26 14:00:00 \n",
"86 2020-08-27 02:00:00 \n",
"16 2020-08-28 00:00:00 \n",
"100 2020-08-29 06:15:00 \n",
"92 2020-08-31 07:00:00 \n",
"\n",
" pattern \\\n",
"34 (Rick Mercer Report [d=0:30:00] Rick Mercer Report)[r=5 p=7 days, 0:00:00] \n",
"0 (Grand Designs)[r=31 p=1 day, 0:00:00] \n",
"7 (The Moblees [d=0:11:00] Big Block Sing Song [d=0:02:00] Big Block Sing Song [d=0:02:00] CBC Kid... \n",
"19 (CBC Kids [d=1:40:00] Big Block Sing Song [d=0:02:00] Furchester Hotel [d=0:38:00] Ollie: The Bo... \n",
"95 (CBC Kids)[r=29 p=0:12:00] \n",
".. ... \n",
"109 (Jamie's Super Foods)[r=3 p=1 day, 0:00:00] \n",
"86 (Mr. D)[r=4 p=0:30:00] \n",
"16 (Schitt's Creek)[r=8 p=0:30:00] \n",
"100 (CBC Kids)[r=23 p=0:12:00] \n",
"92 (CBC Kids)[r=17 p=0:13:00] \n",
"\n",
" repetition_major period_major sum_E \n",
"34 5 7 days 00:00:00 0 days 00:00:00 \n",
"0 31 1 days 00:00:00 0 days 00:00:00 \n",
"7 5 7 days 00:00:00 0 days 00:03:00 \n",
"19 5 7 days 00:00:00 0 days 00:02:00 \n",
"95 29 0 days 00:12:00 0 days 02:54:00 \n",
".. ... ... ... \n",
"109 3 1 days 00:00:00 0 days 00:00:00 \n",
"86 4 0 days 00:30:00 0 days 00:00:00 \n",
"16 8 0 days 00:30:00 0 days 00:00:00 \n",
"100 23 0 days 00:12:00 0 days 02:26:00 \n",
"92 17 0 days 00:13:00 0 days 02:15:00 \n",
"\n",
"[121 rows x 5 columns]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.set_option('display.max_colwidth', 100)\n",
"cycles = pcm.transform(ctv_logs)\n",
"cycles"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note** : no need to worry for the warning, it's here to notify duplicate event/timestamp pairs have been found"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Viewing the patterns"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Sometimes the patterns are complicated to understand textually. The `draw_pattern` function allows you to visualise them. Simply indicate in pattern the id of the pattern you wish to view from the results of the previous transform."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pcm.draw_pattern(19) \n",
"# if you want to save the tree, you can add a `directory` parameter to indicate the location of the results (pdf + generated DOT file)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Patterns import - export "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
t0
\n",
"
pattern
\n",
"
repetition_major
\n",
"
period_major
\n",
"
sum_E
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
2020-08-01 04:00:00
\n",
"
(Rick Mercer Report [d=0:30:00] Rick Mercer Report)[r=5 p=7 days, 0:00:00]
\n",
"
5
\n",
"
7 days 00:00:00
\n",
"
0 days 00:00:00
\n",
"
\n",
"
\n",
"
1
\n",
"
2020-08-01 05:00:00
\n",
"
(Grand Designs)[r=31 p=1 day, 0:00:00]
\n",
"
31
\n",
"
1 days 00:00:00
\n",
"
0 days 00:00:00
\n",
"
\n",
"
\n",
"
2
\n",
"
2020-08-01 06:00:00
\n",
"
(The Moblees [d=0:11:00] Big Block Sing Song [d=0:02:00] Big Block Sing Song [d=0:02:00] CBC Kid...
\n",
"
5
\n",
"
7 days 00:00:00
\n",
"
0 days 00:03:00
\n",
"
\n",
"
\n",
"
3
\n",
"
2020-08-01 06:15:00
\n",
"
(CBC Kids [d=1:40:00] Big Block Sing Song [d=0:02:00] Furchester Hotel [d=0:38:00] Ollie: The Bo...
\n",
"
5
\n",
"
7 days 00:00:00
\n",
"
0 days 00:02:00
\n",
"
\n",
"
\n",
"
4
\n",
"
2020-08-01 06:15:00
\n",
"
(CBC Kids)[r=29 p=0:12:00]
\n",
"
29
\n",
"
0 days 00:12:00
\n",
"
0 days 02:54:00
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
116
\n",
"
2020-08-26 14:00:00
\n",
"
(Jamie's Super Foods)[r=3 p=1 day, 0:00:00]
\n",
"
3
\n",
"
1 days 00:00:00
\n",
"
0 days 00:00:00
\n",
"
\n",
"
\n",
"
117
\n",
"
2020-08-27 02:00:00
\n",
"
(Mr. D)[r=4 p=0:30:00]
\n",
"
4
\n",
"
0 days 00:30:00
\n",
"
0 days 00:00:00
\n",
"
\n",
"
\n",
"
118
\n",
"
2020-08-28 00:00:00
\n",
"
(Schitt's Creek)[r=8 p=0:30:00]
\n",
"
8
\n",
"
0 days 00:30:00
\n",
"
0 days 00:00:00
\n",
"
\n",
"
\n",
"
119
\n",
"
2020-08-29 06:15:00
\n",
"
(CBC Kids)[r=23 p=0:12:00]
\n",
"
23
\n",
"
0 days 00:12:00
\n",
"
0 days 02:26:00
\n",
"
\n",
"
\n",
"
120
\n",
"
2020-08-31 07:00:00
\n",
"
(CBC Kids)[r=17 p=0:13:00]
\n",
"
17
\n",
"
0 days 00:13:00
\n",
"
0 days 02:15:00
\n",
"
\n",
" \n",
"
\n",
"
121 rows × 5 columns
\n",
"
"
],
"text/plain": [
" t0 \\\n",
"0 2020-08-01 04:00:00 \n",
"1 2020-08-01 05:00:00 \n",
"2 2020-08-01 06:00:00 \n",
"3 2020-08-01 06:15:00 \n",
"4 2020-08-01 06:15:00 \n",
".. ... \n",
"116 2020-08-26 14:00:00 \n",
"117 2020-08-27 02:00:00 \n",
"118 2020-08-28 00:00:00 \n",
"119 2020-08-29 06:15:00 \n",
"120 2020-08-31 07:00:00 \n",
"\n",
" pattern \\\n",
"0 (Rick Mercer Report [d=0:30:00] Rick Mercer Report)[r=5 p=7 days, 0:00:00] \n",
"1 (Grand Designs)[r=31 p=1 day, 0:00:00] \n",
"2 (The Moblees [d=0:11:00] Big Block Sing Song [d=0:02:00] Big Block Sing Song [d=0:02:00] CBC Kid... \n",
"3 (CBC Kids [d=1:40:00] Big Block Sing Song [d=0:02:00] Furchester Hotel [d=0:38:00] Ollie: The Bo... \n",
"4 (CBC Kids)[r=29 p=0:12:00] \n",
".. ... \n",
"116 (Jamie's Super Foods)[r=3 p=1 day, 0:00:00] \n",
"117 (Mr. D)[r=4 p=0:30:00] \n",
"118 (Schitt's Creek)[r=8 p=0:30:00] \n",
"119 (CBC Kids)[r=23 p=0:12:00] \n",
"120 (CBC Kids)[r=17 p=0:13:00] \n",
"\n",
" repetition_major period_major sum_E \n",
"0 5 7 days 00:00:00 0 days 00:00:00 \n",
"1 31 1 days 00:00:00 0 days 00:00:00 \n",
"2 5 7 days 00:00:00 0 days 00:03:00 \n",
"3 5 7 days 00:00:00 0 days 00:02:00 \n",
"4 29 0 days 00:12:00 0 days 02:54:00 \n",
".. ... ... ... \n",
"116 3 1 days 00:00:00 0 days 00:00:00 \n",
"117 4 0 days 00:30:00 0 days 00:00:00 \n",
"118 8 0 days 00:30:00 0 days 00:00:00 \n",
"119 23 0 days 00:12:00 0 days 02:26:00 \n",
"120 17 0 days 00:13:00 0 days 02:15:00 \n",
"\n",
"[121 rows x 5 columns]"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Store in a json file the patterns (\"patterns.json\" by default) \n",
"pcm.export_patterns(file = \"patterns.json\")\n",
"\n",
"# reinitialize the PeriodicPatternMiner : \n",
"pcm = PeriodicPatternMiner()\n",
"\n",
"# import the json file (\"patterns.json\" by default)\n",
"pcm.import_patterns(file = \"patterns.json\")\n",
"\n",
"# recompute the pattern : \n",
"cycles = pcm.transform(ctv_logs)\n",
"cycles"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have our cycles in a [pandas.DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), we can play with the pandas API and answer questions about our logs\n",
"\n",
"#### Did I find cycles for the TV show \"Arthurt Shorts\""
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
t0
\n",
"
pattern
\n",
"
repetition_major
\n",
"
period_major
\n",
"
sum_E
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
2020-08-01 04:00:00
\n",
"
(Rick Mercer Report [d=0:30:00] Rick Mercer Report)[r=5 p=7 days, 0:00:00]
\n",
"
5
\n",
"
7 days 00:00:00
\n",
"
0 days 00:00:00
\n",
"
\n",
"
\n",
"
1
\n",
"
2020-08-01 05:00:00
\n",
"
(Grand Designs)[r=31 p=1 day, 0:00:00]
\n",
"
31
\n",
"
1 days 00:00:00
\n",
"
0 days 00:00:00
\n",
"
\n",
"
\n",
"
2
\n",
"
2020-08-01 06:00:00
\n",
"
(The Moblees [d=0:11:00] Big Block Sing Song [d=0:02:00] Big Block Sing Song [d=0:02:00] CBC Kid...
\n",
"
5
\n",
"
7 days 00:00:00
\n",
"
0 days 00:03:00
\n",
"
\n",
"
\n",
"
3
\n",
"
2020-08-01 06:15:00
\n",
"
(CBC Kids [d=1:40:00] Big Block Sing Song [d=0:02:00] Furchester Hotel [d=0:38:00] Ollie: The Bo...
\n",
"
5
\n",
"
7 days 00:00:00
\n",
"
0 days 00:02:00
\n",
"
\n",
"
\n",
"
4
\n",
"
2020-08-01 06:15:00
\n",
"
(CBC Kids)[r=29 p=0:12:00]
\n",
"
29
\n",
"
0 days 00:12:00
\n",
"
0 days 02:54:00
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
116
\n",
"
2020-08-26 14:00:00
\n",
"
(Jamie's Super Foods)[r=3 p=1 day, 0:00:00]
\n",
"
3
\n",
"
1 days 00:00:00
\n",
"
0 days 00:00:00
\n",
"
\n",
"
\n",
"
117
\n",
"
2020-08-27 02:00:00
\n",
"
(Mr. D)[r=4 p=0:30:00]
\n",
"
4
\n",
"
0 days 00:30:00
\n",
"
0 days 00:00:00
\n",
"
\n",
"
\n",
"
118
\n",
"
2020-08-28 00:00:00
\n",
"
(Schitt's Creek)[r=8 p=0:30:00]
\n",
"
8
\n",
"
0 days 00:30:00
\n",
"
0 days 00:00:00
\n",
"
\n",
"
\n",
"
119
\n",
"
2020-08-29 06:15:00
\n",
"
(CBC Kids)[r=23 p=0:12:00]
\n",
"
23
\n",
"
0 days 00:12:00
\n",
"
0 days 02:26:00
\n",
"
\n",
"
\n",
"
120
\n",
"
2020-08-31 07:00:00
\n",
"
(CBC Kids)[r=17 p=0:13:00]
\n",
"
17
\n",
"
0 days 00:13:00
\n",
"
0 days 02:15:00
\n",
"
\n",
" \n",
"
\n",
"
121 rows × 5 columns
\n",
"
"
],
"text/plain": [
" t0 \\\n",
"0 2020-08-01 04:00:00 \n",
"1 2020-08-01 05:00:00 \n",
"2 2020-08-01 06:00:00 \n",
"3 2020-08-01 06:15:00 \n",
"4 2020-08-01 06:15:00 \n",
".. ... \n",
"116 2020-08-26 14:00:00 \n",
"117 2020-08-27 02:00:00 \n",
"118 2020-08-28 00:00:00 \n",
"119 2020-08-29 06:15:00 \n",
"120 2020-08-31 07:00:00 \n",
"\n",
" pattern \\\n",
"0 (Rick Mercer Report [d=0:30:00] Rick Mercer Report)[r=5 p=7 days, 0:00:00] \n",
"1 (Grand Designs)[r=31 p=1 day, 0:00:00] \n",
"2 (The Moblees [d=0:11:00] Big Block Sing Song [d=0:02:00] Big Block Sing Song [d=0:02:00] CBC Kid... \n",
"3 (CBC Kids [d=1:40:00] Big Block Sing Song [d=0:02:00] Furchester Hotel [d=0:38:00] Ollie: The Bo... \n",
"4 (CBC Kids)[r=29 p=0:12:00] \n",
".. ... \n",
"116 (Jamie's Super Foods)[r=3 p=1 day, 0:00:00] \n",
"117 (Mr. D)[r=4 p=0:30:00] \n",
"118 (Schitt's Creek)[r=8 p=0:30:00] \n",
"119 (CBC Kids)[r=23 p=0:12:00] \n",
"120 (CBC Kids)[r=17 p=0:13:00] \n",
"\n",
" repetition_major period_major sum_E \n",
"0 5 7 days 00:00:00 0 days 00:00:00 \n",
"1 31 1 days 00:00:00 0 days 00:00:00 \n",
"2 5 7 days 00:00:00 0 days 00:03:00 \n",
"3 5 7 days 00:00:00 0 days 00:02:00 \n",
"4 29 0 days 00:12:00 0 days 02:54:00 \n",
".. ... ... ... \n",
"116 3 1 days 00:00:00 0 days 00:00:00 \n",
"117 4 0 days 00:30:00 0 days 00:00:00 \n",
"118 8 0 days 00:30:00 0 days 00:00:00 \n",
"119 23 0 days 00:12:00 0 days 02:26:00 \n",
"120 17 0 days 00:13:00 0 days 02:15:00 \n",
"\n",
"[121 rows x 5 columns]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cycles"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
t0
\n",
"
pattern
\n",
"
repetition_major
\n",
"
period_major
\n",
"
sum_E
\n",
"
\n",
" \n",
" \n",
"
\n",
"
15
\n",
"
2020-08-01 10:48:00
\n",
"
(Addison [d=0:12:00] Arthur Shorts [d=0:13:00] Stella & Sam [d=0:12:00] Wandering Wenda [d=0:34:...
\n",
"
5
\n",
"
7 days 00:00:00
\n",
"
0 days 00:00:00
\n",
"
\n",
"
\n",
"
50
\n",
"
2020-08-03 09:35:00
\n",
"
((Arthur Shorts [d=0:13:00] Arthur Shorts)[r=5 p=1 day, 0:00:00] [d=0:45:00] (Furchester Hotel [...
\n",
"
3
\n",
"
6 days 23:59:00
\n",
"
0 days 00:22:00
\n",
"
\n",
"
\n",
"
112
\n",
"
2020-08-24 09:35:00
\n",
"
(Arthur Shorts [d=0:13:00] Arthur Shorts [d=0:15:00] Kiri & Lou [d=0:17:00] Furchester Hotel [d=...
\n",
"
5
\n",
"
0 days 23:59:00
\n",
"
0 days 00:11:00
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" t0 \\\n",
"15 2020-08-01 10:48:00 \n",
"50 2020-08-03 09:35:00 \n",
"112 2020-08-24 09:35:00 \n",
"\n",
" pattern \\\n",
"15 (Addison [d=0:12:00] Arthur Shorts [d=0:13:00] Stella & Sam [d=0:12:00] Wandering Wenda [d=0:34:... \n",
"50 ((Arthur Shorts [d=0:13:00] Arthur Shorts)[r=5 p=1 day, 0:00:00] [d=0:45:00] (Furchester Hotel [... \n",
"112 (Arthur Shorts [d=0:13:00] Arthur Shorts [d=0:15:00] Kiri & Lou [d=0:17:00] Furchester Hotel [d=... \n",
"\n",
" repetition_major period_major sum_E \n",
"15 5 7 days 00:00:00 0 days 00:00:00 \n",
"50 3 6 days 23:59:00 0 days 00:22:00 \n",
"112 5 0 days 23:59:00 0 days 00:11:00 "
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cycles[cycles[\"pattern\"].apply(lambda x: \"Arthur Shorts\" in x)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### What are the top 10 longest cycles ?"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
" time event\n",
"0 2020-08-01 05:00:00 Grand Designs\n",
"1 2020-08-02 05:00:00 Grand Designs\n",
"2 2020-08-03 05:00:00 Grand Designs\n",
"3 2020-08-04 05:00:00 Grand Designs\n",
"4 2020-08-05 05:00:00 Grand Designs\n",
"5 2020-08-06 05:00:00 Grand Designs\n",
"6 2020-08-07 05:00:00 Grand Designs\n",
"7 2020-08-08 05:00:00 Grand Designs\n",
"8 2020-08-09 05:00:00 Grand Designs\n",
"9 2020-08-10 05:00:00 Grand Designs\n",
"10 2020-08-11 05:00:00 Grand Designs\n",
"11 2020-08-12 05:00:00 Grand Designs\n",
"12 2020-08-13 05:00:00 Grand Designs\n",
"13 2020-08-14 05:00:00 Grand Designs\n",
"14 2020-08-15 05:00:00 Grand Designs\n",
"15 2020-08-16 05:00:00 Grand Designs\n",
"16 2020-08-17 05:00:00 Grand Designs\n",
"17 2020-08-18 05:00:00 Grand Designs\n",
"18 2020-08-19 05:00:00 Grand Designs\n",
"19 2020-08-20 05:00:00 Grand Designs\n",
"20 2020-08-21 05:00:00 Grand Designs\n",
"21 2020-08-22 05:00:00 Grand Designs\n",
"22 2020-08-23 05:00:00 Grand Designs\n",
"23 2020-08-24 05:00:00 Grand Designs\n",
"24 2020-08-25 05:00:00 Grand Designs\n",
"25 2020-08-26 05:00:00 Grand Designs\n",
"26 2020-08-27 05:00:00 Grand Designs\n",
"27 2020-08-28 05:00:00 Grand Designs\n",
"28 2020-08-29 05:00:00 Grand Designs\n",
"29 2020-08-30 05:00:00 Grand Designs\n",
"30 2020-08-31 05:00:00 Grand Designs"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pcm.reconstruct([1])"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
time
\n",
"
event
\n",
"
\n",
" \n",
" \n",
"
\n",
"
143
\n",
"
2020-08-01 01:12:00
\n",
"
OLYMPIC GAMES REPLAY
\n",
"
\n",
"
\n",
"
287
\n",
"
2020-08-01 13:00:00
\n",
"
Bondi Vet
\n",
"
\n",
"
\n",
"
47
\n",
"
2020-08-01 13:30:00
\n",
"
Basketball
\n",
"
\n",
"
\n",
"
122
\n",
"
2020-08-01 15:55:00
\n",
"
Basketball
\n",
"
\n",
"
\n",
"
347
\n",
"
2020-08-01 19:00:00
\n",
"
Still Standing
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
351
\n",
"
2020-08-31 16:00:00
\n",
"
Escape to the Country
\n",
"
\n",
"
\n",
"
348
\n",
"
2020-08-31 17:00:00
\n",
"
Murdoch Mysteries
\n",
"
\n",
"
\n",
"
409
\n",
"
2020-08-31 17:59:00
\n",
"
News
\n",
"
\n",
"
\n",
"
386
\n",
"
2020-08-31 19:00:00
\n",
"
Hockey Night in Canada
\n",
"
\n",
"
\n",
"
74
\n",
"
2020-08-31 23:15:00
\n",
"
CBC News: The National
\n",
"
\n",
" \n",
"
\n",
"
454 rows × 2 columns
\n",
"
"
],
"text/plain": [
" time event\n",
"143 2020-08-01 01:12:00 OLYMPIC GAMES REPLAY\n",
"287 2020-08-01 13:00:00 Bondi Vet\n",
"47 2020-08-01 13:30:00 Basketball\n",
"122 2020-08-01 15:55:00 Basketball\n",
"347 2020-08-01 19:00:00 Still Standing\n",
".. ... ...\n",
"351 2020-08-31 16:00:00 Escape to the Country\n",
"348 2020-08-31 17:00:00 Murdoch Mysteries\n",
"409 2020-08-31 17:59:00 News\n",
"386 2020-08-31 19:00:00 Hockey Night in Canada\n",
"74 2020-08-31 23:15:00 CBC News: The National\n",
"\n",
"[454 rows x 2 columns]"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pcm.get_residuals()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### what are the 10 most unpunctual TV programs ?\n",
"For this we are going to :\n",
" 1. extract the shift corrections along with other informations about our cycles\n",
" 2. compute the sum of the absolute values for the shift corrections, for every cycles\n",
" 3. get the 10 biggest sums"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"# full_cycles = pcm.transform(x)\n",
"# full_cycles.head()"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# def absolute_sum(*args):\n",
"# return sum(map(abs, *args))\n",
"\n",
"# # level 0 is the name of the TV program\n",
"# shift_sums = full_cycles[\"dE\"].map(absolute_sum).groupby(level=[0]).sum()\n",
"# shift_sums.nlargest(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"# def absolute_sum(*args):\n",
"# return sum(map(abs, *args))\n",
"\n",
"# # level 0 is the name of the TV program\n",
"# shift_sums = full_cycles[\"dE\"].map(absolute_sum).groupby(level=[0]).sum()\n",
"# shift_sums.nlargest(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### What TV programs have been broadcasted every day for at least 5 days straight?\n",
"Let's make use of the [pandas.DataFrame.query](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html) method to express our question in an SQL-like syntax"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# cycles.query('length_major >= 5 and period_major >= 3600', engine='python')\n",
"# # cycles.query('length >= 5 and period == \"1 days\"', engine='python')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### What TV programs are broadcast only on business days ?\n",
"From the previous query we see we have a lot of 5-length cycles, with periods of 1 day.\n",
"An intuition is that these cycles take place on business days. Let's confirm this by considering cycles with\n",
" 1. start timestamps on mondays\n",
" 2. periods of roughly 1 day "
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# monday_starts = cycles[cycles.start.dt.weekday == 0] # start on monday\n",
"# monday_starts.query('length == 5 and period == \"1 days\"', engine='python')"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"------\n",
"### Load your own custom datasets\n",
"\n",
"If you want to **use your own datasets from file**, you can use the `fetch_file` function. \n",
"There are **3 ways to write** :\n",
"\n",
"#### 1. **Each line consists of a datetime followed by its associated event.\n",
"** By default, **the separator is a `,`**. This is customisable. \n",
"Example :"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"['2020-04-16 07:30:00,wake up',\n",
" '2020-04-17 07:29:00,wake up',\n",
" '2020-04-17 07:29:00,wake up',\n",
" '2020-04-18 07:29:00,wake up',\n",
" '2020-04-19 07:30:00,wake up',\n",
" '2020-04-20 07:32:00,wake up',\n",
" '2020-04-23 07:30:00,wake up']"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"with open(\"dataset_datetime.csv\", \"r\") as f:\n",
" data = f.read().splitlines()\n",
" \n",
"data"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/hcourtei/Projects/F-WIN/scikit-mine/codes/skMineDev/skmine/periodic/cycles.py:131: UserWarning: found 1 duplicates in the input sequence, they have been removed.\n",
" warnings.warn(f\"found {diff} duplicates in the input sequence, they have been removed.\")\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
t0
\n",
"
pattern
\n",
"
repetition_major
\n",
"
period_major
\n",
"
sum_E
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
2020-04-16 07:30:00
\n",
"
(wake up)[r=5 p=1 day, 0:00:00]
\n",
"
5
\n",
"
1 days
\n",
"
0 days 00:04:00
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" t0 pattern repetition_major \\\n",
"0 2020-04-16 07:30:00 (wake up)[r=5 p=1 day, 0:00:00] 5 \n",
"\n",
" period_major sum_E \n",
"0 1 days 0 days 00:04:00 "
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from skmine.datasets.periodic import fetch_file\n",
"\n",
"data = fetch_file(\"dataset_datetime.csv\", separator=\",\") # by default separator is already comma\n",
"pcm = PeriodicPatternMiner().fit(data)\n",
"pcm.transform(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that several formats for datetimes can be infered from example. But you can force to parse string according to a specific format with the extra option : `format=\"%d/%m/%Y %H:%M:%S\"` . see https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior for all possibilities\n",
"- `2020-04-23 07:30:00` -> YYYY-MM-DD hh:mm:ss\n",
"- `1997-07-23T19:20+01:00` -> YYYY-MM-DDThh:mmTZD (TZD = time zone designator)\n",
"- `1997-07-23T19:20:30.45+01:00` -> YYYY-MM-DDThh:mm:ss.sTZD\n",
"- `02/23/2018` -> MM/DD/YYYY\n",
"- `02/23/18` -> MM/DD/YY\n",
"- ...\n",
"\n",
"**2 points of caution** are to be noted:\n",
"- **Beware of this type of format \"Feb 23, 2018\" (Mth d, YYYY)** because comma is the default separator. If you want to use this format, you must use **double quotes around the date**.\n",
"- **The MM/DD (Month/day) format is infered by default (when day is less than 12)**, for DD/MM; use format option as in example below: \n",
" "
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"timestamp\n",
"2018-02-10 07:30:00 wake up\n",
"2018-02-23 07:30:00 coffee\n",
"2018-03-10 07:30:00 wake up\n",
"Name: dataset_datetime_DD_MM.csv, dtype: string"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fetch_file(\"dataset_datetime_DD_MM.csv\", separator=',' ,format=\"%d/%m/%Y %H:%M:%S\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" \n",
"#### 2. **Datetimes can also be replaced by integers**. \n",
" Example:"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['1,wake up',\n",
" '3,wake up',\n",
" '10,wake up',\n",
" '30,wake up',\n",
" '40,wake up',\n",
" '70,wake up',\n",
" '100,wake up']"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"with open(\"dataset_integer.csv\", \"r\") as f:\n",
" data = f.read().splitlines()\n",
" \n",
"data"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 3. **Use only the name of the events**.\n",
"In this case, **the indexes correspond to the line numbers (starting from 0)**. \n",
"Example:"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['wake up', 'wake up', 'wake up', 'wake up', 'wake up', 'wake up', 'wake up']"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"with open(\"dataset_no_index.csv\", \"r\") as f:\n",
" data = f.read().splitlines()\n",
" \n",
"data"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"timestamp\n",
"0 wake up\n",
"1 wake up\n",
"2 wake up\n",
"3 wake up\n",
"4 wake up\n",
"5 wake up\n",
"6 wake up\n",
"Name: dataset_no_index.csv, dtype: string"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = fetch_file(\"dataset_no_index.csv\")\n",
"data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"References\n",
"----------\n",
"\n",
"1. Galbrun, E & Cellier, P & Tatti, N & Termier, A & Crémilleux, B\n",
" \"Mining Periodic Pattern with a MDL Criterion\"\n",
"\n",
"2. Galbrun, E\n",
" \"The Minimum Description Length Principle for Pattern Mining : A survey\" \n",
"\n",
"3. Termier, A\n",
" [\"Periodic pattern mining\"](http://people.irisa.fr/Alexandre.Termier/dmv/DMV_Periodic_patterns.pdf) "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {
"height": "calc(100% - 180px)",
"left": "10px",
"top": "150px",
"width": "247.55px"
},
"toc_section_display": true,
"toc_window_display": true
},
"vscode": {
"interpreter": {
"hash": "c4418ac5ac56bcc1d654aebcb97e1ca3ff1be77625ab045a7b4b5e8ee820789e"
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}