Great time series analysis example using the "Ages at Death of the Kings of England" Dataset
This is a great example of how ignoring outliers can make you analysis can go very wrong. We will show you the wrong way and then the right way. A quote comes to mind that said "A good forecaster is not smarter than everyone else, he merely has his ignorance better organized".
A fun dataset to explore is the "age of the death of kings of England". The data comes form the 1977 book from McNeill called "Interactive Data Analysis" as is an example used by some to perform time series analysis. We intend on showing you the right way and the wrong way(we have seen examples of this!). Here is the data so you can you can try this out yourself: 60,43,67,50,56,42,50,65,68,43,65,34,47,34,49,41,13,35,53,56,16,43,69,59,48,59,86,55,68,51,33,49,67,77,81,67,71,81,68,70,77,56
It begins at William the Conqueror from the year 1028 to present(excluding the current Queen Elizabeth II) and shows the ages at death for 42 kings. It is an interesting example in that there is an underlying variable where life expectancy gets larger over time due to better health, eating, medicine, cyrogenic chambers???, etc and that is ignored in the "wrong way" example. We have seen the wrong way example as they are not looking for deterministic approaches to modeling and forecasting. Box-Jenkins ignored deterministic aspects of modeling when they formulated the ARIMA modeling process in 1976. The world has changed since then with research done by Tsay, Chatfield/Prothero (Box-Jenkins seasonal forecasting: Problems in a case study(with discussion)” J. Roy Statist soc., A, 136, 295-352), I. Chang, Fox that showed how important it is to consider deterministic options to achieve at a better model and forecast.
As for this dataset, there could be an argument that there would be no autocorrelation in the age between each king, but an argument could be made that heredity/genetics could have an autocorrelative impact or that if there were periods of stability or instability of the government would also matters. There could be an argument that there is an upper limit to how long we can live so there should be a cap on the maximum life span.
If you look at the dataset knew nothing about statistics, you might say that the first dozen obervations look stable and see that there is a trend up with some occasional real low values. If you ignored the outliers you might say there has been a change to a new higher mean, but that is when you ignore outliers and fall prey to Simpson's paradox or simply put "local vs global" inferences.
If you have some knowledge about time series analysis and were using your "rule book"on how to model, you might look at the ACF and PACF and say the series has no need for differencing and an AR1 model would suit it just fine. We have seen examples on the web where these experts use their brain and see the need for differencing and an AR1 as they like the forecast.
You might (incorrectly), look at the Autocorrelation function and Partial Autocorrelation and see a spike at Lag 1 and conclude that there is autocorrelation at lag 1 and then should then include an AR1 component to the model. Not shown here, but if you calculate the ACF on the first 10 observations the sign is negative and if you do the same on the last 32 observations they are positive supporting the "two trend" theory.
The PACF looks as follows:
Here is the forecast when using differencing and an AR1 model.
The ACF and PACF residuals look ok and here are the residuals. This is where you start to see how the outliers have been ignored with big spikes at 11,17,23,27,31 with general underfitting with values in the high side in the second half of the data as the model is inadequate. We want the residuals to be random around zero.
Now, to do it the right way....and with no human intervention whatsoever.
Autobox finds an AR1 to be significant and brings in a constant. It then identifies to time trends and 4 outliers to be brought into the model. We all know what "step down" regression modeling is, but when you are adding variables to the model it is called "step up". This is what is lacking in other forecasting software.
Note that the first trend is not significant at the 95% level. Autobox uses a sliding scale based on the number of observations. So, for large N .05 is the critical value, but this data set only has 42 observations so the critical value is adjusted. When all of the variables are assembled in the model, the model looks like this:
If you consider deterministic variables like outliers, level shifts, time trends your model and forecast will look very different. Do we expect people to live longer in a straight line? No. This is just a time series example showing you how to model data. Is the current king (Queen Elizabeth II) 87 years old? Yes. Are people living longer? Yes. The trend variable is a surrogate for the general populations longer life expectancy.
Here are the residuals. They are pretty random. There is some underfitting in the middle part of the dataset, but the model is more robust and sensible than the flat forecast kicked out by the difference, AR1 model.
Here is the actual and cleansed history of outliers. Its when you correct for outliers that you can really see why Autobox is doing what it is doing.
Tagged in:outliers fox tiao tsay box box-jenkinsoutliers sassystat minitab ratstime series box-jenkins acf pacf level shift plottime series forecasting trends level shifts seasonality outliers
In the Facebook Live code along session on the 4th of January, we checked out Google trends data of keywords 'diet', 'gym' and 'finance' to see how they vary over time. We asked ourselves if there could be more searches for these terms in January when we're all trying to turn over a new leaf?
In this tutorial, you'll go through the code that we put together during the session step by step. You're not going to do much mathematics but you are going to do the following:
- Source your data
- Wrangle your data
- Exploratory Data Analysis
- Trends and seasonality in time series data
- Identifying Trends
- Seasonal patterns
- First Order Differencing
- Periodicity and Autocorrelation
The emphasis of this tutorial will be squarely on a visual exploration of the dataset in question.
For more on pandas, check out DataCamp's Data Manipulation with Python track. For more on time series with pandas, check out the Manipulating Time Series Data in Python course.
Importing Packages and Data
So the question remains: could there be more searches for these terms in January when we're all trying to turn over a new leaf?
Let's find out by going here and checking out the data. Note that this tutorial is inspired by this FiveThirtyEight piece.
You can also download the data as a .csv, save to file and import into your very own Python environment to perform your own analysis. You'll do this now. Let's get it!
To start, you'll import some packages: in this case, you'll make use of , , and .
Additionally, if you want the images to be plotted in the Jupyter Notebook, you can make use of the IPython magic by adding to your code. Alternatively, you can also switch to the Seaborn defaults with :
Import data that you downloaded with and check out first several rows with .
Note that you add the argument to skip the first row at the start of the file.
|Month||diet: (Worldwide)||gym: (Worldwide)||finance: (Worldwide)|
You can also use the method to check out your data types, number of rows and more:
Now that you've imported your data from Google trends and had a brief look at it, it's time to wrangle your data and get it into the form you want to prepare it for data analysis.
Wrangle Your Data
The first thing that you want to do is rename the columns of your DataFrame so that they have no whitespaces in them. There are multiple ways to do this, but for now, you'll reassign to a list of what you want the columns to be called.
Double check the result of your reassignment by calling :
Next, you'll turn the column into a DateTime data type and make it the index of the DataFrame.
Note that you do this because you saw in the result of the method that the column was actually an of data type . Now, that generic data type encapsulates everything from strings to integers, etc. That's not exactly what you want when you want to be looking at time series data. That's why you'll use to convert the column in your DataFrame to a DateTime.
Be careful! Make sure to include the argument when you're setting the index of the DataFrame so that you actually alter the original index and set it to the column.
Now it's time to explore your DataFrame visually.
A bit of Exploratory Data Analysis (EDA)
You can use a built-in visualization method to plot your data as 3 line plots on a single figure (one for each column, namely, , , and ).
Note that you can also specify some arguments to this method, such as , and to set the figure size, line width and font size of the plot, respectively.
Additionally, you'll see that what you see on the x-axis is not the months, as the default label suggests, but the years. To make your plot a bit more accurate, you'll specify the label on the x-axis to and also set the font size to 20.
Tip: if you want to suppress the Matplotlib output, just add a semicolon to your last line of code!
Note that this data is relative. As you can read on Google trends:
Numbers represent search interest relative to the highest point on the chart for the given region and time. A value of 100 is the peak popularity for the term. A value of 50 means that the term is half as popular. Likewise a score of 0 means the term was less than 1% as popular as the peak.
If you want, you can also plot the column by itself as a time series:
Note: the first thing to notice is that there is seasonality: each January, there's a big jump. Also, there seems to be a trend: it seems to go slightly up, then down, back up and then back down. In other words, it looks like there are trends and seasonal components to these time series.
With this in mind, you'll learn how to identify trends in your time series!
Trends and Seasonality in Time Series Data
Identifying Trends in Time Series
There are several ways to think about identifying trends in time series. One popular way is by taking a rolling average, which means that, for each time point, you take the average of the points on either side of it. Note that the number of points is specified by a window size, which you need to choose.
What happens then because you take the average is it tends to smooth out noise and seasonality. You'll see an example of that right now. Check out this rolling average of using the built-in methods.
When it comes to determining the window size, here, it makes sense to first try out one of twelve months, as you're talking about yearly seasonality.
Note that in the code chunk above you used two sets of squared brackets to extract the column as a DataFrame; If you would have used one set, like , you would have created a pandas Series.
In the code chunk above, you also chained methods: you called methods on an object one after another. Method chaining is pretty popular and pandas is one of the packages that really allows you to use that style of programming to the max!
Now you have the trend that you're looking for! You have removed most of the seasonality compared to the previous plot.
You can also plot the rolling average of using built-in methods with the same window size as you took for the data:
You have successfully removed the seasonality and you see an upward trend for "gym"! But how do these two search terms compare?
You can figure this out by plotting the trends of and on a single figure:
You created a new DataFrame that has two columns with the rolling average of and . You used the function, which takes a list of the columns as a first argument and, since you want to concatenate them as columns, you also added the argument, which you set to .
Next, you plotted the DataFrame with the method, just like you did before! So now, removing the seasonality, you see that potentially has some form of seasonality, whereas is actually increasing!
With the trends in the data identified, it's time to think about seasonality, which is the repetitive nature of your time series. As you saw in the beginning of this tutorial, it looked like there were trends and seasonal components to the time series of the data.
Seasonal Patterns in Time Series Data
One way to think about the seasonal components to the time series of your data is to remove the trend from a time series, so that you can more easily investigate seasonality. To remove the trend, you can subtract the trend you computed above (rolling mean) from the original signal. This, however, will be dependent on how many data points you averaged over.
Another way to remove the trend is called "differencing", where you look at the difference between successive data points (called "first-order differencing", because you're only looking at the difference between one data point and the one before it).
You can use and the and methods to compute and plot the first order difference of the Series:
See that you have removed much of the trend and you can really see the peaks in January every year. Each January, there is a huge spike of 20 or more percent on the highest search item you've seen!
Note: you can also perform 2nd order differencing, which means that you would be looking at the difference between one data point and the two that precede it, if the trend is not yet entirely removed. See here for more on differencing.
Differencing is super helpful in turning your time series into a stationary time series. You won't get too much into these here but a stationary time series is one whose statistical properties (such as mean and variance) don't change over time. These time series are useful because many time series forecasting methods are based on the assumption that the time series is approximately stationary.
With all of this at hand, you'll now analyze your periodicity in your times series by looking at its autocorrelation function. But before that, you'll take a short detour into correlation.
Periodicity and Autocorrelation
A time series is periodic if it repeats itself at equally spaced intervals, say, every 12 months.
Another way to think of this is that if the time series has a peak somewhere, then it will have a peak 12 months after that and, if it has a trough somewhere, it will also have a trough 12 months after that.
Yet another way of thinking about this is that the time series is correlated with itself shifted by 12 months. That means that, if you took the time series and moved it 12 months backwards or forwards, it would map onto itself in some way.
Considering the correlation of a time series with such a shifted version of itself is captured by the concept of autocorrelation.
You'll get to this in a minute.
First, let's remind yourself about correlation and take an intuitive approach to this concept!
The correlation coefficient of two variables captures how linearly related they are. To understand this, you'll take a look at a practical example with the help of the data set, which contains measurements of flowers.
To study this in further detail, you'll import the dataset from scikit-learn, turn it into a DataFrame and view the first rows with the help of :
|sepal length (cm)||sepal width (cm)||petal length (cm)||petal width (cm)||target|
Just as a reminder for you to understand this data set, all flowers contain a sepal and a petal. The sepal encloses the petals and is typically green and leaf-like, while the petals are typically colored leaves. The column, which is the target variable, is the species of the iris flowers, which can either be Versicolor, Virginica or Setosa. In the table above, they are encoded as 0, 1, and 2.
Now, to think about correlation, you'll take a look at how the sepal length of the iris flowers is correlated with the sepal width. To do this, you'll use or to build a scatter plot of against :
Note that you turned off the linear regression by setting the argument to .
Are sepal length and width positively or negatively correlated across all flowers? Are they positively or negatively correlated within each species? This is an essential distinction.
Remember that the former means that as the sepal length increases, the sepal width also increases in a linear manner. The latter means that if the sepal length increases, the sepal width would decrease in a linear fashion.
At first sight, it seems that there is a negative correlation in the above plot: as the sepal length increases, you see that the sepal width decreases slightly.
Let's now build a scatter plot of against , coloured by the target ():
At first sight, it seems like the above plot exhibits a positive correlation: for each species of the iris flower, you see that when the sepal length increases, the sepal width also increases.
Visualizations are a great way to get an intuition of correlation, but the way you could think about that in greater detail is to actually compute a correlation coefficient.
You can compute the correlation coefficients of each pair of measurements with the help of the method:
|sepal length (cm)||sepal width (cm)||petal length (cm)||petal width (cm)||target|
|sepal length (cm)||1.000000||-0.109369||0.871754||0.817954||0.782561|
|sepal width (cm)||-0.109369||1.000000||-0.420516||-0.356544||-0.419446|
|petal length (cm)||0.871754||-0.420516||1.000000||0.962757||0.949043|
|petal width (cm)||0.817954||-0.356544||0.962757||1.000000||0.956464|
Note that 'sepal length (cm)' and 'sepal width (cm)' seem to be negatively correlated! And they are, over the entire population of flowers measured. You see that the correlation coefficent is -0.1. However, they are not negatively correlated within each species, as the coefficient is 0.78.
For those interested, this is known as Simpson's paradox and is essential when thinking about causal inference. You can read more here.
Let's explore this further. Let's compute the correlation coefficients of each pair of measurements within each species. The way to do this is by chaining the and methods, to group by the target and print the correlation coefficient:
|petal length (cm)||petal width (cm)||sepal length (cm)||sepal width (cm)|
|0.0||petal length (cm)||1.000000||0.306308||0.263874||0.176695|
|petal width (cm)||0.306308||1.000000||0.279092||0.279973|
|sepal length (cm)||0.263874||0.279092||1.000000||0.746780|
|sepal width (cm)||0.176695||0.279973||0.746780||1.000000|
|1.0||petal length (cm)||1.000000||0.786668||0.754049||0.560522|
|petal width (cm)||0.786668||1.000000||0.546461||0.663999|
|sepal length (cm)||0.754049||0.546461||1.000000||0.525911|
|sepal width (cm)||0.560522||0.663999||0.525911||1.000000|
|2.0||petal length (cm)||1.000000||0.322108||0.864225||0.401045|
|petal width (cm)||0.322108||1.000000||0.281108||0.537728|
|sepal length (cm)||0.864225||0.281108||1.000000||0.457228|
|sepal width (cm)||0.401045||0.537728||0.457228||1.000000|
In this correlation matrix, you can see that:
- For target 0, the sepal length and width have a correlation of 0.75;
- For target 1, you have a coefficient of 0.5; And
- For target 2, you get a correlation of 0.46.
These are all decreasing amounts of positive correlation, but they're all a lot more positively correlated than your original negative correlation was.
This is incredibly telling and is a clear reminder of the importance of analyzing your data thoroughly.
Now that you have taken a closer look at correlation, you're ready to analyze your periodicity in your times series by looking at its autocorrelation function!
To start off, plot all your time series again to remind yourself of what they look like:
Then, compute the correlation coefficients of all of these time series with the help of :
Now, what does the above tell you?
Let's focus on and ; They are negatively correlated. That's very interesting! Remember that you have a seasonal and a trend component. From the correlation coefficient, 'diet' and 'gym' are negatively correlated. However, from looking at the times series, it looks as though their seasonal components would be positively correlated and their trends negatively correlated.
The actual correlation coefficient is actually capturing both of those.
What you want to do now is plot the first-order differences of these time series and then compute the correlation of those because that will be the correlation of the seasonal components, approximately. Remember that removing the trend may reveal correlation in seasonality.
Start off by plotting the first-order differences with the help of and :
You see that and are incredibly correlated once you remove the trend. Now, you'll compute the correlation coefficients of the first-order differences of these time series:
Note that once again, there was a slight negative correlation when you were thinking about the trend and the seasonal component. Now, you can see that with the seasonal component, and are highly correlated, with a coefficient of 0.76.
Now you've taken a dive into correlation of variables and correlation of time series, it's time to plot the autocorrelation of the series: on the x-axis, you have the lag and on the y-axis, you have how correlated the time series is with itself at that lag.
So, this means that if the original time series repeats itself every two days, you would expect to see a spike in the autocorrelation function at 2 days.
Here, you'll look at the plot and what you should expect to see here is a spike in the autocorrelation function at 12 months: the time series is correlated with itself shifted by twelve months.
Use the interface of , which has the function. You can use this function to plot the time series :
If you included more lags in your axes, you'd see that it is 12 months at which you have this huge peak in correlation. You have another peak at a 24 month interval, where it's also correlated with itself. You have another peak at 36, but as you move further away, there's less and less of a correlation.
Of course, you have a correlation of itself with itself at a lag of 0.
The dotted lines in the above plot actually tell you about the statistical significance of the correlation. In this case, you can say that the series is genuinely autocorrelated with a lag of twelve months.
You have identified the seasonality of this 12 month repetition!
In this tutorial, you covered a lot of ground! You checked out Google trends data of keywords 'diet', 'gym' and looked cursorily at 'finance' to see how they vary over time. You covered concepts such as seasonality, trends, correlation, autocorrelation, ...
For those eager data scientists, there are two things you could do right away:
- You could look into the 'finance' column and report what you find;
- Use ARIMA modeling to make some time series forecasts as to what these search trends will look like over the coming years. Jason Brownlee at Machine Learning Mastery has a cool tutorial on ARIMA modeling in Python, DataCamp has a great ARIMA Modeling with R and you'll also have a Python Time Series forecasting course up and running this year.