ARIMA Modeling and Forecasting
This video introduces ARIMA time series and explains how to build such a time series using Python's statsmodels. package. ARIMA series helps in predicting and forecasting data N timestamps in the future and is, thus, extremely helpful. The techniques involved in the creation and usage of such a series are explained in detail. This series is curated for intermediate and advanced users.
What You'll Learn
> Introduction to ARIMA time series and its usefulness
> Building an ARIMA time series in Python
> Autocorrelation Function plot and Partial Autocorrelation Function plot
Code, R & Python Script Repository can be accessed here.
Hi, welcome back to this Data Science Dojo video tutorial series on time series. In part one, we left it at differencing our data to make it more stationary. As this is a requirement of many time series models, in part two, we’ll take our difference data and start modeling on it and forecast into the future.
So, what we need to do now is look at the autocorrelation function and partial autocorrelation plots, or ACF PACF for short. So these plots help determine the number of order aggressive terms and moving average terms in a autoregressive moving average model. Or to spot the seasonality or periodic trends.
So I’ll explain what I mean by autogregressive and moving average. So autoregressive basically is able to forecast the next timestamps value by regressing over the previous values, and a moving average is able to forecast the next timestamps value by averaging the previous values. So, autoregressive integrated moving average model, which is the one we’re going to use, is useful for non stationary data as it allows us to difference the data plus has an additional seasonal differencing parameter for seasonal non stationary data. So first let’s produce these plots and then I’ll explain how to interpret them.
We’re going to produce our first plots going to be ACF plot and a different style and we’re going to produce a PACAF plot as well. Okay, let’s have a look at these. Okay, so the ACF and the PCAF plot includes a 95% confidence interval band. So anything outside this kind of shaded band here is a statistically significant correlation. So if we see a significant spike at lag X in the ACF that helps us determine the number of moving average terms and if we see a significant spike at lag X in the PACF, that helps us determine the number of autoregressive terms. So here in the ACF plot we see a spike at about one here. So that will turn, help us determine the number of moving average terms and if we look at the PACF, we can see two major spikes here, so one at about five, and one I think at about thirteen. So that will help us determine the number of AR terms.
For now we’re just going to go ahead with a model that only includes about five AR terms and see how that goes. So, now that we have looked at our ACF and PACF plots, we can now build our ARIMA model. That takes into account that the amount of terms that we need to use. And just keep in mind this models also going to infer the frequency, so we need to make sure there’s no gaps between our date times before we start modeling.
So let’s call this ARMA 1 model. And I’m going to apply our ARIMA model. And we’re gonna give it our data. And the order of terms is gonna be our ARMA terms and differencing. So, first we’ll put in number of AR terms here. Two rounds of differences, or two sets of differences. And one MA term here. And I’m going to put an option here, or specified transparameters as false. This kind of ensures, if you set it as true, ensures that things are kept stationary but you’ll see why I have to set this as false later on in the video tutorial series, when we talk about issues with our model. And we’re going to print the summary of our model, so we can get a few details modeled here, so let’s do that. I’ll explain how to interpret the summary as well Okay, let’s go ahead and run this.
So we’ve had a look at our autocorrelation and partial autocorrelation and now we’ve built our model. Alright, so this shows us a summary of our model here. we want to probably look at the P values for our coefficient of our terms here, so our AR terms and our MA terms here. So looking at this is is useful because if the P value for say an AR or an MA coefficient is greater than 0.05, which is our significance level. A kind of cut off mark to determine whether it’s significant or not. Then we can say it’s probably not significant enough of a term to keep in the model. So how you look at this, we might want to remodel and include only this AR or MA term here, as the other ones might not be necessary. But for the purpose of demonstration, let’s go ahead, and then we’ll discuss issues with our model later on.
The next step is, we want to predict the next 5 hours on the next 5 timestamps ahead, which is our test holdout set. So I’ll comment these out so they’re not too much of a distraction. And we’ll give it our model. and we use the predict function here. And I’m going to give it the time stamps from the last time stamp was basically 6:00 p.m. on the 6th of February 9, 2019. So I’m gonna take the time stamps into the future from the last time stamp, which is from 7:00 p.m. to 11:00 p.m. on the five time stamps ahead, so let’s do this. I’m also going to make this type levels, and you’ll see why later on, why we need to specify that. Okay, let’s run this. I’m also going to print these predictions, let’s run this. Alright, so here are our forecasts, or our predictions, for the next five hours ahead. We can kind of see it going in this sort of downward trajectory here, so it predicts that sentiment is likely to go, turn in a kind of bad direction. But what we need to keep in mind is, with time series we need to back transform our D difference predicted values with our D differenced or original actual values. This is automatically done when predicting so when we specified type levels here. We kind of wanted to predict on the original scale, not on the D differenced kind of scale. Nevertheless, we’re going to demonstrate how to de-transform, say, two rounds of differences using cumulative sum, when you’ve been given original data. So the first step in that is we want to basically get the second round of differences back to the first round of differences, and then take that D different starter and get it back to the original. So kind of like it’s two-step process. So let’s go ahead and demonstrate this. So, as I said, we want to get our second round of differences back to the first round. So I’ll just call this undif one. Take our second round of differences. And we’re going to fill in any missing values just so they don’t cause us any problems. And the next step, we want to get that difference data, or undifference data back to the original. So this undiff 2. Once again, fill in any missing values. Okay, now we can compare these. So, the difference or, There’re going to be small differences between our original data and our undifferenced data. But we’re going to round it up to six places after the decimal point. I mean, our values only come in six places after the decimal point anyway, So they’re not very big differences to care about, but they’re essentially the same. When we do round it up six places past the decimal point, so let’s have a look at this. And we’ll round this. And we’ll just look at our original data first. To about six places after the decimal point. I want to see if it’s equal to the same as our undifferenced data. Also, do this six points after the decimal point. And just for our own sanity check, we can just look at the first few values for the original values and compare it with the D difference values to see if they’re on par. Let’s have a look at this. So, it’s come back as true as if there are no differences or real differences between them. So our undifference data and our original values are on par. And you can have your own kind of sanity check here to make sure, just say the first few examples are definitely the same. Now that we have modeled the data and made our predictions, we’ll compare our predictions against the actual values in part three.
Thanks for watching. If you found this video tutorial useful, give us a like.
Rebecca Merrett - Rebecca holds a bachelor’s degree of information and media from the University of Technology Sydney and a post graduate diploma in mathematics and statistics from the University of Southern Queensland. She has a background in technical writing for games dev and has written for tech publications.
© Copyright – Data Science Dojo