ARIMA Models for Time Series Forecasting¶
In this example, we will build ARIMA and seasonal ARIMA (SARIMA) models to make a time series prediction on the future prices of Bitcoin (BTC).
Import modules¶
To import BTC prices using yahoo's stock prices API, we will install yfinance package using the pip package installer for python
pip install yfinance
Collecting yfinance
Downloading yfinance-0.2.58-py2.py3-none-any.whl (113 kB)
-------------------------------------- 113.7/113.7 kB 6.9 MB/s eta 0:00:00
Collecting multitasking>=0.0.7
Downloading multitasking-0.0.11-py3-none-any.whl (8.5 kB)
Collecting requests>=2.31
Downloading requests-2.32.3-py3-none-any.whl (64 kB)
---------------------------------------- 64.9/64.9 kB ? eta 0:00:00
Collecting peewee>=3.16.2
Downloading peewee-3.18.1.tar.gz (3.0 MB)
---------------------------------------- 3.0/3.0 MB 10.7 MB/s eta 0:00:00
Installing build dependencies: started
Installing build dependencies: finished with status 'done'
Getting requirements to build wheel: started
Getting requirements to build wheel: finished with status 'done'
Preparing metadata (pyproject.toml): started
Preparing metadata (pyproject.toml): finished with status 'done'
Collecting pytz>=2022.5
Downloading pytz-2025.2-py2.py3-none-any.whl (509 kB)
------------------------------------- 509.2/509.2 kB 10.6 MB/s eta 0:00:00
Requirement already satisfied: pandas>=1.3.0 in d:\prog_softwares\anaconda\lib\site-packages (from yfinance) (1.4.3)
Requirement already satisfied: platformdirs>=2.0.0 in d:\prog_softwares\anaconda\lib\site-packages (from yfinance) (2.4.0)
Collecting curl_cffi>=0.7
Downloading curl_cffi-0.10.0-cp39-abi3-win_amd64.whl (1.4 MB)
---------------------------------------- 1.4/1.4 MB 12.4 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.16.5 in d:\prog_softwares\anaconda\lib\site-packages (from yfinance) (1.22.3)
Requirement already satisfied: beautifulsoup4>=4.11.1 in d:\prog_softwares\anaconda\lib\site-packages (from yfinance) (4.11.1)
Collecting frozendict>=2.3.4
Downloading frozendict-2.4.6-cp39-cp39-win_amd64.whl (37 kB)
Requirement already satisfied: soupsieve>1.2 in d:\prog_softwares\anaconda\lib\site-packages (from beautifulsoup4>=4.11.1->yfinance) (2.3.1)
Requirement already satisfied: cffi>=1.12.0 in d:\prog_softwares\anaconda\lib\site-packages (from curl_cffi>=0.7->yfinance) (1.15.1)
Requirement already satisfied: certifi>=2024.2.2 in d:\prog_softwares\anaconda\lib\site-packages (from curl_cffi>=0.7->yfinance) (2024.2.2)
Requirement already satisfied: python-dateutil>=2.8.1 in d:\prog_softwares\anaconda\lib\site-packages (from pandas>=1.3.0->yfinance) (2.8.2)
Requirement already satisfied: urllib3<3,>=1.21.1 in d:\prog_softwares\anaconda\lib\site-packages (from requests>=2.31->yfinance) (1.26.11)
Requirement already satisfied: idna<4,>=2.5 in d:\prog_softwares\anaconda\lib\site-packages (from requests>=2.31->yfinance) (3.3)
Requirement already satisfied: charset-normalizer<4,>=2 in d:\prog_softwares\anaconda\lib\site-packages (from requests>=2.31->yfinance) (2.0.4)
Requirement already satisfied: pycparser in d:\prog_softwares\anaconda\lib\site-packages (from cffi>=1.12.0->curl_cffi>=0.7->yfinance) (2.21)
Requirement already satisfied: six>=1.5 in d:\prog_softwares\anaconda\lib\site-packages (from python-dateutil>=2.8.1->pandas>=1.3.0->yfinance) (1.16.0)
Building wheels for collected packages: peewee
Building wheel for peewee (pyproject.toml): started
Building wheel for peewee (pyproject.toml): finished with status 'done'
Created wheel for peewee: filename=peewee-3.18.1-py3-none-any.whl size=139098 sha256=4c0cd0fd93aafbb50522a468aeae4b4ea6f86bc7562b75f0ca57ccfd3bfb60f1
Stored in directory: c:\users\vibha\appdata\local\pip\cache\wheels\e7\09\87\1f1a87eb25af98d0a1f60ef184a9779d81c6a2e8d28bc67d7f
Successfully built peewee
Installing collected packages: pytz, peewee, multitasking, requests, frozendict, curl_cffi, yfinance
Attempting uninstall: pytz
Found existing installation: pytz 2022.1
Uninstalling pytz-2022.1:
Successfully uninstalled pytz-2022.1
Attempting uninstall: requests
Found existing installation: requests 2.28.1
Uninstalling requests-2.28.1:
Successfully uninstalled requests-2.28.1
Successfully installed curl_cffi-0.10.0 frozendict-2.4.6 multitasking-0.0.11 peewee-3.18.1 pytz-2025.2 requests-2.32.3 yfinance-0.2.58
Note: you may need to restart the kernel to use updated packages.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. anaconda-project 0.11.1 requires ruamel-yaml, which is not installed. spacy 3.5.2 requires pydantic!=1.8,!=1.8.1,<1.11.0,>=1.7.4, but you have pydantic 2.10.6 which is incompatible.
# Import libraries
import yfinance as yahooFinance
import datetime
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.seasonal import seasonal_decompose
Reading and Displaying BTC Time Series Data¶
Let's read in the historical prices for BTC as a dataframe by defining start and end dates for our data pull. We’ll specifically use the 'Close' price for our forecasting models.
# startDate , as per our convenience we can modify
startDate = datetime.datetime(2018, 1, 1)
# endDate , as per our convenience we can modify
endDate = datetime.datetime(2020, 12, 2)
GetPrices = yahooFinance.Ticker("BTC-USD")
# pass the parameters as the token dates for start and end
prices = GetPrices.history(start=startDate,
end=endDate)
# format the index date field
prices.index = pd.to_datetime(prices.index).strftime('%Y-%m-%d')
prices
| Open | High | Low | Close | Volume | Dividends | Stock Splits | |
|---|---|---|---|---|---|---|---|
| Date | |||||||
| 2018-01-01 | 14112.200195 | 14112.200195 | 13154.700195 | 13657.200195 | 10291200000 | 0.0 | 0.0 |
| 2018-01-02 | 13625.000000 | 15444.599609 | 13163.599609 | 14982.099609 | 16846600192 | 0.0 | 0.0 |
| 2018-01-03 | 14978.200195 | 15572.799805 | 14844.500000 | 15201.000000 | 16871900160 | 0.0 | 0.0 |
| 2018-01-04 | 15270.700195 | 15739.700195 | 14522.200195 | 15599.200195 | 21783199744 | 0.0 | 0.0 |
| 2018-01-05 | 15477.200195 | 17705.199219 | 15202.799805 | 17429.500000 | 23840899072 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 2020-11-27 | 17153.914062 | 17445.023438 | 16526.423828 | 17108.402344 | 38886494645 | 0.0 | 0.0 |
| 2020-11-28 | 17112.933594 | 17853.939453 | 16910.652344 | 17717.414062 | 32601040734 | 0.0 | 0.0 |
| 2020-11-29 | 17719.634766 | 18283.628906 | 17559.117188 | 18177.484375 | 31133957704 | 0.0 | 0.0 |
| 2020-11-30 | 18178.322266 | 19749.263672 | 18178.322266 | 19625.835938 | 47728480399 | 0.0 | 0.0 |
| 2020-12-01 | 19633.769531 | 19845.974609 | 18321.921875 | 18802.998047 | 49633658712 | 0.0 | 0.0 |
1066 rows × 7 columns
Writing to CSV¶
We'll write our data to a csv file to avoid having to repeatedly pull data.
# Write to csv
prices.to_csv("prices.csv")#--by default your file will save in same location as where you saved this jupyter notebook
# Import csv
prices = pd.read_csv("prices.csv")
print(prices.head())
Date Open High Low Close \
0 2018-01-01 14112.200195 14112.200195 13154.700195 13657.200195
1 2018-01-02 13625.000000 15444.599609 13163.599609 14982.099609
2 2018-01-03 14978.200195 15572.799805 14844.500000 15201.000000
3 2018-01-04 15270.700195 15739.700195 14522.200195 15599.200195
4 2018-01-05 15477.200195 17705.199219 15202.799805 17429.500000
Volume Dividends Stock Splits
0 10291200000 0.0 0.0
1 16846600192 0.0 0.0
2 16871900160 0.0 0.0
3 21783199744 0.0 0.0
4 23840899072 0.0 0.0
Setting the Date Index¶
In order for our model to recognize it's working with time series data, let's set the 'Date' column to be a dataframe index.
# Reset the index as the Date
prices.index = pd.to_datetime(prices['Date'], format='%Y-%m-%d')
del prices['Date']
prices.head()
| Open | High | Low | Close | Volume | Dividends | Stock Splits | |
|---|---|---|---|---|---|---|---|
| Date | |||||||
| 2018-01-01 | 14112.200195 | 14112.200195 | 13154.700195 | 13657.200195 | 10291200000 | 0.0 | 0.0 |
| 2018-01-02 | 13625.000000 | 15444.599609 | 13163.599609 | 14982.099609 | 16846600192 | 0.0 | 0.0 |
| 2018-01-03 | 14978.200195 | 15572.799805 | 14844.500000 | 15201.000000 | 16871900160 | 0.0 | 0.0 |
| 2018-01-04 | 15270.700195 | 15739.700195 | 14522.200195 | 15599.200195 | 21783199744 | 0.0 | 0.0 |
| 2018-01-05 | 15477.200195 | 17705.199219 | 15202.799805 | 17429.500000 | 23840899072 | 0.0 | 0.0 |
Plotting our time series¶
Let’s plot our time series data using Seaborn and Matplotlib libraries.
sns.set()
plt.ylabel('BTC Price')
plt.xlabel('Date')
plt.xticks(rotation=45)
plt.plot(prices.index, prices['Close'], )
[<matplotlib.lines.Line2D at 0x14f31c6d430>]
Decompose Time Series¶
Let's split our time series using stats model decomposition function to see how much of our data is comprised of trend, seasonality, and noise.
Seasonality: describes the periodic signal in your time series.
Trend: describes whether the time series is decreasing, constant, or increasing over time.
Noise: describes what remains behind the separation of seasonality and trend from the time series. In other words, it’s the variability in the data that cannot be explained by the model.
df = prices[['Close']]
decompose_data = seasonal_decompose(df, model="additive", period=30)
#--we select the Additive model since it appears the trend and seasonal variation are relatively constant
#--we want to analyze by month so we set the period to 30 days.
decompose_data.seasonal.plot()
decompose_data.trend.plot()
decompose_data.resid.plot()
Train/Test Split¶
We will split our data such that everything before November 2020 will serve as training data, with everything after 2020 becoming the testing data:
train = prices[prices.index < pd.to_datetime("2020-11-01", format='%Y-%m-%d')]
test = prices[prices.index > pd.to_datetime("2020-11-01", format='%Y-%m-%d')]
#plt.plot(train, color = "black")
#plt.plot(test, color = "red")
plt.plot(train.index, train['Close'], color = "black")
plt.plot(test.index, test['Close'], color = "red")
plt.ylabel('BTC Price')
plt.xlabel('Date')
plt.xticks(rotation=45)
plt.title("Train/Test split for BTC Data")
plt.show()
ARIMA Model¶
The Autoregressive Integrated Moving Average (ARIMA) model doesn’t assume stationarity (i.e. no trends/seasonality) but does still assume that the data exhibits little to no seasonality. Additionally, the model applies exponential smoothing by adding more weight to recent values and gradually decaying this weight on lag values.
# Lets define our model input
y = train['Close']
ARIMAmodel = ARIMA(y, order = (5, 4, 2))
# The first parameter corresponds to the lagging (past values),
# the second corresponds to differencing (this is what makes
# non-stationary data stationary), and the last parameter corresponds
# to the white noise (for modeling shock events)
ARIMAmodel = ARIMAmodel.fit()
y_pred_2 = ARIMAmodel.get_forecast(len(test.index))
y_pred_df_2 = y_pred_2.conf_int(alpha = 0.05) #--higher alpha assigns greater weight to more recent values
y_pred_df_2["Predictions"] = ARIMAmodel.predict(start = y_pred_df_2.index[0], end = y_pred_df_2.index[-1])
y_pred_df_2.index = test.index
y_pred_out_2 = y_pred_df_2["Predictions"]
SARIMA Model¶
The seasonal ARIMA (SARIMA) variant is a statistical model that can work with non-stationary data and capture some seasonality.
# Lets define our model input
y = train['Close']
SARIMAXmodel = SARIMAX(y, order = (5, 4, 2), seasonal_order=(2,2,2,12))
# The first 3 parameters for seasonal_order are defined the same as when treating non-seasonal
# The fourth indicates the seasonal periods to apply the parameters to
# In other words, how many observations of seasonal spikes/dips per year
SARIMAXmodel = SARIMAXmodel.fit()
y_pred_3 = SARIMAXmodel.get_forecast(len(test.index))
y_pred_df_3 = y_pred_3.conf_int(alpha = 0.05)
y_pred_df_3["Predictions"] = SARIMAXmodel.predict(start = y_pred_df_3.index[0], end = y_pred_df_3.index[-1])
y_pred_df_3.index = test.index
y_pred_out_3 = y_pred_df_3["Predictions"]
Visualize Results¶
Let's plot the results of our ARIMA models and get their RMSE values to further compare their performance.
plt.plot(y_pred_out_3, color='Blue', label = 'SARIMA Predictions')
plt.plot(y_pred_out_2, color='Yellow', label = 'ARIMA Predictions')
plt.plot(train.index, train['Close'], color = "black", label = 'Training')
plt.plot(test.index, test['Close'], color = "red", label = 'Testing')
plt.ylabel('BTC Price')
plt.xlabel('Date')
plt.xticks(rotation=45)
plt.title("Train/Test split for BTC Data")
plt.legend()
sarima_rmse = np.sqrt(mean_squared_error(test["Close"].values, y_pred_df_3["Predictions"]))
print("SARIMA RMSE: ",sarima_rmse)
arima_rmse = np.sqrt(mean_squared_error(test["Close"].values, y_pred_df_2["Predictions"]))
print("ARIMA RMSE: ",arima_rmse)
Reference: For more information about ARIMA and forecasting you can refer to this article from which the above example was derived.