Using Vector Autoregression on the SPY ETF
For this post, I took a break from intraday options trading to try equities. I’ve received several comments from readers who didn't want to spend a lot of money on data brokers, so I figured it was worthwhile to use free data from a simple source like yFinance. Ideally, everything I code and model in this post can be done at no cost.
For those unaware, yFinance is a reasonably-useful and accurate free python library that allows you to pull financial prices and other information that has been collected from the Yahoo! Finance website.
It can provide historical, technical, and fundamental data, and even price data at multiple levels from 1-minute to yearly. yFinance data is often limited and imprecise, but I believe it's still a fantastic starting point for ideas.
So I decided to take a look at asset prices at market Close and generate Close-to-Close returns. The close price is often used in stock analysis because markets fluctuate heavily on Open, and a day's High/Low prices in a period can't be timed at point of purchase. So Close prices are the cleanest dirty shirt for a price metric.
After a little deliberation and experimentation, I pulled the following assets:
SPY - S&P 500 ETF
/ES - E-Mini S&P 500 Future (front month)
^SPX - S&P 500 Index
^VIX - CBOE Volatility Index
^VVIX - CBOE Volatility of Volatility Index
USO - United States Oil Fund
SHY - iShares 1-3 Year Treasury Bond Fund
TLH - iShares 7-10 Year Treasury Bond Fund
GLD - SPDR Gold Shares ETF
Using the following Code:
ticker = "SPY"
# Would have to use this to predict SPY daily returns.
ticker_list = [ticker, "/ES=F", "^SPX", "^VVIX", "^VIX", "USO", "SHY","TLH", "GLD",]
interval = "1d"
start = "2006-1-1"
# Safe way to get the most up-to-date data.
end = "2030-1-1"
test = yf.download(ticker_list, start=start, end=end, interval=interval)[["Adj Close"]]["Adj Close"]
monthly = yf.download(ticker_list, start=start, end=end, interval="1mo")[["Adj Close"]]["Adj Close"]
test["Date"] = test.index
test["Year"] = pd.to_datetime(test["Date"]).dt.year
test["Month"] = pd.to_datetime(test["Date"]).dt.month
test["Day"] = pd.to_datetime(test["Date"]).dt.day
A Look at our Data:
If we take a look at a correlation matrix of the returns, we see that there's a decent amount of interaction between each of the assets. SPY, /ES, and ^SPX have a correlation near 1 between each of them. VVIX and VIX are correlated to each other, GLD and SPY have near-zero correlation, etc.
We like to see interesting correlations between our data, but correlation is not necessarily *predictive* of what happens next. We want better predictions of whether to buy or sell an asset than a coin toss. So what if we wanted to use an asset's prior returns to predict a different assets returns? Could we do that?
Granger Causality:
Possibly. There's a concept from Time Series called Granger Causality that can assist us. It doesn't determine *actual* causality, but it's good enough for prediction. And prediction is all we care about right now. Basically, Granger Causality tells you if given time series A and B, whether A's past data is better than B's past data at predicting B's present values.
To explain this concept without math or figures, imagine that Person A lives in Chicago and Person B lives in Indianapolis -a city a bit southeast of Chicago- and both A and B want to know whether it will rain on Monday, knowing the prior week's of weather. Because of weather patterns and location, while Person B doesn't know what their own weather will be from day to day, they can probably guess that Person A's weather will be like their own weather 1-2 days ago. Chicago doesn't cause Indianapolis weather, and Indianapolis weather doesn't predict Chicago weather, but Indianapolis weather can be better predicted with a slight time delay with knowledge of Chicago weather. So we can say in this context that Chicago 'Granger-Causes' Indianapolis, but Indianapolis does not 'Granger-Cause' Chicago.
This video also gives a good explanation of Granger Causality, if you want to look further into it.
For our purposes, we want to see if SPY is Granger-Caused by our chosen variables (tickers: /ES, ^SPX, ^VVIX, ^VIX, USO, SHY, TLH, GLD), but we're still interested in the relations between each of these assets. We'll use the ‘grangercausalitytests’ function from the python library statsmodels instead of building our own. In fact, we can do an all-vs.-all Calculation of Granger causality for our chosen assets using this code, with each number being the p-value for our null hypothesis that there is no Granger Causality:
# Use to check the Average and Minimum p-values
def granger_causation_matrix(data, variables, test='ssr_chi2test', verbose=False, maxlag=5):
#Check Granger Causality between each possible time series variable.
#The rows are the response variable, columns are predictors. The values in the table are the P-Values. # P-Values < significance level (0.05) imply the Null Hypothesis: # (Coeff of Xt-1 = 0, meaning X does not cause Y can be rejected.) #data: pandas dataframe, variables: list of time series columns in data. df_min = pd.DataFrame(np.zeros((len(variables), len(variables))), columns=variables, index=variables)
df_avg = pd.DataFrame(np.zeros((len(variables), len(variables))), columns=variables, index=variables)
for col in df_min.columns:
for row in df_min.index:
test_result = grangercausalitytests(data[[row, col]], maxlag=maxlag, verbose=False)
p_values = [round(test_result[i+1][0][test][1], 4) for i in range(maxlag)]
if verbose: print(f'Y = {row}, X = {col}, P Values = {p_values}')
df_min.loc[row, col] = np.min(p_values)
df_avg.loc[row, col] = np.mean(p_values)
df_min.columns = [var + '_x' for var in variables]
df_min.index = [var + '_y' for var in variables]
df_avg.columns = [var + '_x' for var in variables]
df_avg.index = [var + '_y' for var in variables]
return df_min, df_avg
If we plug our training data into the Granger Causality test matrix function, we get the following:
Since we only want to try trading SPY, we'll just worry about the first row. We see that SPY_1p_y, (SPY etf return for 1 period, in this case a day) has excellent minimum p-values for Granger Causality for all each variable besides SHY and TLH. Within 5 lags, (or a week's worth of daily returns) most of our variables indeed have enough evidence to suggest Granger Causality at the 5%, or even 1% level.
Vector Autoregression:
We have multiple different assets that could have Granger causality to SPY. So we want to regress our data on each asset's prior returns, as well as the correlations between those past returns on each other asset in our list of assets. In the following picture I stole from a Medium article detailing a VAR(1) model in, we are regressing the X and Y variables. For Xt and Yt, we have b11, b12, b21, b22 as our coefficients, and u and v as 'white noise'.
Increasing the number of lags (t-i) and the number of variables (X,Y,Z...) would make this equation difficult to read. So we can condense our notation to Vectors, hence 'Vector Autoregression'.
So I ran a vector autoregression using our data from around 2006 to now. I split the data such that the last 20% of trading days was our test set, and used returns at the daily level. Like with Granger Causality tests, the 'statesmodels' python library has a decent VAR function we can use instead of making our own.
# VAR training Code:
test_size = int(len(temp)*0.20)
print(test_size)
var_data_train = temp[:test_size]
var_data_test = temp[-test_size+1:]
var_data_train = var_data_train[return_list
#+exog
]
var_data_test = var_data_test[return_list
#+exog
]
model = VAR(var_data_train[return_list],
#exog=var_data_train[exog]
)
order_type = None
order_size = 1000
for i in ['aic', 'bic', 'hqic']:
#maxlags takes the number of lags we want to test
#ic takes the information criterion metod based on which order would be suggested results = model.fit(maxlags=5, ic=i)
order = results.k_ar
print(f"The suggested VAR order from {i} is {order}")
if (order < order_size) & (order > 0):
order_size = order
order_type = i
print("Chose: ", order_type, order_size)
model = VAR(var_data_train[return_list],
#exog=var_data_train[exog]
)
model_fit = model.fit(maxlags = order_size, ic=order_type, trend='c')
print(model_fit.summary())
# Check the Coefficient values.
print("Coefficient Values:")
print(model_fit.params[f"f_{ticker_list[0]}_1p"].to_string())
#Test absence of significant residual autocorrelations one can use the test_whiteness method of VARResults
test_corr = model_fit.test_whiteness(nlags=order_size+2, signif=0.05, adjusted=False)
print(f"Whiteness Method Test: {round(test_corr.pvalue,4)}")
# VAR test and prediction code
nobs = model_fit.nobs
idx = temp.index
idx = idx[-nobs:]
var_data_test = var_data_test[order_size-1:]
var_data_test.index = idx
# If we only look at first ~3 months of trading data in test data?
#var_data_test = var_data_test[:60]
# {ticker}_fit Coefficients
var_data_test[f"{ticker}_pred"] = model_fit.params[f"f_{ticker}_1p"].loc["const"]
for i in return_list:
for num in list(range(1, order_size+1)):
var_data_test[f"{ticker}_pred"] += model_fit.params[f"f_{ticker}_1p"].loc[f"L{num}.{i}"] * var_data_test[i].shift(num)
var_data_test[f"Pred_error_{ticker}"] = np.where(np.sign(var_data_test[f"{ticker}_pred"]) != np.sign(var_data_test[f"f_{ticker}_1p"]),
abs(var_data_test[f"{ticker}_pred"] - var_data_test[f"f_{ticker}_1p"]), 0)
# Determining when to buy and sell.
var_data_test["B/S"] = "None"
var_data_test["B/S"][(var_data_test[f"{ticker}_pred"] > 0)] = "Buy"
var_data_test["B/S"][(var_data_test[f"{ticker}_pred"] < 0)] = "Sell"
When I first plotted the results for the final 20% of the dataset, I was disappointed. Performance was inconsistent, and for multiple timeframes, quickly showed massive losses within 2-3 years.
But I noticed something: If you only look at a smaller period of the first 20-40 trading days of a given test period, it does a lot better, at least in tracking the SPY's performance.
I think this suggests that our model needs recent data to perform better, instead of as much as possible. So to keep that data impactful for our regression, we need to remove older data from the model. We'll do this by creating a monthly sliding window for our VAR. Each month, our model pushes out roughly the oldest 1 month of data, and tacks on 1 month of the most recent data.
And viola! Our model is much more powerful, and more capable of knowing when to buy and sell over a long period of time. We note that consistent outperformance truly begins around 2020, during the COVID pandemic. We see that our updated algorithm significantly outperforms Buy and Hold, as well as randomly buying and selling, (like if you flipped a coin each day to determine whether to go long or short).
Digging into the model's decision-making skills:
Because the rebalancing and new data should change the coefficients, I decided to plot out the monthly coefficients for comparison.
Note that the coefficients for SPY_1p and ^SPX_1p are consistently opposing one another and grow in magnitude roughly in tandem. Since we know from our Correlation heat maps that SPY and ^SPX are very correlated, I suspect that there are tiny differences in the S&P 500 index and SPY ETF's returns that the our VAR algorithm is picking up and turning into useful signals. Especially since SPY and SPX coefficients explode around 2020-4, the same time that our model starts greatly outperforming Buy-and-Hold.
Trading fees and slippage:
I set trading fees to $0 and slippage to roughly 0.0001% because we'd be trading the SPY, which is incredibly liquid ($0.01 Bid-Ask spread) and can it probably be traded using any low-cost or no-commission broker.
Prediction Errors:
I tried tracking the error between prediction and real returns, using the prior trick of Moving Average Regimes to track increasing or decreasing Errors. Squared Error (in this case Square Root because 0 <= error <= 1), Absolute Error, and a Conditional Error [if the predicted direction does not match the real return direction, use Error, else Error = 0], did not add much to returns. Using raw Error values or regimes tended to decrease the number of days traded and the profits from returns.
Would I trade this strategy:
As is, I wouldn't trade this model. From 2014-2018, the model did not outperform our SPY ETF benchmark. Would I be able to handle 4 solid years of underperformance and not consider changing my model? Probably not. I noticed that Whiteness Tests and Residual Normality Tests failed spectacularly for pretty much every monthly rebalance. The model was still good for picking whether to buy or sell, but I think there's a performance lift that can still be found.
Things to improve on in the future:
- Use better data: yFinance is useful, but their data could be more accurate. It defeats the purpose of this exercise, I'd probably rerun this using data from an actual data broker like FinancialModelingPrep, Polygon, or EODHD. If you print out a dataframe of our yFinance-provided asset prices from 2006 to now, there are multiple periods with missing data.
- Optimize the context window: I used 1000 trading days because it looked like a nice round number, and because I thought having roughly 4 years of data should give the model enough time to contain choppy, gaining, or dropping markets. However, 300, 500, 1500, or even 2000 trading days for the VAR's training data might lead to better results.
- Use a better model: Although we used VARs, we could increase the model's complexity to a VARMA, which adds Moving Averages. Additionally, we could find another variable that isn't a financial asset, like the unemployment rate, and add it as an exogenous variable to make the model a VARMAX. We could also go for less complex models and only worry about the SPY's price data itself, using an ARIMA model.
- Optimize the retraining window: I used the start of each calendar month to push the former test data into the model and retrain it. The model might do better with a 2-week, 1-week, or 1-day retraining period.
- Change Factors: Currently we are only using Closing price changes. We could possibly increase performance by including, Open, High, Low, or Volume data in the model.
- Switching factors: There are thousands of ETFs with 10+ years of price data. It's possible that a Gold miner ETF, or the Gold Futures Contract might give better signals than the GLD ETF.
More extensive code for the model and results can be found here.
Updates:
In case anyone was curious: I've been running the SPY/VIX/VVIX strategy since August 1st, and I hope by September 1st I'll have interesting results to share with you all.
- LayQuant