GitHub - ManuelPonzi/triplets-trading: Project exploring the possibility of building a statistical arbitrage strategy based on pricing inefficiencies between three S&P 500 ETFs. The approach combines Johansen cointegration analysis to define mean-reverting spreads, and ARMA forecasting to generate trading signals.

1. Introduction

Statistical arbitrage strategies often seek to exploit temporary mispricings between assets that, over time, tend to exhibit stable relationships. Triplets Trading, a variation of traditional pairs trading, extends this approach by analysing the joint dynamics of three assets. The core idea is to build a linear combination of their prices to create a synthetic spread expected to display mean-reverting behavior. Trading decisions are based on the assumption that deviations from this equilibrium will eventually correct, offering opportunities for profit.

In this project, the analysis focuses on three exchange-traded funds: the SPDR S&P 500 ETF Trust (SPY), the iShares Core S&P 500 ETF (IVV), and the Vanguard S&P 500 ETF (VOO), all of which track the S&P 500 index. Despite having the same underlying benchmark, these ETFs are issued by different providers and can exhibit slight pricing discrepancies due to differences in expense ratios, liquidity, and dividend treatment. The strategy aims to identify moments when the price divergence between these ETFs exceeds typical levels, signaling potential opportunities for statistical arbitrage.

The project is structured as follows:

Time period selection and preliminary data analysis;
Implementation of the Johansen multivariate approach to cointegration;
Implementation of the trading strategy.

2. Time period selection and preliminary data analysis

A crucial component of the model’s development is the selection of an appropriate time period for analysis. Since the statistical relationships between the considered ETFs are not constant over time, the stability of the spread, and therefore the reliability of the trading signals, is highly dependent on the timeframe chosen for model calibration.

The period considered for the analysis is June 1, 2016 - December 30, 2023. Although this period extends for 6.5 years and includes the Covid-19 pandemic outbreak, no evident sign of significant structural breaks or regime shifts that would undermine the stability of the relationships among the three series was detected.

The identified timeframe was divided into a training set and a test set, with the split occurring September 7, 2021. Data up to this date was used to estimate the models which will allow to define the trading spreads, while the data following such date was used to test the performance of the triplets trading strategy out-of-sample.

Since the analysis of the cointegration relationship between the time series will be based on the Johansen's approach, which requires the time series to be I(1), the stationarity of the price time series was assessed inspecting the ACF and PACF plots and, more formally, by means of the Augmented Dickey-Fuller (ADF) test.

3. Johansen Multivariate Approach to Cointegration

Given that all three series contain a unit root, the Johansen cointegration method to test for the existence of stationary linear combinations among them was implemented. More precisely, the following operations were carried out:

Estimation of a Vector Autoregressive (VAR) model for the series to identify appropriate lag length $p$, which captures the short-run dynamics among the variables;
Computation the trace test and maximum eigenvalue test to determine the rank of the cointegration matrix, for different specifications of the cointegration relationship;
Estimation of the Vector Error Correction models (VECM) corresponding to the different specifications of the cointegration relationships associated with a cointegration matrix with rank $$r >0 $$, and computation the stationary time series resulting from the estimated cointegration relationships.

3.1 Estimation of the Autoregressive Order

The first step consists of fitting a VAR model to the time series:

$$ X_t = \Phi_0 + \sum_{i=1}^{p} \Phi_i X_{t-i} + u_t $$

Where:

$X_t \in \mathbb{R}^3$ is the vector of time series at time $t$;
$\Phi_i \in \mathbb{R}^{3 \times 3}$ are the autoregressive coefficient matrices;
$u_t \sim \text{i.i.d. } (0, \Sigma_u)$ is a white noise error term.

To select the appropriate lag length $p$, a first estimate was derived from the maximization of the following information criteria: Akaike Information Criterion (AIC), Hannan-Quinn Information Criterion (HQIC), Bayesian Information Criterion (BIC). The quality of such estimate was then assessed studying the serial correlation of each component of the residual vector $u_t$ with the Ljung-Box test.

3.2 Cointegration Rank and Model Specification

The Johansen cointegration test is based on the VECM representation of a VAR(p) model. The VECM formulation is given by:

$$ \Delta X_t = \Pi X_{t-1} + \sum_{i=1}^{p-1} \Gamma_i \Delta X_{t-i} + \varepsilon_t $$

Where:

$\Delta X_t$ is the differenced process;
$\Pi \in \mathbb{R}^{3 \times 3}$ is the long-run coefficient matrix;
$\Gamma_i \in \mathbb{R}^{3 \times 3}$ are short-run adjustment coefficient matrices;
$\varepsilon_t \sim \text{i.i.d. } (0, \Sigma)$ is a white noise innovation.

The long-run coefficient matrix can be decomposed as:

$$ \Pi = \alpha \beta^T $$

Where:

$\alpha \in \mathbb{R}^{3 \times r}$ contains the adjustment speed coefficients;
$\beta \in \mathbb{R}^{3 \times r}$ contains the cointegrating vectors;
$r \leq 3$ is the cointegration rank.

The term $\beta^T X_{t-1}$ gives the long-run equilibrium relations.

To estimate the cointegration rank, two tests are considered within the Johansen approach:

Trace test: a joint test where the null hypothesis is that the number of cointegrating vectors is less than or equal to $r$, against the alternative that it is more than $r$;
Maximum eigenvalue test: tests the null hypothesis that the number of cointegrating vectors is exactly $r$, against the alternative of $r + 1$.

Such tests were performed on a more general version of the VECM:

$$ \Delta X_t = \alpha \left( \beta^T X_{t-1} + \mu + \lambda t \right) + \sum_{i=1}^{p-1} \Gamma_i \Delta X_{t-i} + \varepsilon_t $$

Where:

$\mu \in \mathbb{R}^{r \times 1}$ is a constant term in the cointegration relation;
$\lambda \in \mathbb{R}^{r \times 1}$ is a linear trend coefficient in the cointegration relation.

The following model specifications were considered:

No intercept and no deterministic trend: $\mu = 0, \lambda = 0$;
Intercept and no deterministic trend: $\mu \neq 0, \lambda = 0$;
Intercept and linear deterministic trend: $\mu \neq 0, \lambda \neq 0$.

3.3 Spread Definition

Three possible trading spreads, $s^n_t$, $s^c_t$ and $s^{cl}_t$, were defined as follows:

$$ s^n_t := \beta_c^T X_t $$

$$ s^c_t := \beta_c^T X_t + \mu_c $$

$$ s^{cl}_t := \beta_{cl}^T X_t + \mu_{cl} + \lambda_{cl} t $$

To retrieve the parameters needed, the VECM models corresponding to these spreads were estimated. More precisely:

For the spread $s^n_t$, the estimated VECM model is:

$$ \Delta X_t = \alpha_c \beta_n^T X_{t-1} + \sum_{i=1}^{p-1} \Gamma_i^c \Delta X_{t-i} + \varepsilon_t^c $$

For the spread $s^c_t$, the estimated VECM model is:

$$ \Delta X_t = \alpha_c \left( \beta_c^T X_{t-1} + \mu_c \right) + \sum_{i=1}^{p-1} \Gamma_i^c \Delta X_{t-i} + \varepsilon_t^c $$

For the spread $s_t^{cl}$, the estimated VECM model is:

$$ \Delta X_t = \alpha_{cl} \left( \beta_{cl}^T X_{t-1} + \mu_{cl} + \lambda_{cl}(t-1) \right) + \sum_{i=1}^{p-1} \Gamma_i^{cl} \Delta X_{t-i} + \varepsilon_t^{cl} $$

The stationarity of the retrieved spreads was assessed via the ADF test.

4. Implementation of the trading strategy

4.1 Strategy Design Based on a Stationary Spread

Having defined a stationary spread, the trading strategy is built upon the following principle: open a position on the three ETFs when a trading signal is detected — that is, when the observed value of the spread deviates significantly from its expected value (typically zero) and close the position when the spread reverts back to equilibrium (i.e., approaches zero).

To define trading signals rigorously, an ARMA process is fitted to the spread computed on the training set. The general form of the $\text{ARMA}(p, q)$ process used is:

$$ s_t = \mu + \sum_{i=1}^p \phi_i s_{t-i} + \varepsilon_t + \sum_{j=1}^q \theta_j \varepsilon_{t-j} $$

Where:

$s_t$ is the spread at time $t$;
$\mu$ is the mean;
$\phi_i$ and $\theta_j$ are AR and MA coefficients;
$\varepsilon_t$ is a white noise error term.

For each time step in the test set, the observed spread is compared to the one-step-ahead forecast generated by the fitted ARMA model. A trading signal is triggered when the absolute difference between the forecasted and observed spread exceeds the forecast's standard deviation, indicating a statistically significant deviation from equilibrium. In general, the strength of the trading signal can be tuned by replacing the standard deviation of the forecast $\sigma_t$ with $\eta\sigma_t$, where $\eta$ is a scalar parameter.

After each forecast, the ARMA model's internal state is updated with the new observed value of the spread. This adaptive updating allows the model to produce more accurate subsequent forecasts without increasing uncertainty. However, note that the model is not refit — its parameters remain fixed as estimated from the training data.

This approach ensures that the trading strategy is reactive to new information, while maintaining the stability and interpretability of the original model fit.

4.2 Testing of the trading strategy

The testing of the trading strategy was carried out examining how the strategy’s performance metric, represented by the out-of-sample return, evolves under varying assumptions about market frictions and signal strength thresholds.

Since successful triplet strategies involve an initial positive inflow of capital and no outflow upon closure, standard return metrics like linear returns $r_{0,T} = \frac{V_T}{V_0} - 1$ or log-returns $r_{0,T} = \ln\left(\frac{V_T}{V_0}\right)$ are not appropriate.

Instead, the method proposed by Gatev, Goetzmann, and Rouwenhorst (Review of Financial Studies, 2006) is adopted: if there is no open position at time t, the return $r_t$ is null, while, if the position is open, the return is given by:

$$ r_t = S_t \cdot \frac{S_t - S_{t-1}}{V_t^S} - f \cdot \left| S_t - S_{t-1} \right| $$

Where:

$S_t$ is the value of the spread at time $t$;
$V_t^S$ is the market value of the spread position;
$f$ is the transaction cost parameter.

The overall return over the out-of-sample period is then calculated as:

$$ R_{0,T} = \prod_{t=1}^{T} (1 + r_t) - 1 $$

Such metric was computed for different values of transaction costs $( f )$ and trading signal strength $(\eta)$.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
main.ipynb		main.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1. Introduction

2. Time period selection and preliminary data analysis

3. Johansen Multivariate Approach to Cointegration

3.1 Estimation of the Autoregressive Order

3.2 Cointegration Rank and Model Specification

3.3 Spread Definition

4. Implementation of the trading strategy

4.1 Strategy Design Based on a Stationary Spread

4.2 Testing of the trading strategy

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

1. Introduction

2. Time period selection and preliminary data analysis

3. Johansen Multivariate Approach to Cointegration

3.1 Estimation of the Autoregressive Order

3.2 Cointegration Rank and Model Specification

3.3 Spread Definition

4. Implementation of the trading strategy

4.1 Strategy Design Based on a Stationary Spread

4.2 Testing of the trading strategy

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages