
Formulas and Mathematical Detail
================================

Seemingly Unrelated Regression (SUR/SURE)
-----------------------------------------

Seemingly unrelated regression is a system regression estimator which
jointly estimates multiple models. This allows joint hypothesis testing
of parameters across models since the parameter covariance is robust to
correlation of residuals across models. It can also lead to more precise
parameter estimates if some residuals are conditionally homoskedastic
and regressors differ across equations.

Basic Notation
~~~~~~~~~~~~~~

Assume :math:`K` series are observed for :math:`N` time periods. Denote
the stacked observations of series :math:`i` as :math:`Y_{i}` and its
corresponding regressor matrix :math:`X_{i}`. The complete model
spanning all series can be specified as

.. math::

   \begin{aligned}
   Y & =X\beta+\epsilon\\
   \left[\begin{array}{c}
   Y_{1}\\
   Y_{2}\\
   \vdots\\
   Y_{K}
   \end{array}\right] & =\left[\begin{array}{cccc}
   X_{1} & 0 & 0 & 0\\
   0 & X_{2} & 0 & 0\\
   \vdots & \vdots & \ddots & \vdots\\
   0 & 0 & 0 & X_{K}
   \end{array}\right]\left[\begin{array}{c}
   \beta_{1}\\
   \beta_{2}\\
   \vdots\\
   \beta_{K}
   \end{array}\right]+\left[\begin{array}{c}
   \epsilon_{1}\\
   \epsilon_{2}\\
   \vdots\\
   \epsilon_{K}
   \end{array}\right].\end{aligned}

The OLS estimator of :math:`\beta` is then

.. math:: \hat{\beta}=\left(X^{\prime}X\right)^{-1}X^{\prime}Y.

A GLS estimator can be similarly defined as

.. math:: \hat{\beta}_{GLS}=\left(X^{\prime}\Omega^{-1}X\right)^{-1}X^{\prime}\Omega^{-1}Y

where :math:`\Omega=\Sigma\otimes I_{N}` is the joint covariance of the
residuals. In practice :math:`\Sigma` is not known as so a feasible GLS
(FGLS) is implemented in two steps. The first uses OLS to estimate
:math:`\hat{\epsilon}` and then estimates the residual covariance as

.. math::

   \hat{\Sigma}=N^{-1}\left[\begin{array}{cccc}
   \hat{\epsilon}_{1} & \hat{\epsilon}_{2} & \ldots & \hat{\epsilon}_{N}\end{array}\right]^{\prime}\left[\begin{array}{cccc}
   \hat{\epsilon}_{1} & \hat{\epsilon}_{2} & \ldots & \hat{\epsilon}_{N}\end{array}\right].

The feasible GLS estimator is then

.. math:: \hat{\beta}_{FGLS}=\left(X^{\prime}\hat{\Omega}^{-1}X\right)^{-1}X^{\prime}\hat{\Omega}^{-1}Y

where :math:`\hat{\Omega}=\hat{\Sigma}\otimes I_{N}`.

Covariance Estimation
---------------------

**Homoskedastic Data**

When residuals are assumed to be homoskedastic, the covariance can be
consistently estimated by

.. math:: \left(X^{\prime}\Delta^{-1}X\right)^{-1}\left(X^{\prime}\Delta^{-1}\hat{\Omega}\Delta^{-1}X\right)\left(X^{\prime}\Delta^{-1}X\right)^{-1}

where :math:`\Delta` is the weighting matrix used in the parameter
estimator. For example, in OLS :math:`\Delta=I_{NK}` while in
FGLS\ :math:`\Delta=\hat{\Omega}`. The estimator supports using FGLS
with an assumption that :math:`\hat{\Sigma}` is diagonal or using a
user-specified value of :math:`\Sigma`. When the FGLS estimator is used,
this simplifies to

.. math:: \left(X^{\prime}\Delta^{-1}X\right)^{-1}

**Heteroskedastic Data**

When residuals are heteroskedastic, the covariance can be consistently
estimated by

.. math:: \left(X^{\prime}\Delta^{-1}X\right)^{-1}\hat{S}\left(X^{\prime}\Delta^{-1}X\right)^{-1}

where :math:`\hat{S}` is a covariance estimator that is clustered by
time. Define the matrix of scores as

.. math::

   \hat{g}=\left[\begin{array}{c}
   \hat{g}_{1}\\
   \hat{g}_{2}\\
   \vdots\\
   \hat{g}_{N}
   \end{array}\right]=\Delta^{-\frac{1}{2}}X\odot\left(\hat{\epsilon}\iota_{M}^{\prime}\right)

where :math:`\iota_{M}` is a vector of ones with :math:`M` elements
where :math:`M` is the number of columns in :math:`X` and :math:`\odot`
is element by element multiplication. The clustered-by-time score
covariance estimator is then

.. math:: \hat{S}=N^{-1}\sum_{i=1}^{N}\Psi_{i}^{\prime}\Psi_{i}

where

.. math:: \Psi_{i}=\sum_{j=1}^{K}\hat{g}_{ij}

is the sum of the scores across models :math:`\left(j\right)` that have
the same observation index :math:`i`. This estimator allows arbitrary
correlation of residuals with the same observation index.

**Debiased Estimators**

When the debiased flag is set, a small sample adjustment if applied so
that element :math:`ij` of :math:`\hat{\Sigma}` is scaled by

.. math:: \frac{T}{\sqrt{\left(T-P_{i}\right)\left(T-P_{j}\right)}}.

Other Statistics
~~~~~~~~~~~~~~~~

**Goodness of fit**

The reported :math:`R^{2}` is always for the data, or weighted data is
weights are used, for either OLS or GLS. The means that the reported
:math:`R^{2}` for the GLS estimator may be negative

**F statistics**

When the debiased covariance estimator is used (small sample adjustment)
the reported :math:`F` statistics use :math:`K\left(T-\bar{P}\right)`
where :math:`\bar{P}=K^{-1}\sum_{i=1}^{K}P_{i}` where :math:`P_{i}` is
the number of variables including the constant in model :math:`i`. When
models include restrictions it may be the case that the covariance is
singular. When this occurs, the :math:`F` statistic cannot be
calculated.

Memory efficient calculations
-----------------------------

The parameter estimates in the SUR model can are

.. math:: \left(X^{\prime}\Omega^{-1}X\right)^{-1}\left(X^{\prime}\Omega^{-1}Y\right)^{-1}

where

.. math::

   \Omega^{-1}=\left[\begin{array}{cccc}
   \sigma^{11}I_{T} & \sigma^{12}I_{T} & \ldots & \sigma^{1k}I_{T}\\
   \sigma^{21}I_{T} & \sigma^{22}I_{T} & \dots & \sigma^{2k}I_{T}\\
   \vdots & \vdots & \ddots & \vdots\\
   \sigma^{k2}I_{T} & \sigma^{k2}I_{T} & \dots & \sigma^{kk}I_{T}
   \end{array}\right]

where :math:`\sigma^{ij}=\left[\Sigma^{-1}\right]_{ij}`. Since :math:`X`
is block diagonal,

.. math::

   X^{\prime}\Omega^{-1}=\left[\begin{array}{cccc}
   \sigma^{11}X_{1}^{\prime} & \sigma^{12}X_{1}^{\prime} & \ldots & \sigma^{1k}X_{1}^{\prime}\\
   \sigma^{21}X_{2}^{\prime} & \sigma^{22}X_{2}^{\prime} & \dots & \sigma^{2k}X_{2}^{\prime}\\
   \vdots & \vdots & \ddots & \vdots\\
   \sigma^{k2}X_{k}^{\prime} & \sigma^{k2}X_{k}^{\prime} & \dots & \sigma^{kk}X_{k}^{\prime}
   \end{array}\right]

and

.. math::

   X^{\prime}\Omega^{-1}X=\left[\begin{array}{cccc}
   \sigma^{11}X_{1}^{\prime}X_{1} & \sigma^{12}X_{1}^{\prime}X_{2} & \ldots & \sigma^{1k}X_{1}^{\prime}X_{k}\\
   \sigma^{21}X_{2}^{\prime}X_{1} & \sigma^{22}X_{2}^{\prime}X_{2} & \dots & \sigma^{2k}X_{2}^{\prime}X_{k}\\
   \vdots & \vdots & \ddots & \vdots\\
   \sigma^{k2}X_{k}^{\prime}X_{1} & \sigma^{k2}X_{k}^{\prime}X_{2} & \dots & \sigma^{kk}X_{k}^{\prime}X_{k}
   \end{array}\right]

Similarly

.. math::

   X^{\prime}\Omega^{-1}Y=\left[\begin{array}{c}
   \sigma^{11}X_{1}^{\prime}Y_{1}+\sigma^{12}X_{1}^{\prime}Y_{2}+\ldots\sigma^{1k}X_{1}^{\prime}Y_{k}\\
   \sigma^{21}X_{2}^{\prime}Y_{1}+\sigma^{22}X_{2}^{\prime}Y_{2}+\dots+\sigma^{2k}X_{2}^{\prime}Y_{k}\\
   \vdots\\
   \sigma^{k2}X_{k}^{\prime}Y_{1}+\sigma^{k2}X_{k}^{\prime}Y_{2}+\dots+\sigma^{kk}X_{k}^{\prime}Y_{k}
   \end{array}\right]

This suggests that constructing the high dimensional :math:`\Omega^{-1}`
can be avoided and many needless multiplications can be avoided by
directly computing these two components and the solution can be found
using ``solve``.

Three Stage Least Squares (3SLS)
--------------------------------

Three-stage least squares extends SUR to the case of endogenous
variables. It is the system extension of two-stage least squares (2SLS).
Like SUR, 3SLS jointly estimates multiple models which allows joint
hypothesis testing of parameters. It can also lead to more precise
parameter estimates if some residuals are conditionally homoskedastic
and regressors differ across equations.

Basic Notation
~~~~~~~~~~~~~~

Assume :math:`K` series are observed for :math:`N` time periods. Denote
the stacked observations of series :math:`i` as :math:`Y_{i}` and its
corresponding regressor matrix :math:`X_{i}`. The complete model
spanning all series can be specified as

.. math::

   \begin{aligned}
   Y & =X\beta+\epsilon\\
   \left[\begin{array}{c}
   Y_{1}\\
   Y_{2}\\
   \vdots\\
   Y_{K}
   \end{array}\right] & =\left[\begin{array}{cccc}
   X_{1} & 0 & 0 & 0\\
   0 & X_{2} & 0 & 0\\
   \vdots & \vdots & \ddots & \vdots\\
   0 & 0 & 0 & X_{K}
   \end{array}\right]\left[\begin{array}{c}
   \beta_{1}\\
   \beta_{2}\\
   \vdots\\
   \beta_{K}
   \end{array}\right]+\left[\begin{array}{c}
   \epsilon_{1}\\
   \epsilon_{2}\\
   \vdots\\
   \epsilon_{K}
   \end{array}\right].\end{aligned}

where :math:`X_{i}` contains both exogenous and endogenous variables.
For each equation denote the available instruments as :math:`Z_{i}`. The
instrument include both exogenous regressors and excluded instrumental
variables. Define the fitted values of the exogenous variables as

.. math:: \hat{X}_{i}=Z_{i}\left(Z_{i}^{\prime}Z_{i}\right)^{-1}Z_{i}^{\prime}X_{i}.

The IV estimator of :math:`\beta` is then

.. math:: \hat{\beta}_{IV}=\left(\hat{X}^{\prime}\hat{X}\right)^{-1}\hat{X}^{\prime}Y

where

.. math::

   \hat{X}=\left[\begin{array}{cccc}
   \hat{X}_{1} & 0 & 0 & 0\\
   0 & \hat{X}_{2} & 0 & 0\\
   \vdots & \vdots & \ddots & \vdots\\
   0 & 0 & 0 & \hat{X}_{N}
   \end{array}\right].

A GLS estimator can be similarly defined as

.. math:: \hat{\beta}_{GLS}=\left(\hat{X}^{\prime}\Omega^{-1}\hat{X}\right)^{-1}\hat{X}^{\prime}\Omega^{-1}Y.

Inference is identical to the SUR replacing :math:`X` with
:math:`\hat{X}` where model residuals are estimated using :math:`X` and
:math:`Y`.

System Generalized Method of Moments (GMM)
------------------------------------------

The system GMM estimator uses a set of moment conditions of the form

.. math::

   \begin{aligned}
   E\left[g_{i}\right] & =0,i=1,\ldots,T\\
   E\left[\begin{array}{c}
   g_{1i}\\
   g_{2i}\\
   \vdots\\
   g_{Ki}
   \end{array}\right] & =0\end{aligned}

where :math:`g_{ji}=z_{ji}^{\prime}\left(y_{ji}-x_{ji}\beta\right)`
where :math:`j` is the equation index, :math:`z_{ji}` is a set of
instruments for equation :math:`j` at time period i and :math:`x_{ji}`
and :math:`y_{ji}` are similarly defined. Define

.. math:: g_{N}\left(\beta\right)=N^{-1}\sum_{i=1}^{N}g_{i}\left(\beta\right),

then the parameters are estimated to minimize

.. math:: \min_{\beta}g_{N}\left(\beta\right)^{\prime}W^{-1}g_{N}\left(\beta\right).

The solution to this minimization problem is

.. math:: \hat{\beta}=\left(X^{\prime}ZW^{-1}Z^{\prime}X\right)^{-1}\left(X^{\prime}ZW^{-1}Z^{\prime}Y\right).

:math:`W` is the moment weighting matrix and for efficiency should be
set to the covariance of the moment conditions. In practice a 2-step
estimator is required since estimators of :math:`W` depend on
:math:`\beta`. In the first step, :math:`W=N^{-1}Z^{\prime}Z` is used to
construct an initial estimate of :math:`\hat{\beta}`. This would be
efficient if residuals were homoskedastic and uncorrelated. This choice
will produce estimates that are identical to system 2SLS. Using the
initial parameter estimated, the covariance of the moment conditions is
computed used one of two forms using the estimated model residuals
:math:`\hat{\epsilon}`.

Homoskedastic Weighting
~~~~~~~~~~~~~~~~~~~~~~~

When residuals are assumed to be conditionally homoskedastic, then

.. math:: \hat{W}=N^{-1}Z^{\prime}\left(\hat{\Sigma}\otimes I_{N}\right)Z

where :math:`\hat{\Sigma}`\ is the :math:`K` by :math:`K` covariance of
the model residuals. This specification assumes that residuals are not
correlated across observation index (only across models)

Heteroskedastic Weighting
~~~~~~~~~~~~~~~~~~~~~~~~~

When residuals are heteroskedastic, the weighting matrix is estimated
from

.. math:: \hat{W}=N^{-1}\sum\hat{g}_{i}\hat{g}_{i}^{\prime}

which allows cross-sectional dependence but no dependence across
observation index.

The efficient estimates are computed using the same expression only with
:math:`\hat{W}`,

.. math:: \hat{\beta}=\left(X^{\prime}Z\hat{W}^{-1}Z^{\prime}X\right)^{-1}\left(X^{\prime}Z\hat{W}^{-1}Z^{\prime}Y\right).

Both weighting estimators can be centered which will enforce an exact
mean 0 to the moment conditions. When models are over-identified, this
can improve accuracy of 2-step estimators.

Parameter Covariance
~~~~~~~~~~~~~~~~~~~~

Covariance estimators are available for either homoskedastic or
heteroskedastic data. The only difference is in how the covariance of
the moments is computed. In either case, the parameter covariance can be
estimated by

.. math:: \widehat{Var\left(\hat{\beta}\right)}=N^{-1}\left(\frac{X^{\prime}Z}{N}\hat{W}^{-1}\frac{Z^{\prime}X}{N}\right)^{-1}\left(\frac{X^{\prime}Z}{N}\hat{W}^{-1}\hat{\Omega}\hat{W}^{-1}\frac{Z^{\prime}X}{N}\right)\left(\frac{X^{\prime}Z}{N}\hat{W}^{-1}\frac{Z^{\prime}X}{N}\right)^{-1}.

The covariance of the scores :math:`\hat{\Omega}` estimated using either
the homoskedastic weighting formula or the heteroskedastic weighting
formula immediately above. This allow the choice of weighting matrix to
be set separately from the score covariance estimator, although in most
cases these should be the same, and so the covariance of the estimated
parameters will simplify to

.. math:: \widehat{Var\left(\hat{\beta}\right)}=N^{-1}\left(\frac{X^{\prime}Z}{N}\hat{W}^{-1}\frac{Z^{\prime}X}{N}\right)^{-1}.
