def vech(A):
p = A.shape[0]
if A.shape[1] != p:
raise Exception("A need to be a square matrix.")
vech_A = []
for i in range(p):
for j in range(i + 1):
vech_A.append(A[i,j])
return np.array(vech_A)Documentation
Problem Setup
We will be following the paper1 for their problem, modeling, and optimization problem formulation. We implements additional optimization algorithms not considered in the original paper.
We consider observing i.i.d data \((Y_{1},\ldots ,Y_{n})\sim \mathcal{N}(0,\Sigma^{*})\), \(\Sigma^{*}\in \mathbb{R}^{p\times p}\), \(\Sigma^{*}\succ 0\). Our goal is to estimate \(\Sigma^{*}\), and we in particular consider the high dimensional and sparse covariance setting: \(n\gg p\), and \(\Sigma^{*}\) has many \(0\) off-diagonal entries. A naive estimator is the sample covariance \(S=\frac{1}{n}YY^{T}\), where \(Y:=\begin{pmatrix} Y_{1}^{\mathsf{T}} \\ \vdots \\ Y_{n}^{\mathsf{T}} \end{pmatrix}\in \mathbb{R}^{n\times p}\). We can consider using the sample covariance matrix as an observable to formulate the problem as a regression problem, so that we can utilize LASSO based idea: \[ S=\Sigma^{*}+E, \] where \(E\) is simply the difference between sample covariance and true covariance, which we pretend to be noise.
Vectorizing matrices, and matricizing vectors
It is clear that we would like to vectorize the matrices in the above problem so that it is indeed an regression, and we can avoid using matrix norms in later optimizations. On the other hand, after we solvd a LASSO like problem, we would like to convert the vector back to a covariance matrix. This section is devoted to related implementations.
Vectorizing a symmetric matrix
Given \(A\in\mathbb{R}^{p\times p}\) a symmetric matrix, it is clear that we only need a \(\frac{p(p+1)}{2}\)-dimensional vector to represent the matrix. And this is the funcationality of vech(\(A\)), the output will be a vector looking like \[ (A_{11}, A_{21}, A_{22}, \ldots, A_{p1}, \ldots, A_{pp}). \] The implementation is straightforward, we just loop through the matrix using the above index pattern.
Matricizing a vector
Inversely we can put a \(\frac{p(p+1)}{2}\)-dimensional vector back to a symmetric matrix. The implementation is the reverse engineering of above.
def vech_inverse(a):
n = a.shape[0]
p = int((np.sqrt(1 + 8 * n) - 1) / 2)
if p * (p + 1) // 2 != n:
raise Exception("Invalid dimension.")
A = np.zeros((p, p))
i = 0
for j in range(p):
for k in range(j + 1):
A[j, k] = a[i]
if j != k:
A[k, j] = a[i]
i = i + 1
return ADuplication matrix
Acknowledgement: ChatGPT wrote this function.
If we have a general matrix \(A\), we would need a \(p^{2}\)-dimensional vector to represent it, and such a representation is denoted by vec(\(A\)). The ordering of elements are typically defined as the concatenation of columns, i.e. \[ (A_{11}, A_{21}, \ldots, A_{p1}, A_{12}, \ldots, A_{1p}, \ldots, A_{pp}). \] Although in our formulation, we would not encounter any non-symmetric matricies, some formulation such as the covariance of vech(\(S\)) requires vec(\(S\)) for expression. It is clear that there is a linear relationship between vec\((A)\) and vech\((A)\). And duplication matrix encodes such linear relationship, i.e. \[ \text{vec}(A)=D_{p}\text{vech}(A). \] For each dimension the matrix is fixed. Here we just explicitly fill the entries as our definition of vech\((A)\) is not traditional, which means existing libraries do not have such a matrix stored.
def duplication_matrix(p):
D = np.zeros((p * p, p * (p + 1) // 2))
idx = 0
for i in range(p):
for j in range(i+1):
vec_idx1 = i + j * p
vec_idx2 = j + i * p
D[vec_idx1, idx] = 1
if i != j:
D[vec_idx2, idx] = 1
idx += 1
return DDiagonal and off-diagonal indices in vector representation
After we vectorized \(S\), it will be of interest to optimize the diagonal terms and off-diagonal terms of \(\widehat{\Sigma}\) differently, as the diagonal terms of \(S\) are very good estimators of the variance, and off-diagonal terms are rather noisy and uninformational. Therefore, we need two index set to directly access the corresponding entries in the vectorized representation, and that’s the role of vech\((p)\). Again this is a fixed value for each dimension \(p\).
def vech_idx(p):
all_idx = range(p * (p + 1) // 2)
diag_idx = [k * (k + 1) // 2 - 1 for k in range(1, p + 1)]
off_idx = sorted(set(all_idx) - set(diag_idx))
return diag_idx, off_idxNormalization
After defining the above vectorizing and devectorizing functions, we look back at the regression problem \[ S = \Sigma^{*}+E\iff \text{vech}(S)=\text{vech}(\Sigma^{*})+\text{vech}(E). \] We define \(s:=\text{vech}(S)\), \(\sigma^{*}:=\text{vech}(\Sigma^{*})\), and \(\epsilon :=\text{vech}(E)\), so that we have \[ s = \sigma^{*}+\epsilon. \] But here, \(\epsilon\) doesn’t look like a noise vector, as its entries are correlated due to its relationship with sample covariance matrix. Therefore, we need to normalize it to be able to better pretend that it is indeed a noise. That is, we want to compute \[ V:=\text{Cov}(\epsilon)=\text{Cov}(\text{vech}(S)), \] and then we can rewrite the problem as \[ V^{-1/2}s = V^{-1/2}\sigma^{*}+V^{-1/2}\epsilon. \] \(V\) indeed has a closed form expression: \[ V = \frac{2}{n} D_p^{+} \left( \Sigma^* \otimes \Sigma^* \right) D_p^{+\top}, \] and unfortunately it is clear that we need to first estimate \(\Sigma^{*}\) to have a good estimate of \(V\). Yet since we only need \(V^{-1}\), and we can roughly write: \[ V^{-1} \approx \frac{n}{2} D_p^{\top} \left( \Omega \otimes \Omega \right) D_p, \] where \(\Omega:=\left(\Sigma^{*}\right)^{-1}\), it suffices to have a good estimator of \(\Omega\). A good result in this regime is known as CLIME2. In this section we implemented CLIME using their original R implementation, and then compute \(V^{-1}\) using CLIME.
Sample covariance matrix
Before going into implementation of CLIME, we first consider how do we calculate \(S\). Given a data matrix \(X\in\mathbb{R}^{n\times p}\), numpy already gives a function to calculate the sample covariance matrix. Here we just wrapping the numpy interface with our own function name \(\textbf{sample}\textunderscore\textbf{covariance}(X)\).
def sample_covariance(X):
return np.cov(X, rowvar=False)CLIME
Acknowledgement: ChatGPT wrote this function.
CLIME is solving the following problem: \[
\widehat{\Omega} = \arg\min_{\Omega \in \mathbb{R}^{p \times p}} \|\Omega\|_{1} \quad \text{subject to} \quad \|S \Omega - I\|_{\infty} \leq \rho.
\] We use the authors’ original implementation in R3 to solve this problem. The R package clime is also an implicit dependency of our package that pip would not automatically install. One need to run install.packages("clime") in R console. We will denote the output by \(\widehat{\Omega}\).
def clime(S, rho):
numpy2ri.activate()
robjects.globalenv['S'] = numpy2ri.py2rpy(S)
robjects.globalenv['lambda'] = rho
robjects.r('''
library(clime)
result <- clime(S, lambda=lambda, standardize=FALSE)
''')
Omega_r = robjects.r('result$Omegalist[[1]]')
return np.array(Omega_r)Normalizing matrix
After we compute \(\widehat{\Omega}\) as an estimator of \(\Omega\), we can obtain an estimator of \(V^{-1}\) as \[ \widehat{V}^{-1} := \frac{n}{2} D_p^{\top} \left( \widehat{\Omega} \otimes \widehat{\Omega} \right) D_p, \] where \(\otimes\) is the Kronecker product. The normalization function essentially accept \(\widehat{\Omega}\) and then do the above computation.
def normalization(n, Omega):
p = Omega.shape[0]
D = duplication_matrix(p)
V_inverse = (n / 2) * ((D.T @ (np.kron(Omega, Omega))) @ D)
return V_inverseOptimization
We now formulate a LASSO like problem for the regression problem we previously formulated: \[\begin{align} \min_{\boldsymbol{\sigma}} \quad & \frac{1}{2n} \left\| \widehat{V}^{-1/2} s - \widehat{V}^{-1/2} \sigma \right\|_2^2 + \lambda P(\sigma_{[o]})\\ \text{s.t.} \quad & \sigma_{[d]} = s_{[d]},\; \Sigma(\sigma_{[d]},\sigma_{[o]})\succ 0. \end{align}\] Here \(P(\cdot)\) is going to be the \(\mathcal{L}^{1}\)-norm, and \([d],[o]\) are just the two outputs of vech_idx(\(p\)), one represents diagonal indices and the other for off-diagonal indices. Essentially, we want to force the diagonal of the estimate to be the same as that of \(S\), and performing soft-thresholding on off-diagonal entries to maintain sparsity. We also impose the condition that the estimate must be positive definite, as it should be for a valid covariance matrix. In the paper4, the authors use the ADMM to solve the optimization problem: introducing \(\theta\) with the constraint \(\theta=\sigma\), omitting the positive definiteness constraint, the objective becomes: \[\begin{align} \frac{1}{2n} \left\| \widehat{V}^{-1/2} s - \widehat{V}^{-1/2} \sigma \right\|_2^2 + \lambda \left\| \theta_{[o]} \right\|_1 + \lambda_{\infty} \left\| \theta_{[d]} - s_{[d]} \right\|_1 + \nu^\top (\sigma - \theta) + \frac{1}{2\gamma} \left\| \sigma - \theta \right\|_2^2. \end{align}\] Let \(\eta:=\gamma \nu\), we can compute the update rule: \[\begin{align*} \sigma^{(i+1)} &= \left( \frac{1}{n} \widehat{V}^{-1} + \frac{1}{\gamma} I \right)^{-1} \left\{ \frac{1}{n} \widehat{V}^{-1} s + \frac{1}{\gamma} \left( \theta^{(i)} - \eta^{(i)} \right) \right\}, \\[1ex] \theta_{[o]}^{(i+1)} &= \mathcal{S}_{\lambda \gamma} \left( \sigma_{[o]}^{(i+1)} + \eta_{[o]}^{(i)} \right), \\[1ex] \eta^{(i+1)} &= \eta^{(i)} + \left( \sigma^{(i+1)} - \theta^{(i+1)} \right), \end{align*}\] where \(\mathcal{S}_{\lambda\gamma}\) is the soft-thresholding function with level \(\lambda\gamma\). We also need an extra projection step to guarantee the positive definiteness of \(\Sigma(\sigma)\). That is \[ \tilde{\sigma}^{(i+1)}\in\left\{\sigma:\;\Sigma(\sigma) = \sum_{j=1}^{p} \max\{\lambda_j,\delta\} \, q_j q_j^{\top}\right\}, \] where \(\delta>0\) is some preset lowerbound on the projection. So we can conclude the complete update rule \[\begin{align} \sigma^{(i+1)} &= \left( \frac{1}{n} \widehat{V}^{-1} + \frac{1}{\gamma} I \right)^{-1} \left\{ \frac{1}{n} \widehat{V}^{-1} s + \frac{1}{\gamma} \left( \theta^{(i)} - \eta^{(i)} \right) \right\}= \text{vech}\left(\sum_{j=1}^{p} \lambda_j \, q_j q_j^{\top}\right), \\[1ex] \tilde{\sigma}^{(i+1)}&\in\left\{\sigma:\;\Sigma(\sigma) = \sum_{j=1}^{p} \max\{\lambda_j,\delta\} \, q_j q_j^{\top}\right\}, \\[1ex] \theta_{[o]}^{(i+1)} &= \mathcal{S}_{\lambda \gamma} \left( \widetilde{\sigma}_{[o]}^{(i+1)} + \eta_{[o]}^{(i)} \right), \\[1ex] \eta^{(i+1)} &= \eta^{(i)} + \left( \widetilde{\sigma}^{(i+1)} - \theta^{(i+1)} \right). \end{align}\] The first goal of the section is to implement the ADMM optimization algorithm, first implementing each update rule, then combining them together into a aggregate function on estimating covariance matrices.
The second goal of the section is to implement a proximal gradient descent algorithm of solving the optimization problem. Compared to ADMM, the tuning parameter selection for PGD is much more intuitive. This part is not discussed in the original paper, and we will detail the (simple) update rule later.
The ADMM approach to sparse linear modeling
Soft-thresholding
It is clear that in a LASSO formualtion, soft-thresholding must be utilized, and thus must be implemented. The precise definition of soft-thresholding is \[ \mathcal{S}_{\lambda}(x) = \begin{cases} x - \lambda, & \text{if } x > \lambda \\ 0, & \text{if } |x| \leq \lambda \\ x + \lambda, & \text{if } x < -\lambda \end{cases}. \] We just follow this definition to implement.
def soft_thresholding(a, lambda_):
return np.sign(a) * np.maximum(np.abs(a) - lambda_, 0.0)Projection onto nearest positive definite matrix
Given a general square matrix \(A\in\mathbb{R}^{p\times p}\), we project it onto the cone of positive definite matrix by first symmetrizing it: \[ \widehat{A} := \frac{1}{2}\left(A+A^{\top}\right), \] and then perform the eigen-decomposition: \[ \widehat{A} = \sum_{j=1}^{p}\lambda_{j}q_{j}q_{j}^{\top}. \] and finally lower bound the eigenvalues by a positive constant \(\delta\) to generate a new positive definite matrix: \[ \text{proj}_{PD(\delta)}(A)=\sum_{j=1}^{p}\max\{\lambda_{j},\delta\}q_{j}q_{j}^{\top}. \]
def proj_pd(A, delta=1e-6):
A = (A + A.T) / 2
eigvals, eigvecs = np.linalg.eigh(A)
eigvals_clipped = np.clip(eigvals, delta, None)
return eigvecs @ np.diag(eigvals_clipped) @ eigvecs.TUpdate rule for \(\sigma\)
As we have discussed previously, the essential step we performed for \(\sigma\) is \[\begin{align} \sigma^{(i+1)} &= \left( \frac{1}{n} \widehat{V}^{-1} + \frac{1}{\gamma} I \right)^{-1} \left\{ \frac{1}{n} \widehat{V}^{-1} s + \frac{1}{\gamma} \left( \theta^{(i)} - \eta^{(i)} \right) \right\}= \text{vech}\left(\sum_{j=1}^{p} \lambda_j \, q_j q_j^{\top}\right), \\[1ex] \tilde{\sigma}^{(i+1)}&\in\left\{\sigma:\;\Sigma(\sigma) = \sum_{j=1}^{p} \max\{\lambda_j,\delta\} \, q_j q_j^{\top}\right\}, \end{align}\] i.e. one being a gradient descent like step, and one being eigen-decomposition and then projection step. We implement them together in one update function.
def sigma_update(theta, eta, V_inverse, s, n, gamma, delta):
d = V_inverse.shape[0]
A = (1 / n) * V_inverse + (1 / gamma) * np.eye(d)
b = ((1 / n) * V_inverse) @ s + (1 / gamma) * (theta - eta)
sigma_tilde = np.linalg.solve(A, b)
Sigma_tilde = vech_inverse(sigma_tilde)
Sigma = proj_pd(Sigma_tilde, delta)
return vech(Sigma)Update rule for \(\theta\)
We recall the update rule for \(\theta\): \[\begin{align} \theta_{[o]}^{(i+1)} &= \mathcal{S}_{\lambda \gamma} \left( \widetilde{\sigma}_{[o]}^{(i+1)} + \eta_{[o]}^{(i)} \right), \end{align}\] which is an direct application of the soft-thresholding function on the off-diagonal terms.
def theta_update(sigma, eta, s, diag_idx, off_idx, lambda_, gamma):
theta = np.zeros_like(sigma)
theta[diag_idx] = s[diag_idx]
theta[off_idx] = soft_thresholding(sigma[off_idx] + eta[off_idx], lambda_ * gamma)
return thetaUpdate rule for \(\eta\)
We recall the update rule for \(\eta\): \[ \eta^{(i)} + \left( \widetilde{\sigma}^{(i+1)} - \theta^{(i+1)} \right), \] which is very add and subtract step.
def eta_update(sigma, theta, eta):
return eta + sigma - thetaPut everything together
We now aggregate all of the above into one ADMM based covariance estimation function. The essential steps we perform is first calculate the sample covariance from input data \(X\), and then calculate the normalization matrix \(V^{-1}\) using CLIME, and then we initialize \(\sigma,\theta,\eta\), and begin iterating, with the constraints of \(\textbf{max}\textunderscore\textbf{iter}\), and exit tolerence rule with \(\textbf{tol}\).
SpLCM_ADMM
SpLCM_ADMM (X, lambda_=0.1, rho=0.1, gamma=1.0, delta=0.01, V_inverse_0=None, sigma_0=None, theta_0=None, eta_0=None, max_iter=700, tol=1e-05)
Exported source
def SpLCM_ADMM(X, lambda_=0.1, rho=0.1, gamma=1.0, delta=0.01,
V_inverse_0=None, sigma_0=None, theta_0=None, eta_0=None,
max_iter=700, tol=0.00001):
n = X.shape[0]
p = X.shape[1]
S = sample_covariance(X)
s = vech(S)
diag_idx, off_idx = vech_idx(p)
V_inverse = V_inverse_0
if V_inverse_0 is None:
Omega = clime(S, rho)
V_inverse = normalization(n, Omega)
sigma = sigma_0
theta = theta_0
eta = eta_0
if sigma_0 is None:
sigma = s.copy()
if theta_0 is None:
theta = s.copy()
if eta_0 is None:
eta = np.zeros_like(sigma)
for i in range(max_iter):
sigma_temp = sigma_update(theta, eta, V_inverse, s, n, gamma, delta)
if np.linalg.norm(sigma_temp - sigma) < tol:
break
sigma = sigma_temp
theta_temp = theta_update(sigma, eta, s, diag_idx, off_idx, lambda_, gamma)
theta = theta_temp
eta = eta_update(sigma, theta, eta)
return vech_inverse(sigma)The PGD approach to sparse linear modeling
Given our optimization formulation, which is convex, we do not necessarily need to use the ADMM based approach, where the parameter selection is critical, as we will show in later examples. Here we can consider a simpler yet still empirically efficient (for \(\mathcal{O}(10^{1})\) dimensions) for solving the problem, utilizing proximal gradient descent (PGD). We essentially have the following update rule: \[\begin{align} x^{(k+\frac{1}{4})} &= x^{(k)} - \eta \cdot \frac{1}{n} \hat{V}^{-1}(x^{(k)} - s), \\ x^{(k+\frac{1}{2})}_{[o]} &= \mathcal{S}_{\eta\lambda}\left( x^{(k+\frac{1}{4})}_{[o]} \right), \\ x^{(k+\frac{1}{2})}_{[d]} &= s_{[d]} + \mathcal{S}_{\eta\lambda_{\infty}}\left( x^{(k+\frac{1}{4})}_{[d]} - s_{[d]} \right), \\ x^{(k+1)} & = \text{Proj}_{PD(\delta)}(x^{(k+\frac{1}{2})}), \end{align}\] and we can still utilized our previously defined functions. Here, the thresholding parameters have much more direct meaning, and we can use intuition to tune the sparsity of estimation.
SpLCM_PGD
SpLCM_PGD (X, lambda_=0.1, lambda_i=1000, rho=0.1, delta=0.01, V_inverse_0=None, sigma_0=None, max_iter=700, tol=1e-05)
Exported source
def SpLCM_PGD(X, lambda_=0.1, lambda_i=1000, rho=0.1, delta=0.01,
V_inverse_0=None, sigma_0=None, max_iter=700, tol=0.00001):
n = X.shape[0]
p = X.shape[1]
S = sample_covariance(X)
s = vech(S)
diag_idx, off_idx = vech_idx(p)
V_inverse = V_inverse_0
if V_inverse_0 is None:
Omega = clime(S, rho)
V_inverse = normalization(n, Omega)
V_inverse_op = np.max(np.linalg.eigvalsh(V_inverse))
eta = n / V_inverse_op
sigma = sigma_0
if sigma_0 is None:
sigma = s.copy()
for i in range(max_iter):
grad = (1.0 / n) * V_inverse @ (sigma - s)
sigma_temp = sigma - eta * grad
sigma_temp[off_idx] = soft_thresholding(sigma_temp[off_idx], lambda_)
sigma_temp[diag_idx] = s[diag_idx] + soft_thresholding(sigma_temp[diag_idx] - s[diag_idx], lambda_i)
Sigma_temp = vech_inverse(sigma_temp)
Sigma_temp = proj_pd(Sigma_temp, delta)
sigma_temp = vech(Sigma_temp)
if np.linalg.norm(sigma_temp - sigma) < tol:
break
sigma = sigma_temp
return vech_inverse(sigma)The dual BCD approach to sparse linear modeling
In this section we consider whether it is possible to do covariance matrix estimation via a dual structure. Such an idea was explored in a previous work5. The following derivation originate from an unpublished note by Jacob Bien: we begin by zooming out from our specifically formulated vectorized regression setting, and make an observation on the general optimization setup: \[ \min_{x}\left\{f_{1}(x)+f_{2}(x)+f_{3}(x)+f_{4}(x)\right\}=\min_{x,y,z,a}\left\{f_{1}(x)+f_{2}(y)+f_{3}(z)+f_{4}(a)\;\text{s.t.}\;x=y,\;x=z,\;x=a \right\}, \] and therefore, the dual objective is \[\begin{align} g(u,v,w)&=\min_{x,y,z,a}\left\{f_{1}(x)+f_{2}(y)+f_{3}(z) + f_{4}(a) +u^{\top}(x-y)+v^{\top}(x-z)+w^{\top}(x-a)\right\}\\ &=\min_{x}\left\{f_{1}(x)+(u+v+w)^{\top}x \right\}+\min_{y}\left\{f_{2}(y)-u^{\top}y\right\}+\min_{z}\left\{f_{3}(z)-v^{\top}z\right\}+\min_{a}\left\{f_{4}(a)-w^{\top}a\right\}\\ &=-f_{1}^{*}(-u-v-w)-f_{2}^{*}(u)-f_{3}^{*}(v)-f_{4}^{*}(w), \end{align}\] where \(f_{i}^{*}\) is the Fenchel conjugate of \(f_{i}\). And so the dual problem is \[ \min_{u,v,w}\left\{f_{1}^{*}(-u-v-w)+f_{2}^{*}(u)+f_{3}^{*}(v)+f_{4}^{*}(w) \right\}. \] We now go back to our regression setting. We have the problem being: \[\begin{align} & f_{1}(\sigma)=\frac{1}{2n}\left\Vert V^{-1/2}(s-\sigma)\right\Vert_{2}^{2}=\frac{1}{2}\left\Vert s-\sigma\right\Vert_{\frac{1}{n}V^{-1}}^{2},\\ &f_{2}(\sigma)=\lambda \left\Vert \sigma_{[o]}\right\Vert_{1} = \lambda \left\Vert A\sigma\right\Vert_{1},\\ &f_{3}(\sigma)=\lambda_{\infty} \left\Vert s_{[d]} - \sigma_{[d]}\right\Vert_{1} = \lambda_{\infty} \left\Vert B\left(s - \sigma\right)\right\Vert_{1},\\ &f_{4}(\sigma)=\textbf{1}_{\infty}\left\{ \text{vech}^{-1}\left(\sigma\right) \succ 0\right\}, \end{align}\] where \(A\) and \(B\) are the boolean matrices that take diagonal and off-diagonal entries, i.e.:
Boolean matricies for extracting diagonal and off-diagonal indicies
def boolean_matricies(p):
I = np.identity(p * (p + 1) // 2)
A = np.zeros_like(I)
B = np.zeros_like(I)
all_idx = range(p * (p + 1) // 2)
diag_idx = [k * (k + 1) // 2 - 1 for k in range(1, p + 1)]
off_idx = sorted(set(all_idx) - set(diag_idx))
A[:, off_idx] = I[:, off_idx]
B[:, diag_idx] = I[:, diag_idx]
return A, BSo their duals are \[\begin{align} & f_{1}^{*}(-u-v-w)=\frac{1}{2}\left\Vert u+v+w\right\Vert_{nV}^{2}-\langle u,s\rangle-\langle v,s\rangle-\langle w,s\rangle,\\ & f_{2}^{*}(u)=\textbf{1}_{\infty}\left\{\exists\;p,\;A^{\top}p=u,\; \left\Vert p \right\Vert_{\infty}\leq \lambda\right\},\\ & f_{3}^{*}(v)=\langle v, s\rangle \textbf{1}_{\infty}\left\{\exists\;q,\;B^{\top}q=v,\; \left\Vert q \right\Vert_{\infty}\leq \lambda_{\infty}\right\},\\ & f_{4}^{*}(w) = \textbf{1}_{\infty}\left\{ \text{vech}^{-1}\left(w\right) \prec 0\right\}. \end{align}\] And so we have the specific dual problem: \[\begin{align} \min_{u,v,w} \quad & \frac{1}{2}\left\Vert u+v+w\right\Vert_{nV}^{2}-\langle u,s\rangle-\langle w,s\rangle\\ \text{s.t.} \quad&\exists\;p,\;A^{\top}p=u,\; \left\Vert p \right\Vert_{\infty}\leq \lambda, \\ \quad&\exists\;q,\;B^{\top}q=v,\; \left\Vert q \right\Vert_{\infty}\leq \lambda_{\infty}, \\ \quad& \text{vech}^{-1}\left(w\right) \prec 0. \end{align}\] We first note that in the second constraint, when \(\lambda_{\infty}\) is very large, we can drop the regularization part. We also note the three constraints are decoupled, and the main objective is differentiable. And therefore this problem can be solved via BCD. In particular, we split it into three problems: \[\begin{align} \min_{u} \quad & \frac{1}{2}\left\Vert u+v+w\right\Vert_{nV}^{2}-\langle u,s\rangle\\ \text{s.t.} \quad&\exists\;p,\;A^{\top}p=u,\; \left\Vert p \right\Vert_{\infty}\leq \lambda.\\ \min_{v} \quad & \frac{1}{2}\left\Vert u+v+w\right\Vert_{nV}^{2}\\ \text{s.t.}\quad&\exists\;q,\;B^{\top}q=v.\\ \min_{w} \quad & \frac{1}{2}\left\Vert u+v+w\right\Vert_{nV}^{2}-\langle w,s\rangle\\ \text{s.t.} \quad& \text{vech}^{-1}\left(w\right) \prec 0. \end{align}\] We essentially performs convex optimization on three problems in cycle. A main problem with the dual formulation is the weighting is now \(nV\), where \(V\) is supposed to involve inverse of the CLIME estimator, \(\hat{\Omega}^{-1}\), which is itself an estimator of \(\Sigma\). What’s worse is that the sparsity of \(\hat{\Omega}^{-1}\) is likely very bad. So even if these subproblems take nice forms, if we continue using the CLIME estimator as weighting, the empirical efficiency will likely suffer. On the brightside, we can always use some rough and sparse estimator of \(\Sigma\) to compute a rough and sparse \(V\).
We now compute the update rule. The first problem is equivalently \[\begin{align} \min_{p} \quad & \frac{1}{2}\left\Vert A^{\top}p+c\right\Vert_{nV}^{2}-\langle A^{\top}p,s\rangle\\ \text{s.t.} \quad& \left\Vert p \right\Vert_{\infty}\leq \lambda, \end{align}\] where \(c=v+w\) is a constant. Moreover, \[\begin{align} \frac{1}{2}\left\Vert A^{\top}p+c\right\Vert_{nV}^{2}-\langle A^{\top}p,s\rangle=\frac{1}{2}\left\Vert p\right\Vert_{nAVA^{\top}}^{2}-\langle p,A(s-nVc)\rangle + \frac{1}{2}\Vert c\Vert_{nV}^{2}. \end{align}\] So the update rule is \[ u^{(k+1)} = A^{\top}\text{argmin}_{\left\Vert p \right\Vert_{\infty}\leq \lambda}\left\{ \frac{1}{2}\left\Vert p\right\Vert_{nAVA^{\top}}^{2}-\left\langle p,A\left(s-nV\left(v^{(k)}+w^{(k)}\right)\right)\right\rangle\right\}, \] which amounts to solving a quadratic programming problem.
The second problem is equivalently \[\begin{align} \min_{q} \quad & \frac{1}{2}\left\Vert B^{\top}q+c\right\Vert_{nV}^{2}, \end{align}\] and since \[\begin{equation*} \frac{1}{2}\left\Vert B^{\top}q+c\right\Vert_{nV}^{2} = \frac{1}{2}\Vert q\Vert_{nBVB^{\top}}^{2}+\left\langle q, nBVc \right\rangle+\frac{1}{2}\Vert c\Vert_{nV}^{2}, \end{equation*}\] we have the problem is solving \[\begin{align*} \min_{q} \quad & \frac{1}{2}\Vert q\Vert_{nBVB^{\top}}^{2}+\left\langle q, nBVc \right\rangle, \end{align*}\] which is even simpler, as a QP without constraint. We write the update rule as \[ v^{(k+1)} = B^{\top}\text{argmin}_{q}\left\{\frac{1}{2}\Vert q\Vert_{nBVB^{\top}}^{2}+\left\langle q, nBV\left( u^{(k+1)}+w^{(k)}\right) \right\rangle \right\}. \]
The third problem is equivalently \[\begin{align} \min_{w} \quad & \frac{1}{2}\Vert w-c\Vert_{nV}^{2}+\langle w,s\rangle\\ \text{s.t.} \quad& \text{vech}^{-1}\left(w\right) \succ 0. \end{align}\] So the update rule is \[ w^{(k+1)}=-1\cdot\text{argmin}_{\text{vech}^{-1}\left(w\right) \succ 0}\left\{ \frac{1}{2}\Vert w\Vert_{nV}^{2}+\left\langle w,s-nV\left(u^{(k+1)}+v^{(k+1)}\right)\right\rangle\right\}, \] which again amounts to solve a quadratic programming problem. We now put the update rules together: \[\begin{align} &u^{(k+1)} = A^{\top}\text{argmin}_{\left\Vert p \right\Vert_{\infty}\leq \lambda}\left\{ \frac{1}{2}\left\Vert p\right\Vert_{nAVA^{\top}}^{2}-\left\langle p,A\left(s-nV\left(v^{(k)}+w^{(k)}\right)\right)\right\rangle\right\},\\ &v^{(k+1)} = B^{\top}\text{argmin}_{q}\left\{\frac{1}{2}\Vert q\Vert_{nBVB^{\top}}^{2}+\left\langle q, nBV\left( u^{(k+1)}+w^{(k)}\right) \right\rangle \right\},\\ &w^{(k+1)}=-1\cdot\text{argmin}_{\text{vech}^{-1}\left(w\right) \succ 0}\left\{ \frac{1}{2}\Vert w\Vert_{nV}^{2}+\left\langle w,s-nV\left(u^{(k+1)}+v^{(k+1)}\right)\right\rangle\right\}. \end{align}\] We now see a second major problem with the dual formulation: even if we are leveraging quadratic programming, \(AVA^{\top}\) and \(BVB^{\top}\) are strictly only positive semidefinite, not positive definite. This means each update likely does not admit a unique solution. In practice, we can use tricks like perturbing the normalizing matrix with identity, but this is a major theoretical problem with the formulation. We now first implement a numerically stable version with the perturbation trick.
Update rule for \(u\)
We recall the update rule for \(u\) is \[\begin{equation*}
u^{(k+1)} = A^{\top}\text{argmin}_{\left\Vert p \right\Vert_{\infty}\leq \lambda}\left\{ \frac{1}{2}\left\Vert p\right\Vert_{nAVA^{\top}}^{2}-\left\langle p,A\left(s-nV\left(v^{(k)}+w^{(k)}\right)\right)\right\rangle\right\},
\end{equation*}\] We can implement using package like quadprog or cvxpy with the OSQP solver.
def u_update(v, w, A, V, s, n, lambda_, epsilon=1e-8):
AVAT = A @ V @ A.T
Q = 0.5 * n * (AVAT + AVAT.T)
Q += epsilon * np.eye(Q.shape[0])
a = A @ (s - (n * V) @ (v + w))
m = Q.shape[0]
p = cp.Variable(m)
objective = 0.5 * cp.quad_form(p, Q) - a @ p
constraints = [cp.abs(p) <= lambda_]
prob = cp.Problem(cp.Minimize(objective), constraints)
prob.solve(solver=cp.OSQP)
return A.T @ p.valueUpdate rule for \(v\)
We recall the update rule for \(v\) is \[\begin{equation*} v^{(k+1)} = B^{\top}\text{argmin}_{q}\left\{\frac{1}{2}\Vert q\Vert_{nBVB^{\top}}^{2}+\left\langle q, nBV\left( u^{(k+1)}+w^{(k)}\right) \right\rangle \right\}. \end{equation*}\] This is an unconstrained QP, so we can use the closed form solution \[\begin{equation*} v^{(k+1)}=-B^{\top}\left(BVB^{\top}\right)^{+}\left(BV\left(u^{(k+1)}+w^{(k)} \right) \right). \end{equation*}\]
def v_update(u, w, B, V, epsilon=1e-8):
BVBT = B @ V @ B.T
Q = 0.5 * (BVBT + BVBT.T)
Q += epsilon * np.eye(Q.shape[0])
a = (B @ V) @ (u + w)
q = (-1) * np.linalg.solve(Q, a)
return B.T @ qUpdate rule for \(w\)
We recall the update rule for \(w\) is \[\begin{equation*}
w^{(k+1)}=-1\cdot\text{argmin}_{\text{vech}^{-1}\left(w\right) \succ 0}\left\{ \frac{1}{2}\Vert w\Vert_{nV}^{2}+\left\langle w,s-nV\left(u^{(k+1)}+v^{(k+1)}\right)\right\rangle\right\}.
\end{equation*}\] We note without the constraint, we can write the closed form solution as \[\begin{align*}
w^{(k+1)}&=-1\cdot \left(-\left(nV\right)^{+}\left(s-nV\left(u^{(k+1)}+v^{(k+1)}\right) \right)\right)\\
&=-1\cdot \left(-\left(nV\right)^{+}s+\left(u^{(k+1)}+v^{(k+1)} \right)\right).
\end{align*}\] So with the constraint, the solution is \[\begin{equation*}
w^{(k+1)}=-1\cdot\text{Proj}_{\text{vech}\left(PD(\delta)\right)}\left(-\left(nV\right)^{+}s+\left(u^{(k+1)}+v^{(k+1)} \right)\right),
\end{equation*}\] where the projection can be done utilizing the proj_pd we previously wrote.
def w_update(u, v, V, s, n, delta=1e-6):
Q = n * V
r = (-1) * np.linalg.solve(Q, s) + (u + v)
R = vech_inverse(r)
proj_R = proj_pd(R, delta)
proj_r = vech(proj_R)
return -1 * proj_rNormalizing matrix under the dual formulation
We are one step away from putting the three update rules together: we haven’t give an expression for \(V\). In principle, we would use \[\begin{equation*} V=\text{Cov}(\text{vech}(S))=\frac{2}{n} D_p^{+} \left( \Sigma^{*} \otimes \Sigma^{*} \right) D_p^{+\top}, \end{equation*}\] which means \[\begin{equation*} \widehat{V}=\frac{2}{n} D_p^{+} \left( S \otimes S \right) D_p^{+\top}. \end{equation*}\] Yet in practive, this is unapplicable. My empirical observation is with my unoptimized unparalleled code, each round of QP with this accurate \(V\) takes around 3 minutes, and it should take at least 10s of rounds to converge. So in our implementation, we will use the diagonal of \(\widehat{V}\), which would accelerate the optimization. In the following implementation, we still output the full \(\widehat{V}\) as a reference only. It is also of interest to explore whether there are strategies to accelerate the optimization with a dense \(V\). The user can always supply their own \(V\).
def normalization_dual(n, S):
p = S.shape[0]
D = duplication_matrix(p)
D_inv = np.linalg.pinv(D)
V = (2 / n) * ((D_inv @ (np.kron(S, S))) @ D_inv.T)
return VPut everything together
The primal-dual relationship is \[\begin{equation*} \sigma = s - nV(u + v + w). \end{equation*}\]
SpLCM_BCD
SpLCM_BCD (X, lambda_=0.1, delta=0.01, epsilon=1e-08, V_0=None, u_0=None, v_0=None, w_0=None, max_iter=700, tol=0.0001)
Exported source
def SpLCM_BCD(X, lambda_=0.1, delta=0.01, epsilon=1e-8, V_0=None, u_0=None, v_0=None, w_0=None,
max_iter=700, tol=0.0001):
n = X.shape[0]
p = X.shape[1]
S = sample_covariance(X)
s = vech(S)
A, B = boolean_matricies(p)
V = V_0
if V_0 is None:
V = normalization_dual(n, S)
V = np.diag(np.diag(V))
u = u_0
v = v_0
w = w_0
if u_0 is None:
u = np.zeros_like(s)
if v_0 is None:
v = np.zeros_like(s)
if w_0 is None:
w = np.zeros_like(s)
AVAT = A @ V @ A.T
Q_u = 0.5 * n * (AVAT + AVAT.T)
Q_u += epsilon * np.eye(Q_u.shape[0])
BVBT = B @ V @ B.T
Q_v = 0.5 * (BVBT + BVBT.T)
Q_v += epsilon * np.eye(Q_v.shape[0])
x = u + v + w
for i in range(max_iter):
u_temp = u_update_with_Q(v, w, A, V, s, n, Q_u, lambda_)
v_temp = v_update_with_Q(u_temp, w, B, V, Q_v)
w_temp = w_update(u_temp, v_temp, V, s, n, delta)
x_temp = u_temp + v_temp + w_temp
if np.linalg.norm(x_temp - x) < tol:
break
u = u_temp
v = v_temp
w = w_temp
x = x_temp
sigma = s - n * (V @ x)
return vech_inverse(sigma)Parameter Tuning
We now consider finding the optimal \(\lambda\) and \(\rho\) for the above algorithms. Notice due to runtime reasons, which will become clear in the simulation section, we give up on implementing parameter tuning for ADMM and focus on that for PGD. As is already discussed in the paper6, the key quantity on deciding parameter performance is the BIC criteria: \[ \text{BIC}(\lambda, \rho) = n\log \det\left(\widehat{\Sigma}_{\lambda,\rho}\right)+n\text{tr} \left(S\widehat{\Sigma}_{\lambda,\rho}^{-1}\right)+\log(n)\sum_{j<k}\mathbf{1}\{\widehat{\sigma}_{jk}\neq 0\}. \] It is not immediately justified that BIC is indeed the right choice. Therefore we try to discuss this empirically. We consider two alternative choices: \[\begin{align*} \text{AIC}(\lambda, \rho) &= n\log \det\left(\widehat{\Sigma}_{\lambda,\rho}\right)+n\text{tr} \left(S\widehat{\Sigma}_{\lambda,\rho}^{-1}\right)+2\sum_{j<k}\mathbf{1}\{\widehat{\sigma}_{jk}\neq 0\},\\ \text{EBIC}(\lambda, \rho) &= n\log \det\left(\widehat{\Sigma}_{\lambda,\rho}\right)+n\text{tr} \left(S\widehat{\Sigma}_{\lambda,\rho}^{-1}\right)+\log(n)\sum_{j<k}\mathbf{1}\{\widehat{\sigma}_{jk}\neq 0\}+2\gamma \binom{p(p-1)/2}{\sum_{j<k}\mathbf{1}\{\widehat{\sigma}_{jk}\neq 0\}}. \end{align*}\]
In this section, our goal is to implement an automatic search wrapper for the PGD implementation above.
PGD with parameter searching
Parallelization
Acknowledgement: Chatgpt wrote the first version of functions in this section.
The following are essentially PyTorched version of our previous PGD implementation.
SpLCM_PGD_torch
SpLCM_PGD_torch (X, lambda_=0.1, lambda_i=1000, rho=0.1, delta=0.01, V_inverse_0=None, sigma_0=None, max_iter=700, tol=1e-05, device=device(type='cpu'))
Exported source
def SpLCM_PGD_torch(X, lambda_=0.1, lambda_i=1000, rho=0.1, delta=0.01,
V_inverse_0=None, sigma_0=None, max_iter=700, tol=1e-5,
device=torch.device("cuda" if torch.cuda.is_available() else "cpu")):
X_torch = to_tensor(X, device=device)
n, p = X_torch.shape
S = sample_covariance(X)
S_torch = to_tensor(S, device=device)
s = vech_torch(S_torch)
diag_idx, off_idx = vech_idx_torch(p, device=device)
if V_inverse_0 is None:
Omega = clime(S, rho)
V_inverse = normalization(n, Omega)
V_inverse = torch.from_numpy(V_inverse).float().to(device)
else:
V_inverse = torch.from_numpy(V_inverse_0).float().to(device)
V_inverse_op = torch.linalg.eigvalsh(V_inverse).max()
eta = n / V_inverse_op
if sigma_0 is None:
sigma = s.clone()
else:
sigma = to_tensor(sigma_0, device=device)
for i in range(max_iter):
grad = (1.0 / n) * torch.matmul(V_inverse, (sigma - s))
sigma_temp = sigma - eta * grad
sigma_temp[off_idx] = soft_thresholding_torch(sigma_temp[off_idx], lambda_)
sigma_temp[diag_idx] = s[diag_idx] + soft_thresholding_torch(sigma_temp[diag_idx] - s[diag_idx], lambda_i)
Sigma_temp = vech_inverse_torch(sigma_temp)
Sigma_temp = proj_pd_torch(Sigma_temp, delta)
sigma_temp = vech_torch(Sigma_temp)
if torch.norm(sigma_temp - sigma) < tol:
break
sigma = sigma_temp
Sigma = vech_inverse_torch(sigma)
return Sigma.cpu().numpy()Searching with parameter evaluation
Acknowledgement: ChatGPT wrote the first version of this function.
We now automate the parameter searching for the PGD implementation.
SpLCM_PGD_tuned
SpLCM_PGD_tuned (X, lambda_grid, rho_grid, eval_func, lambda_i=1000, delta=0.01, max_iter=700, tol=1e-05, numerical_tol=0.001, gamma=0.5, early_stopping=True, eval_tol=100, device=device(type='cpu'))
Exported source
def SpLCM_PGD_tuned(X, lambda_grid, rho_grid, eval_func,
lambda_i=1000, delta=0.01, max_iter=700, tol=0.00001, numerical_tol=1e-3, gamma=0.5,
early_stopping=True, eval_tol=100,
device=torch.device("cuda" if torch.cuda.is_available() else "cpu")):
score_opt = None
Sigma_opt = None
score_traj = []
Sigma_traj = []
if not isinstance(lambda_grid, (list, tuple, set, np.ndarray)):
lambda_grid = [lambda_grid]
if not isinstance(rho_grid, (list, tuple, set, np.ndarray)):
rho_grid = [rho_grid]
for lambda_ in lambda_grid:
for rho in rho_grid:
params = dict(locals())
params['lambda_0'] = lambda_
params['rho_0'] = rho
Sigma_hat, score = smart_call(eval_func, **params)
if score_opt is None:
score_opt = score
Sigma_opt = Sigma_hat
score_traj.append(score)
Sigma_traj.append(Sigma_hat)
if score < score_opt:
if early_stopping is True:
if score_opt - score <= 100:
break
else:
score_opt = score
Sigma_opt = Sigma_hat
score_traj.append(score)
Sigma_traj.append(Sigma_hat)
else:
score_opt = score
Sigma_opt = Sigma_hat
score_traj.append(score)
Sigma_traj.append(Sigma_hat)
return Sigma_opt, score_traj, Sigma_trajInformation Criterion
BIC
We first implement the BIC evaluation for parameters as defined above.
def SpLCM_PGD_BIC_eval(X, lambda_0, rho_0, lambda_i=1000, delta=0.01, max_iter=700, tol=0.00001, numerical_tol=1e-3,
device=torch.device("cuda" if torch.cuda.is_available() else "cpu")):
S = sample_covariance(X)
S = torch.from_numpy(S).float().to(device)
n = X.shape[0];
Sigma_hat = smart_call(SpLCM_PGD_torch_tensor, **locals())
sign, logabsdet = torch.linalg.slogdet(Sigma_hat)
logdet = logabsdet
Sigma_inv = torch.linalg.inv(Sigma_hat)
preconditioning = torch.trace(S @ Sigma_inv)
upper_mask = torch.triu(torch.ones_like(Sigma_hat), diagonal=1)
sparsity = torch.count_nonzero(torch.abs(Sigma_hat * upper_mask) > numerical_tol)
Sigma_hat_np = Sigma_hat.cpu().numpy()
bic = (n * logdet + n * preconditioning + torch.log(torch.tensor(n, dtype=Sigma_hat.dtype, device=device)) * sparsity).cpu().numpy();
return (Sigma_hat_np, bic)We can combine this evaluation function with our tuned PGD function.
SpLCM_PGD_BIC
SpLCM_PGD_BIC (X, lambda_grid, rho_grid, lambda_i=1000, delta=0.01, max_iter=700, tol=1e-05, numerical_tol=0.001, early_stopping=True, eval_tol=100, device=device(type='cpu'))
Exported source
def SpLCM_PGD_BIC(X, lambda_grid, rho_grid,
lambda_i=1000, delta=0.01,
max_iter=700, tol=0.00001, numerical_tol=1e-3,
early_stopping=True, eval_tol=100,
device=torch.device("cuda" if torch.cuda.is_available() else "cpu")):
params = dict(locals())
params['eval_func'] = SpLCM_PGD_BIC_eval
Sigma_opt, score_traj, Sigma_traj = smart_call(SpLCM_PGD_tuned, **params)
return Sigma_opt, score_traj, Sigma_trajAIC
Acknowledgement: the implementation is produced by letting ChatGPT imitate the BIC implementation.
def SpLCM_PGD_AIC_eval(X, lambda_0, rho_0, lambda_i=1000, delta=0.01, max_iter=700, tol=0.00001, numerical_tol=1e-3,
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")):
S = sample_covariance(X)
S = torch.from_numpy(S).float().to(device)
n = X.shape[0]
Sigma_hat = smart_call(SpLCM_PGD_torch_tensor, **locals())
sign, logabsdet = torch.linalg.slogdet(Sigma_hat)
logdet = logabsdet
Sigma_inv = torch.linalg.inv(Sigma_hat)
preconditioning = torch.trace(S @ Sigma_inv)
upper_mask = torch.triu(torch.ones_like(Sigma_hat), diagonal=1)
sparsity = torch.count_nonzero(torch.abs(Sigma_hat * upper_mask) > numerical_tol)
Sigma_hat_np = Sigma_hat.cpu().numpy()
aic = (n * logdet + n * preconditioning + 2 * sparsity).cpu().numpy()
return (Sigma_hat_np, aic)SpLCM_PGD_AIC
SpLCM_PGD_AIC (X, lambda_grid, rho_grid, lambda_i=1000, delta=0.01, max_iter=700, tol=1e-05, numerical_tol=0.001, early_stopping=True, eval_tol=100, device=device(type='cpu'))
Exported source
def SpLCM_PGD_AIC(X, lambda_grid, rho_grid,
lambda_i=1000, delta=0.01,
max_iter=700, tol=0.00001, numerical_tol=1e-3,
early_stopping=True, eval_tol=100,
device=torch.device("cuda" if torch.cuda.is_available() else "cpu")):
params = dict(locals())
params['eval_func'] = SpLCM_PGD_AIC_eval
Sigma_opt, score_traj, Sigma_traj = smart_call(SpLCM_PGD_tuned, **params)
return Sigma_opt, score_traj, Sigma_trajEBIC
Acknowledgement: the implementation is produced by letting ChatGPT imitate the BIC implementation.
def SpLCM_PGD_EBIC_eval(X, lambda_0, rho_0, lambda_i=1000, delta=0.01, max_iter=700, tol=0.00001, numerical_tol=1e-3, gamma=0.5,
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")):
S = sample_covariance(X)
S = torch.from_numpy(S).float().to(device)
n = X.shape[0]
p = S.shape[0]
Sigma_hat = smart_call(SpLCM_PGD_torch_tensor, **locals())
sign, logabsdet = torch.linalg.slogdet(Sigma_hat)
logdet = logabsdet
Sigma_inv = torch.linalg.inv(Sigma_hat)
preconditioning = torch.trace(S @ Sigma_inv)
upper_mask = torch.triu(torch.ones_like(Sigma_hat), diagonal=1)
sparsity = torch.count_nonzero(torch.abs(Sigma_hat * upper_mask) > numerical_tol).item()
Sigma_hat_np = Sigma_hat.cpu().numpy()
log_comb = 0 if sparsity == 0 else math.log(comb(p * (p - 1) / 2, sparsity))
ebic = (n * logdet + n * preconditioning + math.log(n) * sparsity + 2 * gamma * log_comb)
return (Sigma_hat_np, ebic)SpLCM_PGD_EBIC
SpLCM_PGD_EBIC (X, lambda_grid, rho_grid, lambda_i=1000, delta=0.01, max_iter=700, tol=1e-05, numerical_tol=0.001, gamma=0.5, early_stopping=True, eval_tol=100, device=device(type='cpu'))
Exported source
def SpLCM_PGD_EBIC(X, lambda_grid, rho_grid,
lambda_i=1000, delta=0.01,
max_iter=700, tol=0.00001, numerical_tol=1e-3, gamma=0.5,
early_stopping=True, eval_tol=100,
device=torch.device("cuda" if torch.cuda.is_available() else "cpu")):
params = dict(locals())
params['eval_func'] = SpLCM_PGD_EBIC_eval
Sigma_opt, score_traj, Sigma_traj = smart_call(SpLCM_PGD_tuned, **params)
return Sigma_opt, score_traj, Sigma_trajTools For Empirical Comparison
We now introduce several functions to generate indeed sparse yet structural covariance matrix to compare the ADMM, PGD, and dual BCD approaches.
Generating sparse matricies and their heat map
Acknowledgement: ChatGPT wrote the first version of these functions.
Generating random almost diagonal sparse covariance matrices
This function generates a almost diagonal positive definite matrix with a small amount of nontrivial off-diagonal correlation. The probabiliy of an arbitrary \((i,j)\)-entry in the upper triangular part (and also trivially lower triangular part) is controlled by the sparsity parameter independently. We also by default consider a scaled covariance matrix, i.e. the diagonal entries are normalized to \(1\).
def sparse_covariance(p, sparsity=0.05, off_diag_range=(0.4, 0.9), delta=1e-6, signed=True, scaled=True):
Sigma = np.eye(p)
for i in range(p):
for j in range(i + 1, p):
if np.random.rand() < sparsity:
val = np.random.uniform(*off_diag_range)
if signed:
val *= np.random.choice([-1, 1])
Sigma[i, j] = Sigma[j, i] = val
Sigma = proj_pd(Sigma, delta)
if scaled is True:
D = np.sqrt(np.diag(Sigma))
Sigma = Sigma / np.outer(D, D)
return SigmaPlot sparsity of matrices using the heat map
Give a visual understanding of the sparsity of a covariance matrix via a heat map.
def plot_matrix_heatmap(M, threshold=0.0, vmax=None, cmap="Reds"):
if isinstance(M, dict):
if ("result" not in M) or ("time" not in M):
raise Exception("Invalid input")
M = M["result"]
if isinstance(M, tuple):
M = M[0]
A = np.abs(M)
if threshold is not None:
A[A < threshold] = 0.0
plt.figure(figsize=(6, 5))
im = plt.imshow(A, cmap=cmap, vmin=0, vmax=vmax)
plt.colorbar(im, fraction=0.045)
plt.tight_layout()
plt.show()Performance metrics
Normed distance
One natural choice for measuring the accuracy of an estimator is its normed distance to the true quantity. Here we consider two norms, the Frobenius norm and the operator \(\mathcal{L}^{2}\) norm. These are the choices considered in the paper7. Here we essentially write a wrapper for the numpy.linalg.norm function to group the two metrics together as a return.
def normed_error(Sigma_hat, Sigma):
diff = Sigma_hat - Sigma
frob = np.linalg.norm(diff, ord='fro')
l2 = np.linalg.norm(diff, ord=2)
return frob, l2Support recovery
In the sparse matrix setting, a performance metric that is easier to be making sense of is the support recovery rate. In the paper8, the author considered two rates, being two sides of a coin. \[\begin{align} \text{True positive rate} &= \frac{\#\{(i, j) : \hat{\Sigma}_{ij} \ne 0,\ \Sigma_{ij} \ne 0\}}{\#\{(i, j) : \Sigma_{ij} \ne 0\}},\\ \text{False positive rate} &= \frac{\#\{(i, j) : \hat{\Sigma}_{ij} \ne 0,\ \Sigma_{ij} = 0\}}{\#\{(i, j) : \Sigma_{ij} = 0\}}. \end{align}\] In our implementation, we removed diagonal entries from any nonzero counting involved in the above definition because variance is always nonzero, and our estimators have explicit diagonal projection, which makes them also nonzero. Yet in general, this implementation detail shouldn’t affect the relative performance ordering of the estimators.
def support_recovery_error(Sigma_hat, Sigma, tol=1e-3):
mask_hat = np.abs(Sigma_hat) > tol
mask = np.abs(Sigma) > tol
off_diag = ~np.eye(Sigma_hat.shape[0], dtype=bool)
mask_hat = mask_hat & off_diag
mask2 = mask & off_diag
true_positive = np.sum(mask_hat & mask2)
false_positive = np.sum(mask_hat & ~mask2)
false_negative = np.sum(~mask_hat & mask2)
tpr = true_positive / (true_positive + false_negative)
fpr = false_positive / np.sum(~mask)
return tpr, fprAll-in-one error evaluation
We now put the previous performance metrics into one function for simplicity.
error
error (result, Sigma, est_name, tol=0.001)
Exported source
def error(result, Sigma, est_name, tol=1e-3):
Sigma_hat = result
time = None
if isinstance(result, dict):
if ("result" not in result) or ("time" not in result):
raise Exception("Invalid input")
Sigma_hat = result["result"]
time = result["time"]
if isinstance(Sigma_hat, tuple):
Sigma_hat = Sigma_hat[0]
tpr, fpr = support_recovery_error(Sigma_hat, Sigma, tol=tol)
frob, l2 = normed_error(Sigma_hat, Sigma)
if time is None:
return {
'Estimator': est_name,
'TPR': tpr,
'FPR': fpr,
'Frobenius': frob,
'L2': l2,
}
else:
return {
'Estimator': est_name,
'TPR': tpr,
'FPR': fpr,
'Frobenius': frob,
'L2': l2,
'Time': time
}Comparing multiple estimators
We implement a function that will take multiple estimators and print out their errors row-by-row in a panda dataframe.
error_table
error_table (results, Sigma, tol=0.001)
Exported source
def error_table(results, Sigma, tol=1e-3):
rows = []
for est_name, result in results.items():
row = error(result, Sigma, est_name, tol)
rows.append(row)
df = pd.DataFrame(rows)
return df.set_index('Estimator')Empirical Study
In this section, our goal is to evaluate the three different optimization algorithms as approaches to the sparse covariance matrix estimation problem, and whether BIC is a good information criteria for the parameter tuning under high dimensional settings. We consider generating a \(60\times 60\) covariance matrix using the sparse covariance matrix generating function we previously defined. We can see the sparsity pattern and the scaled diagonal, which is exactly the reason we want to match the diagonal of estimator with sample covariance matrix.
np.random.seed(616)
sparse_cov= sparse_covariance(60)
plot_matrix_heatmap(sparse_cov, threshold=1e-3)
We first run a heat map of the sample covariance matrix. It is clear that this would be overly noisy for off-diagonal entries.
X1 = np.random.multivariate_normal(np.zeros(60), sparse_cov, size=30)
sample_cov_1 = sample_covariance(X1)
plot_matrix_heatmap(sample_cov_1, threshold=1e-3)
Comparison of ADMM, PGD, and BCD
We first run the ADMM based algorithm for sparse linear model, with \(\lambda=0.7\) and \(\rho=0.1\). The maximal amount of iteration is by default 700, and the converging criteria is \(\left\Vert\widehat{\sigma}^{(k+1)}-\widehat{\sigma}^{(k)}\right\Vert_{2}^{2}<0.00001\). Judging from the heat map, the result is almost identical to that of the sample covariance matrix. What’s worse is that it takes around 150 seconds to finish (under our single thread implementation).
admm_low_reg = SpLCM_ADMM_timed(X1, lambda_ = 0.7)
plot_matrix_heatmap(admm_low_reg, threshold=1e-3)R[write to console]: Loading required package: lpSolve
SpLCM_ADMM_timed ran in 164.5997s

We now choose \(\lambda=100\) to encourage sparsity. Under the same convergence criteria, we now indeed get a rather sparse and informative estimation. But again, this time it takes around 150 seconds to finish running, and likely not converged, but limited by 700 iterations.
admm_high_reg = SpLCM_ADMM_timed(X1, lambda_ = 100)
plot_matrix_heatmap(admm_high_reg, threshold=1e-3)SpLCM_ADMM_timed ran in 150.2583s

We now use PGD to perform the same estimation. We first set \(\lambda=0.005\) to analogously be close to sample covariance, as we did in the ADMM case. The answer is positive, in a much shorter time of around 8 seconds.
pgd_low_reg = SpLCM_PGD_timed(X1, lambda_ = 0.005)
plot_matrix_heatmap(pgd_low_reg, threshold=1e-3)SpLCM_PGD_timed ran in 7.9843s

We now encourage sparsity by setting \(\lambda=0.025\), and we can similarly recover a result which costs ADMM 180 seconds in roughly 17 seconds.
pgd_high_reg = SpLCM_PGD_timed(X1, lambda_ = 0.025)
plot_matrix_heatmap(pgd_high_reg, threshold=1e-3)SpLCM_PGD_timed ran in 4.5279s

We now use dual BCD to do the covariance estimation. We can see it is significantly slower than PGD, but also significantly faster than ADMM, and recover the sparsity and overall structure of covariance matrix well, even with a compromised (diagonal) normalizing matrix.
bcd = SpLCM_BCD_timed(X1)
plot_matrix_heatmap(bcd, threshold=1e-3)SpLCM_BCD_timed ran in 35.1180s

The aboves are purely our observation from plots. We now look at the exact performance of these optimization algorithms and their produced estimators.
opt_results = {
r"ADMM with $\lambda=0.7$": admm_low_reg,
r"ADMM with $\lambda=100$": admm_high_reg,
r"PGD with $\lambda=0.005$": pgd_low_reg,
r"PGD with $\lambda=0.025$": pgd_high_reg,
r"BCD with $\lambda=0.1$": bcd
}From the table, we can see from a normed distance perspective, the estimator produced by ADMM with \(\lambda=100\) is the best, and followed by that produced by PGD with \(\lambda=0.025\). Their difference in normed distance isn’t that significant, but their running time difference is very significant. We now look at the support recovery error, and we can see the result isn’t very informational as TPR and FPR seems to be highly correlated, which in general shouldn’t be the case. A potential reason is that we set a too low tolerance for counting nonzero entries, which is by default \(0.0001\). Recall in our matrix generating function, any off-diagonal nonzero entry by default can only take values between \((0.4, 0.9)\). It is possible that estimators tend to take rather uncontrolled small values in these entries.
opt_table_low_tol = error_table(opt_results, sparse_cov)
display(opt_table_low_tol)| TPR | FPR | Frobenius | L2 | Time | |
|---|---|---|---|---|---|
| Estimator | |||||
| ADMM with $\lambda=0.7$ | 0.988816 | 0.997131 | 11.707416 | 5.149339 | 164.599737 |
| ADMM with $\lambda=100$ | 0.970177 | 0.982783 | 4.876153 | 1.416867 | 150.258311 |
| PGD with $\lambda=0.005$ | 0.875116 | 0.882353 | 9.258855 | 3.512117 | 7.984298 |
| PGD with $\lambda=0.025$ | 0.404473 | 0.417504 | 5.275957 | 1.601727 | 4.527934 |
| BCD with $\lambda=0.1$ | 0.678472 | 0.687231 | 6.546633 | 2.369557 | 35.117952 |
With the above intuition we now look at the table again with tolerance set to a higher value being \(0.05\). We can now see the decoupled behaviour between TPR and FPR. We can see estimators with little regularization such that ADMM with \(\lambda=0.7\) still have very good TPR, but also very bad FPR, which isn’t mitigated by the higher tolerance value. The overall best estimator is still ADMM with \(\lambda=100\), closely followed by PGD with \(\lambda=0.025\). Again, their difference in performance is not that significant, but their running time difference is.
opt_table_high_tol = error_table(opt_results, sparse_cov, tol=0.05)
display(opt_table_high_tol)| TPR | FPR | Frobenius | L2 | Time | |
|---|---|---|---|---|---|
| Estimator | |||||
| ADMM with $\lambda=0.7$ | 0.898649 | 0.781751 | 11.707416 | 5.149339 | 164.599737 |
| ADMM with $\lambda=100$ | 0.567568 | 0.159063 | 4.876153 | 1.416867 | 150.258311 |
| PGD with $\lambda=0.005$ | 0.804054 | 0.630703 | 9.258855 | 3.512117 | 7.984298 |
| PGD with $\lambda=0.025$ | 0.567568 | 0.183724 | 5.275957 | 1.601727 | 4.527934 |
| BCD with $\lambda=0.1$ | 0.736486 | 0.414303 | 6.546633 | 2.369557 | 35.117952 |
Information Criteria
We now consider the tuned version of PGD. Our parameter search space is that \(\rho=0.1\) is fixed, \(\log\lambda\) is equally spaced between \(\left[0.01, 10^{-0.5}\right]\) for 50 values. We first run the parameter search with BIC optimization exactly, i.e. without any early stopping crteria. We can see the estimator is very sparse with a strutural pattern from the true covariance matrix.
pgd_bic_unstopped = SpLCM_PGD_BIC_timed(X1, np.logspace(-2, -0.5, 50), 0.1, early_stopping=False)
plot_matrix_heatmap(pgd_bic_unstopped, threshold=1e-3)SpLCM_PGD_BIC_timed ran in 128.2309s

We now run the BIC optimization with an early stopping criteria being the first time BIC score drop by less than 100. The reason for considering such criteria is sometimes BIC prefers completely diagonal estimator, and the early stopping can prevent this from happening. In our case, the early stopped estimator matches with the exact one. The reason can be that we are searching over a dense enough grid of \(\lambda\).
pgd_bic_stopped = SpLCM_PGD_BIC_timed(X1, np.logspace(-2, -0.5, 50), 0.1)
plot_matrix_heatmap(pgd_bic_stopped, threshold=1e-3)SpLCM_PGD_BIC_timed ran in 129.7535s

We similarly run AIC and EBIC optimization with the early stopping mentioned previously. We can see it seems they produce the exact same estimator as that of BIC optimization.
pgd_aic_stopped = SpLCM_PGD_AIC_timed(X1, np.logspace(-2, -0.5, 50), 0.1)
plot_matrix_heatmap(pgd_aic_stopped, threshold=1e-3)SpLCM_PGD_AIC_timed ran in 117.5695s

pgd_ebic_stopped = SpLCM_PGD_EBIC_timed(X1, np.logspace(-2, -0.5, 50), 0.1)
plot_matrix_heatmap(pgd_ebic_stopped, threshold=1e-3)SpLCM_PGD_EBIC_timed ran in 123.6517s

We now compare the tuned estimators with previously untuned estimators under performance metrics.
results = {
r"ADMM with $\lambda=0.7$": admm_low_reg,
r"ADMM with $\lambda=100$": admm_high_reg,
r"PGD with $\lambda=0.005$": pgd_low_reg,
r"PGD with $\lambda=0.025$": pgd_high_reg,
r"BCD with $\lambda=0.1$": bcd,
"PGD with BIC without stopping": pgd_bic_unstopped,
"PGD with BIC with stopping": pgd_bic_stopped,
"PGD with AIC with stopping": pgd_aic_stopped,
"PGD with EBIC with stopping": pgd_ebic_stopped
}We can see the tuned estimators are the best estimators under the normed distance. They also have extremely good FPR, but comes together is extremely bad TPR, under the sensitive tolerance. One thing worth noticing is the tuned PGD still runs faster than ADMM, but this shouldn’t be taken seriously as here we implemented a torch version of PGD with automatic parallelization, while the implementation of ADMM is vanilla. It is in general a consensus that ADMM receives more boost under parallelization than PGD.
table_low_tol = error_table(results, sparse_cov)
display(table_low_tol)| TPR | FPR | Frobenius | L2 | Time | |
|---|---|---|---|---|---|
| Estimator | |||||
| ADMM with $\lambda=0.7$ | 0.988816 | 0.997131 | 11.707416 | 5.149339 | 164.599737 |
| ADMM with $\lambda=100$ | 0.970177 | 0.982783 | 4.876153 | 1.416867 | 150.258311 |
| PGD with $\lambda=0.005$ | 0.875116 | 0.882353 | 9.258855 | 3.512117 | 7.984298 |
| PGD with $\lambda=0.025$ | 0.404473 | 0.417504 | 5.275957 | 1.601727 | 4.527934 |
| BCD with $\lambda=0.1$ | 0.678472 | 0.687231 | 6.546633 | 2.369557 | 35.117952 |
| PGD with BIC without stopping | 0.062442 | 0.011478 | 4.932425 | 1.424776 | 128.230863 |
| PGD with BIC with stopping | 0.062442 | 0.011478 | 4.932425 | 1.424776 | 129.753501 |
| PGD with AIC with stopping | 0.062442 | 0.011478 | 4.932425 | 1.424776 | 117.569497 |
| PGD with EBIC with stopping | 0.062442 | 0.011478 | 4.932425 | 1.424776 | 123.651720 |
Changing the tolerance to \(0.05\) gives us a better understanding on the performance of tuned estimators. They have nearly no false positive, and an acceptable true positive by recovering 30 percents of the nonzero entries. In this regard, the tuned estimators can be considered as really good.
table_high_tol = error_table(results, sparse_cov, tol=0.05)
display(table_high_tol)| TPR | FPR | Frobenius | L2 | Time | |
|---|---|---|---|---|---|
| Estimator | |||||
| ADMM with $\lambda=0.7$ | 0.898649 | 0.781751 | 11.707416 | 5.149339 | 164.599737 |
| ADMM with $\lambda=100$ | 0.567568 | 0.159063 | 4.876153 | 1.416867 | 150.258311 |
| PGD with $\lambda=0.005$ | 0.804054 | 0.630703 | 9.258855 | 3.512117 | 7.984298 |
| PGD with $\lambda=0.025$ | 0.567568 | 0.183724 | 5.275957 | 1.601727 | 4.527934 |
| BCD with $\lambda=0.1$ | 0.736486 | 0.414303 | 6.546633 | 2.369557 | 35.117952 |
| PGD with BIC without stopping | 0.324324 | 0.006165 | 4.932425 | 1.424776 | 128.230863 |
| PGD with BIC with stopping | 0.324324 | 0.006165 | 4.932425 | 1.424776 | 129.753501 |
| PGD with AIC with stopping | 0.324324 | 0.006165 | 4.932425 | 1.424776 | 117.569497 |
| PGD with EBIC with stopping | 0.324324 | 0.006165 | 4.932425 | 1.424776 | 123.651720 |
Concluding Remarks
In the above two sections, we compared a bunch of estimators produced by different optimization algorithms, different parameter selection, and with or without tuning. The general conclusion is that under the vanilla implementation, with parameter tuning, PGD seems to achieve similar accuracy as ADMM, with much smaller running time. Therefore, for problem dimensions around or under 100, and no significant parallelization, it seems PGD would be a better choice. BCD seems to be less significant in terms of both accuracy and efficiency. Another conclusion is that BIC do produce meaningful tuning, especially when we want to control the false positive to be low, while hoping to recover a moderate amounnt of structure.
The open questions would center around problems with much higher dimensions, such as a covariance matrix with \(10000\times 10000\) entries. It is of interest to see if ADMM with parallelization can significantly outperforms PGD in terms of both accuracy and efficiency. It is also of interest to see whether BIC, AIC, and EBIC would be different in that situation, as currently there does not seem to be significant difference in the three information criterion.
Footnotes
R. Kim and I. Gaynanova, ‘A sparse linear model for positive definite estimation of covariance matrices’, arXiv [stat.ME], 11-Mar-2025.↩︎
T. Cai, W. Liu, and X. Luo, ‘A Constrainedℓ1Minimization approach to sparse precision matrix estimation’, J. Am. Stat. Assoc., vol. 106, no. 494, pp. 594–607, Jun. 2011.↩︎
T. Cai, W. Liu, and X. Luo, ‘A Constrainedℓ1Minimization approach to sparse precision matrix estimation’, J. Am. Stat. Assoc., vol. 106, no. 494, pp. 594–607, Jun. 2011.↩︎
R. Kim and I. Gaynanova, ‘A sparse linear model for positive definite estimation of covariance matrices’, arXiv [stat.ME], 11-Mar-2025.↩︎
J. Bien, ‘Graph-guided banding of the covariance matrix’, arXiv [stat.ME], 01-Jun-2016.↩︎
R. Kim and I. Gaynanova, ‘A sparse linear model for positive definite estimation of covariance matrices’, arXiv [stat.ME], 11-Mar-2025.↩︎
R. Kim and I. Gaynanova, ‘A sparse linear model for positive definite estimation of covariance matrices’, arXiv [stat.ME], 11-Mar-2025.↩︎
R. Kim and I. Gaynanova, ‘A sparse linear model for positive definite estimation of covariance matrices’, arXiv [stat.ME], 11-Mar-2025.↩︎