Documentation

In this page we discuss how each function in DSO699-SpLCM got implemented, how to use them, and (briefly) discuss which optimization approach is better.

Author

Yichen Zhou

Published

2025-05-21

Problem Setup

We will be following the paper¹ for their problem, modeling, and optimization problem formulation. We implements additional optimization algorithms not considered in the original paper.

We consider observing i.i.d data $(Y_{1},\ldots ,Y_{n})\sim \mathcal{N}(0,\Sigma^{*})$, $\Sigma^{*}\in \mathbb{R}^{p\times p}$, $\Sigma^{*}\succ 0$. Our goal is to estimate $\Sigma^{*}$, and we in particular consider the high dimensional and sparse covariance setting: $n\gg p$, and $\Sigma^{*}$ has many $0$ off-diagonal entries. A naive estimator is the sample covariance $S=\frac{1}{n}YY^{T}$, where $Y:=\begin{pmatrix} Y_{1}^{\mathsf{T}} \\ \vdots \\ Y_{n}^{\mathsf{T}} \end{pmatrix}\in \mathbb{R}^{n\times p}$. We can consider using the sample covariance matrix as an observable to formulate the problem as a regression problem, so that we can utilize LASSO based idea: \[ S=\Sigma^{*}+E, \] where $E$ is simply the difference between sample covariance and true covariance, which we pretend to be noise.

Vectorizing matrices, and matricizing vectors

It is clear that we would like to vectorize the matrices in the above problem so that it is indeed an regression, and we can avoid using matrix norms in later optimizations. On the other hand, after we solvd a LASSO like problem, we would like to convert the vector back to a covariance matrix. This section is devoted to related implementations.

Vectorizing a symmetric matrix

Given $A\in\mathbb{R}^{p\times p}$ a symmetric matrix, it is clear that we only need a $\frac{p(p+1)}{2}$-dimensional vector to represent the matrix. And this is the funcationality of vech($A$), the output will be a vector looking like \[ (A_{11}, A_{21}, A_{22}, \ldots, A_{p1}, \ldots, A_{pp}). \] The implementation is straightforward, we just loop through the matrix using the above index pattern.

def vech(A):
    p = A.shape[0]
    if A.shape[1] != p:
        raise Exception("A need to be a square matrix.")
    vech_A = []
    for i in range(p):
        for j in range(i + 1):
            vech_A.append(A[i,j])
    return np.array(vech_A)

Matricizing a vector

Inversely we can put a $\frac{p(p+1)}{2}$-dimensional vector back to a symmetric matrix. The implementation is the reverse engineering of above.

def vech_inverse(a):
    n = a.shape[0]
    p = int((np.sqrt(1 + 8 * n) - 1) / 2)
    if p * (p + 1) // 2 != n:
        raise Exception("Invalid dimension.")
    A = np.zeros((p, p))
    i = 0
    for j in range(p):
        for k in range(j + 1):
            A[j, k] = a[i]
            if j != k:
                A[k, j] = a[i]
            i = i + 1
    return A

Duplication matrix

Acknowledgement: ChatGPT wrote this function.

If we have a general matrix $A$, we would need a $p^{2}$-dimensional vector to represent it, and such a representation is denoted by vec($A$). The ordering of elements are typically defined as the concatenation of columns, i.e. \[ (A_{11}, A_{21}, \ldots, A_{p1}, A_{12}, \ldots, A_{1p}, \ldots, A_{pp}). \] Although in our formulation, we would not encounter any non-symmetric matricies, some formulation such as the covariance of vech($S$) requires vec($S$) for expression. It is clear that there is a linear relationship between vec$(A)$ and vech$(A)$. And duplication matrix encodes such linear relationship, i.e. \[ \text{vec}(A)=D_{p}\text{vech}(A). \] For each dimension the matrix is fixed. Here we just explicitly fill the entries as our definition of vech$(A)$ is not traditional, which means existing libraries do not have such a matrix stored.

def duplication_matrix(p):
    D = np.zeros((p * p, p * (p + 1) // 2))
    idx = 0
    for i in range(p):
        for j in range(i+1):
            vec_idx1 = i + j * p
            vec_idx2 = j + i * p
            D[vec_idx1, idx] = 1
            if i != j:
                D[vec_idx2, idx] = 1
            idx += 1
    return D

Diagonal and off-diagonal indices in vector representation

After we vectorized $S$, it will be of interest to optimize the diagonal terms and off-diagonal terms of $\widehat{\Sigma}$ differently, as the diagonal terms of $S$ are very good estimators of the variance, and off-diagonal terms are rather noisy and uninformational. Therefore, we need two index set to directly access the corresponding entries in the vectorized representation, and that’s the role of vech$(p)$. Again this is a fixed value for each dimension $p$.

def vech_idx(p):
    all_idx = range(p * (p + 1) // 2)
    diag_idx = [k * (k + 1) // 2 - 1 for k in range(1, p + 1)]
    off_idx = sorted(set(all_idx) - set(diag_idx))
    return diag_idx, off_idx

Normalization

After defining the above vectorizing and devectorizing functions, we look back at the regression problem \[ S = \Sigma^{*}+E\iff \text{vech}(S)=\text{vech}(\Sigma^{*})+\text{vech}(E). \] We define $s:=\text{vech}(S)$, $\sigma^{*}:=\text{vech}(\Sigma^{*})$, and $\epsilon :=\text{vech}(E)$, so that we have \[ s = \sigma^{*}+\epsilon. \] But here, $\epsilon$ doesn’t look like a noise vector, as its entries are correlated due to its relationship with sample covariance matrix. Therefore, we need to normalize it to be able to better pretend that it is indeed a noise. That is, we want to compute \[ V:=\text{Cov}(\epsilon)=\text{Cov}(\text{vech}(S)), \] and then we can rewrite the problem as \[ V^{-1/2}s = V^{-1/2}\sigma^{*}+V^{-1/2}\epsilon. \] $V$ indeed has a closed form expression: \[ V = \frac{2}{n} D_p^{+} \left( \Sigma^* \otimes \Sigma^* \right) D_p^{+\top}, \] and unfortunately it is clear that we need to first estimate $\Sigma^{*}$ to have a good estimate of $V$. Yet since we only need $V^{-1}$, and we can roughly write: \[ V^{-1} \approx \frac{n}{2} D_p^{\top} \left( \Omega \otimes \Omega \right) D_p, \] where $\Omega:=\left(\Sigma^{*}\right)^{-1}$, it suffices to have a good estimator of $\Omega$. A good result in this regime is known as CLIME². In this section we implemented CLIME using their original R implementation, and then compute $V^{-1}$ using CLIME.

Sample covariance matrix

Before going into implementation of CLIME, we first consider how do we calculate $S$. Given a data matrix $X\in\mathbb{R}^{n\times p}$, numpy already gives a function to calculate the sample covariance matrix. Here we just wrapping the numpy interface with our own function name $\textbf{sample}\textunderscore\textbf{covariance}(X)$.

def sample_covariance(X):
    return np.cov(X, rowvar=False)

CLIME

Acknowledgement: ChatGPT wrote this function.

CLIME is solving the following problem: \[ \widehat{\Omega} = \arg\min_{\Omega \in \mathbb{R}^{p \times p}} \|\Omega\|_{1} \quad \text{subject to} \quad \|S \Omega - I\|_{\infty} \leq \rho. \] We use the authors’ original implementation in R³ to solve this problem. The R package clime is also an implicit dependency of our package that pip would not automatically install. One need to run install.packages("clime") in R console. We will denote the output by $\widehat{\Omega}$.

def clime(S, rho):
    numpy2ri.activate()

    robjects.globalenv['S'] = numpy2ri.py2rpy(S)
    robjects.globalenv['lambda'] = rho

    robjects.r('''
    library(clime)
    result <- clime(S, lambda=lambda, standardize=FALSE)
    ''')
    Omega_r = robjects.r('result$Omegalist[[1]]')
    return np.array(Omega_r)

Normalizing matrix

After we compute $\widehat{\Omega}$ as an estimator of $\Omega$, we can obtain an estimator of $V^{-1}$ as \[ \widehat{V}^{-1} := \frac{n}{2} D_p^{\top} \left( \widehat{\Omega} \otimes \widehat{\Omega} \right) D_p, \] where $\otimes$ is the Kronecker product. The normalization function essentially accept $\widehat{\Omega}$ and then do the above computation.

def normalization(n, Omega):
    p = Omega.shape[0]
    D = duplication_matrix(p)
    V_inverse = (n / 2) * ((D.T @ (np.kron(Omega, Omega))) @ D)
    return V_inverse

Optimization

We now formulate a LASSO like problem for the regression problem we previously formulated: \[\begin{align} \min_{\boldsymbol{\sigma}} \quad & \frac{1}{2n} \left\| \widehat{V}^{-1/2} s - \widehat{V}^{-1/2} \sigma \right\|_2^2 + \lambda P(\sigma_{[o]})\\ \text{s.t.} \quad & \sigma_{[d]} = s_{[d]},\; \Sigma(\sigma_{[d]},\sigma_{[o]})\succ 0. \end{align}\] Here $P(\cdot)$ is going to be the $\mathcal{L}^{1}$-norm, and $[d],[o]$ are just the two outputs of vech_idx($p$), one represents diagonal indices and the other for off-diagonal indices. Essentially, we want to force the diagonal of the estimate to be the same as that of $S$, and performing soft-thresholding on off-diagonal entries to maintain sparsity. We also impose the condition that the estimate must be positive definite, as it should be for a valid covariance matrix. In the paper⁴, the authors use the ADMM to solve the optimization problem: introducing $\theta$ with the constraint $\theta=\sigma$, omitting the positive definiteness constraint, the objective becomes: \[\begin{align} \frac{1}{2n} \left\| \widehat{V}^{-1/2} s - \widehat{V}^{-1/2} \sigma \right\|_2^2 + \lambda \left\| \theta_{[o]} \right\|_1 + \lambda_{\infty} \left\| \theta_{[d]} - s_{[d]} \right\|_1 + \nu^\top (\sigma - \theta) + \frac{1}{2\gamma} \left\| \sigma - \theta \right\|_2^2. \end{align}\] Let $\eta:=\gamma \nu$, we can compute the update rule: \[\begin{align*} \sigma^{(i+1)} &= \left( \frac{1}{n} \widehat{V}^{-1} + \frac{1}{\gamma} I \right)^{-1} \left\{ \frac{1}{n} \widehat{V}^{-1} s + \frac{1}{\gamma} \left( \theta^{(i)} - \eta^{(i)} \right) \right\}, \\[1ex] \theta_{[o]}^{(i+1)} &= \mathcal{S}_{\lambda \gamma} \left( \sigma_{[o]}^{(i+1)} + \eta_{[o]}^{(i)} \right), \\[1ex] \eta^{(i+1)} &= \eta^{(i)} + \left( \sigma^{(i+1)} - \theta^{(i+1)} \right), \end{align*}\] where $\mathcal{S}_{\lambda\gamma}$ is the soft-thresholding function with level $\lambda\gamma$. We also need an extra projection step to guarantee the positive definiteness of $\Sigma(\sigma)$. That is \[ \tilde{\sigma}^{(i+1)}\in\left\{\sigma:\;\Sigma(\sigma) = \sum_{j=1}^{p} \max\{\lambda_j,\delta\} \, q_j q_j^{\top}\right\}, \] where $\delta>0$ is some preset lowerbound on the projection. So we can conclude the complete update rule \[\begin{align} \sigma^{(i+1)} &= \left( \frac{1}{n} \widehat{V}^{-1} + \frac{1}{\gamma} I \right)^{-1} \left\{ \frac{1}{n} \widehat{V}^{-1} s + \frac{1}{\gamma} \left( \theta^{(i)} - \eta^{(i)} \right) \right\}= \text{vech}\left(\sum_{j=1}^{p} \lambda_j \, q_j q_j^{\top}\right), \\[1ex] \tilde{\sigma}^{(i+1)}&\in\left\{\sigma:\;\Sigma(\sigma) = \sum_{j=1}^{p} \max\{\lambda_j,\delta\} \, q_j q_j^{\top}\right\}, \\[1ex] \theta_{[o]}^{(i+1)} &= \mathcal{S}_{\lambda \gamma} \left( \widetilde{\sigma}_{[o]}^{(i+1)} + \eta_{[o]}^{(i)} \right), \\[1ex] \eta^{(i+1)} &= \eta^{(i)} + \left( \widetilde{\sigma}^{(i+1)} - \theta^{(i+1)} \right). \end{align}\] The first goal of the section is to implement the ADMM optimization algorithm, first implementing each update rule, then combining them together into a aggregate function on estimating covariance matrices.

The second goal of the section is to implement a proximal gradient descent algorithm of solving the optimization problem. Compared to ADMM, the tuning parameter selection for PGD is much more intuitive. This part is not discussed in the original paper, and we will detail the (simple) update rule later.

The ADMM approach to sparse linear modeling

Soft-thresholding

It is clear that in a LASSO formualtion, soft-thresholding must be utilized, and thus must be implemented. The precise definition of soft-thresholding is \[ \mathcal{S}_{\lambda}(x) = \begin{cases} x - \lambda, & \text{if } x > \lambda \\ 0, & \text{if } |x| \leq \lambda \\ x + \lambda, & \text{if } x < -\lambda \end{cases}. \] We just follow this definition to implement.

def soft_thresholding(a, lambda_):
    return np.sign(a) * np.maximum(np.abs(a) - lambda_, 0.0)

Projection onto nearest positive definite matrix

Given a general square matrix $A\in\mathbb{R}^{p\times p}$, we project it onto the cone of positive definite matrix by first symmetrizing it: \[ \widehat{A} := \frac{1}{2}\left(A+A^{\top}\right), \] and then perform the eigen-decomposition: \[ \widehat{A} = \sum_{j=1}^{p}\lambda_{j}q_{j}q_{j}^{\top}. \] and finally lower bound the eigenvalues by a positive constant $\delta$ to generate a new positive definite matrix: \[ \text{proj}_{PD(\delta)}(A)=\sum_{j=1}^{p}\max\{\lambda_{j},\delta\}q_{j}q_{j}^{\top}. \]

def proj_pd(A, delta=1e-6):
    A = (A + A.T) / 2
    eigvals, eigvecs = np.linalg.eigh(A)
    eigvals_clipped = np.clip(eigvals, delta, None)
    return eigvecs @ np.diag(eigvals_clipped) @ eigvecs.T

Update rule for $\sigma$

As we have discussed previously, the essential step we performed for $\sigma$ is \[\begin{align} \sigma^{(i+1)} &= \left( \frac{1}{n} \widehat{V}^{-1} + \frac{1}{\gamma} I \right)^{-1} \left\{ \frac{1}{n} \widehat{V}^{-1} s + \frac{1}{\gamma} \left( \theta^{(i)} - \eta^{(i)} \right) \right\}= \text{vech}\left(\sum_{j=1}^{p} \lambda_j \, q_j q_j^{\top}\right), \\[1ex] \tilde{\sigma}^{(i+1)}&\in\left\{\sigma:\;\Sigma(\sigma) = \sum_{j=1}^{p} \max\{\lambda_j,\delta\} \, q_j q_j^{\top}\right\}, \end{align}\] i.e. one being a gradient descent like step, and one being eigen-decomposition and then projection step. We implement them together in one update function.

def sigma_update(theta, eta, V_inverse, s, n, gamma, delta):
    d = V_inverse.shape[0]
    A = (1 / n) * V_inverse + (1 / gamma) * np.eye(d)
    b = ((1 / n) * V_inverse) @ s + (1 / gamma) * (theta - eta)
    sigma_tilde = np.linalg.solve(A, b)

    Sigma_tilde = vech_inverse(sigma_tilde)
    Sigma = proj_pd(Sigma_tilde, delta)
    
    return vech(Sigma)

Update rule for $\theta$

We recall the update rule for $\theta$: \[\begin{align} \theta_{[o]}^{(i+1)} &= \mathcal{S}_{\lambda \gamma} \left( \widetilde{\sigma}_{[o]}^{(i+1)} + \eta_{[o]}^{(i)} \right), \end{align}\] which is an direct application of the soft-thresholding function on the off-diagonal terms.

def theta_update(sigma, eta, s, diag_idx, off_idx, lambda_, gamma):
    theta = np.zeros_like(sigma)
    theta[diag_idx] = s[diag_idx]
    theta[off_idx] = soft_thresholding(sigma[off_idx] + eta[off_idx], lambda_ * gamma)
    
    return theta

Update rule for $\eta$

We recall the update rule for $\eta$: \[ \eta^{(i)} + \left( \widetilde{\sigma}^{(i+1)} - \theta^{(i+1)} \right), \] which is very add and subtract step.

def eta_update(sigma, theta, eta):
    return eta + sigma - theta

Put everything together

We now aggregate all of the above into one ADMM based covariance estimation function. The essential steps we perform is first calculate the sample covariance from input data $X$, and then calculate the normalization matrix $V^{-1}$ using CLIME, and then we initialize $\sigma,\theta,\eta$, and begin iterating, with the constraints of $\textbf{max}\textunderscore\textbf{iter}$, and exit tolerence rule with $\textbf{tol}$.

source

SpLCM_ADMM

 SpLCM_ADMM (X, lambda_=0.1, rho=0.1, gamma=1.0, delta=0.01,
             V_inverse_0=None, sigma_0=None, theta_0=None, eta_0=None,
             max_iter=700, tol=1e-05)

Exported source

def SpLCM_ADMM(X, lambda_=0.1, rho=0.1, gamma=1.0, delta=0.01, 
          V_inverse_0=None, sigma_0=None, theta_0=None, eta_0=None,
          max_iter=700, tol=0.00001):
    n = X.shape[0]
    p = X.shape[1]
    S = sample_covariance(X)
    s = vech(S)
    diag_idx, off_idx = vech_idx(p)

    V_inverse = V_inverse_0
    if V_inverse_0 is None:
        Omega = clime(S, rho)
        V_inverse = normalization(n, Omega)

    sigma = sigma_0
    theta = theta_0
    eta = eta_0
    if sigma_0 is None:
        sigma = s.copy()
    if theta_0 is None:
        theta = s.copy()
    if eta_0 is None:
        eta = np.zeros_like(sigma)

    for i in range(max_iter):
        sigma_temp = sigma_update(theta, eta, V_inverse, s, n, gamma, delta)
        if np.linalg.norm(sigma_temp - sigma) < tol:
            break
        sigma = sigma_temp
        theta_temp = theta_update(sigma, eta, s, diag_idx, off_idx, lambda_, gamma)
        theta = theta_temp
        eta = eta_update(sigma, theta, eta)
    return vech_inverse(sigma)

The PGD approach to sparse linear modeling

Given our optimization formulation, which is convex, we do not necessarily need to use the ADMM based approach, where the parameter selection is critical, as we will show in later examples. Here we can consider a simpler yet still empirically efficient (for $\mathcal{O}(10^{1})$ dimensions) for solving the problem, utilizing proximal gradient descent (PGD). We essentially have the following update rule: \[\begin{align} x^{(k+\frac{1}{4})} &= x^{(k)} - \eta \cdot \frac{1}{n} \hat{V}^{-1}(x^{(k)} - s), \\ x^{(k+\frac{1}{2})}_{[o]} &= \mathcal{S}_{\eta\lambda}\left( x^{(k+\frac{1}{4})}_{[o]} \right), \\ x^{(k+\frac{1}{2})}_{[d]} &= s_{[d]} + \mathcal{S}_{\eta\lambda_{\infty}}\left( x^{(k+\frac{1}{4})}_{[d]} - s_{[d]} \right), \\ x^{(k+1)} & = \text{Proj}_{PD(\delta)}(x^{(k+\frac{1}{2})}), \end{align}\] and we can still utilized our previously defined functions. Here, the thresholding parameters have much more direct meaning, and we can use intuition to tune the sparsity of estimation.

source

SpLCM_PGD

 SpLCM_PGD (X, lambda_=0.1, lambda_i=1000, rho=0.1, delta=0.01,
            V_inverse_0=None, sigma_0=None, max_iter=700, tol=1e-05)

Exported source

def SpLCM_PGD(X, lambda_=0.1, lambda_i=1000, rho=0.1, delta=0.01,
              V_inverse_0=None, sigma_0=None, max_iter=700, tol=0.00001):
    n = X.shape[0]
    p = X.shape[1]
    S = sample_covariance(X)
    s = vech(S)
    diag_idx, off_idx = vech_idx(p)
    V_inverse = V_inverse_0
    if V_inverse_0 is None:
        Omega = clime(S, rho)
        V_inverse = normalization(n, Omega)
    V_inverse_op = np.max(np.linalg.eigvalsh(V_inverse))
    eta = n / V_inverse_op
    sigma = sigma_0
    if sigma_0 is None:
        sigma = s.copy()
    for i in range(max_iter):
        grad = (1.0 / n) * V_inverse @ (sigma - s)
        sigma_temp = sigma - eta * grad
        
        sigma_temp[off_idx] = soft_thresholding(sigma_temp[off_idx], lambda_)
        
        sigma_temp[diag_idx] = s[diag_idx] + soft_thresholding(sigma_temp[diag_idx] - s[diag_idx], lambda_i)

        Sigma_temp = vech_inverse(sigma_temp)
        Sigma_temp = proj_pd(Sigma_temp, delta)
        sigma_temp = vech(Sigma_temp)

        if np.linalg.norm(sigma_temp - sigma) < tol:
            break
        sigma = sigma_temp

    return vech_inverse(sigma)

The dual BCD approach to sparse linear modeling

In this section we consider whether it is possible to do covariance matrix estimation via a dual structure. Such an idea was explored in a previous work⁵. The following derivation originate from an unpublished note by Jacob Bien: we begin by zooming out from our specifically formulated vectorized regression setting, and make an observation on the general optimization setup: \[ \min_{x}\left\{f_{1}(x)+f_{2}(x)+f_{3}(x)+f_{4}(x)\right\}=\min_{x,y,z,a}\left\{f_{1}(x)+f_{2}(y)+f_{3}(z)+f_{4}(a)\;\text{s.t.}\;x=y,\;x=z,\;x=a \right\}, \] and therefore, the dual objective is \[\begin{align} g(u,v,w)&=\min_{x,y,z,a}\left\{f_{1}(x)+f_{2}(y)+f_{3}(z) + f_{4}(a) +u^{\top}(x-y)+v^{\top}(x-z)+w^{\top}(x-a)\right\}\\ &=\min_{x}\left\{f_{1}(x)+(u+v+w)^{\top}x \right\}+\min_{y}\left\{f_{2}(y)-u^{\top}y\right\}+\min_{z}\left\{f_{3}(z)-v^{\top}z\right\}+\min_{a}\left\{f_{4}(a)-w^{\top}a\right\}\\ &=-f_{1}^{*}(-u-v-w)-f_{2}^{*}(u)-f_{3}^{*}(v)-f_{4}^{*}(w), \end{align}\] where $f_{i}^{*}$ is the Fenchel conjugate of $f_{i}$. And so the dual problem is \[ \min_{u,v,w}\left\{f_{1}^{*}(-u-v-w)+f_{2}^{*}(u)+f_{3}^{*}(v)+f_{4}^{*}(w) \right\}. \] We now go back to our regression setting. We have the problem being: \[\begin{align} & f_{1}(\sigma)=\frac{1}{2n}\left\Vert V^{-1/2}(s-\sigma)\right\Vert_{2}^{2}=\frac{1}{2}\left\Vert s-\sigma\right\Vert_{\frac{1}{n}V^{-1}}^{2},\\ &f_{2}(\sigma)=\lambda \left\Vert \sigma_{[o]}\right\Vert_{1} = \lambda \left\Vert A\sigma\right\Vert_{1},\\ &f_{3}(\sigma)=\lambda_{\infty} \left\Vert s_{[d]} - \sigma_{[d]}\right\Vert_{1} = \lambda_{\infty} \left\Vert B\left(s - \sigma\right)\right\Vert_{1},\\ &f_{4}(\sigma)=\textbf{1}_{\infty}\left\{ \text{vech}^{-1}\left(\sigma\right) \succ 0\right\}, \end{align}\] where $A$ and $B$ are the boolean matrices that take diagonal and off-diagonal entries, i.e.:

Boolean matricies for extracting diagonal and off-diagonal indicies

def boolean_matricies(p):
    I = np.identity(p * (p + 1) // 2)
    A = np.zeros_like(I)
    B = np.zeros_like(I)
    all_idx = range(p * (p + 1) // 2)
    diag_idx = [k * (k + 1) // 2 - 1 for k in range(1, p + 1)]
    off_idx = sorted(set(all_idx) - set(diag_idx))
    A[:, off_idx] = I[:, off_idx]
    B[:, diag_idx] = I[:, diag_idx]
    return A, B

So their duals are \[\begin{align} & f_{1}^{*}(-u-v-w)=\frac{1}{2}\left\Vert u+v+w\right\Vert_{nV}^{2}-\langle u,s\rangle-\langle v,s\rangle-\langle w,s\rangle,\\ & f_{2}^{*}(u)=\textbf{1}_{\infty}\left\{\exists\;p,\;A^{\top}p=u,\; \left\Vert p \right\Vert_{\infty}\leq \lambda\right\},\\ & f_{3}^{*}(v)=\langle v, s\rangle \textbf{1}_{\infty}\left\{\exists\;q,\;B^{\top}q=v,\; \left\Vert q \right\Vert_{\infty}\leq \lambda_{\infty}\right\},\\ & f_{4}^{*}(w) = \textbf{1}_{\infty}\left\{ \text{vech}^{-1}\left(w\right) \prec 0\right\}. \end{align}\] And so we have the specific dual problem: \[\begin{align} \min_{u,v,w} \quad & \frac{1}{2}\left\Vert u+v+w\right\Vert_{nV}^{2}-\langle u,s\rangle-\langle w,s\rangle\\ \text{s.t.} \quad&\exists\;p,\;A^{\top}p=u,\; \left\Vert p \right\Vert_{\infty}\leq \lambda, \\ \quad&\exists\;q,\;B^{\top}q=v,\; \left\Vert q \right\Vert_{\infty}\leq \lambda_{\infty}, \\ \quad& \text{vech}^{-1}\left(w\right) \prec 0. \end{align}\] We first note that in the second constraint, when $\lambda_{\infty}$ is very large, we can drop the regularization part. We also note the three constraints are decoupled, and the main objective is differentiable. And therefore this problem can be solved via BCD. In particular, we split it into three problems: \[\begin{align} \min_{u} \quad & \frac{1}{2}\left\Vert u+v+w\right\Vert_{nV}^{2}-\langle u,s\rangle\\ \text{s.t.} \quad&\exists\;p,\;A^{\top}p=u,\; \left\Vert p \right\Vert_{\infty}\leq \lambda.\\ \min_{v} \quad & \frac{1}{2}\left\Vert u+v+w\right\Vert_{nV}^{2}\\ \text{s.t.}\quad&\exists\;q,\;B^{\top}q=v.\\ \min_{w} \quad & \frac{1}{2}\left\Vert u+v+w\right\Vert_{nV}^{2}-\langle w,s\rangle\\ \text{s.t.} \quad& \text{vech}^{-1}\left(w\right) \prec 0. \end{align}\] We essentially performs convex optimization on three problems in cycle. A main problem with the dual formulation is the weighting is now $nV$, where $V$ is supposed to involve inverse of the CLIME estimator, $\hat{\Omega}^{-1}$, which is itself an estimator of $\Sigma$. What’s worse is that the sparsity of $\hat{\Omega}^{-1}$ is likely very bad. So even if these subproblems take nice forms, if we continue using the CLIME estimator as weighting, the empirical efficiency will likely suffer. On the brightside, we can always use some rough and sparse estimator of $\Sigma$ to compute a rough and sparse $V$.

We now compute the update rule. The first problem is equivalently \[\begin{align} \min_{p} \quad & \frac{1}{2}\left\Vert A^{\top}p+c\right\Vert_{nV}^{2}-\langle A^{\top}p,s\rangle\\ \text{s.t.} \quad& \left\Vert p \right\Vert_{\infty}\leq \lambda, \end{align}\] where $c=v+w$ is a constant. Moreover, \[\begin{align} \frac{1}{2}\left\Vert A^{\top}p+c\right\Vert_{nV}^{2}-\langle A^{\top}p,s\rangle=\frac{1}{2}\left\Vert p\right\Vert_{nAVA^{\top}}^{2}-\langle p,A(s-nVc)\rangle + \frac{1}{2}\Vert c\Vert_{nV}^{2}. \end{align}\] So the update rule is \[ u^{(k+1)} = A^{\top}\text{argmin}_{\left\Vert p \right\Vert_{\infty}\leq \lambda}\left\{ \frac{1}{2}\left\Vert p\right\Vert_{nAVA^{\top}}^{2}-\left\langle p,A\left(s-nV\left(v^{(k)}+w^{(k)}\right)\right)\right\rangle\right\}, \] which amounts to solving a quadratic programming problem.

The second problem is equivalently \[\begin{align} \min_{q} \quad & \frac{1}{2}\left\Vert B^{\top}q+c\right\Vert_{nV}^{2}, \end{align}\] and since \[\begin{equation*} \frac{1}{2}\left\Vert B^{\top}q+c\right\Vert_{nV}^{2} = \frac{1}{2}\Vert q\Vert_{nBVB^{\top}}^{2}+\left\langle q, nBVc \right\rangle+\frac{1}{2}\Vert c\Vert_{nV}^{2}, \end{equation*}\] we have the problem is solving \[\begin{align*} \min_{q} \quad & \frac{1}{2}\Vert q\Vert_{nBVB^{\top}}^{2}+\left\langle q, nBVc \right\rangle, \end{align*}\] which is even simpler, as a QP without constraint. We write the update rule as \[ v^{(k+1)} = B^{\top}\text{argmin}_{q}\left\{\frac{1}{2}\Vert q\Vert_{nBVB^{\top}}^{2}+\left\langle q, nBV\left( u^{(k+1)}+w^{(k)}\right) \right\rangle \right\}. \]

The third problem is equivalently \[\begin{align} \min_{w} \quad & \frac{1}{2}\Vert w-c\Vert_{nV}^{2}+\langle w,s\rangle\\ \text{s.t.} \quad& \text{vech}^{-1}\left(w\right) \succ 0. \end{align}\] So the update rule is \[ w^{(k+1)}=-1\cdot\text{argmin}_{\text{vech}^{-1}\left(w\right) \succ 0}\left\{ \frac{1}{2}\Vert w\Vert_{nV}^{2}+\left\langle w,s-nV\left(u^{(k+1)}+v^{(k+1)}\right)\right\rangle\right\}, \] which again amounts to solve a quadratic programming problem. We now put the update rules together: \[\begin{align} &u^{(k+1)} = A^{\top}\text{argmin}_{\left\Vert p \right\Vert_{\infty}\leq \lambda}\left\{ \frac{1}{2}\left\Vert p\right\Vert_{nAVA^{\top}}^{2}-\left\langle p,A\left(s-nV\left(v^{(k)}+w^{(k)}\right)\right)\right\rangle\right\},\\ &v^{(k+1)} = B^{\top}\text{argmin}_{q}\left\{\frac{1}{2}\Vert q\Vert_{nBVB^{\top}}^{2}+\left\langle q, nBV\left( u^{(k+1)}+w^{(k)}\right) \right\rangle \right\},\\ &w^{(k+1)}=-1\cdot\text{argmin}_{\text{vech}^{-1}\left(w\right) \succ 0}\left\{ \frac{1}{2}\Vert w\Vert_{nV}^{2}+\left\langle w,s-nV\left(u^{(k+1)}+v^{(k+1)}\right)\right\rangle\right\}. \end{align}\] We now see a second major problem with the dual formulation: even if we are leveraging quadratic programming, $AVA^{\top}$ and $BVB^{\top}$ are strictly only positive semidefinite, not positive definite. This means each update likely does not admit a unique solution. In practice, we can use tricks like perturbing the normalizing matrix with identity, but this is a major theoretical problem with the formulation. We now first implement a numerically stable version with the perturbation trick.

Update rule for $u$

We recall the update rule for $u$ is \[\begin{equation*} u^{(k+1)} = A^{\top}\text{argmin}_{\left\Vert p \right\Vert_{\infty}\leq \lambda}\left\{ \frac{1}{2}\left\Vert p\right\Vert_{nAVA^{\top}}^{2}-\left\langle p,A\left(s-nV\left(v^{(k)}+w^{(k)}\right)\right)\right\rangle\right\}, \end{equation*}\] We can implement using package like quadprog or cvxpy with the OSQP solver.

def u_update(v, w, A, V, s, n, lambda_, epsilon=1e-8):
    AVAT = A @ V @ A.T
    Q = 0.5 * n * (AVAT + AVAT.T)
    Q += epsilon * np.eye(Q.shape[0])
    a = A @ (s - (n * V) @ (v + w))

    m = Q.shape[0]
    p = cp.Variable(m)

    objective = 0.5 * cp.quad_form(p, Q) - a @ p
    constraints = [cp.abs(p) <= lambda_]

    prob = cp.Problem(cp.Minimize(objective), constraints)
    prob.solve(solver=cp.OSQP)
    return A.T @ p.value

Update rule for $v$

We recall the update rule for $v$ is \[\begin{equation*} v^{(k+1)} = B^{\top}\text{argmin}_{q}\left\{\frac{1}{2}\Vert q\Vert_{nBVB^{\top}}^{2}+\left\langle q, nBV\left( u^{(k+1)}+w^{(k)}\right) \right\rangle \right\}. \end{equation*}\] This is an unconstrained QP, so we can use the closed form solution \[\begin{equation*} v^{(k+1)}=-B^{\top}\left(BVB^{\top}\right)^{+}\left(BV\left(u^{(k+1)}+w^{(k)} \right) \right). \end{equation*}\]

def v_update(u, w, B, V, epsilon=1e-8):
    BVBT = B @ V @ B.T
    Q = 0.5 * (BVBT + BVBT.T)
    Q += epsilon * np.eye(Q.shape[0])
    a = (B @ V) @ (u + w)
    q = (-1) * np.linalg.solve(Q, a)
    return B.T @ q

Update rule for $w$

We recall the update rule for $w$ is \[\begin{equation*} w^{(k+1)}=-1\cdot\text{argmin}_{\text{vech}^{-1}\left(w\right) \succ 0}\left\{ \frac{1}{2}\Vert w\Vert_{nV}^{2}+\left\langle w,s-nV\left(u^{(k+1)}+v^{(k+1)}\right)\right\rangle\right\}. \end{equation*}\] We note without the constraint, we can write the closed form solution as \[\begin{align*} w^{(k+1)}&=-1\cdot \left(-\left(nV\right)^{+}\left(s-nV\left(u^{(k+1)}+v^{(k+1)}\right) \right)\right)\\ &=-1\cdot \left(-\left(nV\right)^{+}s+\left(u^{(k+1)}+v^{(k+1)} \right)\right). \end{align*}\] So with the constraint, the solution is \[\begin{equation*} w^{(k+1)}=-1\cdot\text{Proj}_{\text{vech}\left(PD(\delta)\right)}\left(-\left(nV\right)^{+}s+\left(u^{(k+1)}+v^{(k+1)} \right)\right), \end{equation*}\] where the projection can be done utilizing the proj_pd we previously wrote.

def w_update(u, v, V, s, n, delta=1e-6):
    Q = n * V
    r = (-1) * np.linalg.solve(Q, s) + (u + v)
    R = vech_inverse(r)
    proj_R = proj_pd(R, delta)
    proj_r = vech(proj_R)
    return -1 * proj_r

Normalizing matrix under the dual formulation

We are one step away from putting the three update rules together: we haven’t give an expression for $V$. In principle, we would use \[\begin{equation*} V=\text{Cov}(\text{vech}(S))=\frac{2}{n} D_p^{+} \left( \Sigma^{*} \otimes \Sigma^{*} \right) D_p^{+\top}, \end{equation*}\] which means \[\begin{equation*} \widehat{V}=\frac{2}{n} D_p^{+} \left( S \otimes S \right) D_p^{+\top}. \end{equation*}\] Yet in practive, this is unapplicable. My empirical observation is with my unoptimized unparalleled code, each round of QP with this accurate $V$ takes around 3 minutes, and it should take at least 10s of rounds to converge. So in our implementation, we will use the diagonal of $\widehat{V}$, which would accelerate the optimization. In the following implementation, we still output the full $\widehat{V}$ as a reference only. It is also of interest to explore whether there are strategies to accelerate the optimization with a dense $V$. The user can always supply their own $V$.

def normalization_dual(n, S):
    p = S.shape[0]
    D = duplication_matrix(p)
    D_inv = np.linalg.pinv(D)
    V = (2 / n) * ((D_inv @ (np.kron(S, S))) @ D_inv.T)
    return V

Put everything together

The primal-dual relationship is \[\begin{equation*} \sigma = s - nV(u + v + w). \end{equation*}\]

source

SpLCM_BCD

 SpLCM_BCD (X, lambda_=0.1, delta=0.01, epsilon=1e-08, V_0=None, u_0=None,
            v_0=None, w_0=None, max_iter=700, tol=0.0001)

Exported source

def SpLCM_BCD(X, lambda_=0.1, delta=0.01, epsilon=1e-8, V_0=None, u_0=None, v_0=None, w_0=None,
          max_iter=700, tol=0.0001):
    n = X.shape[0]
    p = X.shape[1]
    S = sample_covariance(X)
    s = vech(S)
    A, B = boolean_matricies(p)

    V = V_0
    if V_0 is None:
        V = normalization_dual(n, S)
    V = np.diag(np.diag(V))

    u = u_0
    v = v_0
    w = w_0
    if u_0 is None:
        u = np.zeros_like(s)
    if v_0 is None:
        v = np.zeros_like(s)
    if w_0 is None:
        w = np.zeros_like(s)

    AVAT = A @ V @ A.T
    Q_u = 0.5 * n * (AVAT + AVAT.T)
    Q_u += epsilon * np.eye(Q_u.shape[0])

    BVBT = B @ V @ B.T
    Q_v = 0.5 * (BVBT + BVBT.T)
    Q_v += epsilon * np.eye(Q_v.shape[0])
    
    x = u + v + w

    for i in range(max_iter):
        u_temp = u_update_with_Q(v, w, A, V, s, n, Q_u, lambda_)
        v_temp = v_update_with_Q(u_temp, w, B, V, Q_v)
        w_temp = w_update(u_temp, v_temp, V, s, n, delta)
        x_temp = u_temp + v_temp + w_temp
        if np.linalg.norm(x_temp - x) < tol:
            break
        u = u_temp
        v = v_temp
        w = w_temp
        x = x_temp
    sigma = s - n * (V @ x)
    return vech_inverse(sigma)

Parameter Tuning

We now consider finding the optimal $\lambda$ and $\rho$ for the above algorithms. Notice due to runtime reasons, which will become clear in the simulation section, we give up on implementing parameter tuning for ADMM and focus on that for PGD. As is already discussed in the paper⁶, the key quantity on deciding parameter performance is the BIC criteria: \[ \text{BIC}(\lambda, \rho) = n\log \det\left(\widehat{\Sigma}_{\lambda,\rho}\right)+n\text{tr} \left(S\widehat{\Sigma}_{\lambda,\rho}^{-1}\right)+\log(n)\sum_{j<k}\mathbf{1}\{\widehat{\sigma}_{jk}\neq 0\}. \] It is not immediately justified that BIC is indeed the right choice. Therefore we try to discuss this empirically. We consider two alternative choices: \[\begin{align*} \text{AIC}(\lambda, \rho) &= n\log \det\left(\widehat{\Sigma}_{\lambda,\rho}\right)+n\text{tr} \left(S\widehat{\Sigma}_{\lambda,\rho}^{-1}\right)+2\sum_{j<k}\mathbf{1}\{\widehat{\sigma}_{jk}\neq 0\},\\ \text{EBIC}(\lambda, \rho) &= n\log \det\left(\widehat{\Sigma}_{\lambda,\rho}\right)+n\text{tr} \left(S\widehat{\Sigma}_{\lambda,\rho}^{-1}\right)+\log(n)\sum_{j<k}\mathbf{1}\{\widehat{\sigma}_{jk}\neq 0\}+2\gamma \binom{p(p-1)/2}{\sum_{j<k}\mathbf{1}\{\widehat{\sigma}_{jk}\neq 0\}}. \end{align*}\]

In this section, our goal is to implement an automatic search wrapper for the PGD implementation above.

PGD with parameter searching

Parallelization

Acknowledgement: Chatgpt wrote the first version of functions in this section.

The following are essentially PyTorched version of our previous PGD implementation.

source

SpLCM_PGD_torch

 SpLCM_PGD_torch (X, lambda_=0.1, lambda_i=1000, rho=0.1, delta=0.01,
                  V_inverse_0=None, sigma_0=None, max_iter=700, tol=1e-05,
                  device=device(type='cpu'))

Exported source

def SpLCM_PGD_torch(X, lambda_=0.1, lambda_i=1000, rho=0.1, delta=0.01,
                    V_inverse_0=None, sigma_0=None, max_iter=700, tol=1e-5, 
                    device=torch.device("cuda" if torch.cuda.is_available() else "cpu")):
    X_torch = to_tensor(X, device=device)
    n, p = X_torch.shape
    S = sample_covariance(X)             
    S_torch = to_tensor(S, device=device)
    s = vech_torch(S_torch)       
    diag_idx, off_idx = vech_idx_torch(p, device=device)  

    if V_inverse_0 is None:
        Omega = clime(S, rho)           
        V_inverse = normalization(n, Omega)
        V_inverse = torch.from_numpy(V_inverse).float().to(device)
    else:
        V_inverse = torch.from_numpy(V_inverse_0).float().to(device)

    V_inverse_op = torch.linalg.eigvalsh(V_inverse).max()
    eta = n / V_inverse_op

    if sigma_0 is None:
        sigma = s.clone()
    else:
        sigma = to_tensor(sigma_0, device=device)

    for i in range(max_iter):
        grad = (1.0 / n) * torch.matmul(V_inverse, (sigma - s))
        sigma_temp = sigma - eta * grad
        sigma_temp[off_idx] = soft_thresholding_torch(sigma_temp[off_idx], lambda_)
        sigma_temp[diag_idx] = s[diag_idx] + soft_thresholding_torch(sigma_temp[diag_idx] - s[diag_idx], lambda_i)
        Sigma_temp = vech_inverse_torch(sigma_temp)
        Sigma_temp = proj_pd_torch(Sigma_temp, delta)
        sigma_temp = vech_torch(Sigma_temp)
        if torch.norm(sigma_temp - sigma) < tol:
            break
        sigma = sigma_temp
    Sigma = vech_inverse_torch(sigma)
    return Sigma.cpu().numpy()

Searching with parameter evaluation

Acknowledgement: ChatGPT wrote the first version of this function.

We now automate the parameter searching for the PGD implementation.

source

SpLCM_PGD_tuned

 SpLCM_PGD_tuned (X, lambda_grid, rho_grid, eval_func, lambda_i=1000,
                  delta=0.01, max_iter=700, tol=1e-05,
                  numerical_tol=0.001, gamma=0.5, early_stopping=True,
                  eval_tol=100, device=device(type='cpu'))

Exported source

def SpLCM_PGD_tuned(X, lambda_grid, rho_grid, eval_func,
                    lambda_i=1000, delta=0.01, max_iter=700, tol=0.00001, numerical_tol=1e-3, gamma=0.5,
                    early_stopping=True, eval_tol=100,
                    device=torch.device("cuda" if torch.cuda.is_available() else "cpu")):
    score_opt = None
    Sigma_opt = None
    score_traj = []
    Sigma_traj = []
    if not isinstance(lambda_grid, (list, tuple, set, np.ndarray)):
        lambda_grid = [lambda_grid]
    if not isinstance(rho_grid, (list, tuple, set, np.ndarray)):
        rho_grid = [rho_grid]
    for lambda_ in lambda_grid:
        for rho in rho_grid:
            params = dict(locals())
            params['lambda_0'] = lambda_
            params['rho_0'] = rho
            Sigma_hat, score = smart_call(eval_func, **params)
            if score_opt is None:
                score_opt = score
                Sigma_opt = Sigma_hat
                score_traj.append(score)
                Sigma_traj.append(Sigma_hat)
            if score < score_opt:
                if early_stopping is True:
                    if score_opt - score <= 100:
                        break
                    else:
                        score_opt = score
                        Sigma_opt = Sigma_hat
                        score_traj.append(score)
                        Sigma_traj.append(Sigma_hat)
                else:
                    score_opt = score
                    Sigma_opt = Sigma_hat
                    score_traj.append(score)
                    Sigma_traj.append(Sigma_hat)
    return Sigma_opt, score_traj, Sigma_traj

Information Criterion

BIC

We first implement the BIC evaluation for parameters as defined above.

def SpLCM_PGD_BIC_eval(X, lambda_0, rho_0, lambda_i=1000, delta=0.01, max_iter=700, tol=0.00001, numerical_tol=1e-3,
                       device=torch.device("cuda" if torch.cuda.is_available() else "cpu")):
    S = sample_covariance(X)
    S = torch.from_numpy(S).float().to(device)
    n = X.shape[0];
    Sigma_hat = smart_call(SpLCM_PGD_torch_tensor, **locals())
    sign, logabsdet = torch.linalg.slogdet(Sigma_hat)
    logdet = logabsdet
    
    Sigma_inv = torch.linalg.inv(Sigma_hat)
    preconditioning = torch.trace(S @ Sigma_inv)

    upper_mask = torch.triu(torch.ones_like(Sigma_hat), diagonal=1)
    sparsity = torch.count_nonzero(torch.abs(Sigma_hat * upper_mask) > numerical_tol)

    Sigma_hat_np = Sigma_hat.cpu().numpy()
    bic = (n * logdet + n * preconditioning + torch.log(torch.tensor(n, dtype=Sigma_hat.dtype, device=device)) * sparsity).cpu().numpy();

    return (Sigma_hat_np, bic)

We can combine this evaluation function with our tuned PGD function.

source

SpLCM_PGD_BIC

 SpLCM_PGD_BIC (X, lambda_grid, rho_grid, lambda_i=1000, delta=0.01,
                max_iter=700, tol=1e-05, numerical_tol=0.001,
                early_stopping=True, eval_tol=100,
                device=device(type='cpu'))

Exported source

def SpLCM_PGD_BIC(X, lambda_grid, rho_grid,
                  lambda_i=1000, delta=0.01,
                  max_iter=700, tol=0.00001, numerical_tol=1e-3,
                  early_stopping=True, eval_tol=100,
                  device=torch.device("cuda" if torch.cuda.is_available() else "cpu")):
    params = dict(locals())
    params['eval_func'] = SpLCM_PGD_BIC_eval
    Sigma_opt, score_traj, Sigma_traj = smart_call(SpLCM_PGD_tuned, **params)
    return Sigma_opt, score_traj, Sigma_traj

AIC

Acknowledgement: the implementation is produced by letting ChatGPT imitate the BIC implementation.

def SpLCM_PGD_AIC_eval(X, lambda_0, rho_0, lambda_i=1000, delta=0.01, max_iter=700, tol=0.00001, numerical_tol=1e-3,
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")):
    S = sample_covariance(X)
    S = torch.from_numpy(S).float().to(device)
    n = X.shape[0]

    Sigma_hat = smart_call(SpLCM_PGD_torch_tensor, **locals())
    sign, logabsdet = torch.linalg.slogdet(Sigma_hat)
    logdet = logabsdet

    Sigma_inv = torch.linalg.inv(Sigma_hat)
    preconditioning = torch.trace(S @ Sigma_inv)

    upper_mask = torch.triu(torch.ones_like(Sigma_hat), diagonal=1)
    sparsity = torch.count_nonzero(torch.abs(Sigma_hat * upper_mask) > numerical_tol)

    Sigma_hat_np = Sigma_hat.cpu().numpy()
    aic = (n * logdet + n * preconditioning + 2 * sparsity).cpu().numpy()

    return (Sigma_hat_np, aic)

source

SpLCM_PGD_AIC

 SpLCM_PGD_AIC (X, lambda_grid, rho_grid, lambda_i=1000, delta=0.01,
                max_iter=700, tol=1e-05, numerical_tol=0.001,
                early_stopping=True, eval_tol=100,
                device=device(type='cpu'))

Exported source

def SpLCM_PGD_AIC(X, lambda_grid, rho_grid,
                  lambda_i=1000, delta=0.01,
                  max_iter=700, tol=0.00001, numerical_tol=1e-3,
                  early_stopping=True, eval_tol=100,
                  device=torch.device("cuda" if torch.cuda.is_available() else "cpu")):
    params = dict(locals())
    params['eval_func'] = SpLCM_PGD_AIC_eval
    Sigma_opt, score_traj, Sigma_traj = smart_call(SpLCM_PGD_tuned, **params)
    return Sigma_opt, score_traj, Sigma_traj

EBIC

Acknowledgement: the implementation is produced by letting ChatGPT imitate the BIC implementation.

def SpLCM_PGD_EBIC_eval(X, lambda_0, rho_0, lambda_i=1000, delta=0.01, max_iter=700, tol=0.00001, numerical_tol=1e-3, gamma=0.5,
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")):
    S = sample_covariance(X)
    S = torch.from_numpy(S).float().to(device)
    n = X.shape[0]
    p = S.shape[0]

    Sigma_hat = smart_call(SpLCM_PGD_torch_tensor, **locals())
    sign, logabsdet = torch.linalg.slogdet(Sigma_hat)
    logdet = logabsdet

    Sigma_inv = torch.linalg.inv(Sigma_hat)
    preconditioning = torch.trace(S @ Sigma_inv)

    upper_mask = torch.triu(torch.ones_like(Sigma_hat), diagonal=1)
    sparsity = torch.count_nonzero(torch.abs(Sigma_hat * upper_mask) > numerical_tol).item()

    Sigma_hat_np = Sigma_hat.cpu().numpy()
    log_comb = 0 if sparsity == 0 else math.log(comb(p * (p - 1) / 2, sparsity))
    ebic = (n * logdet + n * preconditioning + math.log(n) * sparsity + 2 * gamma * log_comb)

    return (Sigma_hat_np, ebic)

source

SpLCM_PGD_EBIC

 SpLCM_PGD_EBIC (X, lambda_grid, rho_grid, lambda_i=1000, delta=0.01,
                 max_iter=700, tol=1e-05, numerical_tol=0.001, gamma=0.5,
                 early_stopping=True, eval_tol=100,
                 device=device(type='cpu'))

Exported source

def SpLCM_PGD_EBIC(X, lambda_grid, rho_grid,
                  lambda_i=1000, delta=0.01,
                  max_iter=700, tol=0.00001, numerical_tol=1e-3, gamma=0.5,
                  early_stopping=True, eval_tol=100,
                  device=torch.device("cuda" if torch.cuda.is_available() else "cpu")):
    params = dict(locals())
    params['eval_func'] = SpLCM_PGD_EBIC_eval
    Sigma_opt, score_traj, Sigma_traj = smart_call(SpLCM_PGD_tuned, **params)
    return Sigma_opt, score_traj, Sigma_traj

Tools For Empirical Comparison

We now introduce several functions to generate indeed sparse yet structural covariance matrix to compare the ADMM, PGD, and dual BCD approaches.

Generating sparse matricies and their heat map

Acknowledgement: ChatGPT wrote the first version of these functions.

Generating random almost diagonal sparse covariance matrices

This function generates a almost diagonal positive definite matrix with a small amount of nontrivial off-diagonal correlation. The probabiliy of an arbitrary $(i,j)$-entry in the upper triangular part (and also trivially lower triangular part) is controlled by the sparsity parameter independently. We also by default consider a scaled covariance matrix, i.e. the diagonal entries are normalized to $1$.

def sparse_covariance(p, sparsity=0.05, off_diag_range=(0.4, 0.9), delta=1e-6, signed=True, scaled=True):
    Sigma = np.eye(p)
    for i in range(p):
        for j in range(i + 1, p):
            if np.random.rand() < sparsity:
                val = np.random.uniform(*off_diag_range)
                if signed:
                    val *= np.random.choice([-1, 1])
                Sigma[i, j] = Sigma[j, i] = val
        Sigma = proj_pd(Sigma, delta)
        if scaled is True:
            D = np.sqrt(np.diag(Sigma))
            Sigma = Sigma / np.outer(D, D)
    return Sigma

Plot sparsity of matrices using the heat map

Give a visual understanding of the sparsity of a covariance matrix via a heat map.

def plot_matrix_heatmap(M, threshold=0.0, vmax=None, cmap="Reds"):
    if isinstance(M, dict):
        if ("result" not in M) or ("time" not in M):
            raise Exception("Invalid input")
        M = M["result"]
        if isinstance(M, tuple):
            M = M[0]
    A = np.abs(M)
    if threshold is not None:
        A[A < threshold] = 0.0
    plt.figure(figsize=(6, 5))
    im = plt.imshow(A, cmap=cmap, vmin=0, vmax=vmax)
    plt.colorbar(im, fraction=0.045)
    plt.tight_layout()
    plt.show()

Performance metrics

Normed distance

One natural choice for measuring the accuracy of an estimator is its normed distance to the true quantity. Here we consider two norms, the Frobenius norm and the operator $\mathcal{L}^{2}$ norm. These are the choices considered in the paper⁷. Here we essentially write a wrapper for the numpy.linalg.norm function to group the two metrics together as a return.

def normed_error(Sigma_hat, Sigma):
    diff = Sigma_hat - Sigma
    frob = np.linalg.norm(diff, ord='fro')
    l2 = np.linalg.norm(diff, ord=2)
    return frob, l2

Support recovery

In the sparse matrix setting, a performance metric that is easier to be making sense of is the support recovery rate. In the paper⁸, the author considered two rates, being two sides of a coin. \[\begin{align} \text{True positive rate} &= \frac{\#\{(i, j) : \hat{\Sigma}_{ij} \ne 0,\ \Sigma_{ij} \ne 0\}}{\#\{(i, j) : \Sigma_{ij} \ne 0\}},\\ \text{False positive rate} &= \frac{\#\{(i, j) : \hat{\Sigma}_{ij} \ne 0,\ \Sigma_{ij} = 0\}}{\#\{(i, j) : \Sigma_{ij} = 0\}}. \end{align}\] In our implementation, we removed diagonal entries from any nonzero counting involved in the above definition because variance is always nonzero, and our estimators have explicit diagonal projection, which makes them also nonzero. Yet in general, this implementation detail shouldn’t affect the relative performance ordering of the estimators.

def support_recovery_error(Sigma_hat, Sigma, tol=1e-3):
    mask_hat = np.abs(Sigma_hat) > tol
    mask = np.abs(Sigma) > tol

    off_diag = ~np.eye(Sigma_hat.shape[0], dtype=bool)
    mask_hat = mask_hat & off_diag
    mask2 = mask & off_diag

    true_positive = np.sum(mask_hat & mask2)
    false_positive = np.sum(mask_hat & ~mask2)
    false_negative = np.sum(~mask_hat & mask2)

    tpr = true_positive / (true_positive + false_negative)
    fpr = false_positive / np.sum(~mask)
    return tpr, fpr

All-in-one error evaluation

We now put the previous performance metrics into one function for simplicity.

source

error

 error (result, Sigma, est_name, tol=0.001)

Exported source

def error(result, Sigma, est_name, tol=1e-3):
    Sigma_hat = result
    time = None
    if isinstance(result, dict):
        if ("result" not in result) or ("time" not in result):
            raise Exception("Invalid input")
        Sigma_hat = result["result"]
        time = result["time"]
        if isinstance(Sigma_hat, tuple):
            Sigma_hat = Sigma_hat[0]
    tpr, fpr = support_recovery_error(Sigma_hat, Sigma, tol=tol)
    frob, l2 = normed_error(Sigma_hat, Sigma)
    if time is None:
        return {
            'Estimator': est_name,
            'TPR': tpr,
            'FPR': fpr,
            'Frobenius': frob,
            'L2': l2,
        }
    else:
        return {
            'Estimator': est_name,
            'TPR': tpr,
            'FPR': fpr,
            'Frobenius': frob,
            'L2': l2,
            'Time': time
        }

Comparing multiple estimators

We implement a function that will take multiple estimators and print out their errors row-by-row in a panda dataframe.

source

error_table

 error_table (results, Sigma, tol=0.001)

Exported source

def error_table(results, Sigma, tol=1e-3):
    rows = []
    for est_name, result in results.items():
        row = error(result, Sigma, est_name, tol)
        rows.append(row)
    df = pd.DataFrame(rows)
    return df.set_index('Estimator')

Empirical Study

In this section, our goal is to evaluate the three different optimization algorithms as approaches to the sparse covariance matrix estimation problem, and whether BIC is a good information criteria for the parameter tuning under high dimensional settings. We consider generating a $60\times 60$ covariance matrix using the sparse covariance matrix generating function we previously defined. We can see the sparsity pattern and the scaled diagonal, which is exactly the reason we want to match the diagonal of estimator with sample covariance matrix.

np.random.seed(616)
sparse_cov= sparse_covariance(60)
plot_matrix_heatmap(sparse_cov, threshold=1e-3)

We first run a heat map of the sample covariance matrix. It is clear that this would be overly noisy for off-diagonal entries.

X1 = np.random.multivariate_normal(np.zeros(60), sparse_cov, size=30)
sample_cov_1 = sample_covariance(X1)
plot_matrix_heatmap(sample_cov_1, threshold=1e-3)

Comparison of ADMM, PGD, and BCD

We first run the ADMM based algorithm for sparse linear model, with $\lambda=0.7$ and $\rho=0.1$. The maximal amount of iteration is by default 700, and the converging criteria is $\left\Vert\widehat{\sigma}^{(k+1)}-\widehat{\sigma}^{(k)}\right\Vert_{2}^{2}<0.00001$. Judging from the heat map, the result is almost identical to that of the sample covariance matrix. What’s worse is that it takes around 150 seconds to finish (under our single thread implementation).

admm_low_reg = SpLCM_ADMM_timed(X1, lambda_ = 0.7)
plot_matrix_heatmap(admm_low_reg, threshold=1e-3)

R[write to console]: Loading required package: lpSolve

SpLCM_ADMM_timed ran in 164.5997s

We now choose $\lambda=100$ to encourage sparsity. Under the same convergence criteria, we now indeed get a rather sparse and informative estimation. But again, this time it takes around 150 seconds to finish running, and likely not converged, but limited by 700 iterations.

admm_high_reg = SpLCM_ADMM_timed(X1, lambda_ = 100)
plot_matrix_heatmap(admm_high_reg, threshold=1e-3)

SpLCM_ADMM_timed ran in 150.2583s

We now use PGD to perform the same estimation. We first set $\lambda=0.005$ to analogously be close to sample covariance, as we did in the ADMM case. The answer is positive, in a much shorter time of around 8 seconds.

pgd_low_reg = SpLCM_PGD_timed(X1, lambda_ = 0.005)
plot_matrix_heatmap(pgd_low_reg, threshold=1e-3)

SpLCM_PGD_timed ran in 7.9843s

We now encourage sparsity by setting $\lambda=0.025$, and we can similarly recover a result which costs ADMM 180 seconds in roughly 17 seconds.

pgd_high_reg = SpLCM_PGD_timed(X1, lambda_ = 0.025)
plot_matrix_heatmap(pgd_high_reg, threshold=1e-3)

SpLCM_PGD_timed ran in 4.5279s

We now use dual BCD to do the covariance estimation. We can see it is significantly slower than PGD, but also significantly faster than ADMM, and recover the sparsity and overall structure of covariance matrix well, even with a compromised (diagonal) normalizing matrix.

bcd = SpLCM_BCD_timed(X1)
plot_matrix_heatmap(bcd, threshold=1e-3)

SpLCM_BCD_timed ran in 35.1180s

The aboves are purely our observation from plots. We now look at the exact performance of these optimization algorithms and their produced estimators.

opt_results = {
    r"ADMM with $\lambda=0.7$": admm_low_reg,
    r"ADMM with $\lambda=100$": admm_high_reg,
    r"PGD with $\lambda=0.005$": pgd_low_reg,
    r"PGD with $\lambda=0.025$": pgd_high_reg,
    r"BCD with $\lambda=0.1$": bcd
}

From the table, we can see from a normed distance perspective, the estimator produced by ADMM with $\lambda=100$ is the best, and followed by that produced by PGD with $\lambda=0.025$. Their difference in normed distance isn’t that significant, but their running time difference is very significant. We now look at the support recovery error, and we can see the result isn’t very informational as TPR and FPR seems to be highly correlated, which in general shouldn’t be the case. A potential reason is that we set a too low tolerance for counting nonzero entries, which is by default $0.0001$. Recall in our matrix generating function, any off-diagonal nonzero entry by default can only take values between $(0.4, 0.9)$. It is possible that estimators tend to take rather uncontrolled small values in these entries.

opt_table_low_tol = error_table(opt_results, sparse_cov)
display(opt_table_low_tol)

	TPR	FPR	Frobenius	L2	Time
Estimator
ADMM with $\lambda=0.7$	0.988816	0.997131	11.707416	5.149339	164.599737
ADMM with $\lambda=100$	0.970177	0.982783	4.876153	1.416867	150.258311
PGD with $\lambda=0.005$	0.875116	0.882353	9.258855	3.512117	7.984298
PGD with $\lambda=0.025$	0.404473	0.417504	5.275957	1.601727	4.527934
BCD with $\lambda=0.1$	0.678472	0.687231	6.546633	2.369557	35.117952

With the above intuition we now look at the table again with tolerance set to a higher value being $0.05$. We can now see the decoupled behaviour between TPR and FPR. We can see estimators with little regularization such that ADMM with $\lambda=0.7$ still have very good TPR, but also very bad FPR, which isn’t mitigated by the higher tolerance value. The overall best estimator is still ADMM with $\lambda=100$, closely followed by PGD with $\lambda=0.025$. Again, their difference in performance is not that significant, but their running time difference is.

opt_table_high_tol = error_table(opt_results, sparse_cov, tol=0.05)
display(opt_table_high_tol)

	TPR	FPR	Frobenius	L2	Time
Estimator
ADMM with $\lambda=0.7$	0.898649	0.781751	11.707416	5.149339	164.599737
ADMM with $\lambda=100$	0.567568	0.159063	4.876153	1.416867	150.258311
PGD with $\lambda=0.005$	0.804054	0.630703	9.258855	3.512117	7.984298
PGD with $\lambda=0.025$	0.567568	0.183724	5.275957	1.601727	4.527934
BCD with $\lambda=0.1$	0.736486	0.414303	6.546633	2.369557	35.117952

Information Criteria

We now consider the tuned version of PGD. Our parameter search space is that $\rho=0.1$ is fixed, $\log\lambda$ is equally spaced between $\left[0.01, 10^{-0.5}\right]$ for 50 values. We first run the parameter search with BIC optimization exactly, i.e. without any early stopping crteria. We can see the estimator is very sparse with a strutural pattern from the true covariance matrix.

pgd_bic_unstopped = SpLCM_PGD_BIC_timed(X1, np.logspace(-2, -0.5, 50), 0.1, early_stopping=False)
plot_matrix_heatmap(pgd_bic_unstopped, threshold=1e-3)

SpLCM_PGD_BIC_timed ran in 128.2309s

We now run the BIC optimization with an early stopping criteria being the first time BIC score drop by less than 100. The reason for considering such criteria is sometimes BIC prefers completely diagonal estimator, and the early stopping can prevent this from happening. In our case, the early stopped estimator matches with the exact one. The reason can be that we are searching over a dense enough grid of $\lambda$.

pgd_bic_stopped = SpLCM_PGD_BIC_timed(X1, np.logspace(-2, -0.5, 50), 0.1)
plot_matrix_heatmap(pgd_bic_stopped, threshold=1e-3)

SpLCM_PGD_BIC_timed ran in 129.7535s

We similarly run AIC and EBIC optimization with the early stopping mentioned previously. We can see it seems they produce the exact same estimator as that of BIC optimization.

pgd_aic_stopped = SpLCM_PGD_AIC_timed(X1, np.logspace(-2, -0.5, 50), 0.1)
plot_matrix_heatmap(pgd_aic_stopped, threshold=1e-3)

SpLCM_PGD_AIC_timed ran in 117.5695s

pgd_ebic_stopped = SpLCM_PGD_EBIC_timed(X1, np.logspace(-2, -0.5, 50), 0.1)
plot_matrix_heatmap(pgd_ebic_stopped, threshold=1e-3)

SpLCM_PGD_EBIC_timed ran in 123.6517s

We now compare the tuned estimators with previously untuned estimators under performance metrics.

results = {
    r"ADMM with $\lambda=0.7$": admm_low_reg,
    r"ADMM with $\lambda=100$": admm_high_reg,
    r"PGD with $\lambda=0.005$": pgd_low_reg,
    r"PGD with $\lambda=0.025$": pgd_high_reg,
    r"BCD with $\lambda=0.1$": bcd,
    "PGD with BIC without stopping": pgd_bic_unstopped,
    "PGD with BIC with stopping": pgd_bic_stopped,
    "PGD with AIC with stopping": pgd_aic_stopped,
    "PGD with EBIC with stopping": pgd_ebic_stopped
}

We can see the tuned estimators are the best estimators under the normed distance. They also have extremely good FPR, but comes together is extremely bad TPR, under the sensitive tolerance. One thing worth noticing is the tuned PGD still runs faster than ADMM, but this shouldn’t be taken seriously as here we implemented a torch version of PGD with automatic parallelization, while the implementation of ADMM is vanilla. It is in general a consensus that ADMM receives more boost under parallelization than PGD.

table_low_tol = error_table(results, sparse_cov)
display(table_low_tol)

	TPR	FPR	Frobenius	L2	Time
Estimator
ADMM with $\lambda=0.7$	0.988816	0.997131	11.707416	5.149339	164.599737
ADMM with $\lambda=100$	0.970177	0.982783	4.876153	1.416867	150.258311
PGD with $\lambda=0.005$	0.875116	0.882353	9.258855	3.512117	7.984298
PGD with $\lambda=0.025$	0.404473	0.417504	5.275957	1.601727	4.527934
BCD with $\lambda=0.1$	0.678472	0.687231	6.546633	2.369557	35.117952
PGD with BIC without stopping	0.062442	0.011478	4.932425	1.424776	128.230863
PGD with BIC with stopping	0.062442	0.011478	4.932425	1.424776	129.753501
PGD with AIC with stopping	0.062442	0.011478	4.932425	1.424776	117.569497
PGD with EBIC with stopping	0.062442	0.011478	4.932425	1.424776	123.651720

Changing the tolerance to $0.05$ gives us a better understanding on the performance of tuned estimators. They have nearly no false positive, and an acceptable true positive by recovering 30 percents of the nonzero entries. In this regard, the tuned estimators can be considered as really good.

table_high_tol = error_table(results, sparse_cov, tol=0.05)
display(table_high_tol)

	TPR	FPR	Frobenius	L2	Time
Estimator
ADMM with $\lambda=0.7$	0.898649	0.781751	11.707416	5.149339	164.599737
ADMM with $\lambda=100$	0.567568	0.159063	4.876153	1.416867	150.258311
PGD with $\lambda=0.005$	0.804054	0.630703	9.258855	3.512117	7.984298
PGD with $\lambda=0.025$	0.567568	0.183724	5.275957	1.601727	4.527934
BCD with $\lambda=0.1$	0.736486	0.414303	6.546633	2.369557	35.117952
PGD with BIC without stopping	0.324324	0.006165	4.932425	1.424776	128.230863
PGD with BIC with stopping	0.324324	0.006165	4.932425	1.424776	129.753501
PGD with AIC with stopping	0.324324	0.006165	4.932425	1.424776	117.569497
PGD with EBIC with stopping	0.324324	0.006165	4.932425	1.424776	123.651720

Concluding Remarks

In the above two sections, we compared a bunch of estimators produced by different optimization algorithms, different parameter selection, and with or without tuning. The general conclusion is that under the vanilla implementation, with parameter tuning, PGD seems to achieve similar accuracy as ADMM, with much smaller running time. Therefore, for problem dimensions around or under 100, and no significant parallelization, it seems PGD would be a better choice. BCD seems to be less significant in terms of both accuracy and efficiency. Another conclusion is that BIC do produce meaningful tuning, especially when we want to control the false positive to be low, while hoping to recover a moderate amounnt of structure.

The open questions would center around problems with much higher dimensions, such as a covariance matrix with $10000\times 10000$ entries. It is of interest to see if ADMM with parallelization can significantly outperforms PGD in terms of both accuracy and efficiency. It is also of interest to see whether BIC, AIC, and EBIC would be different in that situation, as currently there does not seem to be significant difference in the three information criterion.

Footnotes

R. Kim and I. Gaynanova, ‘A sparse linear model for positive definite estimation of covariance matrices’, arXiv [stat.ME], 11-Mar-2025.↩︎
T. Cai, W. Liu, and X. Luo, ‘A Constrainedℓ1Minimization approach to sparse precision matrix estimation’, J. Am. Stat. Assoc., vol. 106, no. 494, pp. 594–607, Jun. 2011.↩︎
T. Cai, W. Liu, and X. Luo, ‘A Constrainedℓ1Minimization approach to sparse precision matrix estimation’, J. Am. Stat. Assoc., vol. 106, no. 494, pp. 594–607, Jun. 2011.↩︎
R. Kim and I. Gaynanova, ‘A sparse linear model for positive definite estimation of covariance matrices’, arXiv [stat.ME], 11-Mar-2025.↩︎
J. Bien, ‘Graph-guided banding of the covariance matrix’, arXiv [stat.ME], 01-Jun-2016.↩︎
R. Kim and I. Gaynanova, ‘A sparse linear model for positive definite estimation of covariance matrices’, arXiv [stat.ME], 11-Mar-2025.↩︎
R. Kim and I. Gaynanova, ‘A sparse linear model for positive definite estimation of covariance matrices’, arXiv [stat.ME], 11-Mar-2025.↩︎
R. Kim and I. Gaynanova, ‘A sparse linear model for positive definite estimation of covariance matrices’, arXiv [stat.ME], 11-Mar-2025.↩︎

Problem Setup

Vectorizing matrices, and matricizing vectors

Vectorizing a symmetric matrix

Matricizing a vector

Duplication matrix

Diagonal and off-diagonal indices in vector representation

Normalization

Sample covariance matrix

CLIME

Normalizing matrix

Optimization

The ADMM approach to sparse linear modeling

Soft-thresholding

Projection onto nearest positive definite matrix

Update rule for \(\sigma\)

Update rule for \(\theta\)

Update rule for \(\eta\)

Put everything together

SpLCM_ADMM

The PGD approach to sparse linear modeling

SpLCM_PGD

The dual BCD approach to sparse linear modeling

Boolean matricies for extracting diagonal and off-diagonal indicies

Update rule for \(u\)

Update rule for \(v\)

Update rule for \(w\)

Normalizing matrix under the dual formulation

Put everything together

SpLCM_BCD

Parameter Tuning

PGD with parameter searching

Parallelization

SpLCM_PGD_torch

Searching with parameter evaluation

SpLCM_PGD_tuned

Information Criterion

BIC

SpLCM_PGD_BIC

AIC

SpLCM_PGD_AIC

EBIC

SpLCM_PGD_EBIC

Tools For Empirical Comparison

Generating sparse matricies and their heat map

Generating random almost diagonal sparse covariance matrices

Plot sparsity of matrices using the heat map

Performance metrics

Normed distance

Support recovery

All-in-one error evaluation

error

Comparing multiple estimators

error_table

Empirical Study

Comparison of ADMM, PGD, and BCD

Information Criteria

Concluding Remarks

Footnotes