Monte Carlo Estimates of Log Determinants

Monte Carlo Estimates of Log-Determinants

In statistics, we often need to compute a log-determinant as part of a multivariate maximum likelihood problem for non-independent data. The matrix for which we may need a log-determinant can be very large, though more often than not it is also very sparse. Direct methods of computing the determinant for a symmetric variance-covariance matrix often use the Cholesky decomposition or in certain cases an LU decomposition can be used.

I have recently need to tackle this problem for the estimation of some spatial-econometric models. One good thing about the variance-covariance in spatial econometric models is that they're sparse. Unfortunately, all of the sparse Cholesky routines that I am aware of are not suitable to be included in BSD-compatible projects like statsmodels and pysal. Furthermore, sparse methods (sometimes) rely on particular orderings of the matrices to achieve good performance; however, these orderings are not always possible to achieve with spatial weighting matrices. Another good thing about working with spatial econometric models is that we often don't need exact estimates. Approximations often do just fine. This led me to a 1999 paper by Barry and Pace in which they introduce a simple Monte Carlo method to approximate log-determinants of matrices, "A Monte Carlo Estimator of Log Determinants of Large Sparse Matrices." [pdf]. See also Chapter 4.4 in LeSage and Pace's Introduction to Spatial Econometrics.

To describe the problem in a little more detail. We need to calculate $\ln|I_n - \rho W|$ where $I_n$ is an $n\times n$ identity matrix, $\rho$ is a scalar in $(-1,1)$, and $W$ has been rescaled such that its eigenvalues are in $(-1,1)$. Direct methods would require the computation of this log determinant for a number of different values of $\rho$. However, we can develop an MC approximation that avoids these recomputations by first noting that

$$\ln|I_n - \rho W| = \operatorname{tr}(\ln(I_n - \rho W))$$

where $\operatorname{tr}$ denotes the trace of the matrix. Next note that the matrix logarithm has the power series expansion

$$\ln(I_n - \rho W) = -\sum_{i=1}^{\infty}\frac{\rho^i W^i}{i}$$

Since the trace is a linear operation, we can write

$$\ln|I_n - \rho W| = -\sum_{i=1}^{\infty}\frac{\rho^i \operatorname{tr}(W^i)}{i}$$

Note that this will also hold for non-symmetric W with complex eigenvalues or for complex $\rho$. Assuming that truncation error is small, we can write a finite approximation of the above power series

$$\ln|I_n - \rho W| = -\sum_{i=1}^{m}\frac{\rho^i \operatorname{tr}(W^i)}{i}$$

The computation for $\operatorname{tr}(W^i)$ can be computationally costly for larger values of $i$. Barry and Pace introduce the following "trick" to calculate the trace. Let the vector $\boldsymbol{x}$ come from a $N_n(0,I)$ distribution and assume we have some matrix $D$. Since the square of any single $x_i\sim N(\cdot)$ gives $x_i^2\sim\chi^2_1$ with $\mathbb{E}[x_i^2]=1$ and $\mathbb{E}[x_ix_j]=0$ by construction, then given the quadratic form $\boldsymbol{x}^\prime D\boldsymbol{x}$ we have the following using $n=2$ for illustration

$$\begin{align}\boldsymbol{x}^\prime D\boldsymbol{x} &=x_1^2d_{1,1} + x_2^2d_{2,2} + x_1x_2(d_{1,2} + d_{2,1}) \cr \mathbb{E}[\boldsymbol{x}^\prime D\boldsymbol{x}] &=\mathbb{E}[x_1^2d_{1,1} + x_2^2d_{2,2} + x_1x_2(d_{1,2} + d_{2,1})] \cr &= d_{1,1} + d_{2,2} \cr &= \operatorname{tr(D)} \end{align}$$

This result suggests the following algorithm, which I'll implement below.

 pick x_i = randn(n)
 set z = x_i
 for t = 1:m
     z = Wz
     tr(W^i) = n*(x_i'z)/(t*x_i'x_i)

Storing all of the estimates for $\operatorname{tr}(W^i)$, we can use them to compute our truncated power series.

$$a^{\prime}b$$

where $a$ are the estimates of the trace and $b=[\rho, \rho^2, \dots, \rho^m]$. It's common to compute the estimate for a grid of values for $\rho$ so that the log determinant over all possible values for $\rho$ can be computed in a vectorized way and then to use interpolation to get intermediate values as needed.

Notice that I used $x_i$ for the random vector above. This suggests that we can do this numerous times and take the average to get a better estimate for the log determinant. Now let's look at the implementation.

In [43]:
import numpy as np
np.set_printoptions(suppress=True)

Take an example of a contiguity matrix

In [44]:
W = np.zeros((7,7), dtype=float)
W[0,1] = 1.
W[-1, -2] = 1.
for i in range(5):
    W[i+1,i] = .5
    W[i+1, i+2] = .5
In [45]:
W
Out[45]:
array([[ 0. ,  1. ,  0. ,  0. ,  0. ,  0. ,  0. ],
       [ 0.5,  0. ,  0.5,  0. ,  0. ,  0. ,  0. ],
       [ 0. ,  0.5,  0. ,  0.5,  0. ,  0. ,  0. ],
       [ 0. ,  0. ,  0.5,  0. ,  0.5,  0. ,  0. ],
       [ 0. ,  0. ,  0. ,  0.5,  0. ,  0.5,  0. ],
       [ 0. ,  0. ,  0. ,  0. ,  0.5,  0. ,  0.5],
       [ 0. ,  0. ,  0. ,  0. ,  0. ,  1. ,  0. ]])

This is what is called a row-stochastic weight matrix in the spatial regression literature. That is, the rows all sum to 1. This guarantees that the eigenvalues are constrained to be from -1 to 1.

In [46]:
np.linalg.eig(W)[0]
Out[46]:
array([-1.       , -0.8660254, -0.5      , -0.       ,  0.5      ,
        1.       ,  0.8660254])
In [47]:
rho = .25

Direct computation of $\ln|I_n - \rho W|$ yields

In [48]:
np.log(np.linalg.det(np.eye(W.shape[0]) - rho*W))
Out[48]:
-0.12829609729207081

Or better yet, use the more numerically stable slogdet.

In [49]:
np.linalg.slogdet(np.eye(W.shape[0]) - rho*W)
Out[49]:
(1.0, -0.12829609729207084)

Below is the implementation of the algorithm of Barry and Pace. I haven't done any timings or optimizations for it. I first need to see how it performs in practice. Notice also that it only takes a single $\rho$. This is easily updated to work with a grid for $\rho$, though I'm thinking about using a memoization pattern for the trace estimates in my own code, depending on speed, so I've left it as is here.

One addition to the algorithm above is given via the exact keyword. If $W$ is symmetric with zeros on the diagonals, which it always will be for spatial econometric applications, then we can compute the first two moments exactly at small cost. $\operatorname{tr}(W=0)$ and $\operatorname{tr}(W^2)=\sum W^2$ where the square is taken elementwise.

In [50]:
from scipy import sparse

def logdet_mc(W, rho=1, seed=None, nterms=10, nrepl=100, exact=True):
    """
    Compute the approximate log-determinant of I - rho*W.
    
    Parameters
    ----------
    W : array or sparse
        An n x n array or a scipy.sparse matrix.
    rho : scalar
        The autoregressive coefficient.
    seed : int or None
        Seed passed to numpy.random.seed.
    nterms : int
        The number of terms for the power series estimate of the log determinate.
    nrepl : int
        The number of power series to estimate. The log determinate is the average
        of each these.
    exact : bool
        Use exact if W is symmetric with zeros on the diagonal to use the exact 
        first two moments.
    """
    if seed is not None:
        np.random.seed(seed)       

    if hasattr(W, 'dot'): # then prefer it (for sparse) and newer numpy
        Wdot = W.dot
    else:
        Wdot = lambda x : np.dot(W, x)
        
    nobs = W.shape[0]
    try:
        assert W.shape[1] == nobs
    except AssertionError:
        raise ValueError("W is not square.")
    
    allT = np.empty((nrepl, nterms))
    
    range_ = np.arange(1., nterms+1)    
    
    for j in range(nrepl):

        T = np.empty(nterms)
        # get the "seed"
        u1 = np.random.randn(nobs, 1)
        z_t = u1
        uTu = np.dot(u1.T, u1).item() # scalar
    
        # compute the terms in the expansion
        for i in range(1, nterms+1):
            z_t = Wdot(z_t)
            T[i-1] = np.dot(u1.T, z_t).item()
        
        T *= nobs/(range_*uTu)
        allT[j] = T
    
    
    if exact: # must be symmetric and zeros on diagonal
        if sparse.issparse(W):
            exactT = [0, np.sum(W.data**2)/2.]
        else:
            exactT = [0, np.sum(W**2)/2.]
        
        allT[:,:2] = exactT
    
    meanT = np.mean(allT, axis=0)
    
    b = -rho ** range_

    return np.dot(b, meanT)
In [51]:
rho = .75

Exact estimate

In [52]:
np.linalg.slogdet(np.eye(W.shape[0]) - rho * W)
Out[52]:
(1.0, -1.5261936420271165)

Monte-carlo estimate

In [53]:
logdet_mc(W, rho, seed=1234, nterms=30, exact=False, nrepl=1000)
Out[53]:
-1.5140194676583663

(EDIT 8/14/2013) The below estimate using CHOLMOD is incorrect since W is not symmetric. The Cholesky decomposition estimate is only valid for positive-definite symmetric matrices. See below for a correct estimate using the LU decomposition.

Estimate using CHOLMOD from scikits.sparse.

In [54]:
sparseW = sparse.csc_matrix(W)
In [55]:
from scikits.sparse import cholmod
In [56]:
f = cholmod.cholesky((sparse.eye(7, format="csc") - rho*sparseW))
f.logdet()
Out[56]:
-2.0152902872928142

(EDIT 8/14/2013 Added this section). We can use the LU decomposition to get the log-determinant of a non-symmetric positive-definite matrix.

In [57]:
from scipy.sparse import linalg as sp_linalg
In [58]:
umfpack = sparse.linalg.umfpack.UmfpackContext()
L, U, P, Q, R, do_recip = umfpack.lu((sparse.eye(7, format="csc") - rho*sparseW))
print np.sum(np.log(U.diagonal()/R))
-1.52619364203

Pace and Barry Data

The following example uses the example data from Pace and Barry (1997) on 3,107 US Counties for the 1980 Presidential election. You can find the data and more information here.

In [59]:
import pandas as pd
from scipy import sparse
In [60]:
dta = pd.read_table("http://spatial-econometrics.com/data/elect.dat", sep=" *", header=None)[[4,5]]
dta.rename(columns={4 : "lat", 5 : "long"}, inplace=True)

Nearest Neighbors

Compute the 5 nearest neighbors using Euclidean distance. Use 5 because it counts itself and we want the 4 nearest neighbors.

In [61]:
from sklearn import neighbors
In [62]:
neigh = neighbors.NearestNeighbors(n_neighbors=5)
neigh.fit(dta.values)
Out[62]:
NearestNeighbors(algorithm='auto', leaf_size=30, n_neighbors=5, p=2,
         radius=1.0)
In [63]:
nobs = len(dta)
W = sparse.lil_matrix((nobs, nobs))
In [64]:
for i, row in dta.iterrows():
    closest = neigh.kneighbors(row.values, return_distance=False)
    for j in closest[0]:
        if j == i: # don't count own observation
            continue
        W[i, j] = 1.

Make row stochastic

In [65]:
W = W / 4.
assert all(W.sum(1) == 1)

Convert to CSC format for calculations

In [66]:
W = W.tocsc()

Set $\rho$

In [67]:
rho = .1

Calculate the exact log determinant.

In [68]:
np.linalg.slogdet(np.eye(W.shape[0]) - rho*W.toarray())
Out[68]:
(1.0, -3.346557139406221)

Get the monte carlo approximation of the log determinant.

In [94]:
logdet_mc(W, rho, seed=124, nterms=10, exact=False, nrepl=2500)
Out[94]:
-3.2609283583388167

(EDIT 8/14/2013 This is wrong. See above note. LU decomposition added below.) Compute the log determinant using CHOLMOD and scikits.sparse again.

In [70]:
f = cholmod.cholesky(sparse.eye(W.shape[0], format="csc") - rho * W)
In [71]:
f.logdet()
Out[71]:
-3.9400025132708647

(EDIT 8/14/2013 Add LU decomposition.)

In [73]:
L, U, P, Q, R, do_recip = umfpack.lu((sparse.eye(W.shape[0], format="csc") - rho*W))
print np.sum(np.log(U.diagonal()/R))
-3.34655713941

Comments