Topic Modeling for Economics:
Latent Dirichlet Allocation and Beyond

Skipper Seabold
February 10, 2012

http://jseabold.net/presentations/topicmodelintro.html

The Problem

What are all these texts about?

Which of these texts contain discussions of market phenomena?

Sources of Electronic Texts

Optical Character Recognition

The Solution: Latent Dirichlet Allocation

Blei, David M.; Ng, Andrew Y.; and Jordan, Michael I. (2003)
     "Latent Dirichlet Allocation." Journal of Machine Learning
    Research
. 3, 993-1022.

Applications of LDA

Practical LDA Overview

Parse documents (Remove punctuation, stemming, stop words)

Create a vocabulary (top n words by tf-idf)

Turn documents into a numerical representation

Run LDA

Inference and Exploration (maybe repeat the process)

Example 1: The Human Test

topic 016 topic 018 topic 036 topic 089 topic 093 topic 097
area equation plane equation curve fluid
gravity value axis function equation pressure
vertical roots axes integral cubic particles
inclination coefficients circle differential conic value
centre symbol parallel variable tangent theory
solid theorem perpendicular elements cayley mass
floating square section condition intersection equation
horizontal sign centre coefficients memoir attraction
stability function curve problem condition density
ordinates multiplied equation arbitrary plane velocity
section factor intersection professor value equilibrium
ship integral sphere value contact solid
volume formula tangent partial reciprocal elastic
axis sin prop satisfied coordinates axis
construction substituting base transformation writing molecules

Example 1: The Human Test

Hirst, T.A. (1863) "On the Volumes of Pedal Surfaces."
     Philosophical Transactions of the Royal Society of London.
     153, 13-32.

Example 2: The Human Test

Thompson, B. (1781) "New Experiments upon
    Gun-Powder, with Occasional Observations and
    Practical Inferences; To Which are Added, an Account of
    a New Method of Determining the Velocities of All
    Kinds of Military Projectiles, and the Description of a
    Very Accurate Eprouvette for Gun-Powder." Philosohical
     Transactions of the Royal Society
. 71, 229-328.

Example 2: The Human Test

Explosives Guns Metalworking Metallurgy Chemistry
topic 031 topic 032 topic 050 topic 058 topic 078
explosion velocity iron metal sulphuric
wire resistance strength copper antimony
vanadium friction bars gold melted
exploded powder resistance silver powder
flame fluid square iron liquid
cane machine beam alloys metal
fulminating barrel ends conducting phosphorus
guncotton elastic cast wire ozone
etonation cylinder cylinder resin luminous
burning bullet breaking zinc eposited
powder raised steel pure acid
ignited charge compressed bismuth flame
combustion center section gravity iron
cannes bore strain palladium shining

History

LDA: Terms and Notation

$K$ The number of topics
$V$ The size of the vocabulary
$D$ The number of documents
$N_{d}$ The number of words in a document $d$
$\alpha$ Hyperparameter for document's topic mixture components, $K$-vector or scalar if symmetric
$\eta$ Hyperparameter for topic's word mixture components, $V$-vector or scalar if symmetric
$\theta_d$ Document-level topic proportions; distribution over topic indices $1,\dots,K$
$\beta_k$ Topic-level word proportions; distribution over the vocabulary $V$
$w_{d,n}$ Term indicator for word $n$ in topic $d$
$z_{d,n}$ Mixture indicator for the topic of word $n$ in document $d$

LDA: The Generative Model

Example Text

The pressure 47 of this fluid 23 upon the valve 561 assists 737 the action of the spring, 255 by which means the valve 561 is more expeditiously 7271 and more effectually 2319 closed. The valve 561 was very accurately fitted 1082 to the aperture 461 by grind- 1927 ing 2681 them together with powdered 195 emery, 4738 and afterwards po- lishing 4696 them one upon the other. And it is very certain, that no part of the elastic 295 fluid 23 made its escape 695 by this vent; 2079 for, upon firing the piece, there was only a simple flash 2351 from the explosion 431 of the priming, 1234 and no stream 484 of fire was to be seen issuing 1172 from the vent, 2079 as is always to be observed when a com- mon vent 2079 is made use of, and in all other cases where this fluid 23 finds a passage.

Statistics Background: Multinomial Distribution

$$p(\vec{n} | \vec{p}, N) = \binom{N}{\vec{n}}\prod_{k=1}^{K}p_{k}^{n^{(k)}} \equiv Mult(\vec{n} | \vec{p}, N)$$

where

$\binom{N}{n} = \frac{N!}{\prod_{k}\vec{n}^{(k)}!}$ and $\sum_{k}p_{k} = 1$; $\sum_{k}n^{(k)} = N$

Statistics Background: Dirichlet Distribution

$$p(\vec{\theta}|\vec{\alpha}) = Dir(\vec{\theta}|\vec{\alpha}) = \frac{\Gamma\left(\sum_{k=1}^K\alpha_{k}\right)}{\prod_{k=1}^K\Gamma\left(\alpha_k\right)}\prod_{k=1}^K\theta_{k}^{\alpha-1}$$ with parameter $\alpha$ that controls the mean and sparsity of $\theta$
where $\sum_{k}\theta_{k}=1$; $\theta_{k}>0$
and $\Gamma(x)$ is the Gamma function.

Statistics Background: Dirichlet Distribution

Statistics Background: Dirichlet Distribution

Statistics Background: Dirichlet Distribution

Statistics Background: Dirichlet Distribution

Statistics Background: Dirichlet Distribution

Statistics Background: Dirichlet Distribution

Statistics Background: Dirichlet Distribution

Statistics Background: Dirichlet Distribution

Statistics Background: Dirichlet Distribution

Statistics Background: Dirichlet Distribution

Statistics Background: Dirichlet Distribution

Statistics Background: Graphical Models

Statistics Background: Graphical Models

$$p(y,x_1,\cdots,x_N)=p(y)\prod_{n=1}^{N}p\left(x_n|y\right)$$

LDA Graphical Model

$$p(w, \theta, \beta, z | \alpha, \eta) = $$ $$\prod_{k=1}^{K}p\left(\beta_{k}|\eta\right)\prod_{d=1}^{D}p\left(\theta_{d}|\alpha\right)\left[\prod_{n=1}^{N}p\left(z_{d,n}|\theta_{d}\right)p\left(w_{d,n}|z_{d,n},\beta_{1:K}\right)\right]$$

LDA Inference

The Posterior Distribution

To obtain the joint posterior distribution of the hidden variables conditional on the observations

Approximate Inference

Existing methods:

Batch Vartiational Bayes

Variational Bayes Overview

Variational Distributions

Batch VB Algorithm

    Initialize $\lambda$ randomly.
    while $\mathcal{L}(w,\phi,\gamma,\lambda) \gt 1e6$ (corpus tolerance)do
      E step:
      for d=1 to D do
      Initialize $\lambda_{d,k} = 1$
      repeat
        $\phi_{d,w,k} :\propto \exp\left\{\mathbb{E}_q[\log\theta_{d,k}] + \mathbb{E}[\log\beta_{k,w}]\right\}$ (update topic assignment)
        $\gamma_{d,k} := \alpha + \sum_w\phi_{d,w,k}n_{d,w}$ (update topic proportions)
      until $\frac{1}{K}\sum_{k}|\Delta\gamma_{d,k}| \lt 1e6$ (per-document tolerance)
      end for
      M step:
    $\lambda_{k,w} := \eta + \sum_dn_{d,w}\phi_{d,w,k}$ (update topics w\ aggregated per-document
                          stats)
    end while

LDA Implementations

Shortcomings of LDA

Extensions: CTM

Correlated Topic Model: Topics are assumed to come from a logistic normal distribution with mean $\mu$ and covariance $\Sigma$
For each document of length $N_{d}$ (assuming topics are known and fixed) where the function $f:\mathbb{R}^d \rightarrow (K-1)$-simplex $$f(\eta_i) = \frac{\exp(\eta_i)}{\sum_{j}\exp(\eta_j)}$$

Extensions: DTM

Dynamic Topic Model: Topics are assumed to be different but correlated at time $t$ and $t-1$
For the documents at each time slice $t$

Model Performance

Applying LDA: Patents and Innovation

Testing the Alternative Hypothesis

New Approach: Determine whether country-specific intellectual activity determined innovations

These Slides