Jekyll2018-09-30T23:51:35+00:00http://benlansdell.github.io/Ben LansdellComputational neuroscience and applied mathematicsBen Lansdellben dot lansdell at gmail dot comIntroduction to statistical causal inference2018-04-03T00:00:00+00:002018-04-03T00:00:00+00:00http://benlansdell.github.io/statistics/causalinference<h1 id="but-what-is-causality">But what is causality?</h1>
<p>Say we observe a negative relationship between number of apples eaten
per day and heart disease. Does this relationship mean that apples are
protective against disease? Maybe. It is well known that correlation
does not imply causation. Perhaps in this case number of apples eaten
per day correlates with general diet, or general fitness, which instead
are the cause of lower heart disease. Such factors are a source of
confounding. How do we distinguish between these possibilities? A
statistician answers these causal inference questions in two ways: by
considering counterfactuals and interventions.</p>
<p>A counterfactual is simply a potential event that did not occur. A given
patient either does or does not receive the treatment on a given trial.
Whichever event does not occur is the counterfactual. Under a
counterfactual account of causality to claim that a proposed treatment
causes disease remission is to claim that had the patient not received
the treatment then the disease outcome would be different (greater).</p>
<p>Interventionist accounts are similar but focus on the notion of
manipulability. Here to claim that one variable causes another is to
claim that if through intervention one variable is forced to a given
state then a change in the other variable will be observed. Here the
notion of intervention is treated as a primitive and causal
relationships are derived from that.</p>
<p>Thus, had a given patient <em>not</em> had so many apples per day, would their
health be worse? And, if a patient was <em>forced</em> to eat many apples per
day, would their health be better? Here we focus on frameworks that
attempt to answer these questions in the presence of confounding. The
basic idea is that if we do observe all the factors we reasonably
consider to be confounding the estimate then we can correct for this.</p>
<h1 id="learning-causal-relationships">Learning causal relationships</h1>
<p>Randomized controlled trials (RCTs) are the gold standard for causal
inference. The idea simply being that if assignment to a treatment group
is randomized then the distribution of covariates in the control and
treatment groups will be identical, and therefore any difference in
outcome between the control and treatment groups can then only be
attributed to the fact that one group received a treatment while the
other did not.</p>
<p>However, sometimes RCTs are difficult, expensive, or unethical to
perform. This motivates considering when causal relationships can be
inferred from observational data alone. In the absence of randomization,
receiving treatment may be correlated with many other factors which
could also impact the outcome. What are conditions in which the effects
of confounding can be mitigated?</p>
<p>Counterfactual outcomes are not observed for individual patients – they
either receive a treatment or do not. This is known as the fundamental
problem of causal inference. As a result often we need to (or in fact
want to) consider aggregate causal effects estimated over a population.
This has two consequences for analysis.</p>
<p>The first is that in considering causal relationships in this aggregate
sense, the timing of pairs of events is often unspecified, vaguely
defined, or implicit in how data is collected. By losing this timing
information it is harder to analyze cases of mutual causation. Thus the
assumption made here is that one variable is the cause of another, or
vice versa, or not at all – there is a directedness to the relationship
over the time window in which observations are made. It addition to
excluding mutual causation from consideration, it is simplest to further
exclude cycles, or causal chains (e.g. $A \to B \to C \to A$). The
second consequence is that, by considering a population of
subjects/events, it becomes more necessary to allow for probabilistic
causal relationships, in which one variable’s occurrence affects
another’s probability of occurring and the relationship need in no way
be deterministic. These considerations motivate summarizing causal
relationships between a set of variables using directed acyclic graphs
(DAGs), and using a probabilistic framework.</p>
<h1 id="counterfactuals-the-causal-effect-as-difference-in-potential-outcomes">Counterfactuals: the causal effect as difference in potential outcomes</h1>
<p>Measuring causal effects in terms of counterfactuals is a relatively old
idea (as far as statistics goes), dating back to 1923 from work of
Neyman. The Neyman-Rubin causal model provides a framework for reasoning
about causal effects with counterfactuals. In a simple setting, the
model considers two <em>potential outcomes</em>: an outcome when a subject does
receive a treatment, $Y(1)$, and an outcome when a subject does not
receive a treatment, $Y(0)$ (i.e. a control subject). For a given
subject, $i$, the <em>causal effect</em> is the difference in potential
outcomes:
<script type="math/tex">\begin{align}E_i = Y_i(1)-Y_i(0).\end{align}</script></p>
<p>If we let $W_i$ be a treatment random variable then assuming consistency between potential and
observed outcome, $Y_i$, we have:
<script type="math/tex">\begin{align}\label{eq:consistent}
Y_i = W_iY_i(0) + (1-W_i)Y_i(1).\end{align}</script></p>
<p>As an aside, note that the potential outcomes $Y(i)$ are treated as
kinds of hypothetical random variables. In a sense neither is observed,
and they are only related to observation through the assumption that
holds. This is a somewhat subtle point that is perhaps not well
reflected in the notation. Equations in causal models can have quite
different interpretations to standard statistical models, despite having
similar notation, which is important to be aware of.</p>
<p>Per the <em>fundamental problem of causal inference</em>, only one of these
potential outcomes is ever observed. To get around this, causal effects
can be measured over a population of subjects, some of which receive the
treatment and some of which do not. Over a population we can consider
the <em>average causal effect</em>:
<script type="math/tex">\begin{align}\tau = \mathbb{E}(Y_I(1)-Y_I(0)).\end{align}</script></p>
<p>If $W_i$ is assigned to each subject at random then $\tau$ can be
computed directly from the treatment and control subpopulation means. In
randomized cases, $W_i$ is independent from the potential outcomes. If
$W_i$ were not independent from the potential outcomes then the measured
causal effect (difference in means) could simply be a result of this
correlation.</p>
<h2 id="causal-assumptions-for-identifiability">Causal assumptions for identifiability</h2>
<p>Being able to measure a causal effect in an unbiased (unconfounded) way
means the effect is <em>identifiable</em>. Within this counterfactual
framework, this linking of potential outcomes to causal effects relies
on four <em>causal assumptions</em>. Some of these have been alluded to above.
They are:</p>
<ol>
<li>
<p>(<strong>SUTVA</strong>) Stable Unit Treatment Value Assumption. This means:</p>
<ul>
<li>
<p>There is no interference in treatments – one subject receiving
treatment does not affect others’ treatment.</p>
</li>
<li>
<p>There is only one form of treatment.</p>
</li>
</ul>
</li>
<li>
<p>(<strong>Consistency</strong>) This assumption links the hypothetical potential
outcomes to observed data. If we assume consistency then we are
assuming:
<script type="math/tex">\begin{align}Y_i = W_iY_i(0) + (1-W_i)Y_i(1),\end{align}</script> as discussed above.</p>
</li>
<li>
<p>(<strong>No unmeasured confounders/ignorability</strong>) The treatment
assignment is independent of the potential outcomes:
<script type="math/tex">\begin{align}Y(1),Y(0) {\mathrel{\text{$\perp\mkern-10mu\perp$}}}W.\end{align}</script></p>
<p>In most cases of interest both the outcome and treatment variable
are related to a set of observed covariates, $X$. Causal inference
then requires:
<script type="math/tex">\begin{align}Y(1),Y(0) {\mathrel{\text{$\perp\mkern-10mu\perp$}}}W | X.\end{align}</script>
In RCTs this assumption may be reasonable. This says that the
distribution of potential outcomes $(Y(1), Y(0))$ is the same across
treatment levels $W$, conditioned on $X$. In observational settings
often this is the primary assumption that is a road block
to identifiability.</p>
<p>Another way to understand this is as follows. We want to relate
observed quantities to hypothetical potential outcomes. We can do
this if we assume ignorability: <script type="math/tex">% <![CDATA[
\begin{aligned}
\mathbb{E}(Y|W=1)-\mathbb{E}(Y|W=0) &= \mathbb{E}(WY(1)+(1-W)Y(0)|W=1)-\mathbb{E}(WY(1)+(1-W)Y(0)|W=0)\\
&= \mathbb{E}(Y(1)|W=1)-\mathbb{E}(Y(0)|W=0)\\
\text{(ignorability)}\quad &= \mathbb{E}(Y(1) - Y(0)))\\
&= \tau\end{aligned} %]]></script></p>
</li>
<li>
<p>(<strong>Positivity</strong>) Additionally, causal inference requires a non-zero
probability of assignment to a treatment group for all subjects:
<script type="math/tex">% <![CDATA[
\begin{align}0 < \mathbb{P}(W_i = 1|X_i = x) < 1, \quad \forall x.\end{align} %]]></script> This is
known as the <em>positivity</em>, or overlap, assumption.</p>
<p>Simply, a causal effect cannot be measured if no subjects receive
the treatment, or they all do.</p>
</li>
</ol>
<h1 id="directed-acyclic-graphs-and-probability-distributions">Directed acyclic graphs and probability distributions</h1>
<p>In a sense the conditional independence between treatment and potential
outcome is the main assumption that requires analysis in the above set
of assumptions. This analysis can be aided by encoding our assumptions
about the relations between different variables in a graph. This section
defines and describes the behavior of these graphs. The following
section contains criteria that can be used to identify sets of variables
that are sufficient to act as controls, that remove the effect of
confounding and hence that satisfy ignorability. These models are types
of graphical models, sometimes known as <em>Bayesian networks</em>, and were
first developed by Pearl in the 1980s.</p>
<p>Here we will consider a set of random variables $\mathcal{X}$ as nodes
on a directed acyclic graph $\mathcal{G}$. Let this graph have edges
$\mathcal{E}$ that represent relations between the variables.
Ignorability requires conditional independence of the outcome from the
treatment variable, so here we will let the directed edges encode
conditional independence assumptions. (A <em>causal Bayesian network</em> has
additional semantics that are discussed in Section [sec:cbn]. For the
moment the directed edges only encode information about conditional
independence.)</p>
<p>First note that the DAG imposes an ordering on the variables
$\mathcal{X}$, from which we can talk about a node’s parents, children,
ancestors or descendants. Note also that any multivariate distribution
can be decomposed into a product of conditional probabilities for any
ordering of the variables:
<script type="math/tex">\begin{align}P(X) = \prod_{j=1}^N P(X_j|\{X_k\}_{k>j}).\end{align}</script></p>
<p>Given this, if we assume that the variables are ordered in a way that
respects the ordering of the DAG, then we will say $\mathcal{X}$ is a
Bayesian network with respect to $\mathcal{G}$ if the joint distribution
over variables $\mathcal{X}$ factors according to:
<script type="math/tex">\begin{align}P(X) = \prod_{j=1}^N P(X_j|\text{Pa}(X_j)),\end{align}</script> where $\text{Pa}(X_j)$
is the parents of node $X_j$. That is, each node $X_j$ is conditionally
independent of its non-descendants given its parents:
<script type="math/tex">\begin{align}P(X_j|\{X_k\}_{k>j}) = P(X_j|\text{Pa}(X_j)).\end{align}</script> This is the <em>Markov
condition</em>, or Markov assumption, for a Bayesian network. A node is
conditionally independent of the entire network given its <em>Markov
blanket</em> – its parents, its children, and its children’s other parents.</p>
<p>Often also invoked is the <em>faithfulness condition</em>, which is the
condition that the conditional independencies implied by the graph are
the only conditional independencies in the distribution. E.g. assuming
faithfulness in the graph $A \to B$ says that there is in fact a
dependence between $A$ and $B$.</p>
<h2 id="some-types-of-graphs">Some types of graphs</h2>
<p>Some properties of a Bayesian network can be inferred graphically. For
instance three basic components of DAGs are:</p>
<ol>
<li>
<p>Chain: $A \to B \to C$</p>
</li>
<li>
<p>Fork: $A \leftarrow B \to C$</p>
</li>
<li>
<p>Collider (inverted fork): $A \to B \leftarrow C$</p>
</li>
</ol>
<p>These graphs behave differently when conditioning on parts of them.
Compare the fork and the inverted fork.</p>
<ul>
<li>
<p>For the fork, $A$ and $C$ are dependent. Yet when conditioned on
$B$, $A$ and $C$ become independent.</p>
</li>
<li>
<p>The converse is true for the inverted fork. Without conditioning,
$A$ and $B$ are independent. Yet when conditioned on $B$, $A$ and
$C$ become dependent. This may seem a little counter-intuitive. An
example of this phenomenon is if $B$ is determined through tossing
two independent coins, $A$ and $C$. If $B$ is determined as
<script type="math/tex">\begin{align}B = \begin{cases}
1, \quad A=H, C = H;\\
0, \quad \text{else}
\end{cases}\end{align}</script> By itself, knowing $A$ tells you nothing about $C$.
But knowing $B$ and $A$ now tells you something about $C$.</p>
</li>
</ul>
<p>Note that the fork and the chain have the same behavior:</p>
<ul>
<li>
<p>For the fork, $A$ and $C$ are dependent. Yet when conditioned on
$B$, $A$ and $C$ become independent.</p>
</li>
<li>
<p>For the chain, $A$ and $C$ are dependent. Yet when conditioned on
$B$, $A$ and $C$ become independent.</p>
</li>
</ul>
<h2 id="d-separation">d-separation</h2>
<p>For more complicated graphs, are a given set of variables sufficient
controls to render two nodes conditionally independent? Here the notion
of <em>d-separation</em> is useful.</p>
<p>The d stands for dependence. Let $P$ be a path from node $u$ to $v$. A
path is a loop-free, undirected (i.e. all edge directions are ignored)
path between two nodes. Then $P$ is said to be d-separated by a set of
nodes $Z$ if any of the following conditions holds:</p>
<ul>
<li>
<p>$P$ contains a directed chain such that the middle node $m$ is in
$Z$, or</p>
</li>
<li>
<p>$P$ contains a fork, $u \cdots \leftarrow m \to \cdots v$, such
that the middle node m is in Z, or</p>
</li>
<li>
<p>$P$ contains an inverted fork (or collider),
$u \cdots \to m \leftarrow \cdots v$, such that the middle node $m$
is <em>not</em> in $Z$ and no descendant of $m$ is in $Z$.</p>
</li>
</ul>
<p>Nodes $u$ and $v$ are said to be d-separated by $Z$ if all paths between
them are d-separated. If $u$ and $v$ are not d-separated, they are
called d-connected.</p>
<p>We have the result that $X_u$ and $X_v$ being d-separated by $Z$ tells
us that $X_u$ and $X_v$ are conditionally independent given $Z$.</p>
<h2 id="markov-equivalence-classes">Markov equivalence classes</h2>
<p>Note that a DAG may prescribe a factorization of the probability
distribution, but the converse is not true. That is, knowing a
factorization of the joint distribution does not always imply a unique
DAG. Instead it prescribes a <em>Markov equivalence class</em> of DAGs. This
means that if we want to think of the directed edges as representing
causal relationships then knowing a joint distribution factorization
does not always provide a unique graph of causal relationships. This
limits what we can learn about causal relationships from a joint
(observational) distribution alone.</p>
<p>Two graphs are Markov equivalent iff they share the same conditional
independencies. Equally, they are Markov equivalent iff they have the
same d-separations. That is, if $u$ and $v$ are d-separated by $C$ in
$\mathcal{G}_1$ then they are d-separated by $C$ in $\mathcal{G}_2$, and
vice versa. Some examples of DAGs that are Markov equivalent are shown
in Figure [fig:dags].</p>
<p><img src="./../images/dags.svg" alt="Examples of DAGs in the same Markov equivalence class.<span
data-label="fig:dags"></span>" /></p>
<p>In fact a simple graphical rule tells us if two DAGs are in the same
Markov equivalence class. The <em>skeleton</em> of a network is the undirected
graph. Two DAGs are in the same equivalence class (observationally
equivalent) if they have the same skeleton and the same set of
‘v-structures’ – the same set of two converging arrows whose tails are
not connected by an arrow.</p>
<h1 id="controlling-for-confounders">Controlling for confounders</h1>
<p>Now we know some of the behavior of Bayesian networks we can return to
the question of identifying variables that can be controlled for to
remove confounding. This means we want to identify variables $X$ such
that ignorability holds:
<script type="math/tex">\begin{align}Y(1),Y(0) {\mathrel{\text{$\perp\mkern-10mu\perp$}}}W | X.\end{align}</script></p>
<p>Note that the observed outcome is of the form $Y = W Y(1) + (1-W)Y(0)$,
which induces a conditional dependence between $W$ and $Y$ – the
corresponding DAG will have a directed edge from $W$ to $Y$.
Ignorability requires essentially that any <em>other</em> paths from $W$ to $Y$
are blocked (i.e. controlled for, conditioned on). Which choices of $X$
achieve this? Three such criteria are identified below, stated without
proof. An example of each is shown in Figure [fig:criteria].</p>
<h2 id="backdoor-criterion">Backdoor criterion</h2>
<p>If a set of variables $X$ satisfy the following conditions:</p>
<ol>
<li>
<p>$X$ blocks every path from $W$ to $Y$ that has an arrow into $W$
(blocks the back door), and</p>
</li>
<li>
<p>No node in $X$ is a descendant of $W$.</p>
</li>
</ol>
<p>then $X$ satisfies the backdoor criterion with respect to nodes $W$ and
$Y$.</p>
<h2 id="disjunctive-cause-criterion">Disjunctive cause criterion</h2>
<p>Sometimes simpler than using the backdoor criterion, which can involve
analyzing the entire DAG is the disjunctive cause criterion. It is
simply:</p>
<ul>
<li>Control for all parents of the treatment variable, the effect
variable (that are not descendants of the treatment), or both.</li>
</ul>
<p>Sometimes this is an easier set to identify than other (potentially
smaller) sets that satisfy the backdoor criterion.</p>
<h2 id="frontdoor-criterion">Frontdoor criterion</h2>
<p>If a set of variables $Z$ satisfy the following conditions:</p>
<ol>
<li>
<p>$Z$ blocks all directed paths from $X_i$ to $X_j$, and</p>
</li>
<li>
<p>there is no backdoor path from $X_i$ to $Z$, and</p>
</li>
<li>
<p>all backdoor paths from $Z$ to $X_j$ are blocked by $X_i$</p>
</li>
</ol>
<p>then $Z$ satisfies the frontdoor criterion with respect to nodes $X_i$
and $X_j$.</p>
<p><img src="../../images/dags_criteria.svg" alt="Three criteria through which conditioning on $Z$ will render the
effect of $X$ on $Y$ identifiable.<span
data-label="fig:criteria"></span>" /></p>
<h1 id="some-common-methods">Some common methods</h1>
<p>Once a set of variables to control for has been identified, how do we
actually use this knowledge to identify causal effects? In theory, if we
observe controls $X$ then we can measure the causal effect from:
<script type="math/tex">\begin{align}\tau = \mathbb{E}(\mathbb{E}(Y|W=1,X)-\mathbb{E}(Y|W=0,X)).\end{align}</script></p>
<p>In practice however this requires a lot of data to get reliable
estimates of each conditional expectation. In biomedical/social science
settings this is often an issue. Generally each conditional expectation
has to be estimated parametrically to capture the dependence on $X$.
This introduces bias through choice of model, etc. Thus methods that can
estimate causal effects without this modeling are attractive. A way of
doing this is to effectively match the confound distribution $X$ between
the control and treatment groups. Thereby making treatment independent
of the covariates, and the data more like what is produced in a
randomized control trial. This balancing of distributions among control
and treatment groups is achieved through sampling subjects in different
ways.</p>
<h2 id="matching">Matching</h2>
<p>The basic idea of matching is as follows. For each condition $W= 1$ and
$W=0$ there are only a finite number of samples:
<script type="math/tex">\begin{align}\{y_i^{w=0}, x_i^{w=0}\}_{i=1}^{I_0} \text{ and } \{y_i^{w=1}, x_i^{w=1}\}_{i=1}^{I_1}.\end{align}</script>
Matching simply pairs one sample in the treatment group with one sample
in the control group whose control covariates are close:
<script type="math/tex">\begin{align}(y_i^{w=0}, x_i^{w=0})\leftrightarrow (y_j^{w=1}, x_j^{w=1}), \quad x_i^{w=0}\sim x_j^{w=1}.\end{align}</script>
Since between treatment groups $X$ have roughly the same distribution,
this dependence does not need to be modeled. This allows the above
causal effect expectation to be approximated.</p>
<p>Choices must be made about the metric that is used to decide when two
points are similar. And choices must be made about how to deal with
different treatment and control population sizes. One possibility is to
discard all samples for which no match is made. Another possibility is
to match one sample in the treatment group to more than one sample in
the control group.</p>
<p>A common way is to match on the treatment group. This then estimates
what is known as the <em>causal effect of treatment on the treated</em>, often
a quantity of interest. If we let $C(i)$ represent the sample index in
the control population that is matched to sample $i$ in the treatment
population then the causal effect is estimated from:
<script type="math/tex">\begin{align}\tau \approx \frac{1}{I_1}\sum_{i=1}^{I_1} y_i^{w=1} - y_{C(i)}^{w=0}.\end{align}</script></p>
<p>Matching can be performed on all covariates, or just covariates that are
identified as confounders, according to the backdoor or other criterion.
Note that matching does not remove the need for ignorability –
unmeasured confounders can still affect the analysis, thus $X$ still
must satisfy the backdoor criteria.</p>
<h2 id="propensity-score-matching">Propensity score matching</h2>
<p>Matching directly on controls $X$ can be difficult if $X$ is
high-dimensional. Instead, we can match on what is called the propensity
score, which is the probability of being treated given a set of
controls: <script type="math/tex">\begin{align}\pi(X) = P(W = 1| X).\end{align}</script></p>
<p>Matching on $\pi(X)$ has the same effect as matching on $X$ directly.
This is because subjects at the same propensity level have, by
definition, the same probability of being assigned to the treatment
group. Thus, for these subjects, treatment assignment is randomized
(independent of $X$). In this way the distribution of $X$ in treatment
and control groups are made to be the same.</p>
<p>The propensity score is known, by definition, in randomized control
trials. It has to be estimated in observational studies. But since it
only involves observed data $X$ and $W$ this is straightforward. For
example, one can use logistic regression.</p>
<p>Again, propensity score matching still requires the ignorability
assumption with controls $X$. Without it, even if the distribution of
$X$ is balanced between control and treatment groups, unobserved
confounders can still be different amongst control and treatment.</p>
<h2 id="inverse-probability-of-treatment-weighting">Inverse probability of treatment weighting</h2>
<p>Instead of matching on propensity score, which may discard some samples,
we can simply reweight each subject by the inverse of its probability of
receiving treatment – known as the inverse probability of treatment
weighting (IPTW). This matches one unit in a treatment group with a
certain number of ‘pseudo-units’ in the control group at a rate
proportional to the relative probability of receiving treatment at a
given level in $X$. In this way balance is achieved across levels.</p>
<p>This is a type of importance sampling.</p>
<h1 id="sec:cbn">Causal Bayesian networks</h1>
<p>This is the framework developed most significantly by Pearl. A causal model is a Bayesian network along with a
mechanism to determine how the model will respond to intervention. Now,
rather than using the notion of potential outcomes and counterfactuals,
causal effects are measured as the result of intervention. In addition
to parents/children, we also think of the directed edges in the DAG as
representing causal relationships, meaning a node’s parents and children
are also its causes and effects.</p>
<p>The <em>causal Markov condition</em> is the condition that all nodes are
independent of their non-effects, given their direct causes. In the
event that the structure of a Bayesian network accurately depicts
causality, this is equivalent to the Markov condition. However, a
network may accurately embody the Markov condition without depicting
causality, in which case it should not be assumed to embody the causal
Markov condition.</p>
<h2 id="interventions-and-causal-effects">Interventions and causal effects</h2>
<p>An intervention on a single variable is denoted ${\text{do}}(X_i = y)$.
Intervening on a variable removes the edges to that variable from its
parents and forces the variable to take on a specific value:
<script type="math/tex">P(x_i|{\text{Pa}}_{X_i}=\mathbf{x_i}) = \delta(x_i = y)</script>. The
interventional joint distribution, $P_{X_i=y}$, is then defined as:
<script type="math/tex">\begin{align}P_{X_i=y}(\mathbf{x}) = \prod_{j\ne i}^N P(x_j | {\text{Pa}}_{X_j} = \mathbf{x}_j)\delta(x_i = y),\end{align}</script>
also abbreviated to $P_{X_i}$. Expectations under interventions then
take the form:
<script type="math/tex">\begin{align}\mathbb{E}(X_j|{\text{do}}(X_i = y)) = \int x_j P_{X_i=y}(x_j)\,dx_j = \mathbb{E}_{X_i=y}(X_j).\end{align}</script>
The idea of intervention is shown in Figure [fig:inter].</p>
<p><img src="../../dags_intervene.svg" alt="Intervening on $X$ changes the graph and underlying distribution.<span
data-label="fig:inter"></span>" /></p>
<p>Now, given the ability to intervene, the average causal effect between
an outcome variable $X_j$ and a binary variable $X_i$ can be defined as:
<script type="math/tex">\begin{align}\tau = \mathbb{E}(X_j|{\text{do}}(X_i = 1)) - \mathbb{E}(X_j|{\text{do}}(X_i = 0)).\end{align}</script></p>
<p>In general, the ‘do’ conditional is different to standard probabilistic
conditioning. However criteria exist under which the interventional
conditional distribution coincides with the probabilistic conditional
distribution. The causal effect from node $X_i$ to $X_j$ can be inferred
for conditional distributions that satisfy these criteria. These are
actually the same criteria identified above in the counterfactual
framework when searching for controls that provide ignorability. The
interventional and counterfactual frameworks thus are compatible with
one another. Pearl argues the interventional framework subsumes the
older counterfactual framework.</p>
<p>For instance, if $S_{ij}$ satisfy the backdoor criteria with respect to
$X_i\to X_j$ then we can relate the interventional and observational
expectations as follows: <script type="math/tex">% <![CDATA[
\begin{aligned}
\nonumber \mathbb{E}(X_j|{\text{do}}(X_i = y)) &= \int x_j P_{X_i = y}(x_j)\,dx_j\\
\nonumber &= \int\int x_j P_{X_i = y}(x_j|\mathbf{s}_{ij})P_{X_i = y}(\mathbf{s}_{ij})\,dx_j d\mathbf{s}_{ij}\\
\nonumber &= \int\int x_j P(x_j|\mathbf{s}_{ij}, X_i = y)P(\mathbf{s}_{ij})\,dx_j d\mathbf{s}_{ij} \\
\label{eq:doce}&= \mathbb{E}\left(\mathbb{E}(X_j|\mathbf{S}_{ij}, X_i = y)\right),\end{aligned} %]]></script>
from which a causal effect can be measured.</p>
<h1 id="structural-equation-models">Structural equation models</h1>
<p>The above frameworks are non-parametric, dealing simply with
factorizations of joint distributions. The parametric form of a causal
Bayesian network is the structural equation model (SEM). Each node is
described by: <script type="math/tex">\begin{align}X_j = f_j(\text{Pa}(X_j), \epsilon_j; \theta_j),\end{align}</script> for
some independent noise variable $\epsilon_j$, and parameters $\theta_j$.</p>
<p>Note that the equality here is of a different nature to an algebraic
equality. It conveys assignment rather than comparison. (Similar to the
difference between = and == in programming languages.) Some authors use
$\leftarrow$ instead of = to communicate this difference. This means
that structural equation models have an invariance property that
standard statistical models do not: the SEM is robust to intervention.
The model should describe the data equally well regardless of whether it
comes from observation or interventional experiments.</p>
<h1 id="some-further-reading">Some further reading</h1>
<p>An overview of the counterfactual framework can be found in the short
Coursera course. The interventionist framework of Pearl is described in
his influential 2000 book. A more modern treatment, based on structural
equation models, which in some sense subsume the above two frameworks
can be found in Peters et al 2017.</p>
<ul>
<li>
<p>“A Crash Course in Causality: Inferring Causal Effects from
Observational Data” Coursera course. By Jason Roy.
<a href="www.coursera.org/learn/crash-course-in-causality/">www.coursera.org/learn/crash-course-in-causality/</a></p>
</li>
<li>
<p>“Causality: Models, Reasoning and Inference” Judea Pearl, 2000.</p>
</li>
<li>
<p>“Elements of Causal Inference: Foundations and Learning Algorithms”
Jonas Peters, Dominik Janzing and Bernhard Schölkopf, 2017.</p>
</li>
</ul>Ben Lansdellben dot lansdell at gmail dot comBut what is causality?MathJax, Jekyll and github pages2016-06-27T00:00:00+00:002016-06-27T00:00:00+00:00http://benlansdell.github.io/computing/mathjax<p>Integrating MathJax with Jekyll is a very convenient way of typesetting mathematics in a blog hosted on github pages. There are a few guides online, which were (almost) helpful in acheiving this on a github hosted site. The steps, as of September 2016, are:</p>
<p>Ensure the markdown engine is set to <code class="highlighter-rouge">kramdown</code> in <code class="highlighter-rouge">_config.yml</code>. This is now the <a href="https://help.github.com/articles/updating-your-markdown-processor-to-kramdown/">only supported markdown processor</a> on github pages, so this should be set anyway.</p>
<p>Include a new file in <code class="highlighter-rouge">_includes</code> named <code class="highlighter-rouge">_mathjax_support.html</code> (a clever idea from <a href="http://haixing-hu.github.io/programming/2013/09/20/how-to-use-mathjax-in-jekyll-generated-github-pages/">here</a>):</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code><script type="text/x-mathjax-config">
MathJax.Hub.Config({
TeX: {
equationNumbers: {
autoNumber: "AMS"
}
},
tex2jax: {
inlineMath: [ ['$', '$'] ],
displayMath: [ ['$$', '$$'] ],
processEscapes: true,
}
});
MathJax.Hub.Register.MessageHook("Math Processing Error",function (message) {
alert("Math Processing Error: "+message[1]);
});
MathJax.Hub.Register.MessageHook("TeX Jax - parse error",function (message) {
alert("Math Processing Error: "+message[1]);
});
</script>
<script type="text/javascript" async
src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
</code></pre></div></div>
<p>The bottom two hooks alert the user/writer about math and tex errors.</p>
<p>Importantly, in contrast to older guides online, note the https in the MathJax CDN. Unencrypted access to the CDN is a security risk and now will either not render in some browsers (didn’t work in Chrome for me), or will issue warnings in other browsers (Firefox). See the MathJax <a href="http://docs.mathjax.org/en/latest/start.html#secure-access-to-the-cdn">documentation</a> for more information.</p>
<p>Next, include in the <code class="highlighter-rouge"><head></code> of <code class="highlighter-rouge">_layouts/default.html</code>:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{% if page.use_math %}
{% include mathjax_support.html %}
{% endif %}
</code></pre></div></div>
<p>Now to include $\LaTeX$ in a post you just need set the variable <code class="highlighter-rouge">use_math: true</code> in the YAML front-matter of the page/post! Enclose inline formulas in <code class="highlighter-rouge">$</code>s and display formulas in <code class="highlighter-rouge">$$</code>s. For instance,</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$$
K(a,b) = \int \mathcal{D}x(t) \exp(2\pi i S[x]/\hbar)
$$
</code></pre></div></div>
<p>produces:</p>
<script type="math/tex; mode=display">K(a,b) = \int \mathcal{D}x(t) \exp(2\pi i S[x]/\hbar)</script>
<h2 id="alignment">Alignment</h2>
<p>Note that any equations requiring alignment (use of ampersand &) need some care. The solution I found was to wrap any of these elements in <div>’s.</p>
<h2 id="changing-typeset-fontsize">Changing typeset fontsize</h2>
<p>Add the following to <code class="highlighter-rouge">MathJax.Hub.Config</code>:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CommonHTML: {
scale: 85
}
</code></pre></div></div>
<h2 id="some-references">Some references:</h2>
<ul>
<li>http://cwoebker.com/posts/latex-math-magic – no longer seems to work</li>
<li>http://haixing-hu.github.io/programming/2013/09/20/how-to-use-mathjax-in-jekyll-generated-github-pages/</li>
<li>MathJax guide: http://docs.mathjax.org/en/latest/tex.html</li>
<li>MathJax details: http://docs.mathjax.org/en/latest/advanced/model.html</li>
</ul>Ben Lansdellben dot lansdell at gmail dot comIntegrating MathJax with Jekyll is a very convenient way of typesetting mathematics in a blog hosted on github pages. There are a few guides online, which were (almost) helpful in acheiving this on a github hosted site. The steps, as of September 2016, are:On the relation between maximum likelihood and KL divergence2016-06-26T00:00:00+00:002016-06-26T00:00:00+00:00http://benlansdell.github.io/statistics/likelihood<p>In this post I describe some of the theory of maximum likelihood estimation (MLE), highlighting its relation to information theory. In a later post I will develop the theory of maximum entropy models, also drawing connections to information theory, hoping to clarify the relation between MLE and MaxEnt.</p>
<p>Maximum likelihood was developed and advocated by many figures throughout the history of mathematics (see [1] for a nice overview). In a sense it was first considered in a significant way by Lagrange, but it was also considered by Bernoulli, Laplace, and Gauss, among others. Indeed, what was known as the ‘Gaussian method’ involved maximum aposteriori estimation of a model with normal distributed errors and a uniform prior, resulting in what is now known as the method of least squares. However, its theory and use was advanced most strongly by Fisher in the 1920s and 30s. Fisher worked for many years to demonstrate conditions needed for both the consistency of MLE and efficiency. While his later results have stood up to scrutiny, the theory, as it stands, does not possess the quite generality he sought after. Nonetheless, it remains a cornerstone of contemporary statistics.</p>
<h2 id="maximum-likelihood-estimation">Maximum likelihood estimation</h2>
<p>Much of statistics relies on identifying models of data that are, in some sense, close to our observations. Indeed, in many cases it seems sensible that we seek models that are the closest to our observations. Maximum likelihood provides one principle by which we may identify theseclosest distributions. It has many appealing properties that make it an appropriate measure, and is a broadly applicable method. As we will see its simplicity is somewhat deceiving.</p>
<p>The theory is easiest to describe in a discrete setting, which we will address first. Let</p>
<script type="math/tex; mode=display">x = (x_1, x_2, \cdots, x_N)</script>
<p>describe $N$ observations drawn from a discrete probability distribution. Each draw $x_n\in\mathcal{X}$ is taken from an alphabet of $M$ characters, $\mathcal{X}=(a_1, \dots, a_M)$. Let $p_m$ denote the probability of drawing character $m$ in any one draw, and let $f_m$ denote the frequency of character $m$ is observed in the $N$ draws. Note that we’re just describing a multinomial distribution having $M$ parameters $p_m$.</p>
<p>Given our observations, how should we estimate the multinomial parameters $\mathbf{p}$? The principle of maximum likelihood states simply that we take parameters that result in our observations having highest probability, when compared with all other possible choices of parameters. If we assume that each draw is independently and identically distributed (i.i.d.) then this is</p>
<div>
$$
\begin{align}
\hat{\mathbf{p}}_{MLE} & = \text{argmax}_{p\in \mathcal{P}} \prod_{n=1}^Np_{x_n}\\
& = \text{argmax}_{p\in\mathcal{P}}\log\left(\prod_{n=1}^Np_{x_n} \right) \\
& =\text{argmax}_{p\in\mathcal{P}}\sum_{n=1}^N\log \left( p_{x_n} \right) \\
& =\text{argmax}_{p\in\mathcal{P}}\sum_{m=1}^M f_m \log \left( p_m \right)
\end{align}
$$
</div>
<p>For many reasons, some of which will become clear here, expressing the maximization problem in terms of logarithms is the natural choice, so the last line above is one we will be optimizing. (As a brief aside, note the step taken to reach the last line appears a trivial manipulation, but if we were to write out what was happening in a general probability space, it is roughly analogous to the pushforward change of variables:</p>
<script type="math/tex; mode=display">\begin{align*}
\int_\mathbb{R} \log (p(x)) dF(x) = \int_\Omega \log(p(X(\omega)) d\mu(\omega)\end{align*}</script>
<p>we make when shifting between expectations in terms of a measure $\mu$ and a distribution function $F$. The LHS being given by a Lebesgue-Stieltjes integral, the RHS by a Lebesgue integral.)</p>
<p>The problem is constrained by the fact that $\sum q_m = 1$ and $q_m\ge 0 \forall m$. This constrained optimization problem can be solved using Lagrange multipliers. Recall this involves augmenting our objective function with our constraints</p>
<script type="math/tex; mode=display">\text{argmax}_{p\in\mathcal{P}} \sum_{m=1}^M f_m \log \left(p_{x_n}\right) - \lambda (\sum_{m=1}^Mq_m - 1) + \sum_{m=1}^M\mu_m p_m = \text{argmax}_{p\in\mathcal{P}} \mathcal{\tilde{L}}(\mathbf{p}, \mathbf{f})</script>
<p>We set the partial derivative of the Lagrangian $\mathcal{\tilde{L}}$ taken with respect to $p_m$ to zero to obtain</p>
<script type="math/tex; mode=display">p_m = \frac{1}{\lambda + \mu_m} f_m</script>
<p>We find that the optimal occurs at, not surprisingly, the empirical estimates for the mean. Thus we have that the MLE estimates for our multinomial distribution are maximized at the empirical distribution.</p>
<h2 id="kl-divergence">KL-divergence</h2>
<p>I mentioned earlier that we would like some measure of closeness, and would like to find distributions that are ‘close’ our observations. What measure of closeness have we just minimized? In the discrete case, we have just minimized the relative entropy, or Kullback-Liebler divergence between the empirical distribution and the model.</p>
<p>Recall the empirical distribution is</p>
<script type="math/tex; mode=display">\hat{q}_m = \frac{\sum_{n=1}^N\mathbf{I}}{\|\mathbf{f}\|_1} = \frac{f_m}{N}</script>
<p>so that:</p>
<div>
$$
\begin{align*}
\hat{\mathbf{p}}_{MLE}
& = \text{argmax}_{\mathbf{p}\in\mathcal{P}} \sum_{m=1}^M f_m \log \left( {p_m} \right) \\
& = \text{argmin}_{\mathbf{p}\in\mathcal{P}} \sum_{m=1}^M \hat{q}_m \log \left( \hat{q}_m \right) - \hat{q}_m \log \left( {p_m} \right) \\
& = \text{argmin}_{\mathbf{p}\in\mathcal{P}} \sum_{m=1}^M \hat{q}_m \log \left( \frac{\hat{q}_m}{p_m} \right) \\
& = \text{argmin}_{\mathbf{p}\in\mathcal{P}} D_{KL}( \mathbf{q} | \mathbf{p}) \\
\end{align*}
$$
</div>
<p>where we have taken as convention $0\log(0) = 0$. <em>Thus in the discrete case, at least, maximizing likelihood corresponds to minimizing the KL divergence.</em></p>
<p>We have discussed only the case where $\mathbf{p}$ is estimated non-parameterically. What if we instead have some probability model described by a parameter $\theta$? The optimization problem now becomes</p>
<div>
$$
\begin{align}\hat{\theta}_{MLE} & = \text{argmax}_{\theta\in\Theta}\mathcal{L}(\theta|\mathbf{f})\\
& = \text{argmax}_{\theta\in\Theta}\sum_{m=1}^M f_m \log \left(p_m(\theta)\right)\\\end{align}
$$
</div>
<p>which is typically found by solving $\frac{\partial}{\partial \theta}\mathcal{L}(\theta|\mathbf{f}) = 0$. The result is the same however – we minimize the difference between the empirical distribution and the parametric model.</p>
<p>In the continuous case things are not so straightforward. We try the same argument for a real-valued random variable, $N$ observations $X^n = {x_1, \dots, x_N }$, and a parametric model $f(x; \theta)$. If we denote by</p>
<script type="math/tex; mode=display">p_D(x) = \frac{1}{N}\sum_{i=1}^{N}\delta(x-x_i)</script>
<p>the ‘empirical density’, then trying the same argument gives:</p>
<div>
$$
\begin{align*}
\hat{\theta}_{MLE}
& = \text{argmax}_{\theta} \sum_{i=1}^N \log \left( f(x_i;\theta) \right) \\
\text{(??)} \qquad & = \text{argmax}_{\theta} \int p_D(x) \log \left( {f(x;\theta)} \right) \,dx\\
\text{(??)} \qquad & = \text{argmin}_{\theta} \int p_D(x) \log \left( \frac{p_D(x)}{f(x;\theta)} \right) \,dx\\
& = \text{argmin}_{\theta} D_{KL}( p_D | f(\dot; \theta)) \\
\end{align*}
$$
</div>
<p>However, the question marked lines don’t quite make sense – it’s unclear what the $log p_D(x)$ is in a continuous setting. For such a line to make sense the empirical distribution would have to be absolutely continuous so that $p_D$ was actually a density.</p>
<h2 id="consistency">Consistency</h2>
<p>That said, the intuition about MLE and min. $D_{KL}$ does carry through to the continuous setting in the following sense. Let the data be generated by a model given by $f(x; \theta^*)$ and let</p>
<script type="math/tex; mode=display">M_N(\theta) = \frac{1}{N}\sum_{i=1}^N \log(f(x_i;\theta))</script>
<p>denote the quantity we maximize under MLE. We note that this is an approximation to the expectation</p>
<script type="math/tex; mode=display">M_N(\theta) \approx \mathbb{E}_{\theta^*} \log(f(X;\theta))</script>
<p>and that, through the law of large numbers, this indeed converges to</p>
<script type="math/tex; mode=display">\lim_{N\to\infty} M_N(\theta) =M(\theta) = \mathbb{E}_{\theta^*} \log(f(X;\theta)).</script>
<p>But maximizing $M(\theta)$ is the same as minimizing $D_{KL}(f(\theta^*)|f(\theta))$ since:</p>
<div>
$$
\max_{\theta} M(\theta) = \max_{\theta} \mathbb{E}_{\theta^*} \log \left(\frac{f(X;\theta)}{f(X;\theta^*)}\right) = \min_{\theta} D_{KL}(f(\theta^*)\|f(\theta))
$$
</div>
<p>Note that since $D_{KL} \ge 0$ and $D_{KL}(f|g) = 0 \iff f = g$ then, under regularity conditions not specified here, this result implies the consistency of the MLE – that $\theta_N \to^{P} \theta^*$ as $N\to\infty$.</p>
<h2 id="a-reference">A reference</h2>
<ol>
<li>“The Epic Story of Maximum Likelihood” Stephen M. Stigler. 2007.</li>
</ol>Ben Lansdellben dot lansdell at gmail dot comIn this post I describe some of the theory of maximum likelihood estimation (MLE), highlighting its relation to information theory. In a later post I will develop the theory of maximum entropy models, also drawing connections to information theory, hoping to clarify the relation between MLE and MaxEnt.Installing NEURON with python and Neuronvisio2016-06-17T00:00:00+00:002016-06-17T00:00:00+00:00http://benlansdell.github.io/computing/neurovisio<p>Since it took some time, I’m going to describe the steps I took to install NEURON with support for python and the 3D visualization tool neuronvisio. I’m running Ubuntu 14.04, python 2.7.</p>
<p>First, I installed the neuron python package found at http://neuralensemble.org/people/eilifmuller/software.html. Note that this installs NEURON 7.1, which is not the latest version.
I had to install also libreadline-dev to get nrnivmodl (the hoc compiler) working:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo apt-get install libreadline-dev
</code></pre></div></div>
<p>Then install the prerequisites for neuronvisio:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo apt-get install python-qt4 python-matplotlib python-setuptools python-tables mayavi2 python-pip
</code></pre></div></div>
<p>Install neuronvisio</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cd ~/python/
git clone git://github.com/mattions/neuronvisio.git
python setup.py install
</code></pre></div></div>
<p>Add ~/python to your $PYTHONPATH, add ~/python/neuronvisio/bin/neuronvisio (or links thereto) to path
The final change (the most strange, and the that’s making me write this down) is as follows. Following steps 1-5 should get NEURON and neuronvisio working in python. I could run simulations, plot the 3D structures, etc. The one thing I was unable to do was to select segments in the 3D visualization window. I would receive the error:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Pick() takes exactly 4 arguments (2 given)
</code></pre></div></div>
<p>The workaround I found was to edit the following file in mayavi:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo vim /usr/lib/python2.7/dist-packages/mayavi/core/mouse_pick_dispatcher.py
</code></pre></div></div>
<p>So that the line changes from and to:
- 168: picker.pick((x, y, 0), self.scene.scene.renderer)
+ 168: picker.pick(x, y, 0, self.scene.scene.renderer)</p>
<p>It’s annoying that such a strange workaround was necessary. However, a similar fix was suggested here (https://github.com/enthought/mayavi/issues/21), so perhaps I wasn’t alone.</p>Ben Lansdellben dot lansdell at gmail dot comSince it took some time, I’m going to describe the steps I took to install NEURON with support for python and the 3D visualization tool neuronvisio. I’m running Ubuntu 14.04, python 2.7.Path integrals and SDEs in neuroscience – part two2016-01-30T00:00:00+00:002016-01-30T00:00:00+00:00http://benlansdell.github.io/statistics/sdesII<p>In the <a href="http://benlansdell.github.io/statistics/sdes/">previous post</a> we defined path integrals through a simple ‘time-slicing’ approach. And used them to compute moments of simple stochastic DEs. In this follow-up post we will examine how expansions can be used to approximate moments, how we can use the moment generating functional to compute probability densities, and how these methods may be helpful in some cases in neuroscience.</p>
<h3 id="sub:perturbation">Perturbative approaches</h3>
<p>For a general, non-linear, SDE the series will not terminate and must be
truncated at some point. It is then necessary to determine which terms
will contribute the sum, and to include these terms up to a given order.
We mention briefly three such possibilities, though do not discuss them
in any detail. One way of doing this is if some terms in $S_{I}$
($v_{mn}\int x^{n}\tilde{x}^{m},m\ge2$) are small. Then we can simply
let each such vertex contribute a small parameter $\alpha$ and perform
an expansion in orders of $\alpha$ (known as a ‘weak coupling expansion’
<sup id="fnref:3"><a href="#fn:3" class="footnote">1</a></sup>).</p>
<p>Another option is to perform a weak noise, or loop, expansion. Here we
scale the entire exponent in the MGF by some factor $h$</p>
<script type="math/tex; mode=display">Z=\int\mathcal{D}x(t)\mathcal{D}\tilde{x}(t)e^{-\frac{1}{h}(S-\int\tilde{J}x-\int J\tilde{x})}</script>
<p>Then each vertex of $S_{I}$ gains a factor of $1/h$ and each edge of
$S_{F}$ gains a factor $h$ which implies we can expand in powers of $h$.
In performing this expansion, if we let $E$ denote the number of
external edges of a diagram, $I$ the number of internal edges and $V$
the number of vertices then each connected graph has a factor of
<script type="math/tex">h^{I+E-V}</script> and, in fact, it can be shown by induction that:
<script type="math/tex">L=I-V+1</script> where $L$ is the number of *loops *the diagram contains.
Thus, each graph collects a factor of $h^{E-L+1}$. This allows us to
order the expansion in terms of the number of loops in each diagram.
Diagrams which contain no loops are trees, or classical diagrams. Such
diagrams form the basis of the *semi-classical *approximation.</p>
<p>This expansion is of course only valid when the contribution of the
higher loop number diagrams is smaller than that of the lower loop
number diagrams. The <em>Ginzburg</em> *criterion *says when this expansion is
indeed valid.</p>
<h2 id="some-other-examples">Some other examples</h2>
<p>We present two further examples which demonstrate how these methods are
used.</p>
<h3 id="example-1">Example 1</h3>
<p>A simple extension of the OU process so that it is now <em>mean-reverting</em>
(to something not zero, as in the previous case) is the SDE</p>
<script type="math/tex; mode=display">\dot{x}(t)+a(b+x(t))-\sqrt{D}\eta(t)=0.</script>
<p>This problem is obviously very similar to the above problem and is of
course solved almost identically. This time the action of the process is</p>
<script type="math/tex; mode=display">S=\int\left[\tilde{x}(t)(\dot{x}(t)+a(b+x(t)))+\tilde{x}(t)y\delta(t-t_{0})-\frac{D}{2}\tilde{x}^{2}(t)\right]\, dt</script>
<p>such that the free action is as before:</p>
<script type="math/tex; mode=display">S_{F}=\int\tilde{x}(t)\left[\dot{x}(t)+ax(t))\right]\, dt</script>
<p>(the
linear, homogenous part, for which a Green’s function can be calculated)
and the interacting action is:</p>
<script type="math/tex; mode=display">S_{I}=\int\left[\tilde{x}(t)\left(y\delta(t-t_{0})+ba\right)-\frac{D}{2}\tilde{x}^{2}(t)\right]\, dt.</script>
<p>The only details that change are thus that the vertex linear in
$\tilde{x}(t)$ changes from</p>
<script type="math/tex; mode=display">\int\tilde{x}(t)y\delta(t-t_{0})dt\to\int\tilde{x}(t)\left(y\delta(t-t_{0})+ba\right)dt,</script>
<p>which adds an extra term to the expression for the mean:</p>
<script type="math/tex; mode=display">\langle x(t)\rangle=H(t-t_{0})\left(ye^{-a(t-t_{0})}+b(1-ye^{-a(t-t_{0})})\right).</script>
<p>Since only this internal vertex is affected, and the second order vertex
($D\tilde{x}(t)^{2}/2$) is unaffected, the solution for the second-order
cumulant will in fact be the same as our original example:</p>
<script type="math/tex; mode=display">\langle x(t)x(s)\rangle_{C}=D\frac{e^{2a(t-s)}-e^{2a(t+s-2t_{0})}}{2a}</script>
<h3 id="example-2">Example 2</h3>
<p>Consider the harmonic oscillator with noise:</p>
<script type="math/tex; mode=display">\ddot{x}+2\gamma\dot{x}+\omega^{2}x=\sqrt{D}\eta(t)</script>
<p>where $\eta$ is
a white noise process. Subject to initial conditions $x(0)=x_{0}$ and
$\dot{x}(0)=v_{0}$ ($t_{0}=0$). The action for this process is</p>
<script type="math/tex; mode=display">S=\int dt\,\left(\tilde{x}\left[\ddot{x}+2\gamma\dot{x}+\omega^{2}x+v_{0}\delta(t)+x_{0}\delta'(t)\right]+\frac{D}{2}\tilde{x}^{2}\right)</script>
<p>which we will split into free and interacting components:</p>
<div>
$$\begin{aligned}
S_{F} & = \int dt\,\left(\tilde{x}\left[\ddot{x}+2\gamma\dot{x}+\omega^{2}x\right]\right)\\
S_{I} & = \int dt\,\left(\tilde{x}\left[v_{0}\delta(t)+x_{0}\delta'(t)\right]+\frac{D}{2}\tilde{x}^{2}\right)\end{aligned}$$
</div>
<p>The free action gives the propagator as the Green’s function:</p>
<script type="math/tex; mode=display">\left(\frac{d^{2}}{dt^{2}}+2\gamma\frac{d}{dt}+\omega^{2}\right)G(t,t')=\delta(t-t')</script>
<p>which can be shown to be</p>
<script type="math/tex; mode=display">G(t,t')=\frac{1}{\omega_{1}}H(t-t')e^{-\gamma(t-t')}\sin[\omega_{1}(t-t')],</script>
<p>for $\omega_{1}=\sqrt{\omega^{2}-\gamma^{2}}$. Once $G$ is determined
the mean and covariance can be immediately calculated through the
diagrams and calculations of the Figure 1.</p>
<figure class="center" style="width:500px">
<img src="../../images/feynman3.png" alt="img txt" />
<figcaption>Figure 1. Computation of first and second cumulant of Brownian motion process
with Gaussian white noise. Diagrams (with internal vertices labeled
adjacent to diagram), and the equivalent integral to evaluate to obtain
each cumulant.
</figcaption>
</figure>
<p>We find, as expected, the following mean and covariance:</p>
<div>
$$\begin{aligned}
\langle x(t)\rangle & = \int[\delta(t')v_{0}+\delta'(t')x_{0}]G(t,t')dt'\\
& = v_{0}G(t,0)+x_{0}G'(t,0)\\
& = e^{-\gamma t}\left(\frac{\gamma x_{0}+v_{0}}{\omega_{1}}\sin[\omega_{1}t]+\cos[\omega_{1}t]\right)\end{aligned}$$
</div>
<p>and (assuming $t_{1}<t_{2}$)</p>
<div>
$$\begin{aligned}
\langle x(t_{1})x(t_{2})\rangle_{C} & = D\int G(t_{1},t)G(t_{2},t)dt\\
& = \frac{D}{\omega_{1}^{2}}\int_{0}^{\infty}e^{-\gamma(t_{1}+t_{2}-2t)}H(t_{1}-t)H(t_{2}-t)\sin[\omega_{1}(t_{1}-t)]\sin[\omega_{1}(t_{2}-t)]dt\\
& = \frac{D{\rm e}^{-\gamma\,{\it t_{1}}-\gamma\,{\it t_{2}}}}{4\omega_{1}^{2}\omega^{2}\gamma}\left({\rm e}^{2\,\gamma\,{\it t1}}\left[\cos\left(\omega_{1}\,{\it t_{1}}\right)\cos\left(\omega\,{\it t_{2}}\right)\omega_{1}^{2}+\cos\left(\omega_{1}\,{\it t_{1}}\right)\sin\left(\omega_{1}\,{\it t_{2}}\right)\gamma\,\omega_{1}-\cos\left(\omega_{1}\,{\it t_{2}}\right)\sin\left(\omega_{1}\,{\it t_{1}}\right)\gamma\,\omega_{1}\right]\right.\\
& + {\rm e}^{2\,\gamma\,{\it t_{1}}}\sin\left(\omega_{1}\,{\it t_{1}}\right)\sin\left(\omega_{1}\,{\it t_{2}}\right)\omega_{1}^{2}-\cos\left(\omega_{1}\,{\it t_{1}}\right)\cos\left(\omega_{1}\,{\it t_{2}}\right)\omega_{1}^{2}-\omega_{1}\,\cos\left(\omega_{1}\,{\it t_{1}}\right)\sin\left(\omega_{1}\,{\it t_{2}}\right)\gamma\\
& + \left.\omega_{1}\,\sin\left(\omega_{1}\,{\it t_{1}}\right)\cos\left(\omega_{1}\,{\it t_{2}}\right)\gamma-2\,\gamma^{2}\sin\left(\omega_{1}\,{\it t_{1}}\right)\sin\left(\omega_{1}\,{\it t_{2}}\right)-\sin\left(\omega_{1}\,{\it t_{1}}\right)\sin\left(\omega_{1}\,{\it t_{2}}\right)\omega^{2}\right)\end{aligned}$$
</div>
<p>which, with some rearrangement, simplifies to the variance<sup id="fnref:4"><a href="#fn:4" class="footnote">2</a></sup>:</p>
<script type="math/tex; mode=display">\langle x(t)^{2}\rangle_{C}=\frac{D}{4\gamma\omega^{2}}\left[1-\exp(-2\gamma t)\left\{ 1+\frac{\gamma}{\omega_{1}}\left(\sin(2\omega_{1}t)+\frac{2\gamma}{\omega_{1}}\sin^{2}(\omega_{1}t)\right)\right\} \right].</script>
<h2 id="connection-to-fokker-planck-equation">Connection to Fokker-Planck Equation</h2>
<p>So far we have considered the moment generating functional, and the
probability density functional $P[x(t)]$, however often of interest is
the probability density $p(x,t)$. This can be computed from the above
framework with the following derivation.</p>
<p>Let $U(x_{1},t_{1}|x_{0},t_{0})$ be the transition probability between a
start point $x_{0},t_{0}$ to $x_{1},t_{1}$, then</p>
<div>
$$\begin{aligned}
U(x_{1},t_{1}|x_{0},t_{0}) & = \int\mathcal{D}x(t)\delta(x(t_{1})-x_{1})P[x(t)]\\
& = \frac{1}{2\pi i}\int d\lambda\int\mathcal{D}x(t)e^{-\lambda(x(t_{1})-x_{1})}P[x(t)]\\
& = \frac{1}{2\pi i}\int d\lambda e^{-\lambda(x_{1}-x_{0})}Z_{CM}(\lambda)\end{aligned}$$
</div>
<p>where $Z_{CM}$ gives the moments of $x(t_{1})-x_{0}$ given
$x(t_{0})=x_{0}$</p>
<script type="math/tex; mode=display">Z_{CM}=\int\mathcal{D}xe^{\lambda(x(t_{1})-x_{0})}P[x(t)]</script>
<p>Using the
following two relations:</p>
<div>
$$\begin{aligned}
Z_{CM}(\lambda) & = 1+\sum_{n=1}^{\infty}\frac{1}{n!}\langle(x(t_{1})-x_{0})^{n}\rangle_{x(t_{0})=x_{0}}\\
\frac{1}{2\pi i}\int d\lambda\, e^{-\lambda(x_{1}-x_{0})}\lambda^{n} & = \left(-\frac{\partial}{\partial x_{1}}\right)^{n}\delta(x_{1}-x_{0})\end{aligned}$$
</div>
<p>then $U$ becomes</p>
<script type="math/tex; mode=display">U(x_{1},t_{1}|x_{0},t_{0})=\left(1+\sum_{n=1}^{\infty}\frac{1}{n!}\left(-\frac{\partial}{\partial x_{1}}\right)^{n}\langle(x(t_{1})-x_{0})^{n}\rangle_{x(t_{0})=x_{0}}\right)\delta(x_{1}-x_{0}).</script>
<p>From here we can derive a relation for $p(x,t)$:</p>
<div>
$$\begin{aligned}
p(y,t+\Delta t) & = \int U(x,t+\Delta t|y',t)p(y',t)\, dy'\\
& = \int\left(1+\sum_{n=1}^{\infty}\frac{1}{n!}\left(-\frac{\partial}{\partial y}\right)^{n}\langle(x(t_{1})-y')^{n}\rangle_{x(t)=y'}\right)\delta(y-y')p(y',t)\, dy'\\
& = \left(1+\sum_{n=1}^{\infty}\frac{1}{n!}\left(-\frac{\partial}{\partial y}\right)^{n}\langle(x(t_{1})-y)^{n}\rangle_{x(t)=y}\right)p(y,t)\end{aligned}$$
</div>
<p>and thus a PDE for $p(x,t)$:</p>
<div>
$$\begin{aligned}
\frac{\partial p(y,t)}{\partial t}\Delta t & = \sum_{n=1}^{\infty}\frac{1}{n!}\left(-\frac{\partial}{\partial y}\right)^{n}\langle(x(t_{1})-y)^{n}\rangle_{x(t)=y}p(y,t)+O(\Delta t^{2})\\
\frac{\partial p(y,t)}{\partial t} & = \sum_{n=1}^{\infty}\frac{1}{n!}\left(-\frac{\partial}{\partial y}\right)^{n}D_{n}(y,t)p(y,t)\end{aligned}$$
</div>
<p>as $\Delta t\to0$. This is the Kramers-Moyal expansion where
the $D_{n}$are</p>
<p><script type="math/tex">D_{n}(y,t)=\lim_{\Delta t\to0}\left.\frac{\langle(x(t+\Delta t)-y)^{n}\rangle}{\Delta t}\right|_{x(t)=y}</script>
and are computed from the SDE. For example, for the Ito process</p>
<p><script type="math/tex">dx=f(x,t)dt+g(x,t)dB_{t}</script> we can compute $D_{1}(y,t)=f(y,t)$ and
$D_{2}(y,t)=g(y,t)^{2}$, $D_{n}=0$ for $n>2$. Hence the PDE becomes a
Fokker-Planck equation</p>
<p><script type="math/tex">\frac{\partial p(y,t)}{\partial t}=\left(\frac{\partial}{\partial y}D_{1}(y,t)+\frac{1}{2}\frac{\partial^{2}}{\partial y^{2}}D_{2}(y,t)\right)p(y,t)</script>
Compute $p(x,t)=U(x,t|0,0)$ as</p>
<div>
$$\begin{aligned}
p(x,t) & = \frac{1}{2\pi i}\int d\lambda\, e^{-\lambda x}Z_{CM}(\lambda)\\
& = \frac{1}{2\pi i}\int d\lambda\, e^{-\lambda x}\exp\left[\sum_{n=1}\frac{1}{n!}\lambda^{n}\langle x(t)^{n}\rangle_{C}\right]\end{aligned}$$
</div>
<p>For OU, we know the cumulants hence</p>
<script type="math/tex; mode=display">p(x,t)=\sqrt{\frac{a}{\pi D(1-e^{-2a(t-t_{0})})}}\exp\left(\frac{-a(x-ye^{-a(t-t_{0})})^{2}}{D(1-e^{-2a(t-t_{0})})}\right)</script>
<h1 id="statistical-mechanics-of-the-neocortex">Statistical mechanics of the neocortex</h1>
<p>Having spent some time on how path integrals can be used as
calculation devices for studying stochastic DEs, we now turn to some
specific examples of their use in neuroscience.</p>
<h2 id="neural-field-models">Neural field models</h2>
<p>A neural field model represents a continuum approximation to neural
activity (particularly in models of cortex). They are often expressed as
integro-differential equations:</p>
<script type="math/tex; mode=display">dU=\left[-U+\int_{-\infty}^{\infty}w(x-y)F(U(y,t))dy\right]dt</script>
<p>where
$U=U(x,t)$ may be either the mean firing rate or a measure of synaptic
input at position $x$ and time $t$. The function $w(x,y)=w(|x-y|)$ is a
weighting function often taken to represent the synaptic weight as a
function of distance from $x$. $F(U)$ is a measure of the firing rate as
a function of inputs. For tractability, $F$ may often be taken to be a
heaviside function, or a sigmoid curve. It is called a field because
each continuous point $x$ is assigned a value $U$, instead of modelling
the activity of individual neurons. A number of spatio-temporal pattern
forming systems may be studied in the context of these models. The
formation of ocular-dominance columns, geometric hallucinations,
persistent ‘bump models’ of activity associated with working memory, and
perceptual switching in optical illusions are all examples of pattern
formation that can be modelled by such a theory. Refer to Bressloff 2012
@Bressloff2012a for a comprehensive review.</p>
<p>The addition of additive noise to the above model:</p>
<script type="math/tex; mode=display">dU=\left[-U+\int_{-\infty}^{\infty}w(x-y)F(U(y,t))dy\right]dt+g(U)dW(x,t)\label{eq:neuralfield}</script>
<p>for $dW(x,t)$ a white noise process has been studied by Bressloff
from both a path integral approach, and by studying a
perturbation expansion of the resulting master equation more directly.
We describe briefly how the path integral approach is formulated, and
the results that can be computed as a result. More details are found in
Bressloff 2009.</p>
<p>As in the derivations of Section 2, Equation [eq:neuralfield] is
discretized in both time and space to give:</p>
<script type="math/tex; mode=display">U_{i+1,m}-U_{i,m}=\left[-U_{i,m}+\Delta d\sum_{n}w_{mn}F(U_{i,n})\right]\Delta t+\frac{\sqrt{\Delta t}}{\sqrt{\Delta d}}g(U_{i,m})dW_{i,m}+\Phi_{m}\delta_{i,0}</script>
<p>for initial condition function $\Phi(x)=U(x,0).$ Where each noise
process is a zero-mean, delta correlated process:</p>
<script type="math/tex; mode=display">\langle dW_{i,m}\rangle=0,\quad\langle dW_{i,m}dW_{j,n}\rangle=\delta_{i,j}\delta_{m,n}.</script>
<p>Let $U$ and $W$ represent vectors with components $U_{i,m}$ and
$W_{i,m}$ such that we can write down the probability density function
conditioned on a particular realization of $W$:</p>
<script type="math/tex; mode=display">P(U|W)=\prod_{n}\prod_{i=1}^{N}\delta\left(U_{i+1,m}-U_{i,m}+\left[U_{i,m}-\Delta d\sum_{n}w_{mn}F(U_{i,n})\right]\Delta t-\frac{\sqrt{\Delta t}}{\sqrt{\Delta d}}g(U_{i,m})dW_{i,m}-\Phi_{m}\delta_{i,0}\right)</script>
<p>where we again use the Fourier representation of the delta function:</p>
<script type="math/tex; mode=display">P(U|W)=\int\prod_{n}\prod_{i=1}^{N}\frac{d\tilde{U}_{j,n}}{2\pi}\exp\tilde{-iU_{i,m}}\left(U_{i+1,m}-U_{i,m}+\left[U_{i,m}-\Delta d\sum_{n}w_{mn}F(U_{i,n})\right]\Delta t-\frac{\sqrt{\Delta t}}{\sqrt{\Delta d}}g(U_{i,m})dW_{i,m}-\Phi_{m}\delta_{i,0}\right).</script>
<p>Knowing the density for the random vector $W$ we can write the
probability of a vector $U$:</p>
<script type="math/tex; mode=display">P(U)=\int\prod_{n}\prod_{i=1}^{N}\frac{d\tilde{U}_{j,n}}{2\pi}\exp\tilde{-iU_{i,m}}\left(U_{i+1,m}-U_{i,m}+\left[U_{i,m}-\Delta d\sum_{n}w_{mn}F(U_{i,n})\right]\Delta t+\frac{\Delta t}{2\Delta d}g^{2}(U_{i,m})\tilde{U}_{i,m}-\Phi_{m}\delta_{i,0}\right).</script>
<p>Taking the continuum limit gives the density:</p>
<script type="math/tex; mode=display">P[U]=\int\mathcal{D}\tilde{U}e^{-S[U,\tilde{U}]},</script>
<p>for action</p>
<script type="math/tex; mode=display">S[U,\tilde{U}]=\int dx\int_{0}^{T}dt\tilde{U}\left[U_{t}(x,t)+U(x,t)-\int w(x-y)F(U(y,t))dy-\Phi(x)\delta(t)-\frac{1}{2}\tilde{U}^{2}g^{2}(U(x,t))\right].</script>
<p>Given the action, the moment generating functional and propagator can be
defined as previously. In linear cases the moments can be computed
exactly.</p>
<h3 id="the-weak-noise-expansion">The weak-noise expansion</h3>
<p>If the noise term is scaled by a small parameter, $g(U)\to\sigma g(U)$
for $\sigma\ll1.$ (For instance, in the case of a Langevin approximation
to the master equation, it is the case that $\sigma\approx1/N$ for $N$
the number of neurons.) Rescaling variables
$\tilde{U}\to\tilde{U}/\sigma^{2}$ and
$\tilde{J}\to\tilde{J}/\sigma^{2}$ then the generating functional
becomes:</p>
<script type="math/tex; mode=display">Z=\int\mathcal{D}U\mathcal{D}\tilde{U}e^{-\frac{1}{\sigma^{2}}S[U,\tilde{U}]}e^{\frac{1}{\sigma^{2}}\int dx\int_{0}^{T}dt[\tilde{U}J+\tilde{J}U]},</script>
<p>which can be thought of in terms of a loop expansion described in
Section [sub:perturbation]. Performing the expansion in orders of
$\sigma$ allows for a ‘semi-classical’ expansion to be performed. The
corrections to the deterministic equations take the form</p>
<script type="math/tex; mode=display">\frac{\partial v}{\partial t}=-v(x,t)+\int w(x-y)F(v(y,t))dy+\frac{\sigma^{2}}{2}\int w(x-y)C(x,y,t)F''(v(y,t))dy+O(\sigma^{4})</script>
<p>for $C(x,y,t)$ the second-order cumulant (covariance) function. The
expression for $C(x,y,t)$ is derived and studied in more detail in Buice
<em>et al</em> 2010.</p>
<h2 id="mean-field-wilson-cowan-equations-and-corrections">Mean-field Wilson-Cowan equations and corrections</h2>
<p>Another approach using path integrals has been extensively studied by
Buice and Cowan (Buice2007, see also Bresslof
2009). Here, we envision a network of neurons which exist
in one of either two or three states, depending on the time scales of
interest relative to the time scales of the neurons being studied. Each
neuron in the network is modeled as a Markov process which transitions
between active and quiescent states (and refractory, if it’s relevant).</p>
<p>For the two state model, assume that each neuron in the network creates
spikes and that these spikes have an impact on the network dynamics for
an exponentially distributed time given by a decay rate $\alpha.$ Let
$n_{i}$ denote the number of ‘active’ spikes at a given time for neuron
$i$ and let $\mathbf{n}$ denote the state of all neurons at a given
time. We assume that the effect of neuron $j$ on neuron $i$ is given by
the function</p>
<script type="math/tex; mode=display">f(\sum_{ij}w_{ij}n_{j}+I)</script>
<p>for some firing rate function
$f$ and some external input $I$. Then the master equation for the state
of the system is:</p>
<script type="math/tex; mode=display">\frac{dP(\mathbf{n},t)}{dt}=\sum_{i}\alpha(n_{i}+1)P(\mathbf{n}_{i+},t)-\alpha n_{i}P(\mathbf{n},t)+f\left(\sum_{ij}w_{ij}n_{j}+I\right)[P(\mathbf{n}_{i-},t)-P(\mathbf{n},t)]</script>
<p>where we denote by $\mathbf{n}<em>{i\pm}$the state of the network
$\mathbf{n}$ with one more or less active spike in neuron $i$. The
assumption is made that each neuron is identical and that the weight
function $w</em>{ij}=w_{|i-j|}$, that is, it only depends on the distance
between the two neurons. Of interest is the mean activity of neuron $i$:</p>
<script type="math/tex; mode=display">a_{i}(t)=\langle n_{i}(t)\rangle.</script>
<p>Using an operator representation it is possible to derive a stochastic
field theory in the continuum limit ($N\to\infty$ and
$n_{i}(t)\to n(x,t)$) to give moments of $n_{i}(t)$ in terms of the the
interaction between two fields $\varphi(x,t)$ and
$\tilde{\varphi}(x,t)$. The details of the derivation are contained in
the Appendices of Buice and Cowan 2007. These fields can be
related to quantities of interest through</p>
<script type="math/tex; mode=display">a(x,t)=\langle n(x,t)\rangle=\langle\varphi(x,t)\rangle</script>
<p>and</p>
<script type="math/tex; mode=display">\langle n(x_{1},t_{1})n(x_{2},t_{2})\rangle=\langle\varphi(x_{1},t_{1})\varphi(x_{2},t_{2})\rangle+\langle\varphi(x_{1},t_{1})\tilde{\varphi}(x_{2},t_{2})\rangle a(x_{2},t_{2})</script>
<p>for $t_{1}>t_{2}$. The propagator, as before, is</p>
<script type="math/tex; mode=display">G(x_{1},t_{1};x_{2},t_{2})=\langle\varphi(x_{1},t_{1})\tilde{\varphi}(x_{2},t_{2})\rangle</script>
<p>and the generating function is given by</p>
<script type="math/tex; mode=display">Z[J,\tilde{J}]=\int\mathcal{D}\varphi\mathcal{D}\tilde{\varphi}e^{-S[\varphi,\tilde{\varphi}]+J\tilde{\varphi}+\tilde{J}\varphi}.</script>
<p>For the master equation above the action is given by:</p>
<script type="math/tex; mode=display">S[\varphi,\tilde{\varphi}]=\int dx\left(\int_{0}^{t}dt\tilde{\varphi}\partial_{t}\varphi+\alpha\tilde{\varphi}\varphi-\tilde{\varphi}f\left(w\star[\tilde{\varphi}\varphi+\varphi]+I\right)\right)-\int dx\,\bar{n}(x)\tilde{\varphi}(x,0),</script>
<p>for convolution $\star$ and initial condition vector $\bar{n}(x).$ This
now allows the action to be divided into a free and interacting action
and for perturbation expansions to be performed.</p>
<p>The loop expansion provides a useful expansion and, as described in
Section 2, amounts to ordering Feynman diagrams by the number of loops
contained in them. The zeroth order, mean field theory, corresponds to
Feyman diagrams containing zero loops. Such diagrams are called tree
diagrams. It can be shown that the dynamics of this expansion obey:
<script type="math/tex">\partial_{t}a_{0}(x,t)+\alpha a_{0}(x,t)-f(w\star a_{0}(x,t)+I)=0</script>
which are also a simple form of the well known Wilson-Cowan equations.
That these equations are recovered as the zeroth order expansion of the
continuum limit of the master equation gives confidence that the
higher-order terms of the expansion will indeed correspond to relevant
dynamics. The ‘one-loop’ correction is given by:</p>
<script type="math/tex; mode=display">\partial_{t}a_{1}(x,t)+\alpha a_{1}(x,t)-f(w\star a_{1}(x,t)+I)+h\mathcal{N}(a_{1},\Delta)=0</script>
<p>for</p>
<script type="math/tex; mode=display">\mathcal{N}(a,\Delta)=\int dx_{1}dx_{2}dx'dt'dx''f^{(2)}(x,t)w(x-x_{1})w(x-x_{2})f^{(1)}(x',t')w(x'-x'')\Delta(x_{1}-x',t-t')\Delta(x_{2}-x'',t-t')a(x'',t')</script>
<p>and the ‘tree-level’ propagator</p>
<h1 id="summary">Summary</h1>
<p>We have described how path integrals can be used to compute moments and
densities of a stochastic differential equation, and how they can be
used to perform perturbation expansions around ‘mean-field’, or
classical, solutions.</p>
<p>Path integral methods, once one is accustomed to their use, can provide
a quick and intuitive way of solving particular problems. However, it is
worth highlighting that there are few examples of problems which can be
solved with path integral methods but not with other, perhaps more
standard, methods. Thus, while they are a powerful and general tool,
their utility is often countered by the fact that, for many problems,
simpler solution techniques exist.</p>
<p>Further, it should be highlighted that the path integral as it was
defined here – as a limit of finite-dimensional integration
($\int\prod_{i}^{N}dx_{i}\to\int\mathcal{D}x(t)$) – does not result in a
valid measure. In some cases the Weiner measure may equivalently be
used, but in other cases the path integral as formulated by Feynman
remains a mathematically unjustified entity.</p>
<p>With these caveats in mind, their principle benefit, then, may instead
come from the intuition that they bring to novel mathematical and
physical problems. When unsure how to proceed, having as many different
ways of approaching a problem can only be beneficial. Indeed, in 1965
Feynman said in his Nobel acceptance lecture: “Theories of the known,
which are described by different physical ideas may be equivalent in all
their predictions and are hence scientifically indistinguishable.
However, they are not psychologically identical when trying to move from
that base into the unknown. For different views suggest different kinds
of modifications which might be made and hence are not equivalent in the
hypotheses one generates from them in one’s attempt to understand what
is not yet understood.”</p>
<div class="footnotes">
<ol>
<li id="fn:3">
<p><em>e.g.</em> in QED this coupling is related to change of electron
($e$): $\alpha\approx1/137=\text{fine structure constant}$ <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>The expression for the mean and variance I was able to verify (pp.
83-85 @Gitterman2005). The expression for the covariance I was
unable to locate in another source to verify; that it reduces to the
correct variance is encouraging, however. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Ben Lansdellben dot lansdell at gmail dot comIn the previous post we defined path integrals through a simple ‘time-slicing’ approach. And used them to compute moments of simple stochastic DEs. In this follow-up post we will examine how expansions can be used to approximate moments, how we can use the moment generating functional to compute probability densities, and how these methods may be helpful in some cases in neuroscience.Path integrals and SDEs in neuroscience2016-01-30T00:00:00+00:002016-01-30T00:00:00+00:00http://benlansdell.github.io/statistics/sdes<h2 id="introduction">Introduction</h2>
<p>The path integral was first considered by Wiener in the 1930s in his study of diffusion and Brownian motion. It was later co-opted by Dirac and by Richard Feynman in Lagrangian formulations of quantum mechanics. They provide a quite general and powerful approach to tackling problems not just in quantum field theory but in stochastic differential equations more generally. There is an associated learning curve to being able to make use of path integral methods, however, and for many problems simpler solution techniques exist.</p>
<p>Nonetheless, it is interesting to think about their application to neuroscience. In the following two posts I will describe how path integrals can be defined and used to solve simple SDEs, and why such ‘statistical mechanics’ tools may be useful in studying the brain. The following largely follows material from the paper “Path Integral Methods for Stochastic Differential Equations”, by Carson Chow and Michael Buice, 2012. This post will assume some familiarity with probability and stochastic processes.</p>
<h2 id="statistical-mechanics-in-the-brain">Statistical mechanics in the brain</h2>
<p>The cerebral cortex is the outer-most layer of the mammalian brain. In a human brain the *neocortex *consists of approximately 30 billion neurons. Of all parts of the human brain, its neural actvity is the most correlated with our *high-order *behaviour: language, self-control, learning, attention, memory, planning. Lesion and stroke studies make clear that the cortex has signficant functional localization, however, despite this localization, individual neurons from different regions of cortex in general require expert training to distinguish – these differences in functionality appear to arise largely from differences in connectivity.</p>
<p>The interplay between structural homogeniety and functional heterogeneity of different cortical regions poses definitive challenges to obtaining quantitative models of the large-scale activity of the cortex. Since structured neural activity is observed on spatial scales involving thousands to billions of neurons, and given that this activity is associated with particular functions and pathologies, dynamical models of large-scale cortical networks are definitely necessary to an understanding of these functions and dysfunctions. Examples of large-scale activities include wave-like activity during development, bump models of working memory, avalanches in awake and sleeping states, and pathological oscillations responsible for epileptic seizure.</p>
<p>A particular challenge to building such models is noise: it is well known that significant neural variability at both the individual and population level exists in response to repeated stimuli. The spike trains of individual cortical neurons are in general very noisy, such that their firing is often well approximated by a Poisson process. The primary source of cell intrinsic noise is fluctuations in ion channel activity, which arises from a finite number of ion channels opening and closing. While the primary source of extrinsic noise is from uncorrelated synaptic inputs – a neuron may contain thousands of synapses whose inputs often do not contain meaningful, correlated input. Population responses similarly exhibit highly variable responses. Models of cortical networks must account for this variability, or demonstrate that it is irrelevant to the particular questions being asked.</p>
<p>Methods from statistical mechanics lend themselves well to modelling both of these factors – statistical, but meaningful, connectivity, and noisy, but meaningful, neural responses to stimulus – in networks with large numbers of neurons. With these thoughts in mind, let’s see how path integrals may be used to study SDEs relevant to tackling the above issues.</p>
<h2 id="a-path-integral-representation-of-stochastic-differential-equations">A path integral representation of stochastic differential equations</h2>
<p>We will begin by describing in some detail how they are constructed and manipulated. In general, we would like to study SDEs that may be of the form:</p>
<script type="math/tex; mode=display">\frac{d\mathbf{x}}{dt}=\mathbf{f}(\mathbf{x})+\mathbf{g}(\mathbf{x})\mathbf{\eta}(t)</script>
<p>for some noise process $\eta(t).$ Such a process may be characterized by either its probability density function (pdf, $p(x,t)$) or, equivalently, by its <em>moment heirarchy</em></p>
<script type="math/tex; mode=display">\langle x(t)\rangle,\quad\langle x(t)x(t')\rangle,\dots</script>
<p>A generic SDE in the above form may be studied as either a Langevin equation, or can be written as a Fokker-Planck equation, but perturbation methods in either of these forms may be difficult to apply. The path integral approach is able to provide more mechanical methods for performing particular types of perturbation expansions. In the following sections we will derive a path integral formulation of a moment generating funcational of an SDE, using the Ornstein-Uhlenbeck process as an example. This will be used to demonstrate the use of perturbation techinques using Feynman diagrams. We will also derive the pdf $p(x,t)$ of such a process.</p>
<h2 id="path-integrals">Path integrals</h2>
<p>A path integral, loosely, is an integral in which the domain of
integration is not a subset of a finite dimensional space (say
$\mathbb{R}^{n}$) but instead an infinite dimensional function space.
For instance, if we can define the probability density associated with a
particular realization of a random trajectory according to a given SDE,
then the probability that a particle travels from a point $\mathbf{a}$
to a point $\mathbf{b}$ can be computed by marginalizing (summing) over
all paths connecting these two points, subject to a suitable
normalization. Before taking this further it is useful to review some relevant concepts.</p>
<h3 id="moment-generating-functions">Moment generating functions</h3>
<p>The moment generating function (MGF) forms a crucial component to this
framework. Recall that for a single random variable $X$, the <em>moments</em>
($\langle X\rangle=\int x^{n}P(x)\, dx$) are obtained from the MGF</p>
<script type="math/tex; mode=display">Z(\lambda)=\langle e^{\lambda x}\rangle=\int e^{\lambda x}P(x)\, dx</script>
<p>by taking derivatives</p>
<script type="math/tex; mode=display">\langle X^{n}\rangle=\left.\frac{1}{Z(0)}\frac{d^{n}}{d\lambda^{n}}Z(\lambda)\right|_{\lambda=0},</script>
<p>and that the MGF contains all information about RV $X$, as an alternative to studying the pdf directly.</p>
<p>In a similar fashion we can define <script type="math/tex">W(\lambda)=\log Z(\lambda),</script> so that</p>
<script type="math/tex; mode=display">\langle X^{n}\rangle_{C}=\frac{d^{n}}{d\lambda^{n}}\left.W(\lambda)\right|_{\lambda=0}</script>
<p>are the <em>cumulants</em> of RV $X$.</p>
<p>For an $n$-dimensional random variable $\mathbf{x}=(x_{1},\dots,x_{n})$, the generating function is</p>
<script type="math/tex; mode=display">Z(\mathbf{\lambda})=\langle e^{\mathbf{\lambda}\cdot\mathbf{x}}\rangle=\int\prod_{i=1}^{n}dx_{i}e^{\mathbf{\lambda}\cdot\mathbf{x}}P(\mathbf{x})</script>
<p>for $\lambda=(\lambda_{1},\dots,\lambda_{n})$. Here, the $k$-th order moments are obtained via</p>
<script type="math/tex; mode=display">\left\langle \prod_{i=1}^{k}x_{(i)}\right\rangle =\left.\frac{1}{Z(0)}\prod_{i=1}^{k}\frac{\partial^{n}}{\partial\lambda_{(i)}}Z(\lambda)\right|_{\lambda=0}.</script>
<p>And, as before, the cumulant generating function is $W(\lambda)=\log Z(\lambda)$.</p>
<h3 id="stochastic-processes">Stochastic processes</h3>
<p>Instead of considering random variables in $n$ dimensions, we can
consider ‘infinite dimensional’ random variables through a time-slicing
limiting process. That is, we identify with each $x_{i}$ in $\mathbf{x}$
a time $t=ih$ such that $x_{i}=x(ih)$, and we let total time $T=nh$,
thereby splitting the interval $[0,T]$ into $n$ segments of length $h.$
From here, leaving any questions of convergence, etc, aside for the time
being, we can take the limit $n\to\infty$ (with $h=T/n$) such that
$x_{i}\to x(ih)=x(t)$, $\lambda_{i}\to\lambda(t)$ and
$P(\mathbf{x})\to P[x(t)]=\exp(-S[x(t)])$ for some functional $S[x]$
that we will call the <em>action</em>. Thus we envision that to compute the
MGF, instead of summing over all points in $\mathbb{R}^{n}$
$\left(\int\prod_{i=1}^{n}dx_{i}\right)$, we are instead summing over
all paths using a differential denoted $\int\mathcal{D}x(t)$:</p>
<script type="math/tex; mode=display">Z[\lambda]=\int\mathcal{D}x(t)\, e^{-S[x]+\int\lambda(t)x(t)\, dt}.</script>
<p>From this formula, moments can now be obtained via</p>
<script type="math/tex; mode=display">\left\langle \prod_{i=1}^{k}x(t_{(i)})\right\rangle =\frac{1}{Z[0]}\left.\prod_{i=1}^{k}\frac{\delta}{\delta\lambda(t_{(i)})}Z[\lambda]\right|_{\lambda(t)=0},</script>
<p>with the cumulant generating functional again being</p>
<script type="math/tex; mode=display">W[\lambda]=\log(Z[\lambda]).</script>
<h3 id="generic-gaussian-processes">Generic Gaussian processes</h3>
<p>The most important random process we consider is the Gaussian. Recall
that in one dimension the RV $X\sim N(a,\sigma^{2})$ has MGF</p>
<script type="math/tex; mode=display">Z(\lambda)=\int_{-\infty}^{\infty}\exp\left[\frac{-(x-a)^{2}}{2\sigma^{2}}+\lambda x\right]\, dx=\sqrt{2\pi}\sigma\exp(\lambda a+\lambda^{2}\sigma^{2}/2),</script>
<p>which is obtained by a ‘completing the square’ manipulation, and has cumulant GF</p>
<script type="math/tex; mode=display">W(\lambda)=\lambda a+\frac{1}{2}\lambda^{2}\sigma^{2}+\log(Z(0)),</script>
<p>so that the cumulants are
$\langle x\rangle_{C}=a,\langle x^{2}\rangle_{C}=\text{var}{X}=\sigma^{2}$ and
$\langle x^{k}\rangle_{C}=0$ for all $k>2$.</p>
<p>The $n$ dimensional Gaussian RV $X\sim N(0,K)$, with covariance matrix $K$, has MGF</p>
<script type="math/tex; mode=display">Z(\lambda)=\int_{-\infty}^{\infty}e^{-\frac{1}{2}\sum_{jk}x_{j}K_{jk}^{-1}x_{k}+\sum_{j}\lambda_{j}x_{j}}\, dx</script>
<p>This integral can also be integrated exactly. Indeed, since $K$ is
symmetric positive definite (and so is $K^{-1}$), we can diagonalise in
orthonormal coordinates, making each dimension independent, and allowing
the integration to be performed one dimension at a time. This provides
<script type="math/tex">Z(\lambda)=[2\pi\det(K)]^{n/2}e^{\frac{1}{2}\sum_{jk}\lambda_{j}K_{jk}\lambda_{k}}.</script>
In an analogous fashion, through the same limiting process described
above, the infinite dimensional case is</p>
<script type="math/tex; mode=display">Z[\lambda]=\int\mathcal{D}x(t)e^{-\frac{1}{2}\int x(s)K^{-1}(s,t)x(t)dsdt+\int\lambda(t)x(t)dt}=Z[0]e^{\frac{1}{2}\int\lambda(s)K(s,t)\lambda(t)dsdt}.</script>
<p>Importantly for perturbation techniques, higher order (centered) moments
of multivariate Gaussian random variables can be expressed simply as a
sum of products of their second moments. This result is known as Wick’s
theorem:</p>
<div>
$$
\left\langle \prod_{i=1}^{k} x_{(i)} \right\rangle = \begin{cases} 0, & k\text{ odd}\\
\sum_{\sigma\in A}K_{\sigma(1)\sigma(2)}K_{\sigma(3)\sigma(4)}\cdots K_{\sigma(k-1)\sigma(k)}, & k\text{ even}\end{cases}
$$
</div>
<p>for $A={\text{all pairings of }x_{(i)}}$. Only even
moments are non-zero. Note that this means that the covariance matrix
$K$ is the key to determining all higher order moments. Wick’s theorem lies at the heart of calculations utilizing Feynman diagrams.</p>
<h2 id="applications-to-sdes">Applications to SDEs</h2>
<p>The previous construction for generic Gaussian processes can be adapted to construct a moment generating functional for generic SDEs of the form</p>
<script type="math/tex; mode=display">\frac{dx}{dt}=f(x,t)+g(x)\eta(t)+y\delta(t-t_{0}),</script>
<p>for $t\in[0,T]$. The process involves the same time-slicing approach, in which the above SDE is discretized in time steps $h$</p>
<script type="math/tex; mode=display">x_{i+1}-x_{i}=f_{i}(x_{i})h+g_{i}(x_{i})w_{i}\sqrt{h}+y\delta_{i,0}</script>
<p>under the Ito interpretation. We assume that each $w_{i}$ is a Guassian with $\langle w_{i}\rangle=0$ and
$\langle w_{i}w_{j}\rangle=\delta_{ij}$ such that $w_{i}$ describes a
Guassian white noise process. Then the PDF of $\mathbf{x}$ given a
particuar instantiation of a random walk ${w_{i}}$ is</p>
<script type="math/tex; mode=display">P[x|w;y]=\prod_{i=0}^{n}\delta[x_{i+1}-x_{i}+f_{i}(x_{i})h-g_{i}(x_{i})w_{i}\sqrt{h}-y\delta_{i,0}].</script>
<p>If we take the Fourier transform of the PDF:</p>
<script type="math/tex; mode=display">P[x|w;y]=\int\prod_{j=0}^{N}\frac{dk_{j}}{2\pi}e^{-i\sum_{j}k_{j}(x_{j+1}-x_{j}-f_{j}(x_{j})h-g_{j}(x_{j})w_{j}\sqrt{h}-y\delta_{j,0})}</script>
<p>where we’ve made use of the fact that the Dirac delta function has
Fourier transform:</p>
<script type="math/tex; mode=display">\mathcal{F}\{\delta(x-x_{0});x\to k\}=\frac{1}{2\pi}e^{-ix_{0}k}.</script>
<p>Marginalizing over all random trajectories ${w}$ and evaluating the
resulting Gaussian integral gives:</p>
<script type="math/tex; mode=display">P[x|y]=\int\prod_{j=0}^{N}\frac{dk_{j}}{2\pi}e^{-\sum_{j}(ik_{j})\left(\frac{x_{j+1}-x_{j}}{h}-f_{j}(x_{j})-y\delta_{j,0}/h\right)h+\sum_{j}\frac{1}{2}g_{j}^{2}(x_{j})(ik_{j})^{2}h}</script>
<p>Again we take the continuum limit by letting $h\to0$ with $N=T/h$, and
by replacing $ik_{j}$ with $\tilde{x}(t)$ and
${\displaystyle \frac{x_{j+1}-x_{j}}{h}}$ with $\dot{x}(t)$:</p>
<script type="math/tex; mode=display">P[x(t)|y,t_{0}]=\int\mathcal{D}\tilde{x}(t)e^{-\int[\tilde{x}(t)(\dot{x}(t)-f(x(t),t)-y\delta(t-t_{0}))-\frac{1}{2}\tilde{x}^{2}g^{2}(x(t),t)]dt}.</script>
<p>The function $\tilde{x}(t)$ represents a function of the wave numbers
$k_{j}$, thus we can write down a moment generating functional for both
its position and its conjugate space:</p>
<script type="math/tex; mode=display">Z[J,\tilde{J}]=\int\mathcal{D}x(t)\mathcal{D}\dot{x}(t)e^{-S[x,\tilde{x}]+\int\tilde{J}x\, dt+\int J\tilde{x}\, dt}</script>
<p>More generally, instead of $g(x)\eta(t)$ with $\eta(t)$ a white noise
process, an SDE having a noise process with cumulant $W[\lambda(t)]$
will have the PDF:</p>
<div>
$$\begin{aligned}
P[x(t)|y,t_{0}] & = \int\mathcal{D}\eta(t)\delta[\dot{x}(t)-f(x,t)-\eta(t)-y\delta(t-t_{0})]e^{-S[\eta(t)]}\\
& = \int\mathcal{D}\eta(t)\mathcal{D}\tilde{x}(t)e^{-\int\tilde{x}(t)(\dot{x}(t)-f(x,t)-y\delta(t-t_{o}))\, dt+W[\tilde{x}(t)]}\end{aligned}$$
</div>
<p>If $\eta(t)$ is delta correlated
($\langle\eta(t)\eta(t’)\rangle=\delta(t-t’)$) then $W[\tilde{x}(t)]$
can be Taylor expanded in both $x(t)$ and $\tilde{x}(t)$:</p>
<script type="math/tex; mode=display">W[\tilde{x}(t)]=\sum_{n=1,m=0}^{\infty}\frac{v_{nm}}{n!}\int\tilde{x}^{n}(t)x^{m}(t)\, dt.</script>
<p>Note that the summation over $n$ starts at one because
$W[0]=\log(Z[0])=0$.</p>
<h2 id="the-ornstein-uhlenbeck-process">The Ornstein-Uhlenbeck process</h2>
<p>As an example, consider the Orstein-Uhlenbeck process</p>
<script type="math/tex; mode=display">\dot{x}(t)+ax(t)-\sqrt{D}\eta(t)=0</script>
<p>which has the action</p>
<script type="math/tex; mode=display">S[x,\tilde{x}]=\int\left[\tilde{x}(t)(\dot{x}(t)+ax(t)-y\delta(t-t_{0}))-\frac{D}{2}\tilde{x}^{2}(t)\right]\, dt.</script>
<p>The moments could found immedately, since action is quadratic in
$\tilde{x}(t)$[^1], however we instead demonstrate how to study the
problem through a perturbation expansion. In this case the perturbation
will truncate to the exact, and already known, solution. The idea is to
break the action into a ‘free’ and ‘interacting’ component. The
terminology comes from quantum field theory in which free terms
typically represent a particle without any interaction with a field or
potential, and would have a quadratic action. The free action can
therefore be evaluated exactly, and the interaction term can be
expressed as an ‘asymptotic series’ around this solution. Let the action
be written</p>
<div>
$$\begin{aligned}
S & = S_{F}+S_{I}\\
& = \int\tilde{x}(t)\left[\dot{x}(t)+ax(t)\right]\, dt+\int\tilde{x}(t)y\delta(t-t_{0})-\frac{D}{2}\tilde{x}^{2}(t)\, dt\end{aligned}$$
</div>
<p>We define the function $G$, known as the the linear response function or
correlator or propagator, to be the Green’s function of the linear
differential operator corresponding to the free action:</p>
<script type="math/tex; mode=display">\left(\frac{d}{dt}+a\right)G(t,t')=\delta(t-t')</script>
<p>Note that $G(t,t’)$
is in fact exactly equivalent to $K(t,t’)$ from the generic Gaussian
stochastic process derived previously. Note also that, in general, the
‘inverse’ of a Green’s function$G(t,t’)$ is an integral operator
satisfying:</p>
<script type="math/tex; mode=display">\mathcal{L}G=\int dt''G^{-1}(t,t'')G(t'',t')=\delta(t-t')=\left(\frac{d}{dt}+a\right)G(t,t'),</script>
<p>for some $G^{-1}(t,t’)$. The operator $\mathcal{L}$ would indeed be such
an inverse for the following choice of $G^{-1}$:</p>
<script type="math/tex; mode=display">G^{-1}(t,t')=\left(\frac{d}{dt}+a\right)\delta(t-t').</script>
<p>The free
generating functional, then, is</p>
<script type="math/tex; mode=display">Z_{F}[J,\tilde{J}]=\int\mathcal{D}x(t)\mathcal{D}\tilde{x}(t)e^{-\int dtdt'\tilde{x}(t)G^{-1}(t,t')x(t)+\int\tilde{x}(t)J(t)\, dt+\int x(t)\tilde{J}(t)\, dt}.</script>
<p>So, analogous to the multivariate Gaussian case, we can evaluate this
integral exactly to obtain:</p>
<script type="math/tex; mode=display">Z_{F}[J,\tilde{J}]=e^{\int\tilde{J}G(t,t')J\, dtdt'}.</script>
<p>For the OU process we can in fact solve the linear differential equation
for Green’s function $G$: <script type="math/tex">G(t,t')=H(t-t')e^{-a(t-t')}.</script> The <em>free</em>
<em>moments</em> are then given by</p>
<script type="math/tex; mode=display">\left\langle \prod_{ij}x(t_{i})\tilde{x}(t_{j})\right\rangle _{F}=\left.\prod_{ij}\frac{\delta}{\delta\tilde{J}(t_{i})}\frac{\delta}{\delta J(t_{j})}e^{\int\tilde{J}(t)G(t,t')J(t')\, dtdt'}\right|_{J=\tilde{J}=0}.</script>
<p>Importantly, note that</p>
<script type="math/tex; mode=display">\left\langle x(t_{1})\tilde{x}(t_{2})\right\rangle _{F}=\left.\frac{\delta}{\delta\tilde{J}(t_{1})}\frac{\delta}{\delta J(t_{2})}e^{\int\tilde{J}(t)G(t,t')J(t')\, dtdt'}\right|_{J=\tilde{J}=0}=G(t_{1},t_{2})</script>
<p>and
$\langle\tilde{x}(t_{1})\tilde{x}(t_{2})\rangle_{F}=\langle x(t_{1})x(t_{2})\rangle_{F}=0$.</p>
<p>Since the only non-zero second order moments are those in which an
$x(t)$ is paired with an $\tilde{x}(t’)$ then Wick’s theorem means that
all non-zero higher order <em>free moments</em> must have equal numbers of
$x$’s as $\tilde{x}$’s. This is important in performing the expansions
below.</p>
<h3 id="using-feynman-diagrams">Using Feynman diagrams</h3>
<p>We have split the action into, loosely, linear and non-linear parts[^2]
$S=S_{F}+S_{I}$ so that the MGF can be written:</p>
<div>
$$\begin{aligned}
Z[J,\tilde{J}] & = \int\mathcal{D}x(t)\mathcal{D}\tilde{x}(t)e^{-S_{F}-S_{I}+\int\tilde{J}x+\int J\tilde{x}}\\
& = \int\mathcal{D}x(t)\mathcal{D}\tilde{x}(t)P_{F}[x(t),\tilde{x(t)}]e^{-S_{I}+\int\tilde{J}x+\int J\tilde{x}}\\
& = \int\mathcal{D}x(t)\mathcal{D}\tilde{x}(t)P_{F}[x(t),\tilde{x(t)}]\sum_{n=0}^{\infty}\frac{1}{n!}(-S_{I}+\int\tilde{J}x+\int J\tilde{x})^{n}\\
& = \sum_{n=0}^{\infty}\frac{1}{n!}\left\langle \mu^{n}\right\rangle _{F}\end{aligned}$$
</div>
<p>with $\mu=S_{I}+\int\tilde{J}x\, dt+\int J\tilde{x}\, dt$.</p>
<p>We have now expressed the MGF in terms of a sum of free moments, which
we know how to evaluate. To proceed, expand $S_{I}$:</p>
<script type="math/tex; mode=display">S_{I}=\sum_{m\ge0,n\ge0}V_{mn}=\sum_{m\ge0,m\ge n}v_{mn}\int x^{m}\tilde{x}^{n}\, dt.</script>
<p>In evaluating the expression for $Z$, there exists a diagrammatic way to
visualize each term that we need to consider for a desired moment.
Recall that the only free moments that are going to be non-zero are the
ones containing equal numbers of $x(t)$ and $\tilde{x(t)}$ terms. Wick’s
theorem then expresses these moments as the sum of the product of all
possible pairings between the $x(t)$ and $\tilde{x}(t)$ terms. Thus each
term of the multinomial expansion</p>
<script type="math/tex; mode=display">\left\langle \left(\sum_{n\ge0,m\ge0}v_{mn}\int x^{m}\tilde{x}^{n}\, dt+\int\tilde{J}xdt+\int J\tilde{x}dt\right)^{n}\right\rangle _{F}</script>
<p>can be thought of in terms of these pairings. The idea is that with each
$V_{mn}$ in $S_{I}$ we associate an <em>internal vertex</em> having $m$
entering edges and $n$ exiting edges. The $\int{J}\tilde{x}$ and
$\int\tilde{J}{x}$ terms contribute, respectively, entering and exiting
<em>external vertices.</em> Edges connecting vertices then correspond to a
pairing between an $x(t)$ and $\tilde{x}(t)$. Finally, since</p>
<script type="math/tex; mode=display">\left\langle \prod_{i=1}^{N}\prod_{j=1}^{M}x(t_{i})\tilde{x}(t_{j})\right\rangle =\frac{1}{Z[0,0]}\left.\frac{\delta}{\delta J(t_{i})}\frac{\delta}{\delta\tilde{J}(t_{j})}Z\right|_{J=\tilde{J}=0}</script>
<p>then only the terms in the expansion for $Z$ having $N$ entering and $M$
exiting external vertices (and thus $N$ and $M$ auxillary terms) will
contribute to that moment. These terms are represented by *Feynman
diagrams, *which is a graph composed of a combination of these vertices
and in which each of the $N$ external vertices is connected (paired
with) $M$ external vertices, possibly through a number of the internal
vertices. Moments can be simply computed by writing down all possible
diagrams with the requiste number of external vertices.</p>
<p>As an example, the coupling between external vertex
$\int\tilde{J}x\, dt$ and internal vertex
$\int\delta(t-t_{0})y\tilde{x}(t)\, dt$ in $Z$ can be evaluated as:</p>
<div>
$$\begin{aligned}
Z & = \left\langle \int dtdt'\,\tilde{J}(t)x(t)y\delta(t'-t_{0})\tilde{x}(t')\right\rangle _{F}+\text{all other terms}\\
& = \int dtdt'\,\tilde{J}(t)y\delta(t'-t_{0})\left\langle x(t)\tilde{x}(t')\right\rangle _{F}+\text{all other terms}\\
& = \int dt\, y\tilde{J}(t)G(t,t_{0})+\text{all other terms}.\end{aligned}$$
</div>
<p>But this is best explained diagrammatically. In our case we have:</p>
<script type="math/tex; mode=display">S_{I}=\int dt\, y\delta(t-t_{0})\tilde{x}(t)+\int dt\,\frac{D}{2}\tilde{x}^{2}(t),</script>
<p>and the relevant vertices are illustrated in Figure 1. The process for
then computing the first and second moment for the OU process is
illustrated in Figure 2. We can see that each term will be written as an
integral involving the auxillary functions $J$, $\tilde{J}$ and the
propagator $G$. In general, each vertex in each diagram is assigned
temporal index $t_{k}$.</p>
<figure class="center" style="width:300px">
<img src="../../images/feynman1.png" alt="img txt" />
<figcaption>Figure 1. Vertices involved in evaluating moments of example OU process. First
two vertices are internal vertices and are a part of the interacting
action $S_{I}$, the next two vertices are external vertices associated
with an auxillary variable $J$, $\tilde{J.}$ Each edge of a Feynman
diagram contributes a propagator $G(t,t')$.
</figcaption>
</figure>
<figure class="center" style="width:500px">
<img src="../../images/feynman2.png" alt="img txt" />
<figcaption>Figure 2. Computation of first and second cumulant using Feynman diagrams. Mean
is given by functional derivative with respect to one auxillary function
$\tilde{J}$, evaluated at zero. The only term non-zero term is
represented by a diagram containing one exiting vertex, and no entering
vertex. In this case the only diagram possible is composed of the
internal vertex representing the initial condition paired with the
exiting vertex. Evaluating the free moment and taking the functional
derivative of this term gives the mean in terms of $G(t,t')$. In a
similar fashion, the second cumulant is also calculated.
</figcaption>
</figure>
<p>In OU, in fact only a finite number of diagrams can be considered and
the exact mean and covariance can be determined. This is a result of the
linearity of the SDE: a linear SDE can be written to have no $x$ terms
in $S_{I}$, which means all internal vertices have no entering edges and
that all moments in $x$ must correspond to a finite number of diagrams
(in contrast to internal vertices with both entering and exiting edges
which can then be combined in an infinite number of ways). In this case,
from Figure 2, the mean and covariance are given by:</p>
<script type="math/tex; mode=display">\langle x(t)\rangle=yH(t-t_{0})e^{-a(t-t_{0})}</script>
<p>and</p>
<script type="math/tex; mode=display">\langle x(t)x(s)\rangle_{C}=D\frac{e^{2a(t-s)}-e^{2a(t+s-2t_{0})}}{2a}.</script>
<h2 id="in-summary">In summary</h2>
<p>We’ve seen how to construct a path integral formulation of a generic SDE. And have seen how to construct Feynman diagrams perform perturbation expansions for the solution. In a <a href="http://benlansdell.github.io/statistics/sdesII/">follow-up post</a> we will consider more examples of how they can be used.</p>Ben Lansdellben dot lansdell at gmail dot comIntroduction