All You Can Eat

The arXiv Book Club

A few friends and I started a reading group earlier this month since we’re all hanging out in Chicago over the summer. This is the first of (hopefully) many posts related to various topics we cover over the next few months. Our entree (this post) summarises the theoretical results from the seminal paper on the Indian Buffet Process by Griffiths and Ghahramani (2011).

Preliminaries

For random variables $A$ and $B$, Bayes’ rule gives us

$$ \begin{align} P(B \mid A) &= \frac{P(B \cap A)}{P(A)} \nonumber \newline &= \frac{P(A \mid B) \cdot P(B)}{P(A)} \label{bayesrule} \end{align} $$

Now, suppose we have some hypothetical distribution $X$ from which data can be drawn from, and some hypothetical distribution $\theta$ from which the parameters of $X$ can be drawn from. Then, a natural problem is to find the value(s) of $\theta$ that best explains the observed data from $X$. Applying Bayes’ rule from $\eqref{bayesrule}$ is one approach

$$ \begin{align} P(\theta \mid X) &= \frac{P(X \mid \theta) \cdot P(\theta)}{P(X)} \nonumber \newline &\propto P(X \mid \theta) \cdot P(\theta) \nonumber \end{align} $$

Here, $P(\theta)$ is called the prior distribution of $\theta$, which represents the distribution of $\theta$ values before any data is observed, $P(X \mid \theta)$ is the likelihood, which represents how likely we are of observing the current instance of $X$ given the $\theta$ parameter values, and $P(X)$ is the likelihood marginalised over all possible $\theta$ values. Bayes’ rule tells us that we can update the prior of $\theta$ using observations from $X$ so that our belief in the value of $\theta$ reflects the information we currently have – we call this updated distribution $P(\theta \mid X)$ the posterior distribution of $\theta$.

Motivating infinite mixture models

We will consider a simple example to make some of these ideas more concrete. Suppose we are the owners of a supermarket and we are interested in better understanding the different types of customers that shop in our store. We have access to the purchasing history of each customer, including information such as which items they purchase, at what price they purchase them at, when they purchase those items, and how many they purchase at one time. We want to use this data to infer attributes of each customer that are naturally unknown, such as their price elasticity, whether they live by themselves or as part of a large family, and whether they are young or old. The intention is to use these attributes to improve store-level operations and ensure customer satisfaction remains high.

This example is a classic unsupervised learning problem, where each customer represents an observation, their purchasing history can be summarized as observable features associated with each observation, and we want to make a prediction that classifies each observation into some unobservable customer-type category. In the literature, the hidden customer category is more generally referred to as a latent variable (or latent feature) in such models, and each category itself is referred to as a class of that latent variable.

One key problem here is determining a reasonable form to this latent structure, such as the number of possible customer types, and is often resolved by assuming there is a unique, finite number of classes that correctly characterises what is observed. Alternatively, we can instead assume that the correct representation of the latent structure is potentially unbounded, or “infinite” in the number of customer types – although this is in actual fact bounded by the number of unique customers we observe. The benefit here is that there is no need to define an additional hyperparameter for how many ground-truth classes should exist in our sample, which without first analysing the data seems rather contrived. This family of models is central to the study of nonparametric Bayesian statistics, where nonparametric refers to the fact that these models have potentially an infinite number of parameters.

The Dirichlet process mixture model (DP) is one such example in this family, where each customer is assigned to one customer-type category, and each category is associated with a distribution over some set of observable properties, such as average price per item or average elapsed time since last purchase. While the Dirichlet process is nonparametric, it is somewhat unnatural to think of customers as being solely associated with one class from the set of all possible classes. Rather, it would seem more appropriate that our customers are multi-faceted and their observed behaviour is a result from multiple causes. The Indian buffet process (IBP) is a generative approach that is primarily motivated by the DP model, but considers customers (observations) as combinations of (possessing) causes or drivers (features) that are potentially infinite in number.

Extending finite mixture models…

We begin by first reviewing the construction of finite mixture models, which will motivate the construction of infinite mixture models. Suppose we have $N$ observations $x_1, \dots, x_N$, where each $x_i \in \mathbb{R}^{D}$ is a vector of $D$ observable properties for $i = 1, \dots, N$. Then, let $X = \left[ x_1^\intercal, \dots, x_N^\intercal \right]^\intercal$ be the $N \times D$ data matrix, and let the vector $c = \left[ c_1, \dots, c_N \right]^\intercal \in {1, \dots, K}^{N}$ indicate the class assignments of the $N$ observations. Following the definitions from earlier, $P(c)$ is the assignment prior and specifies the number of classes and their relative probabilities, and $P(X \mid c)$ is the likelihood which determines how classes relate to the observable properties in $X$. Assuming that the assignment of an observation to a class is independent of the assignment of other observations, if there are $K$ classes and $\theta = \left[\theta_1, \dots, \theta_K \right]^\intercal$ is the vector of parameters for $c$, we can write the distribution of classes over $\theta$ as

$$ \begin{align} P(c \mid \theta) &= \prod_{i=1}^{N} P(c_i = k \mid \theta) \nonumber \newline &= \prod_{i=1}^{N} \theta_{k} \label{thetadist} \end{align} $$

where $\theta_{k}$ can be interpreted as the probability of assigning an observation to class $k$ given $\theta$, and it follows that $c \mid \theta \sim Multinomial(N, \theta_1, \dots, \theta_K)$. Then, using $\eqref{thetadist}$, we can write the likelihood as

$$ \begin{align} P(X \mid c) &= \prod_{i=1}^{N} P(x_i \mid c_i, \theta) \nonumber \newline &= \prod_{i=1}^{N} \sum_{k=1}^{K} P(x_i \mid c_i = k) \cdot P(c_i = k \mid \theta) \nonumber \newline &= \prod_{i=1}^{N} \sum_{k=1}^{K} P(x_i \mid c_i = k) \cdot \theta_{k} \nonumber \end{align} $$

The likelihood is thus a finite mixture model with $K$ classes and weights $\theta_k$ that themselves can be treated as parameters that need to be estimated. One common choice is to let $\theta \mid \alpha \sim Dirichlet(K, \alpha_1, \dots, \alpha_K)$, and setting $\alpha_1 = \alpha_2 = \dots = \alpha_K = \frac{\alpha}{K}$. The complete finite mixture model is then

$$ \begin{align} \theta &\mid \alpha \sim Dirichlet \left(K, \frac{\alpha}{K}, \dots, \frac{\alpha}{K} \right) \nonumber \newline c &\mid \theta \sim Multinomial(N, \theta_1, \dots, \theta_K) \nonumber \end{align} $$

Marginalising $\theta$, we can express the assignment prior as

$$ \begin{align} P(c) &= \int_{\Theta} \prod_{i=1}^{N} P(c_i \mid \theta) \cdot P(\theta) \quad d\theta \label{thetamarg1} \newline &= \int_{\Theta} \frac{1}{D\left(\frac{\alpha}{K}, \dots, \frac{\alpha}{K} \right)} \prod_{k=1}^{K} \theta_{k}^{m_k + \frac{\alpha}{K} - 1} \quad d\theta \label{thetamarg2} \newline &= \frac{D\left(m_1 + \frac{\alpha}{K}, \dots, m_K + \frac{\alpha}{K} \right)}{D\left(\frac{\alpha}{K}, \dots, \frac{\alpha}{K} \right)} \label{thetamarg3} \newline &= \frac{\prod_{k=1}^{K} \Gamma\left(m_k + \frac{\alpha}{K}\right)}{\left[\Gamma \left(\frac{\alpha}{K} \right)\right]^{K}} \cdot \frac{\Gamma(\alpha)}{\Gamma(N + \alpha)} \label{thetamarg4} \end{align} $$

where $m_k = \sum_{i=1}^{N} \delta(c_i = k)$ is the number of observations assigned to class $k$, and $D\left(\alpha_1, \dots, \alpha_K \right) = \frac{\prod_{k=1}^{K} \Gamma(\alpha_k)}{\Gamma(\sum_{k=1}^{K} \alpha_k)}$ is the normalising constant in a Dirichlet distribution with parameters $\alpha_1, \dots, \alpha_K$. Here, $\eqref{thetamarg1}$ follows from the law of total probability, $\eqref{thetamarg2}$ follows from combining the density functions of the Dirichlet and multinomial distributions, and $\eqref{thetamarg3}$ and $\eqref{thetamarg4}$ follow from simplifying the resulting expression.

… to infinity

In order to replicate the same derivations for the infinite case, we make use of the recursive identity of the gamma function

$$ \begin{align} \Gamma(x) &= (x-1) \cdot \Gamma(x-1) \label{gammarec} \end{align} $$

Applying $\eqref{gammarec}$ to $\eqref{thetamarg4}$, we can write the assignment prior as

$$ \begin{align} P(c) &= \frac{\prod_{k=1}^{K} \prod_{j=1}^{m_k - 1} \left(j + \frac{\alpha}{K} \right) \cdot \Gamma \left(\frac{\alpha}{K} + 1\right)}{\left[\Gamma \left(\frac{\alpha}{K} \right)\right]^{K}} \cdot \frac{\Gamma(\alpha)}{\Gamma(N + \alpha)} \nonumber \newline &= \frac{\left[\Gamma \left(\frac{\alpha}{K} + 1\right)\right]^{K_+}}{\left[\Gamma \left(\frac{\alpha}{K} \right)\right]^{K}} \left( \prod_{k=1}^{K_+} \prod_{j=1}^{m_k - 1} \left(j + \frac{\alpha}{K} \right) \right) \frac{\Gamma(\alpha)}{\Gamma(N + \alpha)} \nonumber \newline &= \left(\frac{\alpha}{K} \right)^{K_+} \left(\prod_{k=1}^{K_+} \prod_{j=1}^{m_k - 1} \left(j + \frac{\alpha}{K} \right) \right) \frac{\Gamma(\alpha)}{\Gamma(N + \alpha)} \label{assignprior} \end{align} $$

where $K_+$ is the number of classes for which $m_k > 0$, and the indices of $m_k$ are reordered so that $m_k > 0$ for all $k < K_+$. Note that this does not affect the assignment prior, which is invariant to permutations of the observation order – this property is known as exchangeability. Now, observe that there are $K^{N}$ possible values for $c$, which diverges as $K \rightarrow \infty$, and thus the probability of observing any single instance assignment vector goes to 0. This motivates the consideration of modelling the distribution over equivalence classes, rather than the actual assignment vectors themselves. We explore these concepts next.

A partition is a division of a set of $N$ observations into subsets, such that each observation only belongs to one subset and the ordering of the subsets do not matter. Thus, two assignment vectors that result in the same division of observations correspond to the same partition. A partition therefore defines an equivalence class $[c]$ for assignment vectors $c$, where two assignment vectors belong to the same equivalence class if they correspond to the same partition. Suppose we partition $N$ observations into $K_+$ subsets, and we have $K = K_0 + K_+$ class labels that can be applied to these subsets. Then, there are $\binom{K}{K_+} \times K_+! = \frac{K!}{K_0!}$ assignment vectors that belong to the equivalence class defined by that partition, and we can define a probability distribution over partitions by summing over all assignment vectors that belong to that class. Using the results from $\eqref{assignprior}$ and taking the limit gives us

$$ \begin{align} P([c]) &= \sum_{c \in [c]} P(c) \nonumber \newline &= \frac{K!}{K_0!} \left(\frac{\alpha}{K} \right)^{K_+} \left(\prod_{k=1}^{K_+} \prod_{j=1}^{m_k - 1} \left(j + \frac{\alpha}{K} \right) \right) \frac{\Gamma(\alpha)}{\Gamma(N + \alpha)} \nonumber \newline \lim_{K \rightarrow \infty} P([c]) &= \alpha^{K_+} \left(\prod_{k=1}^{K_+} (m_k - 1)! \right) \frac{\Gamma(\alpha)}{\Gamma(N + \alpha)} \label{partprior} \newline \end{align} $$

Thus, we have a valid distribution over partitions, and thus over equivalence classes of assignment vectors, which provides an assignment prior for an infinite mixture model. Like the assignment prior from the finite mixture model, the distribution of the assignment prior here satisfies exchangeability.

Arbitrarily large tables

A Chinese restaurant process (CRP) is an analogous formulation of our “distribution over partitions” described above. Imagine a restaurant with an infinite number of tables, and an infinite number of chairs at each table. As customers enter the restaurant, they choose an already occupied table with probability proportional to the number of people already at the table, or chooses the next vacant table with probability proportional to parameter $\alpha$. It follows that customer $c_i$ would be assigned to table $k$ with probability

$$ \begin{align} P(c_i = k \mid c_1, \dots, c_{i-1}) &= \begin{cases} \frac{m_k}{i - 1 + \alpha} \quad &\text{if} \quad k < K_+ \newline \frac{\alpha}{i - 1 + \alpha} \quad &\text{if} \quad k = K + 1 \end{cases} \nonumber \end{align} $$

The assignment of all customers (observations) to tables (classes) via this process produces a distribution over partitions of the same form as $\eqref{partprior}$. The elegance of the CRP is that it provides an intuitive way for generating assignment priors for an infinite mixture model.

Connections to latent feature models

We now have the tools for representing observations in terms of infinitely many latent features. In a latent feature model, each observation $c_i$ is represented by a vector of latent feature values $f_i \in \mathbb{R}^{K}$, and the observable properties $x_i \in \mathbb{R}^{D}$ are generated from a distribution dependent on the $f_i$ vector. Let $F = [f_1^\intercal, \dots, f_N^\intercal]^\intercal$ be the $N \times K$ matrix of latent features for all $N$ observations, with prior $P(F)$ and likelihood $P(X \mid F)$. Then, $F$ can be expressed as the elementwise (Hadamard) product of two matrices $Z$ and $V$, where $Z \in \mathbb{R}^{N \times K}$ is a binary matrix for whether observation $i$ has feature $k$, and $V \in \mathbb{R}^{N \times K}$ is a matrix of feature values for each observation. The prior on $F$ can then be defined by specifying priors on $Z$ and $V$, with $p(F) = p(Z) \cdot p(V)$. If $Z$ is sparse, we can define a prior for infinite latent feature models by defining a distribution over infinite binary matrices. Our analysis suggests that such a distribution should satisfy exchangeability, and this can be achieved by starting with a model that assumes a finite number of features, and then taking the limit.

From a la carte…

We begin in the same way as in our finite dimensional mixture model formulation by considering a finite latent feature model. Suppose we have $N$ observations and $K$ features, and the possession of feature $k$ by observation $i$ is indicated by a binary variable $z_{i,k}$, such that the values of $z_{i,k}$ form a $N \times K$ binary matrix $Z$. Assume observation $i$ possesses feature $k$ independently with probability $\pi_k$, where $0 \le \pi_k \le 1$. Then, let $\pi = [\pi_1, \dots, \pi_K]$ and we can write the likelihood as

$$ \begin{align} P(Z \mid \pi) &= \prod_{k=1}^{K} \prod_{i=1}^{N} P(z_{i,k} \mid \pi_k) \nonumber \newline &= \prod_{k=1}^{K} \pi_k^{m_k} (1 - \pi_k)^{N - m_k} \nonumber \end{align} $$

where $m_k = \sum_{i=1}^{N} z_{i,k}$ is the number of observations possessing feature $k$. Note that this is a Bernoulli likelihood with parameter $\pi_k$, and for the same reasons as we chose a Dirichlet distribution as the assignment prior in our latent class model, here we choose a feature prior that is beta distributed with parameters $\frac{\alpha}{K}$ and $1$

$$ \begin{align} P(\pi_k) &= \frac{\pi_k^{\frac{\alpha}{K}-1}}{B \left(\frac{\alpha}{K}, 1 \right)} \nonumber \newline &= \frac{\alpha}{K} \cdot \pi_k^{\frac{\alpha}{K}-1} \nonumber \end{align} $$

where $B(r,s) = \frac{\Gamma(r) \Gamma(s)}{\Gamma(r+s)}$ is the beta function. The complete finite feature model is then

$$ \begin{align} \pi_k \mid \alpha &\sim Beta \left( \frac{\alpha}{K}, 1 \right) \nonumber \newline z_{i,k} \mid \pi_k &\sim Bernoulli(\pi_k) \nonumber \end{align} $$

and the marginal distribution of $Z$ is

$$ \begin{align} P(Z) &= \prod_{k=1}^{K} \int_{\Pi} P(Z \mid \pi_k) P(\pi_k) \quad d\pi_k \label{zmarg1} \newline &= \prod_{k=1}^{K} \int_{\Pi} \left(\prod_{i=1}^{N} P(z_{i,k} \mid \pi_k) \right) P(\pi_k) \quad d\pi_k \label{zmarg2} \newline &= \prod_{k=1}^{K} \frac{B(m_k + \frac{\alpha}{K}, N - m_k + 1)}{B \left( \frac{\alpha}{K}, 1 \right)} \label{zmarg3} \newline &= \frac{\alpha}{K} \cdot \prod_{k=1}^{K} \frac{\Gamma \left(m_k + \frac{\alpha}{K} \right) \Gamma(N - m_k + 1)}{\Gamma \left( N + 1 + \frac{\alpha}{K} \right)} \label{zmarg4} \end{align} $$

Here, $\eqref{zmarg1}$ follows from the law of total probability, $\eqref{zmarg2}$ follows from independence of $z_{i,k} \mid \pi_k$, $\eqref{zmarg3}$ follows from combining the densities of beta and binomial distributions, and $\eqref{zmarg4}$ follows from the definition of the beta function. Note that $P(Z)$ satisfies exchangeability as desired.

… to all you can eat

The technical details of equivalence classes for binary matrices is omitted here for brevity – feel free to refer to section 4.2 of the paper if you are interested. Instead, we simply define $[Z]$ as the equivalence class of binary matrix $Z$ and we can write

$$ \begin{align} P([Z]) &= \sum_{Z \in [Z]} P(Z) \nonumber \newline &= \frac{K!}{\prod_{h=0}^{2^{N}-1} K_{h}!} \prod_{k=1}^{K} \frac{\frac{\alpha}{K} \Gamma(m_k + \frac{\alpha}{K}) \Gamma(N - m_k + 1) }{\Gamma(N+1+\frac{\alpha}{K})} \label{equivZ1} \end{align} $$

Note that $\frac{K!}{\prod_{h=0}^{2^N-1} K_h!}$ is the cardinality of $[Z]$, where $K_h$ is the count of the number of columns with full history $h$ (see section 4.2 of the paper for details). Reordering the columns of $Z$ such that $m_k > 0$ if $k \le K_+$, and $m_k = 0$ otherwise, we can write the product in $\eqref{equivZ1}$ as

$$ \begin{align} &\prod_{k=1}^{K} \frac{\frac{\alpha}{K} \Gamma(m_k + \frac{\alpha}{K}) \Gamma(N - m_k + 1) }{\Gamma(N+1+\frac{\alpha}{K})} \nonumber \newline = \quad & \left(\frac{\frac{\alpha}{K} \Gamma \left(\frac{\alpha}{K} \right) \Gamma(N+1)}{\Gamma \left( N + 1 + \frac{\alpha}{K} \right)} \right)^{K - K_+} \prod_{k=1}^{K_+} \frac{ \frac{\alpha}{K} \Gamma \left(m_k +\frac{\alpha}{K}\right) \Gamma(N - m_k + 1) }{\Gamma \left( N+1+\frac{\alpha}{K} \right)} \label{equivZ2} \newline = \quad & \left(\frac{\frac{\alpha}{K} \Gamma \left(\frac{\alpha}{K} \right) \Gamma(N+1)}{\Gamma \left( N + 1 + \frac{\alpha}{K} \right)} \right)^{K} \prod_{k=1}^{K_+} \frac{\Gamma \left(m_k +\frac{\alpha}{K}\right) \Gamma(N - m_k + 1) }{\Gamma \left( \frac{\alpha}{K} \right) \Gamma \left( N + 1 \right)} \label{equivZ3} \newline = \quad & \left( \frac{N!}{\prod_{j=1}^{N} \left( j + \frac{\alpha}{K} \right) } \right)^{K} \left(\frac{\alpha}{K} \right)^{K_+} \prod_{k=1}^{K_+} \frac{(N - m_k)! \prod_{j=1}^{m_k-1} \left(j + \frac{\alpha}{K} \right)}{N!} \label{equivZ4} \end{align} $$

Here, $\eqref{equivZ2}$ follows from applying the definition of the column reordering, $\eqref{equivZ3}$ follows from reordering the terms in the product, and $\eqref{equivZ4}$ follows from applying the recursive identity of the gamma function from $\eqref{gammarec}$. Substituting $\eqref{equivZ4}$ into $\eqref{equivZ1}$, and taking the limit gives us

$$ \begin{align} &\lim_{K \rightarrow \infty} P([Z]) \nonumber \newline = \quad & \frac{\alpha^{K_+}}{\prod_{h=1}^{2^{N}-1} K_{h}!} \cdot \exp \left\{-\alpha \sum_{j=1}^{N} \frac{1}{j} \right\} \cdot \prod_{k=1}^{K_+} \frac{(N - m_k)! (m_k - 1)!}{N!} \label{equivZ5} \end{align} $$

Again, this distribution satisfies exchangeability since neither the number of identical columns nor the column sums are affected by the ordering of observations.

Arbitrarily large plates

We are now ready to define the IBP, in much the same way as defined the CRP. Suppose we have $N$ customers enter a restaurant sequentially, and each customer encounters a buffer consisting of infinitely many dishes arranged in a line. The first customer starts at the beginning of the line and takes a serving from each dish, stopping after a $Poisson(\alpha)$ number of dishes are on their plate. The $i^{th}$ customer then moves along the buffet line and samples dishes according to their popularity $\frac{m_k}{i}$ where $m_k$ is the number of previous customers who sampled the dish. At the end of the line of previously sampled dishes, the $i^{th}$ customer samples $Poisson \left( \frac{\alpha}{i} \right)$ new dishes. We can record which customers sample which dishes using a binary matrix $Z$ with $N$ rows and infinitely many columns, where the entry in position $(i,k)$ is equal to $1$ if customer $i$ sampled dish $k$. Let $K_1^{(i)}$ indicate the number of newly sampled dishes by the $i^{th}$ customer, and the probability of any particular binary matrix being produced by this process is

$$ \begin{align} P(Z) &= \frac{\alpha^{K_+}}{\prod_{i=1}^{N} K_1^{(i)}! } \cdot \exp \left\{-\alpha \sum_{j=1}^{N} \frac{1}{j} \right\} \cdot \prod_{k=1}^{K_+} \frac{(N - m_k)! (m_k - 1)!}{N!} \nonumber \end{align} $$

and $\frac{\prod_{i=1}^{N} K_1!}{\prod_{h=0}^{2^{N}-1} K_{h}!}$ matrices generated by this process map to the same equivalence class, which gives us the exchangeable distribution $P([Z])$ from $\eqref{equivZ5}$. It is possible to directly produce the equivalence class distribution by considering the dishes with history $h$ – in the exchangeable IBP, the $i^{th}$ customer instead makes a single decision for each set of dishes with the same history. If there are $K_h$ dishes with history $h$ and $m_h$ previous customers have sampled each of these dishes, then the $i^{th}$ customer samples a $Binomial \left(\frac{m_h}{i}, K_h \right)$ number of those dishes before sampling $Poisson \left(\frac{\alpha}{i} \right)$ new dishes.

Tips are welcome!

Griffiths and Ghahramani (2011) goes into further detail with examples (section 5), applications of the IBP approach (section 6), and concludes with a discussion on extensions to the IBP (section 7) and suggestions for future work (section 8). This post covers most of sections 1 through 4, with the exception of the derivation of Gibbs samplers for both the CRP and IBP (see sections 2.4 and 4.7 of the paper if you are interested). Thank you for bearing with all the food-related puns.

comments powered by Disqus