TLDR: See Section 6 for a concise summary and the practical takeaways.

Introduction

A few days ago, I came across the following statement about Dirichlet distributions on Wikipedia:

An example of where a sparse prior (concentration parameter much less than 1) is called for, consider a topic model, which is used to learn the topics that are discussed in a set of documents, where each “topic” is described using a categorical distribution over a vocabulary of words. A typical vocabulary might have 100,000 words, leading to a 100,000-dimensional categorical distribution. The prior distribution for the parameters of the categorical distribution would likely be a symmetric Dirichlet distribution. However, a coherent topic might only have a few hundred words with any significant probability mass. Accordingly, a reasonable setting for the concentration parameter might be 0.01 or 0.001. With a larger vocabulary of around 1,000,000 words, an even smaller value, e.g. 0.0001, might be appropriate.

This made me wonder: Where do these numbers come from? Is there a principled way to know that \(\alpha\) around \(0.01 - 0.001\) will yield distributions the appropriate sparsity, i. e. samples will have mass concentrated on a few hundred words?

I’ve also encountered a related question in my project about AlphaZero. Namely, in the original paper they inject Dirichlet noise at the root node to help exploration of new strategies. The concentration parameter \(\alpha\) is scaled in inverse proportion to the number of legal moves, to values of 0.3, 0.15, and 0.03 for chess, shogi, and Go. Again, is there a principled way to set these values of \(\alpha\) if we know the branching factor¹ of the game? And even if we believe that \(\alpha\) should be inversely proportional to the branching factor, which constant of proportionality should we use?

¹ The average number of legal moves.

² The heuristic \(\alpha\) is also close to “optimal” (in a specific sense) for small values of \(N / K\). For example \(N / K < 0.02\) yields \(\alpha\) within 5% of the optimal value. This will all be explained in more detail.

In this post I will argue that there are indeed principled ways to answer these questions, leading to surprisingly simple rules of thumb. In the example of the topic model, we get the following: Given a vocabulary size \(K\) and a target number \(N\) of significant words, set \(\alpha = 1.54 \frac{N}{K}\).² Using \(N = 200\) and \(K = 100,000\) yields \(\alpha = 0.003\), aligning well with the recommendation from Wikipedia.

Warning

I believe that some of the results in this post are novel. But I’m not an expert on the topic and this work is not peer-reviewed. If you find any errors or have any comments, please contact me!

Two heuristics for setting \(\alpha\)

The heuristics for setting \(\alpha\) for the topic model from Wikipedia and the Dirichlet noise in AlphaZero can be formulated as two complementary optimization problems. The first goal is to state these optimization problems precisely, which serve as the theoretical foundation for simple rules of thumb that will be derived later on.

First, let us standardize some notation and make precise the context in which we work. For a given \(K\), which can represent vocabulary size or the number of available actions in a given position, we consider the symmetric Dirichlet distribution \(\operatorname{Dir}_K(\alpha)\) of order \(K\) with parameters \(\alpha\). The problem we are trying to solve is that of selecting \(\alpha\).

Heuristic for target number of significant words

Consider the situation of the topic model, where we are given the vocabulary size \(K\), and we have some idea of the number \(N\) of significant words that should be used to describe a topic.

It is here important to clarify exactly what is meant by a significant word. Given a sample distribution over words \(\mathbf{X} = (X_1, \dots, X_K) \sim \operatorname{Dir}_K(\alpha)\), a word \(i\) is significant if \(X_i\) is large. But “large” is not well defined, so to be more precise, we say that a word \(i\) is significant relative to a significance threshold \(\epsilon\) if \(X_i > \epsilon\).

The main idea for the heuristic can then be stated informally as follows:

Tip 1: Heuristic for target number of significant words

For a target number \(N\) of significant words, set the \(\alpha\) which maximizes the significance threshold \(\epsilon\) with respect to which the expected number of significant words is at least \(N\).

In other words, Tip 1 can be formulated as choosing \(\alpha\) that solves the following optimization problem:

Maximize: \(\epsilon\)
Constraint: \(\operatorname{E}[ \# \lbrace i | X_i > \epsilon \rbrace ] = N\), where the expectation is taken over \(\operatorname{Dir}_K(\alpha)\).

The punchline is that this optimization has the surprisingly simple (approximate) solution \(\alpha = 1.54 \frac{N}{K}\) as long as \(N / K\) is sufficiently small. This will be given thorough justification in Section 3.3.

Visualizing the optimal choice

It is worth exploring in more detail what sets distributions sampled from the optimal \(\operatorname{Dir}_K(\alpha)\) apart, by running a simple experiment. For a given \(\alpha\), take a sample \(\mathbf{X} \sim \operatorname{Dir}_K(\alpha)\), sort the entries of \(\mathbf{X}\) in descending order, and plot the result. This provides a visualization of how concentrated the probability mass of \(\mathbf{X}\) is in some of its entries. For a vocabulary size \(K = 10000\) and target number \(N\), we can make these plots for \(\alpha = 1.54 N / K \approx 0.005\), \(\alpha_{\text{low}} = \alpha / 4\) and \(\alpha_{\text{high}} = 4 \alpha\).

Show code for experiment

import numpy as np
import matplotlib.pyplot as plt

# Setup parameters
N_TARGET = 30
K_VOCAB = 10000

# Concentration parameters to compare
alpha_optimal = 1.54 * N_TARGET / K_VOCAB
alpha_low = alpha_optimal / 4
alpha_high = 4 * alpha_optimal

# Generate one sample for each C
np.random.seed(42) # for reproducibility
sample_low = np.random.dirichlet([alpha_low] * K_VOCAB)
sample_optimal = np.random.dirichlet([alpha_optimal] * K_VOCAB)
sample_high = np.random.dirichlet([alpha_high] * K_VOCAB)

# Sort the components in descending order
sample_low.sort()
sample_optimal.sort()
sample_high.sort()
sorted_low = sample_low[::-1]
sorted_optimal = sample_optimal[::-1]
sorted_high = sample_high[::-1]

# Create the first plot (log-log)
plt.style.use('seaborn-v0_8-whitegrid')
fig1, ax1 = plt.subplots(figsize=(12, 7))
ax1.plot(sorted_low, label=f'Too Sparse (α={alpha_low:.4f})', color='blue', lw=2)
ax1.plot(sorted_optimal, label=f'Optimal (α={alpha_optimal:.4f})', color='red', lw=2.5)
ax1.plot(sorted_high, label=f'Too Uniform (α={alpha_high:.4f})', color='green', lw=2)

# Mark the target N
ax1.axvline(x=N_TARGET, color='black', linestyle='--', label=f'Target N = {N_TARGET}')

ax1.set_title('Structure of a Single Sample from Different Dirichlet Priors (Log Scale)', fontsize=16)
ax1.set_xlabel('Component Index (Sorted by Value)', fontsize=14)
ax1.set_ylabel('Log Probability Mass', fontsize=14)
ax1.set_xscale('log')
ax1.set_yscale('log')
ax1.set_ylim(bottom=1e-5, top=1.0) # Corrected: Limit y-axis to a meaningful range
ax1.legend(fontsize=12)
ax1.grid(True, which="both")

Figure 1 illustrates a different perspective on the optimal \(\alpha\), namely the one for which the \(N\)th highest element in a typical sample is as significant as possible.

Heuristic for target significance threshold

The second heuristic is motivated by the problem of setting \(\alpha\) for Dirichlet noise in AlphaZero. For this problem, \(K\) refers to the number of available actions in a typical position. The goal here is to tune \(\alpha\) so that the number of actions seeing significant exploration is maximized.

The exact details of how Dirichlet noise is implemented in AlphaZero will be covered in Section 5.2. For now, it suffices to say that “significant exploration” refers to the amount of time spent exploring an action supplied by the Dirichlet noise, and this can be translated into a value for the significance threshold \(\epsilon\). So if we have an idea of how much time constitutes significant exploration, we know the significance threshold \(\epsilon\) to aim for, leading to the following heuristic for setting \(\alpha\).

Tip 2: Heuristic for target significance threshold

For a target significance threshold \(\epsilon\), set \(\alpha\) which maximizes the expected number of significant actions.

In other words, Tip 2 can be stated as the following optimization problem:

Maximize: \(\operatorname{E}[ \# \lbrace i | X_i > \epsilon \rbrace]\), where the expectation is taken over \(\operatorname{Dir}_K(\alpha)\).
Constraint: \(\epsilon\) is fixed.

Just like for Tip 1, this also has a remarkably simple approximate solution given by \(\alpha = \frac{0.43}{K \epsilon}\), as will be shown in Section 3.2.

A few remarks

The symmetric perspective

It should be clear that the optimization problems in heuristics 1 and 2 are complementary. Both can be stated in terms of \(\alpha, \epsilon\), and \(N(\alpha, \epsilon) = \operatorname{E}[ \# \lbrace i | X_i > \epsilon \rbrace]\). In Tip 1, \(N(\alpha, \epsilon)\) is kept fixed and \(\epsilon\) is maximized, whereas in Tip 2, \(\epsilon\) is fixed and \(N(\alpha, \epsilon)\) is maximized.

But it goes deeper than that, because the two heuristics define a correspondence between \(N\) and \(\epsilon\). More precisely: If you start with a target \(N\) and use Tip 1, you get a value \(\alpha\), but also an \(\epsilon\) (the maximal one). If you then use this \(\epsilon\) as a target in Tip 2, you get back the same \(\alpha\) and \(N\) (the maximal one) you started with. And the other way around.

Proof sketch

One way to see this is to use calculus to show that both optimization problems reduce to the equation

\[ \frac{\partial N}{\partial \alpha} = 0 . \]

Case by case:

Tip 2 is very straightforward. The goal is to maximize \(N(\alpha, \epsilon)\) subject to \(\epsilon\) being fixed, i. e. to maximize a function of one variable. As \(\alpha \to 0\), \(\operatorname{Dir}_K(\alpha)\) moves into the very sparse one-hot region, and \(N(\alpha, \epsilon) \to 1\). In the other extreme, as \(\alpha \to \infty\), \(\operatorname{Dir}_K(\alpha)\) moves into the very dense uniform region, and \(N(\alpha, \epsilon) \to K\) if \(\epsilon < 1/K\) and \(N(\alpha, \epsilon) \to 0\) if \(\epsilon \geq 1/K\). Since by assumption we are in the sparse regime (i. e. \(\epsilon \gg 1/K\)), the maximum has to occur at a critical point in between the extremes, where \(\partial N / \partial \alpha = 0\).
For Tip 1, we want to maximize \(\epsilon\) subject to \(N(\alpha, \epsilon) = N\) fixed. Here, we should assume that \(N\) is well within the range \(1 < N < K\) to remain within the sparse regime and exclude solutions as the extreme values \(\alpha \to 0\) and \(\alpha \to \infty\). Using Lagrange multipliers, we still arrive at the equation \(\partial N / \partial \alpha = 0\).

Another consideration is whether \(\partial N / \partial \alpha = 0\) has a unique solution. Using the exact formula from Section 3.1.1, this seems to be the case, but is a bit annoying to work out precisely.

This symmetry between the two heuristics suggests that we should think of \(\alpha\) as attuned to the pair \((N, \epsilon)\).

Note

According to the approximate solutions found in Section 3.2 and Section 3.3, \(\epsilon\) and \(N\) will approximately be related by the relation

\[ N \epsilon = 0.28 . \]

It is not surprising that \(N\) and \(\epsilon\) should be inversely proportional, but this tells us the constant of proportionality. More precisely, the constant is \(A E_1(A) \approx 0.28\), where \(E_1(z) = \int_z^\infty t^{-1}e^{-t} dt\) is the exponential integral, and \(A > 0\) is the unique solution of \(E_1(z) = e^{-z}\).

The heuristics are really about priors

The heuristics above are not magic ways to find the optimal \(\alpha\), because they rely on some prior assumptions about \(N\) or \(\epsilon\). What they accomplish is to set \(\alpha\) in a way that is attuned to those priors.

In practice, one doesn’t know in advance exactly which \(N\) or \(\epsilon\) will be optimal; at best, one has a belief about a range of reasonable values, i. e. a prior distribution. Via the approximate heuristics \(\alpha = 1.54 N / K\) and \(\alpha = 0.43 / (K \epsilon)\), any prior belief about \(N\) or \(\epsilon\) translates in a straightforward way into a prior for \(\alpha\), on the level of distributions. This is the natural way to interpret the heuristics from the perspective of Bayesian inference.

Solutions to the heuristics

In Section 2.1, the heuristics Tip 1 and Tip 2 were formulated as complementary optimization problems. In this section, remarkably simple approximate solutions to these optimization problems will be provided:

For Tip 1, set \(\alpha = 1.54 N / K\).
For Tip 2, set \(\alpha = 0.43 / (K \epsilon)\).

Note

The accuracy of these approximate solutions, i. e. how close they are to the exact solutions, will be addressed in Section 4.

Let us first give a quick overview of how these solutions are obtained. First, an exact formula for the number of significant words \(N(K, \alpha, \epsilon)\) in terms of Beta distributions will be provided. To go from here, two ideas are needed:

Idea 1: The total concentration

The first idea is to reformulate the optimization problems in terms of the total concentration \(C = K \alpha\).³ As it turns out, \(C\) controls the sparsity of distributions sampled from \(\operatorname{Dir}_K (\alpha)\) for large \(K\). Specifically, the number of significant words \(N(\alpha, \epsilon) = \operatorname{E}[ \# \lbrace i | X_i > \epsilon \rbrace]\) has a limit as \(K \to \infty\) for a fixed value of \(C = K \alpha\).

³ The parameter \(\alpha\) is called the concentration parameter, which is why we call \(C = K \alpha\) the total concentration.

Idea 2: The exponential integral approximation

The second idea is to use an approximation of \(N(\alpha, \epsilon)\) in terms of the exponential integral \(E_1(x) = \int_x^\infty t^{-1} e^{-t} dt\), and further replace the heuristics with their approximate counterparts. The upshot is that it becomes easy to obtain exact solutions, requiring nothing more than a few lines of calculus. These solutions are exactly the rules of thumb given above.

Formulas for the number of significant words

The exact formula

Recall the number of significant words

\[ N(\alpha, \epsilon) = \operatorname{E}[ \# \lbrace i | X_i > \epsilon \rbrace] , \]

which shows up in the heuristics. There is an expression for this number in terms of Beta distributions:

\[ N(\alpha, \epsilon) = K P(Y > \epsilon), \quad \text{ where }Y \sim \operatorname{Beta}(\alpha, (K-1)\alpha) . \]

Proof

Since the marginals of the \(X_i\) are iid \(\sim \operatorname{Beta}(\alpha, (K-1)\alpha)\) and expectation is linear, \(N(\alpha, \epsilon) = K \operatorname{E}[ \mathbf{1}_{\lbrace Y > \epsilon \rbrace} ] = K P(Y > \epsilon)\).

This formula makes it feasible to solve the heuristics numerically, for example using the following code:

Numerical solvers using the exact formula

import numpy as np
from scipy.optimize import root_scalar, minimize_scalar
from scipy.special import exp1, betaincc, betaincinv

# First, find the constant A which is the unique solution to E_1(x) = e^(-x).
# This is used for initial guesses in the numerical solvers.
def _f_for_A(x):
    return exp1(x) - np.exp(-x)
_solution = root_scalar(_f_for_A, bracket=[0.1, 1.0])
A = _solution.root
E1_A_inv = 1 / exp1(A)

def get_epsilon(alpha, K, N):
    """
    For a given alpha, K, N, find epsilon such that
    N = K * P(Y > epsilon) where Y ~ Beta(alpha, (K-1)*alpha).
    This is equivalent to P(Y < epsilon) = 1 - N/K.
    """
    if alpha <= 0 or N <= 0 or N >= K:
        return 0
    # betaincinv(a, b, p) finds x such that P(Y < x) = p for Y ~ Beta(a,b)
    # We want to find epsilon such that P(Y > epsilon) = N/K
    # i.e. P(Y < epsilon) = 1 - N/K
    try:
        return betaincinv(alpha, (K - 1) * alpha, 1 - N / K)
    except (ValueError, RuntimeError):
        return 0

def find_alpha_exact_h1(K, N):
    """Finds the exact alpha that maximizes epsilon for a given K and N."""
    def objective(alpha):
        # We want to maximize epsilon, so we minimize -epsilon
        return -get_epsilon(alpha, K, N)

    # Search for alpha around the approximate solution
    alpha_approx = E1_A_inv * N / K
    
    # The search bounds need to be reasonably wide but not too wide
    # to avoid numerical issues.
    lower_bound = alpha_approx * 1e-2
    upper_bound = alpha_approx * 1e2
    
    # Brent's method is often good for 1D optimization.
    res = minimize_scalar(objective, bounds=(lower_bound, upper_bound), method='bounded')
    return res.x if res.success else np.nan

def find_alpha_exact_h2(K, epsilon):
    """Finds the exact alpha that maximizes N for a given K and epsilon."""
    def objective(alpha):
        if alpha <= 0: return np.inf
        try:
            return -betaincc(alpha, (K - 1) * alpha, epsilon)
        except (ValueError, RuntimeError):
            return np.inf

    ke = K * epsilon 
    alpha_approx = A / ke

    lower_bound = alpha_approx * 1e-2
    upper_bound = alpha_approx * 1e2
    
    res = minimize_scalar(objective, bounds=(lower_bound, upper_bound), method='bounded')
    return res.x if res.success else np.nan

Let’s see what this says about the Wikipedia example from the introduction, and compare it to the approximate value obtained there. For \(N = 200\) and \(K = 100,000\), Tip 1 yields

K, N = 100000, 200
alpha_exact = find_alpha_exact_h1(K, N)
alpha_approx = 1.54 * N / K
rel_error = 1 - alpha_approx / alpha_exact
print(f"alpha_exact(100000, 200) = {alpha_exact}")
print(f"alpha_approx(100000, 200) = {alpha_approx}")
print(f"relative error = {100*rel_error:.2f}%")

alpha_exact(100000, 200) = 0.0030935760616089786
alpha_approx(100000, 200) = 0.00308
relative error = 0.44%

The approximation being witin 1% of the actual value will be more than good enough for our purposes.

Total concentration and the asymptotic formula

For large \(K\), the sparsity is controlled by the total concentration \(C = K \alpha\). Specifically, keeping \(C\) constant, we can calculate the limit

\[ \lim_{K \to \infty} N(\alpha, \epsilon) = C \int_{\epsilon}^1 t^{-1} (1 - t)^{C - 1} dt . \]

Note that \(K \to \infty \leftrightarrow \alpha \to 0^+\).

Proof

Write out the pdf for \(\operatorname{Beta}(\alpha, C - \alpha)\):

\[ K P(Y > \epsilon) = \frac{C \int_{\epsilon}^1 t^{\alpha - 1} (1 - t)^{C - \alpha - 1} dt}{ \alpha B(\alpha, C - \alpha) } . \]

The integrand in the numerator is bounded, so \(\lim_{\alpha \to 0} \text{(numerator)} = C \int_{\epsilon}^1 t^{-1} (1-t)^{C - 1} dt\) by the dominated convergence theorem. In addition, \(\lim_{\alpha \to 0} \text{(denominator)} = 1\) since \(B(\alpha, C - \alpha) = \Gamma(\alpha) \Gamma(C - \alpha) / \Gamma(C)\), and \(\alpha \Gamma(\alpha) = \Gamma(\alpha + 1)\).

The exponential integral approximation

While the asymptotic formula is conceptually satisfying, there is a much better approximation for the purposes of solving the heuristics. Keeping \(C\) fixed, we have

\[ N(\alpha, \epsilon) \approx C E_1(C \epsilon), \quad \text{ for sufficiently large } K , \]

where \(E_1(x) = \int_x^\infty t^{-1} e^{-t} dt\) is a special function known as the exponential integral.

Proof of the approximation

A sample \(\mathbf{X} \sim \operatorname{Dir}_K(\alpha)\) can be viewed as being generated from Gamma distributions: Take \(K\) independent samples \(Z_i \sim \operatorname{Gamma}(\alpha, 1)\), and set \(X_i = \frac{Z_i}{\sum_j Z_j}\). For large \(K\), \(\sum_j Z_j \approx K \alpha = C\) by the central limit theorem, so \(X_i \approx \frac{Z_i}{C}\). Therefore, \(P(X_i > \epsilon) \approx P(Z_i > C \epsilon) = E_1(C \epsilon)\).

Alternatively, substitute \(t = u/C\) in the asymptotic formula from Section 3.1.2 to obtain

\[ \lim_{K \to \infty} N(K, \epsilon) = C \int_{C \epsilon}^C u^{-1} \left( 1 - \frac{u}{C} \right)^{C - 1} . \]

Since \((1 - u/C)^C \approx e^{-u}\) for reasonable large \(C\), \(\int_{\epsilon}^1 t^{-1} (1 - t)^{C - 1} dt \approx \int_{C \epsilon}^\infty u^{-1} e^{-u} dt\), and more detailed bounds can be worked out.

Note

While the error term in the approximation isn’t quantified here, empirical estimates show good accuracy.

Motivated by the approximate formula, define the function \(F(C, \epsilon) = C E_1(C \epsilon)\). With this new notation, the approximate formula can be restated as \(\operatorname{E}[ \# \lbrace i | X_i > \epsilon \rbrace ] \approx F(C, \epsilon)\) for large \(K\) (or equivalently small \(\alpha\)).

Code for generating plot of F(C, epsilon).

import numpy as np
import plotly.graph_objects as go
from scipy.special import exp1

# Define the function F(C, ε)
def F(C, epsilon):
    return np.clip(C * exp1(C * epsilon), 0, 1e6)

# Create a grid of C and ε values
C_vals = np.linspace(0.1, 1000, 100)
epsilon_vals = np.logspace(-3, -1, 100) # Epsilon from 0.001 to 0.1
C_grid, epsilon_grid = np.meshgrid(C_vals, epsilon_vals)

# Calculate F for each point on the grid
F_vals = F(C_grid, epsilon_grid)

# Create an interactive 3D plot with Plotly
fig = go.Figure(data=[go.Surface(z=F_vals, x=C_grid, y=epsilon_grid, colorscale='viridis')])

fig.update_layout(
    title='Interactive plot for F(C, ε)',
    scene=dict(
        xaxis_title='C',
        yaxis_title='ε',
        zaxis_title='F(C, ε)',
        yaxis_type='log'
    ),
    margin=dict(l=65, r=50, b=65, t=90)
)

fig.show()

Figure 2

Approximate solution for Tip 2

Instead of solving the optimization problem for Tip 2 directly, we solve the corresponding optimization problem using the approximation \(F(C, \epsilon)\): Namely, maximize \(F(C, \epsilon)\) for a given \(\epsilon\). This will turn out to have a particularly nice solution, which is why the approximation \(F(C, \epsilon)\) was chosen in the first place.

The standard calculus problem of maximizing the function \(C \mapsto F(C, \epsilon)\) can be solved by setting the derivative to zero. Using the fundamental theorem of calculus,

\[ 0 = \frac{\partial F}{\partial C} = E_1(C \epsilon) + C \epsilon E_1' (C \epsilon) = E_1(C \epsilon) - e^{- C \epsilon} . \]

The key point is now that the equation \(E_1(x) = e^{- x}\) has a unique solution \(A\), which is computed numerically in the code cell below.

Code for finding numerical solution

import numpy as np
from scipy.optimize import root_scalar
from scipy.special import exp1

# Define the function f(x) = E_1(x) - e^(-x)
def f(x):
    return exp1(x) - np.exp(-x)

# Find the root of f(x) = 0
solution = root_scalar(f, bracket=[0.1, 1.0])
A = solution.root

The unique solution \(A\) of \(E_1(x) = e^{-x}\) is given by \(A =\) 0.4348182043849037. The critical point of \(F(C, \epsilon)\) for fixed \(\epsilon\) is therefore given by (using an approximate value of \(A\))

\[ K \alpha \epsilon = C \epsilon = 0.4348 . \]

The ridge

In the course of solving the optimization problem for Tip 2, we found that the optimal \(C\) is given by \(C \epsilon = A\), where \(A = 0.4348\) is the unique solution of \(E_1(x) = e^{-x}\). Geometrically, the equation \(C \epsilon = A\) describes a curve in the \(C\epsilon\)-plane which lies below the “ridge” of the surface \(F(C, \epsilon)\). This is illustrated in the figure below.

Code

import numpy as np
import plotly.graph_objects as go
from scipy.special import exp1

# --- Data for the surface (same as before) ---
C_vals = np.linspace(0.1, 500, 100)
epsilon_vals = np.logspace(-3, -1, 100)
C_grid, epsilon_grid = np.meshgrid(C_vals, epsilon_vals)
F_vals = F(C_grid, epsilon_grid)

# --- Data for the ridge line ---
# A is available from the previous cell
C_ridge = np.linspace(C_vals.min(), C_vals.max(), 200)
epsilon_ridge = A / C_ridge
F_ridge = F(C_ridge, epsilon_ridge)

# --- Create the plot ---
# Create the surface plot
fig = go.Figure(data=[go.Surface(
    z=F_vals, x=C_grid, y=epsilon_grid,
    colorscale='viridis', opacity=0.8, showscale=False
)])

# Add the ridge line
fig.add_trace(go.Scatter3d(
    x=C_ridge, y=epsilon_ridge, z=F_ridge,
    mode='lines',
    line=dict(color='red', width=5),
    name='Ridge (Cε = A)'
))

# --- Configure layout ---
fig.update_layout(
    title='Surface of F(C, ε) with highlighted ridge over Cε=A',
    scene=dict(
        xaxis_title='C',
        yaxis_title='ε',
        zaxis_title='F(C, ε)',
        yaxis_type='log'
    ),
    margin=dict(l=65, r=50, b=65, t=90)
)

fig

Figure 3: The surface F(C, ε) with the ridge Cε=A highlighted.

Approximate solution for Tip 1

Using the approximate formula, a simple approximate solution for Tip 1 can also be found. The problem is now to maximize \(\epsilon\) under the constraint that \(N = F(C, \epsilon) = C E_1(C \epsilon)\). By staring at Figure 3, it is clear that the solution of this optimization problem also happens at a point where \(\frac{\partial F}{\partial C} = 0\).⁴ From the previous section, this equation describes a curve \(C \epsilon = A\), where \(A= 0.4348\) is the unique solution of \(E_1(x) = e^{-x}\). Therefore, \(N = C E_1(C \epsilon) = C E_1(A)\), so the optimal \(C\) is given by the linear relationship

⁴ The fact that the maximum of \(\epsilon\) is achieved where \(\frac{\partial F}{\partial C} = 0\) can also be shown using Lagrange multipliers.

\[ K \alpha = C = E_1(A)^{-1} N . \]

The slope can be calculated numerically as \(E_1(A)^{-1} =\) np.float64(1.5446822169704288).

When are the rules of thumb accurate?

We now turn to the question of when the approximate solutions found in Section 3.3 and Section 3.2 are close to the exact solutions using the exact formula for \(N(K, \alpha, \epsilon)\).

The comparison requires a choice of metric. A natural choice is to use the relative errors of the variable being optimized, i. e. \(\epsilon\) for Tip 1 and \(N\) for Tip 2. The “regions of validity” will denote the values where the rules of thumb yields \(\epsilon\) or \(N\) within some pre-defined error threshold.

It is worth emphasizing that this analysis doesn’t answer when the heuristics themselves should be applied. But in a stroke of good fortune, it does seem like the regions of validity coincide with values often used in real world problems!

Region of validity for Tip 2

For Tip 2, let \(\alpha_{\text{exact}}(K, \epsilon)\) denote the exact solution to the optimization problem formulated in Tip 2 for a given \(K\) and target \(\epsilon\), and let \(N_{\text{exact}}(K, \epsilon) = N(K, \alpha_{\text{exact}}, \epsilon)\) denote the resulting expected number of significant words. Similarly, let \(\alpha_{\text{approx}} = A / (K \epsilon)\) denote the approximate value found in Section 3.2, and let \(N_{\text{approx}} (K, \epsilon) = N(K, \alpha_{\text{approx}}, \epsilon)\) denote the resulting expected number of significant words (calculated with the exact formula).

The metric used to evaluate the approximation will be the relative error

\[ E_{\text{rel}}(K, \epsilon) = \frac{N_{\text{exact}}(K, \epsilon) - N_{\text{approx}} (K, \epsilon)}{N_{\text{exact}}(K, \epsilon)} . \]

Specifically, the region of validity for threshold \(T\) is the set of \((K, \epsilon)\) where \(E_{2, \text{rel}}(K, \epsilon) \leq T\).

Show code for calculating region of validity

import matplotlib.pyplot as plt
from scipy.optimize import brentq
from scipy.special import betaincc

# A is calculated in a previous cell, it is the solution to E_1(x) = e^(-x) 
# A approx 0.4348

def get_rel_error_N_h2(K, epsilon):
    """Calculates the relative error in N for Heuristic 2."""
    if epsilon <= 1/K or epsilon >= 1: return np.inf # Less than 1/K: Dense regime

    alpha_exact = find_alpha_exact_h2(K, epsilon)
    alpha_approx = A / (K * epsilon)
    
    if np.isnan(alpha_exact): return np.inf
    
    N_exact = betaincc(alpha_exact, (K - 1) * alpha_exact, epsilon)
    N_approx = betaincc(alpha_approx, (K - 1) * alpha_approx, epsilon)
    
    if N_exact == 0: return np.inf

    rel_error = np.abs(N_exact - N_approx) / N_exact
    return rel_error

def find_all_valid_intervals_h2(K, tolerance, error_func, search_space):
    """
    Finds all disjoint intervals where the error is within tolerance.
    Robust to oscillating error curves.
    """
    def root_func(val):
        try:
            return error_func(K, val) - tolerance
        except (ValueError, RuntimeError):
            return np.inf

    errors = np.array([root_func(val) for val in search_space])
    valid_mask = errors <= 0
    
    valid_indices = np.where(valid_mask)[0]
    if not valid_indices.size:
        return []

    # Find start and end indices of contiguous blocks of valid points
    diffs = np.diff(valid_indices)
    block_ends_mask = np.append(np.where(diffs > 1)[0], len(valid_indices) - 1)
    block_starts_mask = np.insert(block_ends_mask[:-1] + 1, 0, 0)
    
    block_starts_idx = valid_indices[block_starts_mask]
    block_ends_idx = valid_indices[block_ends_mask]

    intervals = []
    for start_idx, end_idx in zip(block_starts_idx, block_ends_idx):
        lower_bound = search_space[start_idx]
        upper_bound = search_space[end_idx]

        # Refine lower bound
        if start_idx > 0:
            try:
                lower_bound = brentq(root_func, search_space[start_idx - 1], search_space[start_idx])
            except (ValueError, RuntimeError): pass
        
        # Refine upper bound
        if end_idx < len(search_space) - 1:
            try:
                upper_bound = brentq(root_func, search_space[end_idx], search_space[end_idx + 1])
            except (ValueError, RuntimeError): pass

        intervals.append((lower_bound, upper_bound))
        
    return intervals

# --- Main calculation ---
K_vals = np.logspace(2, 4, 15)
tolerances = [0.05, 0.01]
results = {tol: [] for tol in tolerances}
epsilon_search_space = np.linspace(0, 1, 1000)

for tol in tolerances:
    for K in K_vals:
        intervals = find_all_valid_intervals_h2(K, tol, get_rel_error_N_h2, epsilon_search_space)
        results[tol].append(intervals)

# --- Plotting ---
plt.style.use('seaborn-v0_8-whitegrid')
fig, ax = plt.subplots(figsize=(12, 7))

colors = {'0.05': 'skyblue', '0.01': 'lightgreen'}
labels = {'0.05': '5% tolerance', '0.01': '1% tolerance'}

for tol in tolerances:
    has_added_label = False
    for i, K in enumerate(K_vals):
        intervals = results[tol][i]
        if not intervals: continue
            
        # Define x-range for the vertical bar for this K on a log scale
        if i > 0:
            x_start = np.sqrt(K_vals[i-1] * K)
        else:
            x_start = K / np.sqrt(K_vals[1] / K_vals[0]) if len(K_vals) > 1 else K * 0.9

        if i < len(K_vals) - 1:
            x_end = np.sqrt(K * K_vals[i+1])
        else:
            x_end = K * np.sqrt(K_vals[-1] / K_vals[-2]) if len(K_vals) > 1 else K * 1.1

        for min_val, max_val in intervals:
            label = labels[str(tol)] if not has_added_label else None
            ax.fill_between([x_start, x_end], min_val, max_val, 
                            color=colors[str(tol)], alpha=0.5, label=label)
            has_added_label = True

ax.set_xscale('log')
ax.set_yscale('log')
ax.set_ylim(1e-4, 1)
ax.set_xlabel('Vocabulary Size (K)', fontsize=12)
ax.set_ylabel('Valid Range for ε', fontsize=12)
ax.set_title('Region of Validity for Heuristic 2 (Error in N)', fontsize=14)
ax.legend()
ax.grid(True, which="both")
plt.show()

Figure 4: For T = 0.01 or 0.05, the plot shows regions of (K, ε) where E_rel(K, ε) ≤ T.

The plot reveals some subtle but important properties of the approximation. Let’s address some of these in QA format:

Q: Why does the approximation break down for small \(\epsilon\)?

Small \(\epsilon\) corresponds to the dense regime, but the approximation was designed for the sparse regime. Specifically, with \(\epsilon \ll 1/K\), the best way to maximize the number of significant words \(N\) is to sample near the uniform distribution (all words are significant).

Q: Why does the approximation break down for large \(\epsilon\)?

This is a consequence of (small) absolute errors in the approximation. For example, if \(\epsilon = 0.25\), then \(N \leq 4\) since \(N \epsilon \leq 1\). On the other hand, some computations reveal that the absolute difference is around \(|N_{\text{exact}} - N_{\text{approx}}| \approx 0.68\) for a reasonably large \(K\). This is not a large error in absolute terms, but it results in a relative error of roughly 5% as is clear from the plot.

The point is just that \(N\) gets small, making the relative error larger.

Q: Doesn’t this contradict the claim that the approximate formula is accurate for large \(K\)?

The exact relation between \(N_{\text{exact}}\) and \(N_{\text{approx}}\) is somewhat delicate:

The approximate formula described \(N\) for large \(K\) when the total concentration \(C = K \alpha\) is held constant. If \(K\), \(\alpha\), and \(\epsilon\) are all allowed to vary, it becomes more difficult to quantify the exact error in the approximation.
The errors are not only due to the difference between the exact and approximate value of \(N\) at a point \((K, \epsilon, \alpha)\), but due to the difference between their maxima which occur at different points.

Region of validity for Tip 1

The analysis will be similar to the one for Tip 2. This time, \(K\) and \(N\) are fixed and \(\epsilon_{\text{exact}}(K, N)\) denotes the exact maximum for the optimization problem formulated in Tip 1. On the other hand, the approximate maximum derived in Section 3.3 was given by \(\alpha_{\text{approx}}(K, N) = E_1(A)^{-1} N / K\). Define

\[ \epsilon_{\text{approx}}(K, N) = \sup \lbrace \epsilon \, : \, \operatorname{E}_{\alpha_{\text{approx}}}\left[ \# \lbrace i | X_i > \epsilon \right] \geq N \rbrace . \]

That is, \(\epsilon_{\text{approx}}(K, N)\) solves the optimization problem defining Tip 1 for a fixed \(\alpha = \alpha_{\text{approx}}(K, N)\). Using the exact formula from Section 3.1.1, this is equivalent to solving the equation \(N = K P(Y > \epsilon)\) for \(\epsilon\), where \(Y \sim \operatorname{Beta}(\alpha_{\text{approx}}, (K-1)\alpha_{\text{approx}})\).

Note

Note that \(\epsilon_{\text{approx}}(K,N)\) is not the \(\epsilon\) which appears in the equation \(C \epsilon = A\) from Section 3.2, although they should be related: \(\epsilon_{\text{approx}}\) should be close to \(\frac{A}{C} = \frac{A}{K \alpha_{\text{approx}}} = \frac{A E_1(A)}{N}\) according to the approximation.

The metric used to evaluate the approximation in this case is the relative error

\[ E_{\text{rel}}(K, N) = \frac{\epsilon_{\text{exact}}(K, N) - \epsilon_{\text{approx}}(K, N)}{\epsilon_{\text{exact}}(K, N)} . \]

Again, the region of validity for a threshold \(T\) is the set of \((K, N)\) where \(E_{\text{rel}}(K, N) \leq T\).

Show code for calculating region of validity

import matplotlib.pyplot as plt
from scipy.optimize import brentq
from scipy.special import betaincinv

# A is calculated in a previous cell, it is the solution to E_1(x) = e^(-x) approx 0.4348
def get_rel_error_epsilon_h1(K, N_over_K):
    """Calculates the relative error in epsilon for Heuristic 1."""
    N = N_over_K * K
    if N <= 1 or N >= K: return np.inf

    # 1. Find exact maximum epsilon
    alpha_exact = find_alpha_exact_h1(K, N)
    if np.isnan(alpha_exact): return np.inf
    epsilon_exact = get_epsilon(alpha_exact, K, N)

    # 2. Find epsilon from approximate alpha
    alpha_approx = E1_A_inv * N / K
    epsilon_from_approx = get_epsilon(alpha_approx, K, N)
    
    if epsilon_exact == 0: return np.inf
    
    rel_error = np.abs(epsilon_exact - epsilon_from_approx) / epsilon_exact
    return rel_error

def find_all_valid_intervals_h1(K, tolerance, search_space_N_over_K):
    """
    Finds all disjoint intervals for N/K where the error is within tolerance.
    """
    def root_func(n_over_k):
        try:
            return get_rel_error_epsilon_h1(K, n_over_k) - tolerance
        except (ValueError, RuntimeError):
            return np.inf

    errors = np.array([root_func(val) for val in search_space_N_over_K])
    valid_mask = errors <= 0
    
    valid_indices = np.where(valid_mask)[0]
    if not valid_indices.size:
        return []

    # Find start and end indices of contiguous blocks of valid points
    diffs = np.diff(valid_indices)
    block_ends_mask = np.append(np.where(diffs > 1)[0], len(valid_indices) - 1)
    block_starts_mask = np.insert(block_ends_mask[:-1] + 1, 0, 0)
    
    block_starts_idx = valid_indices[block_starts_mask]
    block_ends_idx = valid_indices[block_ends_mask]

    intervals = []
    for start_idx, end_idx in zip(block_starts_idx, block_ends_idx):
        lower_bound = search_space_N_over_K[start_idx]
        upper_bound = search_space_N_over_K[end_idx]

        # Refine interval boundaries using a root finder
        if start_idx > 0:
            try:
                lower_bound = brentq(root_func, search_space_N_over_K[start_idx - 1], search_space_N_over_K[start_idx])
            except (ValueError, RuntimeError): pass
        
        if end_idx < len(search_space_N_over_K) - 1:
            try:
                upper_bound = brentq(root_func, search_space_N_over_K[end_idx], search_space_N_over_K[end_idx + 1])
            except (ValueError, RuntimeError): pass

        intervals.append((lower_bound, upper_bound))
        
    return intervals

# --- Main calculation ---
K_vals = np.logspace(2, 6, 15)
tolerances = [0.05, 0.01]
results_h1 = {tol: [] for tol in tolerances}
# We scan over N/K instead of N
N_over_K_search_space = np.linspace(1e-6, 0.5, 200) 

for tol in tolerances:
    for K in K_vals:
        intervals = find_all_valid_intervals_h1(K, tol, N_over_K_search_space)
        results_h1[tol].append(intervals)

# --- Plotting ---
plt.style.use('seaborn-v0_8-whitegrid')
fig, ax = plt.subplots(figsize=(12, 7))

colors = {'0.05': 'skyblue', '0.01': 'lightgreen'}
labels = {'0.05': '5% tolerance', '0.01': '1% tolerance'}

for tol in tolerances:
    has_added_label = False
    for i, K in enumerate(K_vals):
        intervals = results_h1[tol][i]
        if not intervals: continue
            
        # Define x-range for the vertical bar for this K on a log scale
        if i > 0:
            x_start = np.sqrt(K_vals[i-1] * K)
        else:
            x_start = K / np.sqrt(K_vals[1] / K_vals[0]) if len(K_vals) > 1 else K * 0.9

        if i < len(K_vals) - 1:
            x_end = np.sqrt(K * K_vals[i+1])
        else:
            x_end = K * np.sqrt(K_vals[-1] / K_vals[-2]) if len(K_vals) > 1 else K * 1.1

        for min_val, max_val in intervals:
            label = labels[str(tol)] if not has_added_label else None
            ax.fill_between([x_start, x_end], min_val, max_val, 
                            color=colors[str(tol)], alpha=0.5, label=label)
            has_added_label = True

ax.set_xscale('log')
ax.set_yscale('log')
ax.set_ylim(1e-6, 1)
ax.set_xlabel('Vocabulary Size (K)', fontsize=12)
ax.set_ylabel('Valid Range for N/K', fontsize=12)
ax.set_title('Region of Validity for Heuristic 1 (Error in ε)', fontsize=14)
ax.legend()
ax.grid(True, which="both")
plt.show()

Figure 5: For T = 0.01 or 0.05, the plot shows regions of (K, N/K) where E_rel(K, N/K) ≤ T.

Validating the heuristics against conventional wisdom

To verify that the heuristics are reasonable, we can compare them to the conventional wisdom, i. e. ad-hoc recommendations for setting \(\alpha\) that can be found in the literature and elsewhere.

Topic models

The quote from Wikipedia in the beginning of this post is a good example of a conventional wisdom for setting \(\alpha\) for a topic model. In the introduction it was verified that the simple heuristic \(\alpha = 1.54 N / K\) falls within the recommended range.

Another heuristic mentioned in some places is setting \(\alpha = 50 / K\), or equivalently \(C = 50\). Using the approximate formulas, this corresponds to \(N = C / 1.544 \approx 32.4\) significant words at a significance threshold of \(\epsilon = 0.4348 / C \approx 0.009\).

Dirichlet noise in AlphaZero

The purpose of Dirichlet noise in AlphaZero is to enable meaningful random exploration. Recall that way that this works: For a “prior policy” \(\mathbf{p}\) over \(K\) available actions at the root node, supplied by a neural net, Dirichlet noise samples \(\mathbf{X} \sim \operatorname{Dir}_K(\alpha)\), and replaces \(\mathbf{p}\) by \(\mathbf{p}_{\text{new}} = (1 - \lambda)\mathbf{p} + \lambda\mathbf{X}\) for some \(\lambda \in [0, 1]\).⁵

⁵ The \(\lambda\) here is usually denoted \(\epsilon\), and often referred to as “Dirichlet epsilon” in Dirichlet noise.

Think of \(\lambda\) as the allotted budget for purely random exploration, and \(\mathbf{X}\) as the distribution dictating how this random exploration is to be done. For a specific action \(i\), the value of \(X_i\) controls whether action \(i\) is explored in a significant way. The goal is to tune \(\alpha\) so that the number of actions seeing significant exploration is maximized.

It is instructive to think about what happens in the edge cases. If \(\alpha\) is too large, and \(\mathbf{X}\) is on average closer to a uniform distribution, and each \(X_i\) is small, meaning that each action is explored at most a few times. If \(\alpha\) is very small, \(\mathbf{X}\) is close to a one-hot distribution, in which case all of the random exploration budget is spent exploring a single action. The sweet spot is somewhere in between.

For example, suppose we believe that a meaningful amount of exploration is achieved when an action being explored is visited at least 10 times. If we run 800 simulations in total, using \(\lambda = 0.25\), this leaves 200 simulations for random exploration. Out of those 200 simulations, we want a significant action to be visited 10 times, or 5% of the budget. That means we should use a significance threshold of \(\epsilon = 0.05\). Using the approximate formula, this corresponds to a total concentration of \(K \alpha = C = 0.4348 / \epsilon = 8.696\).

Let’s run the numbers, using the values from the AlphaZero paper:

Table 1: Values for \(\alpha\) used in the AlphaZero paper, known values of the branching factors \(K\), and the corresponding values for \(C\) and \(N\) using the approximate formulas. For exploration constant \(\lambda = 0.25\) and \(800\) simulations, \(200 \epsilon\) estimates the number of times a significant action is explored.

Game	\(\alpha\)	\(K\)	\(C\)	\(\epsilon = 0.4348 / C\)	\(200 \epsilon\)
Chess	0.3	35	10.5	0.041	8.2
Shogi	0.15	80	12	0.036	7.2
Go	0.03	250	7.5	0.058	11.6

Note that these values of \(C\) are pretty close to the one we obtained, \(8.7\).

Summary and practical takeaways

The heuristics in this post concerns the estimation of a good concentration parameter \(\alpha\), based on a relation

\[ N = \operatorname{E}[ \# \lbrace i | X_i > \epsilon \rbrace ] \]

between the significant number of words \(N\), and the significance threshold \(\epsilon\), where the expectation is taken over \(\mathbf{X} \sim \operatorname{Dir}_K(\alpha)\), the symmetric Dirichlet distribution of order \(K\) with concentration parameter \(\alpha\). Two heuristics were suggested, based on the following two cases:

When we have an idea about a good target value of \(N\), as in the case of the topic model,
When we have an idea about a good target value of \(\epsilon\), as in the case of Dirichlet noise in AlphaZero.

The heuristics, formulated as optimization problems in Section 2.1, lead to simple rules of thumb summarized in the following table.

Table 2: Summary of the heuristics.

Known target value	Rule of thumb	Within 5% of exact heuristic when (roughly)
number \(N\) of significant words	\(\alpha = 1.54 N / K\)	\(2 < N < 0.2 K\)
significance threshold \(\epsilon\)	\(\alpha = 0.43 / (K \epsilon)\)	\(2 / K < \epsilon < 0.2\)

The constant \(0.43\) is an approximation of the unique solution \(A\) of \(E_1(x) = e^{-x}\), and \(1.54\) is an approximation of \(E_1(A)^{-1}\).

Concentration parameters \(\alpha\) determined by these heuristics seem to align well with those found in other, ad-hoc recommendations.