Motivated by topic models and Dirichlet noise in AlphaZero, I suggest two complementary heuristics for setting sparse Dirichlet priors.
machine learning
bayesian statistics
optimization
Author
Eivind Otto Hjelle
Published
July 16, 2025
TLDR: See Section 6 for a concise summary and the practical takeaways.
Introduction
A few days ago, I came across the following statement about Dirichlet distributions on Wikipedia:
An example of where a sparse prior (concentration parameter much less than 1) is called for, consider a topic model, which is used to learn the topics that are discussed in a set of documents, where each “topic” is described using a categorical distribution over a vocabulary of words. A typical vocabulary might have 100,000 words, leading to a 100,000-dimensional categorical distribution. The prior distribution for the parameters of the categorical distribution would likely be a symmetric Dirichlet distribution. However, a coherent topic might only have a few hundred words with any significant probability mass. Accordingly, a reasonable setting for the concentration parameter might be 0.01 or 0.001. With a larger vocabulary of around 1,000,000 words, an even smaller value, e.g. 0.0001, might be appropriate.
This made me wonder: Where do these numbers come from? Is there a principled way to know that \(\alpha\) around \(0.01 - 0.001\) will yield distributions the appropriate sparsity, i. e. samples will have mass concentrated on a few hundred words?
I’ve also encountered a related question in my project about AlphaZero. Namely, in the original paper they inject Dirichlet noise at the root node to help exploration of new strategies. The concentration parameter \(\alpha\) is scaled in inverse proportion to the number of legal moves, to values of 0.3, 0.15, and 0.03 for chess, shogi, and Go. Again, is there a principled way to set these values of \(\alpha\) if we know the branching factor1 of the game? And even if we believe that \(\alpha\) should be inversely proportional to the branching factor, which constant of proportionality should we use?
1 The average number of legal moves.
2 The heuristic \(\alpha\) is also close to “optimal” (in a specific sense) for small values of \(N / K\). For example \(N / K < 0.02\) yields \(\alpha\) within 5% of the optimal value. This will all be explained in more detail.
In this post I will argue that there are indeed principled ways to answer these questions, leading to surprisingly simple rules of thumb. In the example of the topic model, we get the following: Given a vocabulary size \(K\) and a target number \(N\) of significant words, set \(\alpha = 1.54 \frac{N}{K}\).2 Using \(N = 200\) and \(K = 100,000\) yields \(\alpha = 0.003\), aligning well with the recommendation from Wikipedia.
Warning
I believe that some of the results in this post are novel. But I’m not an expert on the topic and this work is not peer-reviewed. If you find any errors or have any comments, please contact me!
Two heuristics for setting \(\alpha\)
The heuristics for setting \(\alpha\) for the topic model from Wikipedia and the Dirichlet noise in AlphaZero can be formulated as two complementary optimization problems. The first goal is to state these optimization problems precisely, which serve as the theoretical foundation for simple rules of thumb that will be derived later on.
First, let us standardize some notation and make precise the context in which we work. For a given \(K\), which can represent vocabulary size or the number of available actions in a given position, we consider the symmetric Dirichlet distribution\(\operatorname{Dir}_K(\alpha)\) of order \(K\) with parameters \(\alpha\). The problem we are trying to solve is that of selecting \(\alpha\).
Heuristic for target number of significant words
Consider the situation of the topic model, where we are given the vocabulary size \(K\), and we have some idea of the number \(N\) of significant words that should be used to describe a topic.
It is here important to clarify exactly what is meant by a significant word. Given a sample distribution over words \(\mathbf{X} = (X_1, \dots, X_K) \sim \operatorname{Dir}_K(\alpha)\), a word \(i\) is significant if \(X_i\) is large. But “large” is not well defined, so to be more precise, we say that a word \(i\) is significant relative to a significance threshold \(\epsilon\) if \(X_i > \epsilon\).
The main idea for the heuristic can then be stated informally as follows:
Tip 1: Heuristic for target number of significant words
For a target number \(N\) of significant words, set the \(\alpha\) which maximizes the significance threshold \(\epsilon\) with respect to which the expected number of significant words is at least \(N\).
In other words, Tip 1 can be formulated as choosing \(\alpha\) that solves the following optimization problem:
Maximize: \(\epsilon\)
Constraint: \(\operatorname{E}[ \# \lbrace i | X_i > \epsilon \rbrace ] = N\), where the expectation is taken over \(\operatorname{Dir}_K(\alpha)\).
The punchline is that this optimization has the surprisingly simple (approximate) solution \(\alpha = 1.54 \frac{N}{K}\) as long as \(N / K\) is sufficiently small. This will be given thorough justification in Section 3.3.
Visualizing the optimal choice
It is worth exploring in more detail what sets distributions sampled from the optimal \(\operatorname{Dir}_K(\alpha)\) apart, by running a simple experiment. For a given \(\alpha\), take a sample \(\mathbf{X} \sim \operatorname{Dir}_K(\alpha)\), sort the entries of \(\mathbf{X}\) in descending order, and plot the result. This provides a visualization of how concentrated the probability mass of \(\mathbf{X}\) is in some of its entries. For a vocabulary size \(K = 10000\) and target number \(N\), we can make these plots for \(\alpha = 1.54 N / K \approx 0.005\), \(\alpha_{\text{low}} = \alpha / 4\) and \(\alpha_{\text{high}} = 4 \alpha\).
Show code for experiment
import numpy as npimport matplotlib.pyplot as plt# Setup parametersN_TARGET =30K_VOCAB =10000# Concentration parameters to comparealpha_optimal =1.54* N_TARGET / K_VOCABalpha_low = alpha_optimal /4alpha_high =4* alpha_optimal# Generate one sample for each Cnp.random.seed(42) # for reproducibilitysample_low = np.random.dirichlet([alpha_low] * K_VOCAB)sample_optimal = np.random.dirichlet([alpha_optimal] * K_VOCAB)sample_high = np.random.dirichlet([alpha_high] * K_VOCAB)# Sort the components in descending ordersample_low.sort()sample_optimal.sort()sample_high.sort()sorted_low = sample_low[::-1]sorted_optimal = sample_optimal[::-1]sorted_high = sample_high[::-1]# Create the first plot (log-log)plt.style.use('seaborn-v0_8-whitegrid')fig1, ax1 = plt.subplots(figsize=(12, 7))ax1.plot(sorted_low, label=f'Too Sparse (α={alpha_low:.4f})', color='blue', lw=2)ax1.plot(sorted_optimal, label=f'Optimal (α={alpha_optimal:.4f})', color='red', lw=2.5)ax1.plot(sorted_high, label=f'Too Uniform (α={alpha_high:.4f})', color='green', lw=2)# Mark the target Nax1.axvline(x=N_TARGET, color='black', linestyle='--', label=f'Target N = {N_TARGET}')ax1.set_title('Structure of a Single Sample from Different Dirichlet Priors (Log Scale)', fontsize=16)ax1.set_xlabel('Component Index (Sorted by Value)', fontsize=14)ax1.set_ylabel('Log Probability Mass', fontsize=14)ax1.set_xscale('log')ax1.set_yscale('log')ax1.set_ylim(bottom=1e-5, top=1.0) # Corrected: Limit y-axis to a meaningful rangeax1.legend(fontsize=12)ax1.grid(True, which="both")
Figure 1
Figure 1 illustrates a different perspective on the optimal \(\alpha\), namely the one for which the \(N\)th highest element in a typical sample is as significant as possible.
Heuristic for target significance threshold
The second heuristic is motivated by the problem of setting \(\alpha\) for Dirichlet noise in AlphaZero. For this problem, \(K\) refers to the number of available actions in a typical position. The goal here is to tune \(\alpha\) so that the number of actions seeing significant exploration is maximized.
The exact details of how Dirichlet noise is implemented in AlphaZero will be covered in Section 5.2. For now, it suffices to say that “significant exploration” refers to the amount of time spent exploring an action supplied by the Dirichlet noise, and this can be translated into a value for the significance threshold \(\epsilon\). So if we have an idea of how much time constitutes significant exploration, we know the significance threshold \(\epsilon\) to aim for, leading to the following heuristic for setting \(\alpha\).
Tip 2: Heuristic for target significance threshold
For a target significance threshold \(\epsilon\), set \(\alpha\) which maximizes the expected number of significant actions.
In other words, Tip 2 can be stated as the following optimization problem:
Maximize: \(\operatorname{E}[ \# \lbrace i | X_i > \epsilon \rbrace]\), where the expectation is taken over \(\operatorname{Dir}_K(\alpha)\).
Constraint: \(\epsilon\) is fixed.
Just like for Tip 1, this also has a remarkably simple approximate solution given by \(\alpha = \frac{0.43}{K \epsilon}\), as will be shown in Section 3.2.
A few remarks
The symmetric perspective
It should be clear that the optimization problems in heuristics 1 and 2 are complementary. Both can be stated in terms of \(\alpha, \epsilon\), and \(N(\alpha, \epsilon) = \operatorname{E}[ \# \lbrace i | X_i > \epsilon \rbrace]\). In Tip 1, \(N(\alpha, \epsilon)\) is kept fixed and \(\epsilon\) is maximized, whereas in Tip 2, \(\epsilon\) is fixed and \(N(\alpha, \epsilon)\) is maximized.
But it goes deeper than that, because the two heuristics define a correspondence between \(N\) and \(\epsilon\). More precisely: If you start with a target \(N\) and use Tip 1, you get a value \(\alpha\), but also an \(\epsilon\) (the maximal one). If you then use this \(\epsilon\) as a target in Tip 2, you get back the same \(\alpha\) and \(N\) (the maximal one) you started with. And the other way around.
NoteProof sketch
One way to see this is to use calculus to show that both optimization problems reduce to the equation
\[ \frac{\partial N}{\partial \alpha} = 0 . \]
Case by case:
Tip 2 is very straightforward. The goal is to maximize \(N(\alpha, \epsilon)\) subject to \(\epsilon\) being fixed, i. e. to maximize a function of one variable. As \(\alpha \to 0\), \(\operatorname{Dir}_K(\alpha)\) moves into the very sparse one-hot region, and \(N(\alpha, \epsilon) \to 1\). In the other extreme, as \(\alpha \to \infty\), \(\operatorname{Dir}_K(\alpha)\) moves into the very dense uniform region, and \(N(\alpha, \epsilon) \to K\) if \(\epsilon < 1/K\) and \(N(\alpha, \epsilon) \to 0\) if \(\epsilon \geq 1/K\). Since by assumption we are in the sparse regime (i. e. \(\epsilon \gg 1/K\)), the maximum has to occur at a critical point in between the extremes, where \(\partial N / \partial \alpha = 0\).
For Tip 1, we want to maximize \(\epsilon\) subject to \(N(\alpha, \epsilon) = N\) fixed. Here, we should assume that \(N\) is well within the range \(1 < N < K\) to remain within the sparse regime and exclude solutions as the extreme values \(\alpha \to 0\) and \(\alpha \to \infty\). Using Lagrange multipliers, we still arrive at the equation \(\partial N / \partial \alpha = 0\).
Another consideration is whether \(\partial N / \partial \alpha = 0\) has a unique solution. Using the exact formula from Section 3.1.1, this seems to be the case, but is a bit annoying to work out precisely.
This symmetry between the two heuristics suggests that we should think of \(\alpha\) as attuned to the pair\((N, \epsilon)\).
Note
According to the approximate solutions found in Section 3.2 and Section 3.3, \(\epsilon\) and \(N\) will approximately be related by the relation
\[ N \epsilon = 0.28 . \]
It is not surprising that \(N\) and \(\epsilon\) should be inversely proportional, but this tells us the constant of proportionality. More precisely, the constant is \(A E_1(A) \approx 0.28\), where \(E_1(z) = \int_z^\infty t^{-1}e^{-t} dt\) is the exponential integral, and \(A > 0\) is the unique solution of \(E_1(z) = e^{-z}\).
The heuristics are really about priors
The heuristics above are not magic ways to find the optimal \(\alpha\), because they rely on some prior assumptions about \(N\) or \(\epsilon\). What they accomplish is to set \(\alpha\) in a way that is attuned to those priors.
In practice, one doesn’t know in advance exactly which \(N\) or \(\epsilon\) will be optimal; at best, one has a belief about a range of reasonable values, i. e. a prior distribution. Via the approximate heuristics \(\alpha = 1.54 N / K\) and \(\alpha = 0.43 / (K \epsilon)\), any prior belief about \(N\) or \(\epsilon\) translates in a straightforward way into a prior for \(\alpha\), on the level of distributions. This is the natural way to interpret the heuristics from the perspective of Bayesian inference.
Solutions to the heuristics
In Section 2.1, the heuristics Tip 1 and Tip 2 were formulated as complementary optimization problems. In this section, remarkably simple approximate solutions to these optimization problems will be provided:
The accuracy of these approximate solutions, i. e. how close they are to the exact solutions, will be addressed in Section 4.
Let us first give a quick overview of how these solutions are obtained. First, an exact formula for the number of significant words \(N(K, \alpha, \epsilon)\) in terms of Beta distributions will be provided. To go from here, two ideas are needed:
Idea 1: The total concentration
The first idea is to reformulate the optimization problems in terms of the total concentration\(C = K \alpha\).3 As it turns out, \(C\) controls the sparsity of distributions sampled from \(\operatorname{Dir}_K (\alpha)\) for large \(K\). Specifically, the number of significant words \(N(\alpha, \epsilon) = \operatorname{E}[ \# \lbrace i | X_i > \epsilon \rbrace]\) has a limit as \(K \to \infty\) for a fixed value of \(C = K \alpha\).
3 The parameter \(\alpha\) is called the concentration parameter, which is why we call \(C = K \alpha\) the total concentration.
Idea 2: The exponential integral approximation
The second idea is to use an approximation of \(N(\alpha, \epsilon)\) in terms of the exponential integral \(E_1(x) = \int_x^\infty t^{-1} e^{-t} dt\), and further replace the heuristics with their approximate counterparts. The upshot is that it becomes easy to obtain exact solutions, requiring nothing more than a few lines of calculus. These solutions are exactly the rules of thumb given above.
which shows up in the heuristics. There is an expression for this number in terms of Beta distributions:
\[ N(\alpha, \epsilon) = K P(Y > \epsilon), \quad \text{ where }Y \sim \operatorname{Beta}(\alpha, (K-1)\alpha) . \]
NoteProof
Since the marginals of the \(X_i\) are iid \(\sim \operatorname{Beta}(\alpha, (K-1)\alpha)\) and expectation is linear, \(N(\alpha, \epsilon) = K \operatorname{E}[ \mathbf{1}_{\lbrace Y > \epsilon \rbrace} ] = K P(Y > \epsilon)\).
This formula makes it feasible to solve the heuristics numerically, for example using the following code:
Numerical solvers using the exact formula
import numpy as npfrom scipy.optimize import root_scalar, minimize_scalarfrom scipy.special import exp1, betaincc, betaincinv# First, find the constant A which is the unique solution to E_1(x) = e^(-x).# This is used for initial guesses in the numerical solvers.def _f_for_A(x):return exp1(x) - np.exp(-x)_solution = root_scalar(_f_for_A, bracket=[0.1, 1.0])A = _solution.rootE1_A_inv =1/ exp1(A)def get_epsilon(alpha, K, N):""" For a given alpha, K, N, find epsilon such that N = K * P(Y > epsilon) where Y ~ Beta(alpha, (K-1)*alpha). This is equivalent to P(Y < epsilon) = 1 - N/K. """if alpha <=0or N <=0or N >= K:return0# betaincinv(a, b, p) finds x such that P(Y < x) = p for Y ~ Beta(a,b)# We want to find epsilon such that P(Y > epsilon) = N/K# i.e. P(Y < epsilon) = 1 - N/Ktry:return betaincinv(alpha, (K -1) * alpha, 1- N / K)except (ValueError, RuntimeError):return0def find_alpha_exact_h1(K, N):"""Finds the exact alpha that maximizes epsilon for a given K and N."""def objective(alpha):# We want to maximize epsilon, so we minimize -epsilonreturn-get_epsilon(alpha, K, N)# Search for alpha around the approximate solution alpha_approx = E1_A_inv * N / K# The search bounds need to be reasonably wide but not too wide# to avoid numerical issues. lower_bound = alpha_approx *1e-2 upper_bound = alpha_approx *1e2# Brent's method is often good for 1D optimization. res = minimize_scalar(objective, bounds=(lower_bound, upper_bound), method='bounded')return res.x if res.success else np.nandef find_alpha_exact_h2(K, epsilon):"""Finds the exact alpha that maximizes N for a given K and epsilon."""def objective(alpha):if alpha <=0: return np.inftry:return-betaincc(alpha, (K -1) * alpha, epsilon)except (ValueError, RuntimeError):return np.inf ke = K * epsilon alpha_approx = A / ke lower_bound = alpha_approx *1e-2 upper_bound = alpha_approx *1e2 res = minimize_scalar(objective, bounds=(lower_bound, upper_bound), method='bounded')return res.x if res.success else np.nan
Let’s see what this says about the Wikipedia example from the introduction, and compare it to the approximate value obtained there. For \(N = 200\) and \(K = 100,000\), Tip 1 yields
The approximation being witin 1% of the actual value will be more than good enough for our purposes.
Total concentration and the asymptotic formula
For large \(K\), the sparsity is controlled by the total concentration\(C = K \alpha\). Specifically, keeping \(C\) constant, we can calculate the limit
The integrand in the numerator is bounded, so \(\lim_{\alpha \to 0} \text{(numerator)} = C \int_{\epsilon}^1 t^{-1} (1-t)^{C - 1} dt\) by the dominated convergence theorem. In addition, \(\lim_{\alpha \to 0} \text{(denominator)} = 1\) since \(B(\alpha, C - \alpha) = \Gamma(\alpha) \Gamma(C - \alpha) / \Gamma(C)\), and \(\alpha \Gamma(\alpha) = \Gamma(\alpha + 1)\).
The exponential integral approximation
While the asymptotic formula is conceptually satisfying, there is a much better approximation for the purposes of solving the heuristics. Keeping \(C\) fixed, we have
\[ N(\alpha, \epsilon) \approx C E_1(C \epsilon), \quad \text{ for sufficiently large } K , \]
where \(E_1(x) = \int_x^\infty t^{-1} e^{-t} dt\) is a special function known as the exponential integral.
NoteProof of the approximation
A sample \(\mathbf{X} \sim \operatorname{Dir}_K(\alpha)\) can be viewed as being generated from Gamma distributions: Take \(K\) independent samples \(Z_i \sim \operatorname{Gamma}(\alpha, 1)\), and set \(X_i = \frac{Z_i}{\sum_j Z_j}\). For large \(K\), \(\sum_j Z_j \approx K \alpha = C\) by the central limit theorem, so \(X_i \approx \frac{Z_i}{C}\). Therefore, \(P(X_i > \epsilon) \approx P(Z_i > C \epsilon) = E_1(C \epsilon)\).
Alternatively, substitute \(t = u/C\) in the asymptotic formula from Section 3.1.2 to obtain
Since \((1 - u/C)^C \approx e^{-u}\) for reasonable large \(C\), \(\int_{\epsilon}^1 t^{-1} (1 - t)^{C - 1} dt \approx \int_{C \epsilon}^\infty u^{-1} e^{-u} dt\), and more detailed bounds can be worked out.
Note
While the error term in the approximation isn’t quantified here, empirical estimates show good accuracy.
Motivated by the approximate formula, define the function \(F(C, \epsilon) = C E_1(C \epsilon)\). With this new notation, the approximate formula can be restated as \(\operatorname{E}[ \# \lbrace i | X_i > \epsilon \rbrace ] \approx F(C, \epsilon)\) for large \(K\) (or equivalently small \(\alpha\)).
Code for generating plot of F(C, epsilon).
import numpy as npimport plotly.graph_objects as gofrom scipy.special import exp1# Define the function F(C, ε)def F(C, epsilon):return np.clip(C * exp1(C * epsilon), 0, 1e6)# Create a grid of C and ε valuesC_vals = np.linspace(0.1, 1000, 100)epsilon_vals = np.logspace(-3, -1, 100) # Epsilon from 0.001 to 0.1C_grid, epsilon_grid = np.meshgrid(C_vals, epsilon_vals)# Calculate F for each point on the gridF_vals = F(C_grid, epsilon_grid)# Create an interactive 3D plot with Plotlyfig = go.Figure(data=[go.Surface(z=F_vals, x=C_grid, y=epsilon_grid, colorscale='viridis')])fig.update_layout( title='Interactive plot for F(C, ε)', scene=dict( xaxis_title='C', yaxis_title='ε', zaxis_title='F(C, ε)', yaxis_type='log' ), margin=dict(l=65, r=50, b=65, t=90))fig.show()
Instead of solving the optimization problem for Tip 2 directly, we solve the corresponding optimization problem using the approximation \(F(C, \epsilon)\): Namely, maximize \(F(C, \epsilon)\) for a given \(\epsilon\). This will turn out to have a particularly nice solution, which is why the approximation \(F(C, \epsilon)\) was chosen in the first place.
The standard calculus problem of maximizing the function \(C \mapsto F(C, \epsilon)\) can be solved by setting the derivative to zero. Using the fundamental theorem of calculus,
The key point is now that the equation \(E_1(x) = e^{- x}\) has a unique solution \(A\), which is computed numerically in the code cell below.
Code for finding numerical solution
import numpy as npfrom scipy.optimize import root_scalarfrom scipy.special import exp1# Define the function f(x) = E_1(x) - e^(-x)def f(x):return exp1(x) - np.exp(-x)# Find the root of f(x) = 0solution = root_scalar(f, bracket=[0.1, 1.0])A = solution.root
The unique solution \(A\) of \(E_1(x) = e^{-x}\) is given by \(A =\) 0.4348182043849037. The critical point of \(F(C, \epsilon)\) for fixed \(\epsilon\) is therefore given by (using an approximate value of \(A\))
\[ K \alpha \epsilon = C \epsilon = 0.4348 . \]
The ridge
In the course of solving the optimization problem for Tip 2, we found that the optimal \(C\) is given by \(C \epsilon = A\), where \(A = 0.4348\) is the unique solution of \(E_1(x) = e^{-x}\). Geometrically, the equation \(C \epsilon = A\) describes a curve in the \(C\epsilon\)-plane which lies below the “ridge” of the surface \(F(C, \epsilon)\). This is illustrated in the figure below.
Code
import numpy as npimport plotly.graph_objects as gofrom scipy.special import exp1# --- Data for the surface (same as before) ---C_vals = np.linspace(0.1, 500, 100)epsilon_vals = np.logspace(-3, -1, 100)C_grid, epsilon_grid = np.meshgrid(C_vals, epsilon_vals)F_vals = F(C_grid, epsilon_grid)# --- Data for the ridge line ---# A is available from the previous cellC_ridge = np.linspace(C_vals.min(), C_vals.max(), 200)epsilon_ridge = A / C_ridgeF_ridge = F(C_ridge, epsilon_ridge)# --- Create the plot ---# Create the surface plotfig = go.Figure(data=[go.Surface( z=F_vals, x=C_grid, y=epsilon_grid, colorscale='viridis', opacity=0.8, showscale=False)])# Add the ridge linefig.add_trace(go.Scatter3d( x=C_ridge, y=epsilon_ridge, z=F_ridge, mode='lines', line=dict(color='red', width=5), name='Ridge (Cε = A)'))# --- Configure layout ---fig.update_layout( title='Surface of F(C, ε) with highlighted ridge over Cε=A', scene=dict( xaxis_title='C', yaxis_title='ε', zaxis_title='F(C, ε)', yaxis_type='log' ), margin=dict(l=65, r=50, b=65, t=90))fig
Figure 3: The surface F(C, ε) with the ridge Cε=A highlighted.
Using the approximate formula, a simple approximate solution for Tip 1 can also be found. The problem is now to maximize \(\epsilon\) under the constraint that \(N = F(C, \epsilon) = C E_1(C \epsilon)\). By staring at Figure 3, it is clear that the solution of this optimization problem also happens at a point where \(\frac{\partial F}{\partial C} = 0\).4 From the previous section, this equation describes a curve \(C \epsilon = A\), where \(A= 0.4348\) is the unique solution of \(E_1(x) = e^{-x}\). Therefore, \(N = C E_1(C \epsilon) = C E_1(A)\), so the optimal \(C\) is given by the linear relationship
4 The fact that the maximum of \(\epsilon\) is achieved where \(\frac{\partial F}{\partial C} = 0\) can also be shown using Lagrange multipliers.
\[ K \alpha = C = E_1(A)^{-1} N . \]
The slope can be calculated numerically as \(E_1(A)^{-1} =\) np.float64(1.5446822169704288).
When are the rules of thumb accurate?
We now turn to the question of when the approximate solutions found in Section 3.3 and Section 3.2 are close to the exact solutions using the exact formula for \(N(K, \alpha, \epsilon)\).
The comparison requires a choice of metric. A natural choice is to use the relative errors of the variable being optimized, i. e. \(\epsilon\) for Tip 1 and \(N\) for Tip 2. The “regions of validity” will denote the values where the rules of thumb yields \(\epsilon\) or \(N\) within some pre-defined error threshold.
It is worth emphasizing that this analysis doesn’t answer when the heuristics themselves should be applied. But in a stroke of good fortune, it does seem like the regions of validity coincide with values often used in real world problems!
For Tip 2, let \(\alpha_{\text{exact}}(K, \epsilon)\) denote the exact solution to the optimization problem formulated in Tip 2 for a given \(K\) and target \(\epsilon\), and let \(N_{\text{exact}}(K, \epsilon) = N(K, \alpha_{\text{exact}}, \epsilon)\) denote the resulting expected number of significant words. Similarly, let \(\alpha_{\text{approx}} = A / (K \epsilon)\) denote the approximate value found in Section 3.2, and let \(N_{\text{approx}} (K, \epsilon) = N(K, \alpha_{\text{approx}}, \epsilon)\) denote the resulting expected number of significant words (calculated with the exact formula).
The metric used to evaluate the approximation will be the relative error
Specifically, the region of validity for threshold \(T\) is the set of \((K, \epsilon)\) where \(E_{2, \text{rel}}(K, \epsilon) \leq T\).
Show code for calculating region of validity
import matplotlib.pyplot as pltfrom scipy.optimize import brentqfrom scipy.special import betaincc# A is calculated in a previous cell, it is the solution to E_1(x) = e^(-x) # A approx 0.4348def get_rel_error_N_h2(K, epsilon):"""Calculates the relative error in N for Heuristic 2."""if epsilon <=1/K or epsilon >=1: return np.inf # Less than 1/K: Dense regime alpha_exact = find_alpha_exact_h2(K, epsilon) alpha_approx = A / (K * epsilon)if np.isnan(alpha_exact): return np.inf N_exact = betaincc(alpha_exact, (K -1) * alpha_exact, epsilon) N_approx = betaincc(alpha_approx, (K -1) * alpha_approx, epsilon)if N_exact ==0: return np.inf rel_error = np.abs(N_exact - N_approx) / N_exactreturn rel_errordef find_all_valid_intervals_h2(K, tolerance, error_func, search_space):""" Finds all disjoint intervals where the error is within tolerance. Robust to oscillating error curves. """def root_func(val):try:return error_func(K, val) - toleranceexcept (ValueError, RuntimeError):return np.inf errors = np.array([root_func(val) for val in search_space]) valid_mask = errors <=0 valid_indices = np.where(valid_mask)[0]ifnot valid_indices.size:return []# Find start and end indices of contiguous blocks of valid points diffs = np.diff(valid_indices) block_ends_mask = np.append(np.where(diffs >1)[0], len(valid_indices) -1) block_starts_mask = np.insert(block_ends_mask[:-1] +1, 0, 0) block_starts_idx = valid_indices[block_starts_mask] block_ends_idx = valid_indices[block_ends_mask] intervals = []for start_idx, end_idx inzip(block_starts_idx, block_ends_idx): lower_bound = search_space[start_idx] upper_bound = search_space[end_idx]# Refine lower boundif start_idx >0:try: lower_bound = brentq(root_func, search_space[start_idx -1], search_space[start_idx])except (ValueError, RuntimeError): pass# Refine upper boundif end_idx <len(search_space) -1:try: upper_bound = brentq(root_func, search_space[end_idx], search_space[end_idx +1])except (ValueError, RuntimeError): pass intervals.append((lower_bound, upper_bound))return intervals# --- Main calculation ---K_vals = np.logspace(2, 4, 15)tolerances = [0.05, 0.01]results = {tol: [] for tol in tolerances}epsilon_search_space = np.linspace(0, 1, 1000)for tol in tolerances:for K in K_vals: intervals = find_all_valid_intervals_h2(K, tol, get_rel_error_N_h2, epsilon_search_space) results[tol].append(intervals)# --- Plotting ---plt.style.use('seaborn-v0_8-whitegrid')fig, ax = plt.subplots(figsize=(12, 7))colors = {'0.05': 'skyblue', '0.01': 'lightgreen'}labels = {'0.05': '5% tolerance', '0.01': '1% tolerance'}for tol in tolerances: has_added_label =Falsefor i, K inenumerate(K_vals): intervals = results[tol][i]ifnot intervals: continue# Define x-range for the vertical bar for this K on a log scaleif i >0: x_start = np.sqrt(K_vals[i-1] * K)else: x_start = K / np.sqrt(K_vals[1] / K_vals[0]) iflen(K_vals) >1else K *0.9if i <len(K_vals) -1: x_end = np.sqrt(K * K_vals[i+1])else: x_end = K * np.sqrt(K_vals[-1] / K_vals[-2]) iflen(K_vals) >1else K *1.1for min_val, max_val in intervals: label = labels[str(tol)] ifnot has_added_label elseNone ax.fill_between([x_start, x_end], min_val, max_val, color=colors[str(tol)], alpha=0.5, label=label) has_added_label =Trueax.set_xscale('log')ax.set_yscale('log')ax.set_ylim(1e-4, 1)ax.set_xlabel('Vocabulary Size (K)', fontsize=12)ax.set_ylabel('Valid Range for ε', fontsize=12)ax.set_title('Region of Validity for Heuristic 2 (Error in N)', fontsize=14)ax.legend()ax.grid(True, which="both")plt.show()
Figure 4: For T = 0.01 or 0.05, the plot shows regions of (K, ε) where E_rel(K, ε) ≤ T.
The plot reveals some subtle but important properties of the approximation. Let’s address some of these in QA format:
Q: Why does the approximation break down for small \(\epsilon\)?
Small \(\epsilon\) corresponds to the dense regime, but the approximation was designed for the sparse regime. Specifically, with \(\epsilon \ll 1/K\), the best way to maximize the number of significant words \(N\) is to sample near the uniform distribution (all words are significant).
Q: Why does the approximation break down for large \(\epsilon\)?
This is a consequence of (small) absolute errors in the approximation. For example, if \(\epsilon = 0.25\), then \(N \leq 4\) since \(N \epsilon \leq 1\). On the other hand, some computations reveal that the absolute difference is around \(|N_{\text{exact}} - N_{\text{approx}}| \approx 0.68\) for a reasonably large \(K\). This is not a large error in absolute terms, but it results in a relative error of roughly 5% as is clear from the plot.
The point is just that \(N\) gets small, making the relative error larger.
Q: Doesn’t this contradict the claim that the approximate formula is accurate for large \(K\)?
The exact relation between \(N_{\text{exact}}\) and \(N_{\text{approx}}\) is somewhat delicate:
The approximate formula described \(N\) for large \(K\) when the total concentration \(C = K \alpha\) is held constant. If \(K\), \(\alpha\), and \(\epsilon\) are all allowed to vary, it becomes more difficult to quantify the exact error in the approximation.
The errors are not only due to the difference between the exact and approximate value of \(N\) at a point \((K, \epsilon, \alpha)\), but due to the difference between their maxima which occur at different points.
The analysis will be similar to the one for Tip 2. This time, \(K\) and \(N\) are fixed and \(\epsilon_{\text{exact}}(K, N)\) denotes the exact maximum for the optimization problem formulated in Tip 1. On the other hand, the approximate maximum derived in Section 3.3 was given by \(\alpha_{\text{approx}}(K, N) = E_1(A)^{-1} N / K\). Define
That is, \(\epsilon_{\text{approx}}(K, N)\) solves the optimization problem defining Tip 1 for a fixed \(\alpha = \alpha_{\text{approx}}(K, N)\). Using the exact formula from Section 3.1.1, this is equivalent to solving the equation \(N = K P(Y > \epsilon)\) for \(\epsilon\), where \(Y \sim \operatorname{Beta}(\alpha_{\text{approx}}, (K-1)\alpha_{\text{approx}})\).
Note
Note that \(\epsilon_{\text{approx}}(K,N)\) is not the \(\epsilon\) which appears in the equation \(C \epsilon = A\) from Section 3.2, although they should be related: \(\epsilon_{\text{approx}}\) should be close to \(\frac{A}{C} = \frac{A}{K \alpha_{\text{approx}}} = \frac{A E_1(A)}{N}\) according to the approximation.
The metric used to evaluate the approximation in this case is the relative error
Again, the region of validity for a threshold \(T\) is the set of \((K, N)\) where \(E_{\text{rel}}(K, N) \leq T\).
Show code for calculating region of validity
import matplotlib.pyplot as pltfrom scipy.optimize import brentqfrom scipy.special import betaincinv# A is calculated in a previous cell, it is the solution to E_1(x) = e^(-x) approx 0.4348def get_rel_error_epsilon_h1(K, N_over_K):"""Calculates the relative error in epsilon for Heuristic 1.""" N = N_over_K * Kif N <=1or N >= K: return np.inf# 1. Find exact maximum epsilon alpha_exact = find_alpha_exact_h1(K, N)if np.isnan(alpha_exact): return np.inf epsilon_exact = get_epsilon(alpha_exact, K, N)# 2. Find epsilon from approximate alpha alpha_approx = E1_A_inv * N / K epsilon_from_approx = get_epsilon(alpha_approx, K, N)if epsilon_exact ==0: return np.inf rel_error = np.abs(epsilon_exact - epsilon_from_approx) / epsilon_exactreturn rel_errordef find_all_valid_intervals_h1(K, tolerance, search_space_N_over_K):""" Finds all disjoint intervals for N/K where the error is within tolerance. """def root_func(n_over_k):try:return get_rel_error_epsilon_h1(K, n_over_k) - toleranceexcept (ValueError, RuntimeError):return np.inf errors = np.array([root_func(val) for val in search_space_N_over_K]) valid_mask = errors <=0 valid_indices = np.where(valid_mask)[0]ifnot valid_indices.size:return []# Find start and end indices of contiguous blocks of valid points diffs = np.diff(valid_indices) block_ends_mask = np.append(np.where(diffs >1)[0], len(valid_indices) -1) block_starts_mask = np.insert(block_ends_mask[:-1] +1, 0, 0) block_starts_idx = valid_indices[block_starts_mask] block_ends_idx = valid_indices[block_ends_mask] intervals = []for start_idx, end_idx inzip(block_starts_idx, block_ends_idx): lower_bound = search_space_N_over_K[start_idx] upper_bound = search_space_N_over_K[end_idx]# Refine interval boundaries using a root finderif start_idx >0:try: lower_bound = brentq(root_func, search_space_N_over_K[start_idx -1], search_space_N_over_K[start_idx])except (ValueError, RuntimeError): passif end_idx <len(search_space_N_over_K) -1:try: upper_bound = brentq(root_func, search_space_N_over_K[end_idx], search_space_N_over_K[end_idx +1])except (ValueError, RuntimeError): pass intervals.append((lower_bound, upper_bound))return intervals# --- Main calculation ---K_vals = np.logspace(2, 6, 15)tolerances = [0.05, 0.01]results_h1 = {tol: [] for tol in tolerances}# We scan over N/K instead of NN_over_K_search_space = np.linspace(1e-6, 0.5, 200) for tol in tolerances:for K in K_vals: intervals = find_all_valid_intervals_h1(K, tol, N_over_K_search_space) results_h1[tol].append(intervals)# --- Plotting ---plt.style.use('seaborn-v0_8-whitegrid')fig, ax = plt.subplots(figsize=(12, 7))colors = {'0.05': 'skyblue', '0.01': 'lightgreen'}labels = {'0.05': '5% tolerance', '0.01': '1% tolerance'}for tol in tolerances: has_added_label =Falsefor i, K inenumerate(K_vals): intervals = results_h1[tol][i]ifnot intervals: continue# Define x-range for the vertical bar for this K on a log scaleif i >0: x_start = np.sqrt(K_vals[i-1] * K)else: x_start = K / np.sqrt(K_vals[1] / K_vals[0]) iflen(K_vals) >1else K *0.9if i <len(K_vals) -1: x_end = np.sqrt(K * K_vals[i+1])else: x_end = K * np.sqrt(K_vals[-1] / K_vals[-2]) iflen(K_vals) >1else K *1.1for min_val, max_val in intervals: label = labels[str(tol)] ifnot has_added_label elseNone ax.fill_between([x_start, x_end], min_val, max_val, color=colors[str(tol)], alpha=0.5, label=label) has_added_label =Trueax.set_xscale('log')ax.set_yscale('log')ax.set_ylim(1e-6, 1)ax.set_xlabel('Vocabulary Size (K)', fontsize=12)ax.set_ylabel('Valid Range for N/K', fontsize=12)ax.set_title('Region of Validity for Heuristic 1 (Error in ε)', fontsize=14)ax.legend()ax.grid(True, which="both")plt.show()
Figure 5: For T = 0.01 or 0.05, the plot shows regions of (K, N/K) where E_rel(K, N/K) ≤ T.
Validating the heuristics against conventional wisdom
To verify that the heuristics are reasonable, we can compare them to the conventional wisdom, i. e. ad-hoc recommendations for setting \(\alpha\) that can be found in the literature and elsewhere.
Topic models
The quote from Wikipedia in the beginning of this post is a good example of a conventional wisdom for setting \(\alpha\) for a topic model. In the introduction it was verified that the simple heuristic \(\alpha = 1.54 N / K\) falls within the recommended range.
Another heuristic mentioned in some places is setting \(\alpha = 50 / K\), or equivalently \(C = 50\). Using the approximate formulas, this corresponds to \(N = C / 1.544 \approx 32.4\) significant words at a significance threshold of \(\epsilon = 0.4348 / C \approx 0.009\).
Dirichlet noise in AlphaZero
The purpose of Dirichlet noise in AlphaZero is to enable meaningful random exploration. Recall that way that this works: For a “prior policy” \(\mathbf{p}\) over \(K\) available actions at the root node, supplied by a neural net, Dirichlet noise samples \(\mathbf{X} \sim \operatorname{Dir}_K(\alpha)\), and replaces \(\mathbf{p}\) by \(\mathbf{p}_{\text{new}} = (1 - \lambda)\mathbf{p} + \lambda\mathbf{X}\) for some \(\lambda \in [0, 1]\).5
5 The \(\lambda\) here is usually denoted \(\epsilon\), and often referred to as “Dirichlet epsilon” in Dirichlet noise.
Think of \(\lambda\) as the allotted budget for purely random exploration, and \(\mathbf{X}\) as the distribution dictating how this random exploration is to be done. For a specific action \(i\), the value of \(X_i\) controls whether action \(i\) is explored in a significant way. The goal is to tune \(\alpha\) so that the number of actions seeing significant exploration is maximized.
It is instructive to think about what happens in the edge cases. If \(\alpha\) is too large, and \(\mathbf{X}\) is on average closer to a uniform distribution, and each \(X_i\) is small, meaning that each action is explored at most a few times. If \(\alpha\) is very small, \(\mathbf{X}\) is close to a one-hot distribution, in which case all of the random exploration budget is spent exploring a single action. The sweet spot is somewhere in between.
For example, suppose we believe that a meaningful amount of exploration is achieved when an action being explored is visited at least 10 times. If we run 800 simulations in total, using \(\lambda = 0.25\), this leaves 200 simulations for random exploration. Out of those 200 simulations, we want a significant action to be visited 10 times, or 5% of the budget. That means we should use a significance threshold of \(\epsilon = 0.05\). Using the approximate formula, this corresponds to a total concentration of \(K \alpha = C = 0.4348 / \epsilon = 8.696\).
Let’s run the numbers, using the values from the AlphaZero paper:
Table 1: Values for \(\alpha\) used in the AlphaZero paper, known values of the branching factors \(K\), and the corresponding values for \(C\) and \(N\) using the approximate formulas. For exploration constant \(\lambda = 0.25\) and \(800\) simulations, \(200 \epsilon\) estimates the number of times a significant action is explored.
Game
\(\alpha\)
\(K\)
\(C\)
\(\epsilon = 0.4348 / C\)
\(200 \epsilon\)
Chess
0.3
35
10.5
0.041
8.2
Shogi
0.15
80
12
0.036
7.2
Go
0.03
250
7.5
0.058
11.6
Note that these values of \(C\) are pretty close to the one we obtained, \(8.7\).
Summary and practical takeaways
The heuristics in this post concerns the estimation of a good concentration parameter \(\alpha\), based on a relation
\[ N = \operatorname{E}[ \# \lbrace i | X_i > \epsilon \rbrace ] \]
between the significant number of words \(N\), and the significance threshold \(\epsilon\), where the expectation is taken over \(\mathbf{X} \sim \operatorname{Dir}_K(\alpha)\), the symmetric Dirichlet distribution of order \(K\) with concentration parameter \(\alpha\). Two heuristics were suggested, based on the following two cases:
When we have an idea about a good target value of \(N\), as in the case of the topic model,
When we have an idea about a good target value of \(\epsilon\), as in the case of Dirichlet noise in AlphaZero.
The heuristics, formulated as optimization problems in Section 2.1, lead to simple rules of thumb summarized in the following table.
Table 2: Summary of the heuristics.
Known target value
Rule of thumb
Within 5% of exact heuristic when (roughly)
number \(N\) of significant words
\(\alpha = 1.54 N / K\)
\(2 < N < 0.2 K\)
significance threshold \(\epsilon\)
\(\alpha = 0.43 / (K \epsilon)\)
\(2 / K < \epsilon < 0.2\)
The constant \(0.43\) is an approximation of the unique solution \(A\) of \(E_1(x) = e^{-x}\), and \(1.54\) is an approximation of \(E_1(A)^{-1}\).
Concentration parameters \(\alpha\) determined by these heuristics seem to align well with those found in other, ad-hoc recommendations.