22  Quantile Neural Networks

It is better to be roughly right than precisely wrong.” – John Maynard Keynes

In Chapter 3, we explored how to learn posterior distributions \(p(\theta\mid y)\) using Bayesian learning and to use it for predictions, hypothesis testing, and other tasks. In Chapter 4, we explored how rational agents make decisions under uncertainty by maximizing expected utility. Traditional Bayesian approaches to such problems require computing posterior distributions \(p(\theta\mid y)\), which in turn requires specifying likelihood functions \(p(y\mid \theta)\) and often involves challenging density estimation. But what if we could bypass density estimation entirely and directly learn the quantities we need for decision-making and other tasks?

This chapter introduces quantile neural networks, a powerful approach that learns posterior distributions through their quantile functions rather than their densities. This shift from densities to quantiles has profound implications: it enables likelihood-free inference, provides natural connections to decision theory through the quantile-expectation identity, and scales to high-dimensional problems where density estimation becomes intractable.

The key insight is surprisingly simple. Any expectation—including the expected utility central to decision theory—can be represented as an integral over quantiles: \[ E[f(\theta)] = \int_0^1 F^{-1}_{f(\theta) | y}(\tau) d\tau \] Rather than learning densities and then computing expectations via sampling, we can directly learn the quantile function \(F^{-1}_{f(\theta) | y}(\tau)\) using neural networks. This approach is not only more efficient but also naturally handles simulation-based models where likelihoods are unavailable or computationally expensive.

Throughout this chapter, we explore four major application domains where quantile neural networks demonstrate their power and versatility. We begin with maximum expected utility problems (Section 22.5), showing how to extend generative Bayesian methods to decision problems by incorporating utility functions directly into the training process. This enables efficient computation of optimal decisions without requiring explicit density estimation. Next, we turn to supply chain forecasting (Section 22.8), examining how companies like Amazon use quantile regression to forecast demand under uncertainty. By predicting entire distributions rather than point estimates, these methods enable more sophisticated inventory optimization. We then demonstrate how quantile methods handle the classic problem of optimal portfolio allocation under parameter uncertainty (Section 22.6), extending beyond the limited cases where closed-form solutions exist. Finally, we connect to modern artificial intelligence through distributional reinforcement learning (Section 22.9), showing how quantile methods enable agents to learn entire distributions of returns rather than just expected values, leading to more robust decision-making in complex environments.

We begin by developing the theoretical foundations of quantile regression, deriving the loss functions from first principles, and then show how neural networks provide a flexible architecture for learning complex quantile functions in high dimensions.

22.1 From Densities to Quantiles: A Generative Approach

Let \((X,Y) \sim P_{X,Y}\) be input-output pairs and \(P_{X,Y}\) a joint measure from which we can simulate a training dataset \((x_i, y_i)_{i=1}^N \sim P_{X,Y}\). Standard prediction techniques focus on the conditional posterior mean \(\hat{X}(Y) = E(X|Y) = f(Y)\) of the input given the output. To do this, consider the multivariate non-parametric regression \(X = f(Y) + \epsilon\) and provide methods for estimating the conditional mean. Typical estimators, \(\hat{f}\), include KNN and kernel methods. Recently, deep learners have been proposed and the theoretical properties of superpositions of affine functions (a.k.a. ridge functions) have been provided N. G. Polson and Sokolov (2023).

Generative methods take this approach one step further. Let \(Z \sim P_Z\) be a base measure for a latent variable, \(Z\), typically a standard multivariate normal or vector of uniforms. The goal of generative methods is to characterize the posterior measure \(P_{X|Y}\) from the training data \((x_i, y_i)_{i=1}^N \sim P_{X,Y}\) where \(N\) is chosen to be suitably large. A deep learner is used to estimate \(\hat{f}\) via the non-parametric regression \(X = f(Y, Z)\). In the case where \(Z\) is uniform, this amounts to inverse CDF sampling, namely \(X = F_{X|Y}^{-1}(Z)\).

In general, we characterize the posterior map for any output \(Y\). Simply evaluate the network at any \(Y\) via the transport map \[ X = H(S(Y), \psi(Z)) \] Here \(\psi\) denotes the embedding function. The deep learners \(H\) and \(S\) are estimated from the triples \((X_i, Y_i, \psi(Z_i))_{i=1}^N \sim P_{X,Y} \times P_Z\). The ensuing estimator \(\hat{H}\) can be thought of as a transport map from the base distribution to the posterior as required.

Specifically, the idea of generative methods is straightforward. Let \(y\) denote data and \(\theta\) a vector of parameters including any hidden states (a.k.a. latent variables) \(z\). First, we generate a “look-up” table of “fake” data \(\{y^{(i)}, \theta^{(i)}\}_{i=1}^N\). By simulating a training dataset of outputs and parameters, we can use deep learning to solve for the inverse map via a supervised learning problem. Generative methods have the advantage of being likelihood-free. For example, our model might be specified by a forward map \(y^{(i)} = f(\theta^{(i)})\) rather than a traditional random draw from a likelihood function \(y^{(i)} \sim p(y^{(i)}|\theta^{(i)})\). The base distribution \(P_Z\) is typically uniform (for univariate problems) or a very high-dimensional Gaussian vector (for multivariate problems).

The theoretical foundation for this approach is the noise outsourcing theorem, which guarantees that we can represent any posterior distribution through a deterministic function of the data and a base random variable.

Noise Outsourcing Theorem (Kallenberg 1997): If \((Y, \Theta)\) are random variables in a Borel space \((\mathcal{Y}, \Theta)\), then there exists a random variable \(\tau \sim U(0,1)\) which is independent of \(Y\) and a function \(H: [0,1] \times \mathcal{Y} \rightarrow \Theta\) such that \[ (Y, \Theta) \stackrel{a.s.}{=} (Y, H(Y, \tau)) \] Moreover, if there is a sufficient statistic \(S(Y)\) with \(Y\) independent of \(\Theta | S(Y)\), then \[ \Theta\mid Y \stackrel{a.s.}{=} H(S(Y), \tau). \]

This result tells us that posterior uncertainty can be characterized via an inverse non-parametric regression problem where we predict \(\theta^{(i)}\) from \(y^{(i)}\) and \(\tau^{(i)}\), where \(\tau^{(i)}\) is drawn from a base distribution \(p(\tau)\). The base distribution is typically uniform (for univariate problems) or a very high-dimensional Gaussian vector (for multivariate problems). We train a deep neural network \(H\) on \[ \theta^{(i)} = H(S(y^{(i)}), \tau^{(i)}). \] Here \(S(y)\) is a statistic used to perform dimension reduction with respect to the signal distribution—analogous to sufficient statistics in traditional Bayesian inference. A remarkable result due to Brillinger (2012) shows that we can learn \(S\) independently of \(H\) via ordinary least squares, simplifying the overall estimation problem.

Specifying the architecture of \(H\) is key to the efficiency of the approach. N. Polson, Ruggeri, and Sokolov (2024) propose using quantile neural networks implemented with ReLU activation functions, which we detail in Section 22.7.

22.2 Quantile Regression

Before diving into neural network implementations, we present the foundational concepts of quantile regression. This section derives the quantile loss function from first principles and discusses applications across multiple fields, setting the stage.

It is easy to show that, given observed values \(y_1, \ldots, y_n\), the median is the value that minimizes the expected absolute deviation: \[ m = \arg\min_q \frac{1}{n} \sum_{i=1}^n |y_i - q| = \arg\min_q E[|Y - q|] \] Intuitively, the sum of absolute deviations is minimized when the median is the value that is closest to half of the observations.

What if we want to find quantile \(\tau \in (0,1)\)? We can use the generalization of the absolute value function to find the minimizer of the expected absolute deviation: \[ q_\tau = \arg\min_q \frac{1}{n} \sum_{i=1}^n \rho_\tau(y_i - q). \] Here \(\rho_\tau(u)\) is the check loss or pinball loss and is defined as: \[ \rho_\tau(u) = u(\tau - I(u < 0)) = \begin{cases} \tau u & \text{if } u \geq 0 \\ (\tau - 1) u & \text{if } u < 0 \end{cases} \] This can also be written in the more computationally convenient form: \[ \rho_\tau(u) = \max(u\tau, u(\tau-1)) \]

Example 22.1 (Linear Quantile Regression) To illustrate quantile regression in practice, we’ll analyze the relationship between engine displacement and fuel efficiency using the classic mtcars dataset. Rather than estimating the mean relationship (as ordinary least squares would), we’ll estimate conditional quantiles to understand how this relationship varies across the distribution of fuel efficiency.

# Load mtcars dataset
data(mtcars)

# Define the check loss (pinball loss) function
check_loss <- function(u, tau) {
  # rho_tau(u) = u * (tau - I(u < 0))
  u * (tau - (u < 0))
}

# Objective function: sum of check losses for quantile regression
quantile_objective <- function(beta, X, y, tau) {
  # Predicted values
  y_pred <- X %*% beta
  # Residuals
  residuals <- y - y_pred
  # Sum of check losses
  sum(check_loss(residuals, tau))
}

# Prepare data
X <- cbind(1, log(mtcars$disp))  # Design matrix with intercept
y <- log(mtcars$mpg)

# Fit quantile regression models using optim()
quantiles <- c(0.1, 0.5, 0.9)
results <- list()

for (tau in quantiles) {
  # Initial values (use OLS estimates as starting point)
  ols_fit <- lm(log(mpg) ~ log(disp), data = mtcars)
  beta_init <- coef(ols_fit)
  
  # Optimize using BFGS (quasi-Newton method)
  # We use BFGS because the check loss is non-differentiable at zero,
  # but BFGS can handle this with numerical approximations
  optim_result <- optim(
    par = beta_init,
    fn = quantile_objective,
    X = X,
    y = y,
    tau = tau,
    method = "BFGS"
  )
  
  results[[as.character(tau)]] <- list(
    coefficients = optim_result$par,
    value = optim_result$value,
    convergence = optim_result$convergence
  )
  
  cat(sprintf("tau = %.1f: Convergence = %d, Loss = %.4f\n", 
              tau, optim_result$convergence, optim_result$value))
}
## tau = 0.1: Convergence = 0, Loss = 0.5155
## tau = 0.5: Convergence = 0, Loss = 1.5182
## tau = 0.9: Convergence = 0, Loss = 0.6907
# Also fit OLS for comparison
ols_model <- lm(log(mpg) ~ log(disp), data = mtcars)

# Create visualization
plot(log(mtcars$disp), log(mtcars$mpg), 
     pch = 16, col = "gray30",
     xlab = "Displacement (cu.in.)", 
     ylab = "Miles per Gallon",
     main = "Quantile Regression: Fuel Efficiency vs. Engine Displacement")

# Add regression lines
disp_range <- seq(min(log(mtcars$disp)), max(log(mtcars$disp)), length.out = 100)

# OLS line
abline(ols_model, col = "black", lwd = 2, lty = 2)

# Quantile regression lines
colors <- c("blue", "darkgreen", "red")
for (i in seq_along(quantiles)) {
  tau <- quantiles[i]
  beta <- results[[as.character(tau)]]$coefficients
  # Predicted values: y = beta_0 + beta_1 * x
  pred <- beta[1] + beta[2] * disp_range
  lines(disp_range, pred, col = colors[i], lwd = 2)
}

# Add legend
legend("topright", 
       legend = c("OLS (Mean)", 
                  expression(tau == 0.1),
                  expression(tau == 0.5 ~ "(Median)"),
                  expression(tau == 0.9)),
       col = c("black", colors),
       lty = c(2, 1, 1, 1),
       lwd = 2,
       bg = "white")

grid()

Quantile regression on mtcars dataset. The relationship between engine displacement and fuel efficiency varies across quantiles, revealing heteroskedasticity in the data.
# Print model summaries
cat("\n\nQuantile Regression Results (via optim):\n")
## 
## 
## Quantile Regression Results (via optim):
cat("=========================================\n\n")
## =========================================
for (tau in quantiles) {
  cat(sprintf("Quantile tau = %.1f:\n", tau))
  cat(sprintf("  Intercept: %.4f\n", results[[as.character(tau)]]$coefficients[1]))
  cat(sprintf("  Slope:     %.4f\n", results[[as.character(tau)]]$coefficients[2]))
  cat("\n")
}
## Quantile tau = 0.1:
##   Intercept: 5.6246
##   Slope:     -0.5340
## 
## Quantile tau = 0.5:
##   Intercept: 5.4783
##   Slope:     -0.4781
## 
## Quantile tau = 0.9:
##   Intercept: 5.0835
##   Slope:     -0.3660
cat(sprintf("OLS Regression:\n"))
## OLS Regression:
cat(sprintf("  Intercept: %.4f\n", coef(ols_model)[1]))
##   Intercept: 5.3810
cat(sprintf("  Slope:     %.4f\n", coef(ols_model)[2]))
##   Slope:     -0.4586

The quantile regression results reveal several important patterns in the relationship between engine displacement and fuel efficiency. First, the slopes differ substantially across quantiles, indicating heteroskedasticity in the conditional distribution. The 10th percentile slope is -0.5340, while the 90th percentile slope is -0.3660. This difference suggests that the negative relationship between displacement and fuel efficiency is stronger for more fuel-efficient cars.

Second, the widening gap between the 10th and 90th percentiles as displacement increases reveals increasing uncertainty in fuel efficiency for larger engines. This pattern likely reflects varying driving conditions, maintenance practices, and vehicle technology across different cars with similar engine sizes.

Third, the median regression (\(\tau\) = 0.5) demonstrates robustness to outliers. Unlike OLS, which minimizes squared errors and thus heavily weights extreme values, quantile regression uses the asymmetric absolute loss that treats positive and negative residuals differently based on the quantile level. This asymmetry makes the estimator less sensitive to unusual observations.

Finally, the check loss \(\rho_\tau(u)\) is piecewise linear, making it non-differentiable at zero. This property explains why we use BFGS rather than gradient-based methods that assume smoothness. The BFGS algorithm builds a quasi-Newton approximation that handles the kink effectively, converging reliably despite the non-smooth objective function.

Traditional quantile regression assumes \(f_\tau(x, \theta)\) is linear in parameters. This limitation becomes severe in several important scenarios. First, many real-world relationships are inherently nonlinear—demand forecasting, for instance, involves complex interactions between time, seasonality, promotions, and product features that cannot be captured by linear models. Second, when working with high-dimensional inputs such as image or text data, we need feature learning capabilities that neural networks provide naturally. Third, when estimating multiple quantiles simultaneously, neural networks can learn shared representations across quantiles, substantially improving computational efficiency. Finally, as we demonstrate in Section 22.5, neural architectures enable us to incorporate utility functions directly into the learning process, seamlessly integrating prediction with decision-making.

Neural quantile regression addresses these limitations by combining the robustness and interpretability of quantile methods with the flexibility and scalability of deep learning. This synthesis proves particularly valuable in applications where both distributional uncertainty and complex feature interactions matter.

22.3 Nonlinear Quantile Regression via Trend Filtering

Before exploring the full power of neural networks for quantile regression, we examine an elegant middle ground: trend filtering combined with quantile loss. This approach, developed by N. G. Polson and Scott (2016), provides nonlinear function approximation while maintaining computational tractability through a hierarchical representation. Trend filtering estimates smooth, nonlinear functions by penalizing differences in derivatives rather than the function values themselves, making it particularly suitable for data with local smoothness but global complexity.

Consider the nonparametric regression problem where we observe pairs \((x_i, y_i)\) for \(i = 1, \ldots, n\) with \(x_1 < x_2 < \ldots < x_n\). Traditional smoothing methods like cubic splines require choosing knot locations, while kernel smoothing requires bandwidth selection. Trend filtering offers an alternative: estimate a function \(f(x)\) by solving

\[ \min_{f} \sum_{i=1}^n \rho_\tau(y_i - f(x_i)) + \lambda \sum_{i=1}^{n-k} |\Delta^k f_i| \]

where \(\Delta^k\) denotes the \(k\)-th order discrete derivative operator and \(\rho_\tau\) is the check loss for quantile \(\tau\). The penalty term controls smoothness: \(k=1\) penalizes changes in slope (linear trend filtering), \(k=2\) penalizes changes in curvature (quadratic trend filtering), and so on.

For \(k=2\), the penalty becomes \(\sum_{i=2}^{n-1} |f_{i+1} - 2f_i + f_{i-1}|\), which approximates the integrated squared second derivative \(\int (f''(x))^2 dx\) used in smoothing splines. However, the \(\ell_1\) penalty produces locally adaptive estimates—sharp changes are preserved while smooth regions remain smooth.

N. G. Polson and Scott (2016) show that trend filtering admits an elegant hierarchical representation through envelope duality. The key insight is that the \(\ell_1\) penalty can be represented as an exponential prior in a hierarchical model. Specifically, for second-order trend filtering with quantile loss, we have the hierarchical model:

\[ \begin{aligned} y_i &\sim \text{AsymmetricLaplace}(f_i, \tau, \sigma) \\ \Delta^2 f_i &\sim \text{Laplace}(0, 1/\lambda) \quad \text{for } i = 2, \ldots, n-1 \end{aligned} \]

The asymmetric Laplace distribution naturally arises from the check loss—it is the distribution whose maximum likelihood estimator at quantile \(\tau\) minimizes \(\rho_\tau\). This connection between optimization (minimizing penalized quantile loss) and probability (maximum a posteriori estimation in a hierarchical model) provides both computational and conceptual advantages.

The hierarchical formulation enables efficient computation through data augmentation schemes. Rather than directly optimizing the non-smooth objective, we introduce auxiliary variables that yield closed-form conditional distributions, leading to straightforward EM or Gibbs sampling algorithms.

Implementation with R

We now demonstrate trend filtering for nonlinear quantile regression using synthetic data with both smooth regions and sharp transitions:

# Set seed for reproducibility
set.seed(123)

# Generate nonlinear synthetic data with heteroskedasticity
n <- 100
x <- sort(runif(n, 0, 10))

# True function with different regimes
f_true <- ifelse(x < 3, 1 + 0.5 * x,
                ifelse(x < 5, 2.5 + 2 * (x - 3),
                       6.5 + 0.2 * (x - 5)))

# Heteroskedastic noise that increases with x
sigma <- 0.3 + 0.1 * x
y <- f_true + rnorm(n, 0, sigma)

# Define second-order difference operator
D2 <- function(n) {
  D <- matrix(0, n - 2, n)
  for (i in 1:(n - 2)) {
    D[i, i:(i + 2)] <- c(1, -2, 1)
  }
  return(D)
}

# Trend filtering objective function for quantile regression
trend_filter_obj <- function(f, y, lambda, tau, D) {
  # Check loss
  residuals <- y - f
  quantile_loss <- sum(residuals * (tau - (residuals < 0)))
  
  # L1 penalty on second differences
  penalty <- lambda * sum(abs(D %*% f))
  
  return(quantile_loss + penalty)
}

# Fit trend filtering for multiple quantiles using optim
# For better performance, we use BFGS with box constraints
quantiles <- c(0.1, 0.5, 0.9)
lambda <- 2.0  # Smoothing parameter
D <- D2(n)

results_tf <- list()

for (tau in quantiles) {
  # Initialize with linear quantile regression
  X_init <- cbind(1, x)
  init_fit <- lm(y ~ x)
  f_init <- predict(init_fit)
  
  # Optimize using L-BFGS-B
  opt_result <- optim(
    par = f_init,
    fn = trend_filter_obj,
    y = y,
    lambda = lambda,
    tau = tau,
    D = D,
    method = "L-BFGS-B"
  )
  
  results_tf[[as.character(tau)]] <- opt_result$par
}

# Visualization
par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))

# Left panel: Data and trend filtering estimates
plot(x, y, pch = 16, col = "gray50", cex = 0.8,
     xlab = "x", ylab = "y",
     main = "Trend Filtering Quantile Regression")

# True function
lines(x, f_true, col = "black", lwd = 2, lty = 2)

# Quantile estimates
colors <- c("blue", "darkgreen", "red")
for (i in seq_along(quantiles)) {
  tau <- quantiles[i]
  lines(x, results_tf[[as.character(tau)]], 
        col = colors[i], lwd = 2.5)
}

legend("topleft", 
       legend = c("True mean", "tau = 0.1", "tau = 0.5", "tau = 0.9"),
       col = c("black", colors),
       lty = c(2, 1, 1, 1),
       lwd = c(2, 2.5, 2.5, 2.5),
       bg = "white")

# Right panel: Uncertainty quantification
plot(x, y, pch = 16, col = "gray50", cex = 0.8,
     xlab = "x", ylab = "y",
     main = "Conditional Quantiles and Prediction Intervals")

# Fill prediction intervals
polygon(c(x, rev(x)),
        c(results_tf[["0.1"]], rev(results_tf[["0.9"]])),
        col = rgb(0.5, 0.5, 1, 0.3), border = NA)

# Quantile curves
lines(x, results_tf[["0.1"]], col = "blue", lwd = 2, lty = 2)
lines(x, results_tf[["0.5"]], col = "darkgreen", lwd = 2.5)
lines(x, results_tf[["0.9"]], col = "red", lwd = 2, lty = 2)

legend("topleft", 
       legend = c("P50 (Median)", "P10-P90 (80% interval)"),
       col = c("darkgreen", rgb(0.5, 0.5, 1)),
       lwd = c(2.5, 8),
       bg = "white")

Nonlinear quantile regression via trend filtering. The method adapts to both smooth regions and sharp transitions while estimating conditional quantiles.
# Calculate empirical coverage
coverage <- mean(y >= results_tf[["0.1"]] & y <= results_tf[["0.9"]])
cat(sprintf("\nEmpirical 80%% interval coverage: %.1f%%\n", coverage * 100))
## 
## Empirical 80% interval coverage: 69.0%

Trend filtering for quantile regression exhibits several desirable properties that make it particularly attractive for practical applications, such as computational efficient and flexibility in matching complex patterns.

Trend filtering provides an important conceptual bridge to neural quantile networks. Both approaches learn nonlinear functions through composition: trend filtering composes piecewise polynomials, while neural networks compose nonlinear activation functions. The key difference lies in how they handle high-dimensional inputs. Trend filtering is most effective for univariate or low-dimensional problems with ordered inputs, while neural networks excel when inputs are high-dimensional or lack natural ordering.

For problems with structured, low-dimensional inputs—time series, spatial data along a transect, dose-response curves—trend filtering often provides better interpretability and requires less data than neural networks. For high-dimensional problems—images, text, complex multivariate relationships—neural networks become essential. Understanding both approaches allows practitioners to choose the right tool for their specific problem structure.

22.4 Bayes Rule for Quantiles

Having established the fundamentals of quantile regression, we now develop the connection to Bayesian inference. Parzen (2004) showed that quantile methods provide direct alternatives to density-based Bayesian computations. This section establishes the theoretical foundation for using quantiles to perform Bayesian updating.

Given a cumulative distribution function \(F_{\theta|y}(u)\) (non-decreasing, right-continuous), we define the quantile function as: \[Q_{\theta| y} (u) \defeq F^{-1}_{\theta|y} ( u ) = \inf \left \{ \theta : F_{\theta|y} (\theta) \geq u \right \}\]

The quantile function is non-decreasing and left-continuous. Parzen (2004) established the fundamental probabilistic property: \[ \theta \stackrel{P}{=} Q_\theta ( F_\theta (\theta ) ) \]

This identity enables efficient implementation: we can increase computational efficiency by ordering the samples of \(\theta\) and the baseline uniform draws \(\tau\), exploiting the monotonicity of the inverse CDF map.

A crucial property for understanding why quantiles naturally compose (and thus suit deep learning) is the following. Let \(g(y)\) be non-decreasing and left-continuous with \(g^{-1}(z) = \sup \{ y : g(y) \leq z \}\). Then the transformed quantile has a compositional nature: \[Q_{g(Y)}(u) = g(Q(u))\]

This composition property shows that quantiles act as superpositions—exactly the structure that deep neural networks learn through their layered architecture.

Conditional Quantile Representation

The connection to Bayesian learning is made explicit through the conditional quantile representation. For the Bayesian learning problem, we have the following result for updating prior to posterior quantiles: \[Q_{\theta | Y=y}(u) = Q_\theta(s) \quad \text{where} \quad s = Q_{F(\theta) | Y=y}(u)\]

To compute \(s\), note that by definition: \[ u = F_{F(\theta) | Y=y}(s) = P(F(\theta) \leq s | Y=y) = P(\theta \leq Q_\theta(s) | Y=y) = F_{\theta | Y=y}(Q_\theta(s)) \]

This result shows that Bayesian updating can be performed entirely in terms of quantile functions, without ever computing or manipulating density functions. The posterior quantile function is obtained by composing the prior quantile function with a learned transformation.

22.5 Maximum Expected Utility via Quantile Neural Networks

Having established how to learn posterior distributions via quantile neural networks, we now show how to extend this framework to decision problems—the central application where quantile methods truly shine. Recall from Chapter 4 that optimal Bayesian decisions maximize expected utility: \[ d^\star(y) = \arg \max_d E_{\theta|y}[U(d, \theta)] = \arg \max_d \int U(d, \theta) p(\theta | y) d\theta \]

The naive approach would be to first learn the posterior \(p(\theta|y)\), then use Monte Carlo to approximate the expected utility for each decision \(d\), and finally optimize over \(d\). However, this approach is inefficient for several reasons:

  1. Computational waste: Monte Carlo requires many samples in regions of high posterior probability, but utility functions often place high weight on tail events (risk scenarios) that have low posterior probability.

  2. Density estimation: We must first estimate the potentially high-dimensional posterior density before we can compute expectations.

  3. Optimization difficulty: The expectation must be recomputed for each candidate decision during optimization.

Quantile neural networks provide a more direct path. The key insight is that we can incorporate the utility function directly into the training process rather than as a post-processing step.

The Quantile-Expectation Identity

The foundation of our approach is a classical result relating expectations to quantiles. Given any random variable \(U\), its expectation can be computed as an integral over its quantile function: \[ E[U] = \int_0^1 F^{-1}_{U}(\tau) d\tau \] This is sometimes called the quantile representation of expectations or the Lorenz curve identity. For decision problems, this means: \[ E_{\theta|y}[U(d, \theta)] = \int_0^1 F^{-1}_{U|d,y}(\tau) d\tau \]

Rather than learning \(p(\theta|y)\) and then computing the expectation, we directly learn the quantile function \(F^{-1}_{U|d,y}(\tau)\) of the utility distribution.

Implementation Strategy

To extend our generative method to MEU problems, we assume that the utility function \(U(d, \theta)\) is given (a standard assumption in decision theory). The training procedure is as follows:

  1. Generate synthetic dataset: Simulate triples \(\{y^{(i)}, \theta^{(i)}, \tau^{(i)}\}_{i=1}^N\) where \(y^{(i)} \sim p(y|\theta^{(i)})\), \(\theta^{(i)} \sim p(\theta)\), and \(\tau^{(i)} \sim U(0,1)\).

  2. Compute utilities: For each decision \(d\) of interest, compute \(U^{(i)}_d \defeq U(d,\theta^{(i)})\).

  3. Augment training data: Create the augmented dataset \[ \{U_d^{(i)}, S(y^{(i)}), \tau^{(i)}, d\}_{i=1}^N. \]

  4. Train quantile network: Learn a neural network \(H\) that predicts: \[ U_d^{(i)} = H(S(y^{(i)}), \tau^{(i)}, d) \]

Once trained, the network \(H\) represents the quantile function \(F^{-1}_{U|d,y}(\tau)\). For any observed data \(y\) and candidate decision \(d\), we can:

  • Compute expected utility: Numerically integrate \(\int_0^1 H(S(y), \tau, d) d\tau\)
  • Find optimal decision: \[ d^\star(y) = \arg \max_d \int_0^1 H(S(y), \tau, d) d\tau \]

This approach has several advantages over naive Monte Carlo:

  1. The network learns to focus on regions of the \((\theta, \tau)\) space that matter for utility computation.
  2. We avoid explicit density estimation of \(p(\theta|y)\).
  3. The same network handles all decisions \(d\) simultaneously if \(d\) is included as an input.
  4. The approach naturally handles likelihood-free models where \(p(y|\theta)\) is unavailable but we can simulate from the forward model.

Formal Framework

For completeness, we provide the formal measure-theoretic framework. Let \(\mathcal{Y}\) denote a locally compact metric space of signals \(y\) and \(\Theta\) a space of parameters \(\theta\) (including any latent variables). Let \(P(dy|\theta)\) denote the conditional distribution of signals given parameters. Let \(\Pi(d\theta|y)\) denote the posterior distribution. In many cases, \(\Pi\) is absolutely continuous with density \(\pi\): \[ \Pi(d\theta|y) = \pi(\theta|y) \mu(d\theta). \]

The framework handles both traditional likelihood-based models where \(P(dy|\theta) = p(y|\theta) \lambda(dy)\) and likelihood-free models specified by forward simulators \(y = f(\theta)\). This generality is crucial for modern applications in economics, epidemiology, and climate science where complex simulation models replace closed-form likelihoods.

For multivariate parameters \(\theta = (\theta_1, \ldots, \theta_p)\), we can use autoregressive structures to model the sequence of conditional quantiles: \[ (F^{-1}_{\theta_1}(\tau_1), F^{-1}_{\theta_2|\theta_1}(\tau_2), \ldots, F^{-1}_{\theta_p|\theta_{1:p-1}}(\tau_p)) \] This factorization is analogous to autoregressive density models but operates directly on quantiles, avoiding normalization constraints.

An important architectural choice distinguishes our approach from standard posterior learning followed by Monte Carlo integration: we incorporate the utility function \(U(d, \theta)\) directly into the first layer of the network. This allows the network to learn representations optimized for utility computation rather than pure posterior approximation. As utility functions often place high weight on tail events (representing rare but consequential outcomes), this direct incorporation significantly improves efficiency compared to the naive two-step approach.

Example: Normal-Normal Model and Wang Distortion

For illustration, we consider the normal-normal conjugate learning model—a case where the quantile updating rule can be derived analytically. This example connects quantile methods to Wang’s risk distortion measure from finance, showing that the distortion function is precisely the transformation that needs to be learned.

Consider the model: \[ y_1, \ldots, y_n \mid \theta \sim N(\theta, \sigma^2) \] \[ \theta \sim N(\mu,\alpha^2) \]

The sufficient statistic is \(S(y) = \bar y = \frac{1}{n} \sum_{i=1}^n y_i\). The posterior is \(\theta \mid y \sim N(\mu_*, \sigma_*^2)\) with \[ \mu_* = \frac{\sigma^2 \mu + n\alpha^2\bar{y}}{t}, \quad \sigma^2_* = \frac{\alpha^2 \sigma^2}{t} \] where \(t = \sigma^2 + n\alpha^2\).

The remarkable result is that the posterior and prior CDFs are related via a Wang distortion function: \[ 1-\Phi(\theta; \mu_*,\sigma_*) = g(1 - \Phi(\theta; \mu, \alpha^2)) \] where \(\Phi(\cdot; \mu, \sigma^2)\) denotes the normal CDF. The Wang distortion is: \[ g(p) = \Phi\left(\lambda_1 \Phi^{-1}(p) + \lambda\right) \] with distortion parameters: \[ \lambda_1 = \frac{\alpha}{\sigma_*}, \quad \lambda = \frac{\alpha\lambda_1(n\bar{y}-n\mu)}{t} \]

Proof sketch: The key is to rewrite the tail probabilities: \[ g(1 - \Phi(\theta; \mu, \alpha^2)) = g\left(\Phi\left(-\frac{\theta - \mu}{\alpha}\right)\right) = \Phi\left(\lambda_1 \left(-\frac{\theta - \mu}{\alpha}\right) + \lambda\right) \] \[ = 1 - \Phi\left(\frac{\theta - (\mu+ \alpha\lambda/\lambda_1)}{\alpha/\lambda_1}\right) = 1-\Phi(\theta; \mu_*,\sigma_*) \]

This analytical result has several implications:

  1. Wang distortions are natural: The distortion functions used in risk management (Wang 1996) arise naturally from Bayesian updating in the normal case.

  2. Learning the distortion: In more complex models, a neural network can learn this distortion function \(g\) directly from data, without requiring conjugacy.

  3. Computational efficiency: When the distortion is smooth and well-behaved, neural networks with relatively few parameters can accurately represent it.

Numerical Example: Consider prior \(\theta \sim N(0,5)\) and data from \(y_i \sim N(3,10)\) with \(n=100\) observations. The posterior is \(\theta \mid y \sim N(3.28, 0.98)\).

(a) Model for simulated data
(b) Distortion Function \(g\)
(c) Survival Functions 1 - \(\Phi\)
Figure 22.1: The Normal-Normal learning model. Left: Prior, likelihood, and posterior densities. Center: The Wang distortion function \(g\) that transforms the prior CDF to the posterior CDF. Right: Survival functions showing how \(g\) maps the prior tail probabilities to posterior tail probabilities.

22.6 Portfolio Optimization with Quantile Neural Networks

A classic application of decision theory in finance is portfolio optimization. Consider an investor allocating wealth between a risk-free asset with return \(r_f\) and a risky asset with uncertain return \(R\). Let \(\omega \in (0,1)\) denote the fraction allocated to the risky asset. The portfolio return is: \[ W = (1-\omega)r_f + \omega R \]

With exponential utility \(U(W) = -e^{-\gamma W}\) (measuring risk aversion \(\gamma > 0\)) and normally distributed returns \(R \sim N(\mu, \sigma^2)\), the expected utility has a closed form due to the moment-generating function of the normal distribution: \[ U(\omega) = E[-e^{-\gamma W}] = -\exp\left\{-\gamma \left[(1-\omega)r_f + \omega\mu\right] + \frac{1}{2}\gamma^2\omega^2\sigma^2 \right\} \]

Maximizing expected utility yields the celebrated Kelly-Merton optimal allocation: \[ \omega^* = \frac{\mu - r_f}{\gamma\sigma^2} \]

This formula has a simple interpretation: invest more when expected excess return \((\mu - r_f)\) is high, less when risk \((\sigma^2)\) or risk aversion \((\gamma)\) is high.

Learning Under Parameter Uncertainty

The closed-form solution assumes \(\mu\) and \(\sigma^2\) are known. In practice, these parameters must be estimated from historical data, introducing parameter uncertainty. Traditional approaches use plug-in estimates \(\hat{\mu}\), \(\hat{\sigma}^2\), ignoring estimation error. A Bayesian approach would specify priors \(p(\mu, \sigma^2)\), observe returns \(\{R_t\}_{t=1}^T\), and compute the posterior \(p(\mu, \sigma^2 | \{R_t\})\).

The optimal decision under parameter uncertainty integrates over the posterior: \[ \omega^* = \arg\max_\omega E_{\mu, \sigma^2 | \{R_t\}}[U(\omega, \mu, \sigma^2)] \]

For conjugate priors (normal-inverse-gamma), this can be computed analytically. For more complex models—non-normal returns, time-varying volatility, multi-asset portfolios—analytical solutions are unavailable. This is where quantile neural networks excel.

The Quantile Approach

Rather than integrating expected utility via Monte Carlo sampling from \(p(\mu, \sigma^2 | \{R_t\})\), we use the quantile-expectation identity. The utility distribution \(U(\omega) = -e^{-\gamma W}\) can be integrated via its quantile function: \[ E[U(W)] = \int_{0}^{1}F_{U(W)}^{-1}(\tau)d\tau \]

The quantile neural network learns \(F_{U(W)}^{-1}(\tau, \omega, \{R_t\})\) directly from simulated data. Once trained, we can:

  1. Compute expected utility for any \(\omega\) by numerical integration over \(\tau \in (0,1)\)
  2. Optimize \(\omega\) by evaluating the network on a grid or using gradient-based optimization
  3. Handle non-normal returns, transaction costs, and constraints naturally

Numerical Example

Consider the parameter values: - Risk-free rate: \(r_f = 0.05\) - Expected return: \(\mu = 0.15\) - Volatility: \(\sigma = 0.25\) - Risk aversion: \(\gamma = 2\)

The closed-form optimal allocation is: \[ \omega^* = \frac{0.15 - 0.05}{2 \times 0.25^2} = \frac{0.10}{0.125} = 0.80 \]

# Parameters
rf <- 0.05
mu <- 0.15
sigma <- 0.25
gamma <- 2

# Portfolio utility for different weights
utility_quantile <- function(tau, omega, rf, mu, sigma, gamma) {
  W_quantile <- (1 - omega) * rf + omega * (mu + sigma * qnorm(tau))
  U <- -exp(-gamma * W_quantile)
  return(U)
}

# Expected utility via quantile integration
expected_utility <- function(omega, rf, mu, sigma, gamma, n_quantiles = 1000) {
  tau_grid <- seq(0.001, 0.999, length.out = n_quantiles)
  U_quantiles <- sapply(tau_grid, utility_quantile, omega = omega, 
                       rf = rf, mu = mu, sigma = sigma, gamma = gamma)
  # Integrate using trapezoidal rule
  return(mean(U_quantiles))
}

# Compute expected utility for a grid of portfolio weights
omega_grid <- seq(0.01, 0.99, by = 0.01)
EU_grid <- sapply(omega_grid, expected_utility, rf = rf, mu = mu, 
                 sigma = sigma, gamma = gamma)

# Find optimal weight
omega_opt <- omega_grid[which.max(EU_grid)]

# Plotting
par(mfrow = c(1, 2))

# Plot 1: Quantile functions for different weights
tau_plot <- seq(0.01, 0.99, length.out = 100)
plot(tau_plot, utility_quantile(tau_plot, 0.2, rf, mu, sigma, gamma), 
     type = 'l', col = 'blue', lwd = 2,
     xlab = expression(tau), ylab = expression(F[U]^{-1}(tau)),
     main = "Utility Quantile Functions")
lines(tau_plot, utility_quantile(tau_plot, 0.5, rf, mu, sigma, gamma), 
      col = 'green', lwd = 2)
lines(tau_plot, utility_quantile(tau_plot, 0.8, rf, mu, sigma, gamma), 
      col = 'red', lwd = 2)
legend("bottomright", legend = c("omega = 0.2", "omega = 0.5", "omega = 0.8"),
       col = c("blue", "green", "red"), lwd = 2)
grid()

# Plot 2: Expected utility as function of portfolio weight
plot(omega_grid, EU_grid, type = 'l', lwd = 2, col = 'darkblue',
     xlab = expression(omega), ylab = "Expected Utility",
     main = "Portfolio Optimization")
abline(v = omega_opt, col = 'red', lty = 2, lwd = 2)
text(omega_opt + 0.1, max(EU_grid) * 0.95, 
     paste0("omega* = ", round(omega_opt, 2)), col = 'red')
grid()

Portfolio optimization using quantile representation. Left: Quantile function of utility for different portfolio weights. Right: Expected utility as a function of portfolio weight, with maximum at omega* = 0.80.

The quantile approach recovers the analytical optimum \(\omega^* = 0.80\). The left panel shows how utility quantiles shift with portfolio weight: more aggressive allocations (\(\omega\) closer to 1) increase both upside and downside risk. The right panel shows expected utility as a function of \(\omega\), maximized at the Kelly-Merton allocation.

In the generative framework with parameter uncertainty \(p(\mu, \sigma^2 | \{R_t\})\), we would train a quantile NN on triples \((\omega, \tau, \{R_t\}, U)\) where \((\mu, \sigma^2)\) are drawn from the posterior and \(U\) is the resulting utility. The trained network then provides expected utilities for any observed return history, enabling real-time portfolio rebalancing.

22.7 Neural Network Implementation

Having established the theory, we now detail how to implement quantile neural networks in practice. The key components are: (1) an appropriate loss function, (2) a neural architecture that handles the quantile input \(\tau\), and (3) training strategies for learning multiple quantiles simultaneously.

Wasserstein Distance and Quantile Loss

The 1-Wasserstein distance (also known as earth mover’s distance) provides theoretical justification for quantile methods. For two distributions with quantile functions \(F^{-1}_U\) and \(F^{-1}_V\), the 1-Wasserstein distance is: \[ W_1(F^{-1}_U, F^{-1}_V) = \int_0^1 |F^{-1}_U(\tau) - F^{-1}_V(\tau)| d\tau \]

This distance can be computed efficiently using order statistics (Levina and Bickel 2001). The success of Wasserstein GANs (Arjovsky, Chintala, and Bottou 2017) over vanilla GANs stems partly from this improved metric—Wasserstein distance provides meaningful gradients even when distributions have non-overlapping supports.

The key insight is that minimizing the quantile loss is equivalent to minimizing the 1-Wasserstein distance. For a target quantile \(q_\tau = F^{-1}_U(\tau)\), the check loss \(\rho_\tau\) we derived in Section 22.2 minimizes the expected prediction error: \[ q_\tau = \arg\min_q E_U[\rho_{\tau}(U-q)] \]

Combined Loss Function

For training neural networks, we use a combination of quantile loss and mean-squared error (MSE). Given training data \(\{x_i, y_i\}_{i=1}^N\) and a quantile \(\tau\), the loss is: \[ L_{\tau}(\theta) = \sum_{i=1}^N \rho_{\tau}(y_i - f(\tau, x_i, \theta)) \]

Empirically, adding an MSE term improves stability and predictive accuracy: \[ L(\theta) = \alpha L_{\tau}(\theta) + (1-\alpha) \cdot \frac{1}{N} \sum_{i=1}^N (y_i - f(x_i, \theta))^2 \]

The weighting parameter \(\alpha \in [0,1]\) balances quantile accuracy against overall fit. Typical values are \(\alpha \in [0.7, 0.9]\). The MSE term encourages the median prediction (\(\tau = 0.5\)) to align with the conditional mean, which often improves generalization.

Learning Multiple Quantiles Simultaneously

Rather than training separate networks for each quantile \(\tau_k\), it is more efficient to learn all quantiles with a single network that takes \(\tau\) as an input. Given quantiles \(0 < \tau_1 < \tau_2 < \ldots < \tau_K < 1\), we minimize: \[ L(\theta) = \frac{1}{NK} \sum_{i=1}^N \sum_{k=1}^K \rho_{\tau_k}(y_i - f_{\tau_k}(x_i, \theta)) \]

This approach has several advantages:

  1. Shared representations: The network learns features useful across all quantiles, improving sample efficiency.
  2. Enforcing monotonicity: A single network makes it easier to ensure quantiles don’t cross.
  3. Smooth quantile function: Interpolation between trained quantiles is more reliable.

Non-Crossing Constraints

A valid distribution function must satisfy \(F^{-1}(\tau_i) \leq F^{-1}(\tau_j)\) for \(\tau_i < \tau_j\). Without explicit constraints, neural networks may learn quantile functions that cross: \[ f_{\tau_i}(x, \theta) > f_{\tau_j}(x, \theta) \quad \text{for some } x, \text{ despite } \tau_i < \tau_j \]

Several approaches address this (Chernozhukov, Fernández-Val, and Galichon 2010; Cannon 2018):

  1. Soft penalties: Add a term to the loss penalizing violations.
  2. Monotonic networks: Design architectures that guarantee monotonicity in \(\tau\) (e.g., using monotonic activation functions or cumulative link structures).
  3. Post-processing: After training, rearrange predictions to enforce monotonicity.

For the implementations in this chapter, we use soft penalties during training combined with post-processing for final predictions.

Cosine Embedding for \(\tau\)

A key architectural choice is how to incorporate the quantile level \(\tau \in (0,1)\) as an input to the network. Simply concatenating \(\tau\) as an additional feature works but is inefficient—the network must learn the entire relationship between \(\tau\) and the output from scratch.

A more effective approach uses a cosine embedding to represent \(\tau\) in a higher-dimensional feature space. This leverages Fourier analysis: smooth functions can be well-approximated by cosine bases. The quantile function \(F^{-1}(\tau, x)\) is typically smooth in \(\tau\), making Fourier representations natural.

We represent the quantile network as: \[ F^{-1}(\tau, x) = f_\theta(\tau, x) = g(\psi(x) \circ \phi(\tau)) \] where \(\circ\) denotes element-wise multiplication (Hadamard product), \(g\) and \(\psi\) are feed-forward networks, and \(\phi\) is the cosine embedding: \[ \phi_j(\tau) = \mathrm{ReLU}\left(\sum_{i=0}^{n-1} \cos(\pi i \tau) w_{ij} + b_j\right) \]

The cosine embedding \(\phi(\tau)\) transforms the scalar \(\tau\) into a vector of dimension \(m\), where \(n\) controls the frequency resolution. This embedding has several advantages:

  1. Smooth interpolation: The cosine basis ensures smooth quantile functions.
  2. Universal approximation: Barron (1993) showed that cosine-embedded networks achieve approximation rates of \(O(N^{-1/2})\) for sufficiently smooth functions.
  3. Parameter efficiency: The embedding significantly reduces the number of parameters needed compared to learning the \(\tau\) dependence from scratch.

This architecture was successfully applied to distributional reinforcement learning by Dabney et al. (2018), where it enabled agents to learn entire distributions of returns rather than just expectations. We use the same principle here for Bayesian posterior quantiles.

Synthetic Data Example

To validate our approach before applying it to real data, we first test on synthetic data where the true quantile function is known. This allows us to assess both accuracy and the quality of uncertainty quantification.

Consider synthetic data generated from the model: \[ x \sim U(-1, 1), \quad y | x \sim N\left(\frac{\sin(\pi x)}{\pi x}, \frac{\exp(1-x)}{10}\right) \]

This model has heteroskedastic noise—the variance increases as \(x\) decreases. The conditional mean is the sinc function \(\text{sinc}(x) = \sin(\pi x)/(\pi x)\), which has interesting non-linear behavior near zero.

The true conditional \(\tau\)-quantile function is: \[ f_{\tau}(x) = \frac{\sin(\pi x)}{\pi x} + \Phi^{-1}(\tau) \sqrt{\frac{\exp(1-x)}{10}} \]

We train two types of quantile networks:

  1. Implicit network: Uses cosine embedding for \(\tau\), trained on random \((\tau, x, y)\) triples
  2. Explicit network: Trains separate outputs for fixed quantiles \(\tau \in \{0.05, 0.5, 0.95\}\)

Quantile neural network predictions on synthetic data. Both implicit and explicit architectures recover the true quantile functions accurately. The shaded region shows the 90% prediction interval (\(\tau = 0.05\) to \(\tau = 0.95\)).

The figure shows both networks recover the true quantiles accurately. The implicit network provides smooth interpolation across all \(\tau\) values, while the explicit network gives predictions only at the three trained quantiles. For applications requiring full distributional predictions, the implicit approach is preferable despite slightly higher computational cost during training.

22.8 Supply Chain Forecasting at Scale

Having developed the theory and core implementations, we now turn to a major industrial application: probabilistic forecasting for inventory management. This case study illustrates how quantile neural networks scale to millions of products while providing the uncertainty quantification essential for optimal decision-making.

The Business Problem

Amazon, Walmart, and other large retailers face a fundamental challenge: for millions of products across thousands of warehouses, how much inventory should be kept in stock? Too little inventory leads to stockouts and lost sales; too much ties up capital and risks obsolescence. The optimal inventory level depends critically on future demand, which is inherently uncertain.

Traditional forecasting provides point estimates—a single expected demand value. But optimal inventory decisions require understanding the entire demand distribution:

  • Safety stock (\(P_{10}\)): Conservative estimate—ensures 90% of demand scenarios are covered, preventing stockouts
  • Base stock (\(P_{50}\)): Median demand—balances inventory holding costs against stockout risk
  • Capacity planning (\(P_{90}\)): Optimistic scenario—ensures warehouse and logistics capacity for high-demand periods

The asymmetry matters enormously. For a highly profitable product with long lead times, the cost of stockouts (lost sales, customer dissatisfaction) far exceeds the cost of overstock. The optimal policy may target \(P_{70}\) or \(P_{80}\). For perishable goods or fast-moving consumer products with low margins, the reverse holds—target \(P_{30}\) or \(P_{40}\) to minimize waste.

Amazon’s DeepAR and Quantile Forecasting

Amazon developed DeepAR (Salinas, Flunkert, and Gasthaus 2019), a deep learning approach for probabilistic time series forecasting. DeepAR uses recurrent neural networks (specifically LSTMs) to model temporal dependencies, but its key innovation is forecasting entire probability distributions rather than point estimates.

The model predicts parameters of a parametric distribution (e.g., mean and variance of a Gaussian or negative binomial). To obtain quantiles, one samples from this learned distribution. However, parametric assumptions can be restrictive. Amazon later extended the approach to directly forecast quantiles using the check loss functions we developed in Section 22.2.

The quantile approach has several advantages for demand forecasting:

  1. Robustness: Demand data often contains outliers (promotional events, viral products). Quantile regression is robust to these.

  2. Intermittent demand: Many products have sparse, intermittent demand (many zeros). Parametric distributions struggle; quantile methods handle this naturally.

  3. Asymmetric costs: Different quantiles inform different decisions. The \(P_{10}\) quantile is more relevant for safety stock than the mean.

  4. Model flexibility: Neural quantile regression makes no distributional assumptions beyond smoothness.

Demand Forecasting Setup

Consider forecasting demand \(y_t\) for a product given:

  • Temporal features: Day of week, month, holiday indicators, days since launch
  • Lagged demand: \(y_{t-1}, y_{t-7}, y_{t-28}\) (yesterday, last week, last month)
  • Product features: Category, price, promotional flags
  • External covariates: Weather, events, competitor prices

The model predicts conditional quantiles: \[ q_\tau(t) = F^{-1}_{Y_t | X_t}(\tau) \]

where \(X_t\) represents all available features at time \(t\). Training data consists of historical demand \((y_t, x_t)\) for \(t = 1, \ldots, T\) across many products.

Why Neural Networks?

Traditional quantile regression assumes linear relationships. For demand forecasting, this is inadequate:

  • Non-linear seasonality: Holiday effects interact with day-of-week patterns non-linearly
  • Product interactions: Substitution and complementarity effects between products
  • Promotional dynamics: Price elasticity varies by product category, time, and competitive environment
  • Cross-product learning: Products with similar characteristics exhibit similar demand patterns; neural networks can learn shared representations

A quantile neural network can:

  • Learn shared embeddings for product categories
  • Capture complex temporal patterns through recurrent or attention layers
  • Model interactions between features automatically
  • Transfer learning from data-rich products to new products with limited history

Implementation Strategy

For a retailer with millions of SKUs (stock-keeping units), computational efficiency is critical. The architecture typically involves:

  1. Embedding layers: Map categorical variables (product ID, category, location) to dense vectors
  2. Temporal encoder: LSTM or Transformer to process time series history
  3. Feature fusion: Combine embeddings with numerical features
  4. Quantile head: Final layer produces \(\hat{q}_\tau(t)\) for input \(\tau\), using cosine embedding

Training uses mini-batches sampling randomly across products and time periods, with the combined quantile + MSE loss we discussed. The model is trained to predict multiple quantiles simultaneously (\(\tau \in \{0.1, 0.2, \ldots, 0.9\}\)).

At inference time, for a given product and time period, the model outputs all quantiles instantly, enabling inventory optimization algorithms to compute optimal stock levels.

R Implementation: Demand Forecasting with Quantile Regression

We now provide a complete R implementation for demand forecasting using quantile neural networks. This example simulates realistic demand data with seasonality, trends, and promotional effects, then trains a quantile model to forecast the entire demand distribution.

# Load required libraries
library(quantreg)  # For quantile regression
library(ggplot2)
library(dplyr)
library(tidyr)

# Set seed for reproducibility
set.seed(42)

# Generate synthetic demand data for multiple products
generate_demand_data <- function(n_days = 365 * 2, n_products = 5) {
  days <- 1:n_days
  
  # Create data frame
  data_list <- lapply(1:n_products, function(p) {
    # Base parameters varying by product
    base_demand <- 50 + p * 20
    trend <- 0.05 * p
    
    # Seasonal components
    weekly_pattern <- 10 * sin(2 * pi * days / 7 + p)
    monthly_pattern <- 15 * sin(2 * pi * days / 30.5 + p * 0.5)
    
    # Promotional effect (random promotions)
    promo <- rbinom(n_days, 1, 0.1)  # 10% of days have promotions
    promo_effect <- promo * (30 + 20 * p)
    
    # Trend + seasonality + promotions + noise
    mu <- base_demand + trend * days + weekly_pattern + monthly_pattern + promo_effect
    
    # Heteroskedastic noise (variance increases with demand)
    sigma <- 5 + 0.1 * mu
    demand <- pmax(0, rnorm(n_days, mu, sigma))  # Demand can't be negative
    
    data.frame(
      product = paste0("Product_", p),
      day = days,
      demand = demand,
      promo = promo,
      day_of_week = (days %% 7) + 1,
      day_of_month = ((days - 1) %% 30) + 1
    )
  })
  
  do.call(rbind, data_list)
}

# Generate data
demand_data <- generate_demand_data(n_days = 365 * 2, n_products = 3)

# Create lagged features
demand_data <- demand_data %>%
  group_by(product) %>%
  arrange(day) %>%
  mutate(
    lag_1 = lag(demand, 1, default = NA),
    lag_7 = lag(demand, 7, default = NA),
    lag_30 = lag(demand, 30, default = NA)
  ) %>%
  ungroup() %>%
  filter(!is.na(lag_30))  # Remove rows with NA lags

# Train/test split
train_days <- max(demand_data$day) - 90
train_data <- demand_data %>% filter(day <= train_days)
test_data <- demand_data %>% filter(day > train_days)

# Quantile regression for multiple quantiles
quantiles <- c(0.1, 0.3, 0.5, 0.7, 0.9)
models <- list()

cat("Training quantile regression models...\n")
## Training quantile regression models...
for (q in quantiles) {
  cat(sprintf("  Quantile %.2f...\n", q))
  models[[as.character(q)]] <- rq(
    demand ~ product + day + promo + day_of_week + day_of_month + 
             lag_1 + lag_7 + lag_30,
    tau = q,
    data = train_data
  )
}
##   Quantile 0.10...
##   Quantile 0.30...
##   Quantile 0.50...
##   Quantile 0.70...
##   Quantile 0.90...
# Generate predictions for test set
predictions <- test_data
for (q in quantiles) {
  col_name <- paste0("q_", gsub("\\.", "", sprintf("%.2f", q)))
  predictions[[col_name]] <- predict(models[[as.character(q)]], newdata = test_data)
}

# Calculate prediction intervals and coverage
predictions <- predictions %>%
  mutate(
    in_80_interval = (demand >= q_010 & demand <= q_090),
    in_60_interval = (demand >= q_030 & demand <= q_070)
  )

# Calculate metrics
coverage_80 <- mean(predictions$in_80_interval, na.rm = TRUE)
coverage_60 <- mean(predictions$in_60_interval, na.rm = TRUE)
mae_median <- mean(abs(predictions$demand - predictions$q_050), na.rm = TRUE)

cat(sprintf("\nForecast Performance:\n"))
## 
## Forecast Performance:
cat(sprintf("  80%% interval coverage: %.1f%% (target: 80%%)\n", coverage_80 * 100))
##   80% interval coverage: 79.6% (target: 80%)
cat(sprintf("  60%% interval coverage: %.1f%% (target: 60%%)\n", coverage_60 * 100))
##   60% interval coverage: 33.7% (target: 60%)
cat(sprintf("  Median absolute error: %.2f units\n", mae_median))
##   Median absolute error: 21.94 units
# Visualization
par(mfrow = c(2, 1), mar = c(4, 4, 3, 1))

# Plot 1: Training data for one product
product_1_train <- train_data %>% filter(product == "Product_1")
plot(product_1_train$day, product_1_train$demand, 
     type = 'l', lwd = 1.5, col = 'darkblue',
     xlab = "Day", ylab = "Demand",
     main = "Historical Demand (Product 1)")
points(product_1_train$day[product_1_train$promo == 1],
       product_1_train$demand[product_1_train$promo == 1],
       col = 'red', pch = 16, cex = 0.5)
legend("topleft", legend = c("Demand", "Promotion"),
       col = c("darkblue", "red"), lwd = c(2, NA), pch = c(NA, 16))
grid()

# Plot 2: Forecasts with quantiles (fan chart)
product_1_test <- predictions %>% filter(product == "Product_1")
plot(product_1_test$day, product_1_test$demand,
     type = 'l', lwd = 2, col = 'black',
     xlab = "Day", ylab = "Demand",
     main = "Demand Forecasts with Quantiles (Product 1)",
     ylim = range(c(product_1_test$q_010, product_1_test$q_090, product_1_test$demand)))

# Add prediction intervals
polygon(c(product_1_test$day, rev(product_1_test$day)),
        c(product_1_test$q_010, rev(product_1_test$q_090)),
        col = rgb(0.7, 0.7, 1, 0.3), border = NA)
polygon(c(product_1_test$day, rev(product_1_test$day)),
        c(product_1_test$q_030, rev(product_1_test$q_070)),
        col = rgb(0.5, 0.5, 1, 0.4), border = NA)

# Add quantile lines
lines(product_1_test$day, product_1_test$q_010, col = 'blue', lty = 2, lwd = 1.5)
lines(product_1_test$day, product_1_test$q_050, col = 'darkblue', lwd = 2)
lines(product_1_test$day, product_1_test$q_090, col = 'blue', lty = 2, lwd = 1.5)

# Actual demand
lines(product_1_test$day, product_1_test$demand, col = 'black', lwd = 2)

legend("topleft", 
       legend = c("Actual", "P50 (Median)", "P10-P90 (80%)", "P30-P70 (60%)"),
       col = c("black", "darkblue", rgb(0.7, 0.7, 1), rgb(0.5, 0.5, 1)),
       lwd = c(2, 2, 8, 8), lty = c(1, 1, 1, 1))
grid()

Demand forecasting with quantile neural networks. Top: Historical demand with trend and seasonality. Bottom: Out-of-sample forecasts showing P10, P50, P90 quantiles (fan chart).

Inventory Optimization

Given the quantile forecasts, we can compute optimal inventory levels. For a newsvendor problem with underage cost \(c_u\) (cost of stockout) and overage cost \(c_o\) (cost of excess inventory), the optimal quantile to stock is: \[ \tau^* = \frac{c_u}{c_u + c_o} \]

# Example: High-margin product with stockout cost >> overstock cost
c_u <- 50  # Lost profit from stockout
c_o <- 5   # Cost of excess unit (storage, markdown, disposal)
optimal_tau <- c_u / (c_u + c_o)

cat(sprintf("Inventory Optimization Example:\n"))
## Inventory Optimization Example:
cat(sprintf("  Underage cost (stockout): $%.2f\n", c_u))
##   Underage cost (stockout): $50.00
cat(sprintf("  Overage cost (overstock): $%.2f\n", c_o))
##   Overage cost (overstock): $5.00
cat(sprintf("  Optimal service level (tau): %.1f%%\n", optimal_tau * 100))
##   Optimal service level (tau): 90.9%
# For low-margin perishable product
c_u_perishable <- 5
c_o_perishable <- 20
optimal_tau_perishable <- c_u_perishable / (c_u_perishable + c_o_perishable)

cat(sprintf("\nPerishable Product:\n"))
## 
## Perishable Product:
cat(sprintf("  Optimal service level (tau): %.1f%%\n", optimal_tau_perishable * 100))
##   Optimal service level (tau): 20.0%
cat(sprintf("  Stock at lower quantile to minimize waste\n"))
##   Stock at lower quantile to minimize waste

This implementation demonstrates the key principles:

  1. Quantile regression captures the full demand distribution, not just the mean
  2. Multiple quantiles provide prediction intervals essential for risk management
  3. Lagged features and seasonality are naturally incorporated
  4. Decision-making (optimal inventory levels) follows directly from quantile forecasts

For production systems handling millions of SKUs, this approach scales by using neural networks (e.g., via keras or torch in R) with embeddings for categorical features and recurrent layers for temporal dependencies.

22.9 Distributional Reinforcement Learning

Quantile neural networks have also revolutionized reinforcement learning (RL). Recall from Chapter 9 that RL agents learn policies \(\pi\) that maximize expected cumulative reward. Traditional RL algorithms like Q-learning estimate the expected value function \(Q^\pi(s,a) = E[R | s, a, \pi]\)—the expected return from taking action \(a\) in state \(s\) and following policy \(\pi\) thereafter.

Distributional reinforcement learning (Bellemare, Dabney, and Munos 2017) extends this by learning the entire distribution of returns rather than just the expectation. Using quantile neural networks, agents learn \(F^{-1}_{R|s,a}(\tau)\), the quantile function of returns.

Why Distributions Matter in RL

Learning return distributions provides several advantages:

  1. Risk-sensitive policies: Different quantiles inform different behaviors. Conservative agents might maximize \(P_{10}\) (avoid worst-case scenarios), while risk-seeking agents target \(P_{90}\) (optimize for best-case outcomes).

  2. Better learning signals: The Bellman operator naturally contracts in Wasserstein distance when formulated for quantiles, leading to more stable learning (Dabney et al. 2018).

  3. Uncertainty quantification: Understanding return variance helps agents explore intelligently—high uncertainty indicates potential for learning.

Implementation

Dabney et al. (2018) use quantile neural networks for distributional Q-learning. The key insight is that expectations can be computed as integrals over quantiles (the Lorenz curve identity): \[ E[R] = \int_{-\infty}^{\infty} r dF(r) = \int_0^1 F^{-1}(u) du \]

The distributional RL algorithm finds the optimal policy: \[ \pi^\star(s) = \arg\max_a \int_0^1 F^{-1}_{R|s,a}(\tau) d\tau = \arg\max_a E_{Z \sim z(s, a)}[Z] \]

The network is trained using the quantile loss \(\rho_\tau\) we developed, and Q-learning updates can be applied since the quantile projection operator preserves the contraction property of the Bellman operator. This approach has achieved state-of-the-art performance on Atari games and robotic control tasks.

Connections to dual utility theory (Yaari 1987) suggest that distributional RL naturally incorporates risk preferences—agents can be trained to maximize any utility functional, not just expectations.

22.10 Discussion and Summary

This chapter developed quantile neural networks as a powerful framework for decision-making under uncertainty. We have shown how learning posterior distributions through their quantile functions—rather than densities—provides computational and conceptual advantages across diverse applications.

Key Takeaways

  1. Quantile-expectation identity: The fundamental insight \(E[U] = \int_0^1 F^{-1}_U(\tau) d\tau\) enables direct learning of expected utilities without intermediate density estimation.

  2. Loss function derivation: The check loss \(\rho_\tau(u) = u(\tau - I(u < 0))\) minimizes Wasserstein distance and naturally handles asymmetric costs.

  3. Neural architectures: Cosine embeddings for \(\tau\) leverage Fourier approximation theory, providing efficient universal approximators with \(O(N^{-1/2})\) convergence rates.

  4. Applications span domains:

    • Finance: Portfolio optimization under parameter uncertainty (Section 22.6)
    • Supply chain: Demand forecasting for inventory management (Section 22.8)
    • Bayesian inference: Posterior quantile learning via Wang distortions (Section 22.4)
    • Reinforcement learning: Distributional Q-learning for risk-sensitive policies (Section 22.9)

When to Use Quantile Neural Networks

Quantile NNs are particularly well-suited when:

  • Decision problems: You need expectations or quantiles for optimization, not full densities
  • Likelihood-free models: Complex simulators where \(p(y|\theta)\) is unavailable
  • Robustness: Data contains outliers or heavy tails
  • Asymmetric costs: Different quantiles drive different decisions (e.g., stockouts vs. overstock)
  • High dimensions: Density estimation becomes intractable, but quantile regression remains feasible
  • Real-time requirements: Inference must be fast; networks provide instant quantile predictions

Limitations and Open Questions

Despite their power, quantile methods have limitations:

  1. Multivariate outputs: Extension to high-dimensional outputs \(\mathbf{X} \in \mathbb{R}^p\) requires careful handling of dependence structure. Autoregressive factorizations help but lose some joint information.

  2. Extreme quantiles: Estimating \(P_{0.01}\) or \(P_{99.99}\) requires substantial data in tails. Extreme value theory may help regularize these estimates.

  3. Non-crossing enforcement: Ensuring monotonicity in \(\tau\) adds computational overhead. Recent work on monotonic neural networks (Wehenkel and Louppe 2019) offers promise.

  4. Interpretability vs. flexibility: Neural networks are black boxes. For high-stakes decisions (medical, legal), transparency may favor simpler quantile methods.

  5. Calibration: Like all predictive models, quantile NNs can be miscalibrated. Post-hoc calibration techniques (Kuleshov, Fenner, and Ermon 2018) should be applied, especially for critical applications.

Connections to Broader AI

Quantile methods connect to several contemporary themes in AI and machine learning:

  • Generative modeling: Quantile networks are generative models—sample from uniform, pass through \(F^{-1}\), obtain samples from target distribution
  • Implicit distributions: Like GANs, quantile NNs avoid explicit likelihood computation
  • Conformal prediction: Quantile regression provides natural prediction sets with finite-sample coverage guarantees (Romano, Patterson, and Candes 2019)
  • Bayesian deep learning: Quantile NNs offer an alternative to variational inference and MC dropout for uncertainty quantification
  • Causal inference: Quantile treatment effects (Firpo, Fortin, and Lemieux 2009) extend average treatment effects, capturing heterogeneous impacts across the outcome distribution

Future Directions

Active research directions include:

  • Theoretical understanding: Tighter approximation bounds for quantile NNs in high dimensions
  • Efficient architectures: Attention mechanisms and transformers for temporal quantile modeling
  • Multi-task learning: Sharing representations across related forecasting problems
  • Online learning: Adapting quantile models in non-stationary environments
  • Integration with causal methods: Combining quantile regression with instrumental variables and difference-in-differences

Conclusion

Quantile neural networks represent a convergence of classical statistical theory (quantile regression, robust statistics) with modern machine learning (deep learning, representation learning). By focusing on the quantities we actually need—quantiles for decision-making—rather than intermediate densities, these methods offer a pragmatic and powerful approach to uncertainty quantification.

As Keynes observed, it is better to be roughly right than precisely wrong. Quantile methods embrace this philosophy: they provide the distributional information needed for sound decisions without claiming to know the complete probability model. In an era of increasingly complex data and high-stakes applications, this combination of flexibility, robustness, and decision-focus makes quantile neural networks an essential tool for the modern data scientist.