Robust Bayesian Recourse

September 24, 2022

1. Introduction

Our society is moving increasingly towards algorithmic reliance for a range of decision problems. These decisions directly affect not only individuals but also society as a whole when they are adopted by organizations, some of which are fundamental, such as financial and educational institutions. This raises significant requirements on the trustworthiness and explainability of machine learning models. Among approaches, post-hoc explanations balance between transparency and safeguarding exploitative gaming behaviors. Post-hoc explanations aim to demonstrate why unfavorable decisions are made and how different the input should be so one can obtain a favorable outcome. Thus, these explanations can also be offered as recourses to benefit users, rather than being sole explanations for the system.
A recourse recommends the actions that an individual should take in order to receive an alternate algorithmic outcome. For example, consider an applicant who is rejected for a particular job, a recourse for a future success application may come in the form of personalized recommendations:

Complete a 6-month full-stack engineer internship
Score 60 more points in an ability test

However, decision systems could be updated frequently in practice due to data distribution shifts, rendering recommended recourses useless.
Following this line, our work focus on robust recourses, e.g., recourses that are effective at reversing the algorithmic outcome even under model shifts.

2. Bayesian Recourse

We consider a generic covariate space $X \in \mathcal{X} = \mathbb{R}^p$ and a binary predicted label $\hat{Y} \in \mathcal{Y} = \{0, 1\}$ , where class 0 denotes an unfavorable outcome while class 1 denotes a favorable one. Consider the joint random vector of covariate-label $(X, \hat{Y}) \in \mathcal{X} \times \mathcal{Y}$ , then the class posterior probability of any input $x$ can be represented by the conditional random variable $\hat Y | X = x$ .

Bayesian Recourse

Given an input $x_0$ , let $\mathbb{X}$ be a neighborhood around $x_0$ . A Bayesian recourse $x_{Bayes} \in \mathbb{X}$ is an alternative that minimizes the Bayesian posterior odds ratio

$x_{Bayes} \triangleq \underset{x \in \mathbb{X}}{\arg \min}~ \displaystyle \frac{\mathbb{P}( \hat{Y} = 0 | X = x)}{ \mathbb{P}( \hat{Y} = 1 | X = x)}$

for some joint distribution $\mathbb{P}$ of $(X, \hat{Y})$ induced by the sampling of the synthetic covariate $X$ and the synthetic predicted label $\hat{Y} = \mathcal{C}(X)$ .

Suppose that we can use a sampling mechanism to sample $n$ covariates $\widehat{x}_i$ , then query the given classifier to obtain the predicted labels $\widehat{y}_i$ , $i = 1, \ldots, n$ .
Let $\mathcal{I}_y = \{ i \in [n]: \widehat{y}_i = y \}$ be the indices of samples in class $y \in \mathcal{Y}$ . Let $N_y = | \mathcal{I}_y|$ be the number of training samples with class $y$ , then we can use $\gamma_y = N_y/n$ , the empirical proportion of data of class $y$ , as an estimate of $\mathbb{\hat{Y}} = y$ .
We take the nonparametric statistics approach to estimate the likelihood $\mathbb{P}(X = x | \hat{Y} = y)$ using kernel density estimation [1]. For a concrete example, we choose the Gaussian kernel with bandwidth $h > 0$ , thus the kernel density estimate of the quantity $\mathbb{P}(X = x | \hat{Y} = y)$ is

$L_{\mathrm{KDE}}(x | \hat{Y} = y) = \frac{1}{N_y} \sum_{i \in \mathcal{I}_y} \exp\left( - \frac{1}{2h^2} \| x - \widehat{x}_i \|_2^2 \right).$

We now can formulate the empirical version of Bayesian recourse, termed KDE-Bayesian recourse:

KDE-Bayesian Recourse

$\min_{x \in \mathbb{X}}~ \displaystyle \frac{\gamma_0 \times L_{\mathrm{KDE}}( x | \hat{Y} = 0) }{ \gamma_1 \times L_{\mathrm{KDE}}( x | \hat{Y} = 1)},$

which further simplifies to

$\min_{x \in \mathbb{X}}~\frac{\sum_{i \in \mathcal{I}_0} \exp\left( - \frac{1}{2h^2} \| x - \widehat{x}_i \|_2^2 \right)}{\sum_{i \in \mathcal{I}_1} \exp\left( - \frac{1}{2h^2} \| x - \widehat{x}_i \|_2^2 \right)}$

by exploiting the definition of $L_{\mathrm{KDE}}$ .

To facilitate the empirical evaluation of the neighborhood around $x_0$ , we need to characterize the feasible set $\mathbb{X}$ . It is desirable to control the recourse to be in a strict neighborhood of distance $\delta$ from the input [2]. Thus, we can impose a feasible set of the form

Feasible Set

$\mathbb{X} = \{ x \in \mathcal{X} ~:~ \varphi(x, x_0) \leq \delta \},$

where $\varphi$ is a measure of dissimilarity (distance or norm) on the covariate space $\mathcal{X}$ .

If we use a boundary sampler, we may also opt for the neighborhood satisfying $\varphi(x, x^b) \leq \delta'$ around the boundary point $x^b$ . A good choice of $\varphi$ is the norm-1, which promotes sparse modifications to the input.
In order to construct plausible and meaningful recourses, we could additionally consider the actionability constraints that forbid unrealistic recourses. For example, genders and races should be considered immutable features of a person. These constraints could be directly injected into the definition of $\mathbb{X}$ following [3]. The optimal recourse for the KDE-Bayesian objective over the feasible set can be solved effectively by projected gradient descent.

3. Robust Bayesian Recourse

To robustify the Bayesian recourse, the robust Bayesian recourse aims to perturb directly the empirical conditional distributions of $X | \hat{Y} = y$ , which then reshapes the decision boundary in the covariate space in an adversarial manner. Holistically, our approach can be decomposed into the following steps:

Forming the empirical conditional distributions of $X | \hat{Y} = y$ , then smoothen them by convoluting an isotropic Gaussian noise to each data point.
Formulating the ambiguity set for each conditional distribution of $X | \hat{Y} = y$ .
Solving a min-max problem to find the recourse that minimizes the worst-case Bayesian posterior odds ratio.

We assume that the conditional distribution can be perturbed in an ambiguity set $\mathbb{B}_{\varepsilon_y}(\hat{\mathbb{P}}_y^\sigma)$ . This set $\mathbb{B}_{\varepsilon_y}(\hat{\mathbb{P}}_y^\sigma)$ is defined as a neighborhood of radius $\varepsilon_y \geq 0$ centered at the nominal distribution $\hat{\mathbb{P}}_y^\sigma$ . The robust Bayesian recourse is defined as the optimal solution to the following problem

$\min_{x \in \mathbb{X}}~ \displaystyle \max_{\mathbb{Q}_0 \in \mathbb{B}_{\varepsilon_0}(\hat{\mathbb{P}}_0^\sigma), \mathbb{Q}_1 \in \mathbb{B}_{\varepsilon_1}(\hat{\mathbb{P}}_1^\sigma)}~\frac{ \gamma_0 \mathbb{Q}_0( X = x) }{ \gamma_1 \mathbb{Q}_1( X = x ) }.$

By internalizing the maximization term inside the fraction and replacing $\mathbb{Q}_y(X = x)$ by the likelihood $L(x, \mathbb{Q}_y)$ , the above problem is equivalent to

$\min_{x \in \mathbb{X}}F(x), \quad F(x) \triangleq \frac{\gamma_0 \times \max_{\mathbb{Q}_0 \in \mathbb{B}_{\varepsilon_0}(\hat{\mathbb{P}}_0^\sigma)}~L(x, \mathbb{Q}_0) }{ \gamma_1 \times \min_{\mathbb{Q}_1 \in \mathbb{B}_{\varepsilon_1}(\hat{\mathbb{P}}_1^\sigma)} L(x, \mathbb{Q}_1) }.$

For any $x \in \mathbb{X}$ , evaluating its objective value $F(x)$ requires solving the maximization of the likelihood in the numerator, termed the optimistic likelihood:

$\max~\{ L(x, \mathbb{Q}_0) : \mathbb{Q}_0 \in \mathbb{B}_{\varepsilon_0}(\hat{\mathbb{P}}_0^\sigma)\},$

and the minimization of the likelihood in the denominator, termed the pessimistic likelihood:

$\min~\{ L(x, \mathbb{Q}_1) : \mathbb{Q}_1 \in \mathbb{B}_{\varepsilon_1}(\hat{\mathbb{P}}_1^\sigma)\}.$

Figure 1: An example of the robust Bayesian recourse on a toy 2-dimensional instance. The star denotes the input $x_0$ , and the black circle denotes the boundary point $x^b$ . Green and blue circles are locally sampled data with favorable and unfavorable predicted values, respectively. The red circle denotes the robust Bayesian recourse, and the curved line denotes the continuum of intermediate solutions of the gradient descent algorithm. The robust Bayesian recourse moves to the interior of the favorable region (green), and thus is more likely to be valid subject to model shifts.

4. Wasserstein-Gaussian Mixture Ambiguity Sets

A central problem in solving the robust Bayesian recourse is how to construct the ambiguity set $\mathbb{B}_{\varepsilon_y}(\hat{\mathbb{P}}_y^\sigma)$ . Our construction relies on representing a Gaussian mixture distribution as a discrete distribution on the mean vector and covariance matrix space.
Notice that the smoothed measure $\hat{\mathbb{P}}_y^\sigma$ is a Gaussian mixture and any Gaussian distribution is fully characterized by its mean vector and its covariance matrix. Thus, $\hat{\mathbb{P}}_y^\sigma$ is associated with the discrete distribution $\widehat{\nu_y} = N_y^{-1} \sum_{i \in \mathcal{I}_y} \delta_{(\widehat{x}_i, \sigma^2 I)}$ on the space of mean vector and covariance matrix $\mathbb{R}^p \times \mathbb{S}^p_+$ .
Then, for any $y \in {0, 1}$ , we formally define the ambiguity set as

$\mathbb{B}_{\varepsilon_y}(\hat{\mathbb{P}}_y^\sigma) \triangleq \left\{ \mathbb{Q}_y: \begin{array}{l} \nu_y \in \mathcal{P}(\mathbb{R}^p \times \mathbb{S}_{\geq\sigma}^p),~\mathbb{W}_c(\nu_y, \widehat{\nu}_y) \leq \varepsilon_y \\ \mathbb{Q}_y \text{ is a Gaussian mixture associated with } \nu_y \end{array}\right\}.$

Here, $\mathbb{S}_{\geq\sigma}^p \triangleq { \Sigma \in \mathbb{S}_+^p: \Sigma \succeq \sigma^2 I} \subset \mathbb{S}_+^p$ is the set of covariance matrices whose eigenvalues are lower bounded by $\sigma^2 > 0$ , with $\sigma^2$ is the isotropic variance of the smoothing convolution. $\mathcal{P}(\mathbb{R}^p \times \mathbb{S}_{\geq\sigma}^p)$ denotes the set of all possible distributions supported on $\mathbb{R}^p \times \mathbb{S}_{\geq\sigma}^p$ .
Intuitively, $\mathbb{B}_{\varepsilon_y}(\hat{\mathbb{P}}_y^\sigma)$ contains all Gaussian mixtures $\mathbb{Q}_y$ associated with some $\nu_y$ having a distance less than or equal to $\varepsilon_y$ from the nominal measure $\widehat{\nu}_y$ . The distance between $\nu_y$ and $\widehat{\nu}_y$ is measured by an optimal transport $\mathbb{W}_c$ .
Particularly, we use $\mathbb{W}_c$ as the type- $\infty$ Wasserstein distance, which differentiates us from the literature whose focuses are type-1 and type-2 distances [4, 5]. This type- $\infty$ construction is critical for the computational tractability of our problem.

Figure 2: Another visualization of the worst-case distributions on a toy dataset. The dashed, opaque dots and circles represent the isotropic Gaussian around each data sample. The solid dots and circles represent the worst-case distributions corresponding to the boundary point $x^b$ . For blue (unfavorably predicted) samples, the worst-case distribution is formed by perturbing the distribution towards $x^b$ — which leads to maximizing the posterior probability of unfavorable prediction. For green (favorably predicted) samples, the worst-case distribution is formed by perturbing the distribution away from $x^b$ — which leads to minimizing the posterior probability of favorable prediction. These worst-case distributions will maximize the posterior probability odds ratio.

5. Reformulation of the likelihood evaluation problems

In the paper, by leveraging the essential supremum in the definition of the type- $\infty$ Wasserstein distance, we prove that the evaluation of the optimistic and pessimistic likelihoods can be decomposed into solving smaller subproblems, and that each is an optimization problem over the mean vector – covariance matrix space $\mathbb{R}^p \times \mathbb{S}_{\geq\sigma}^p$ .
We further show that each of these subproblems can be reduced to a 2-dimensional subproblem independent of the original dimension $p$ .
For the sake of brevity, we only state here the final reformulations of our likelihood evaluation problems. Interested readers are welcome to read our papers for the full derivation.

Optimistic likelihood
For each $i \in \mathcal{I}_0$ , let $\alpha_i$ be the optimal value of the following two-dimensional optimization problem

$\min_{\substack{a \in \mathbb{R}_+,~d_p \in [\sigma, +\infty) \\ a^2 + (d_p - \sigma)^2 \leq \epsilon_0^2 }} ~\left\{\log d_p + \frac{(\|x - \widehat{x}_i\|_2 - a)^2}{2d_p^2} + (p-1) \log \sigma\right\}.$

Then, we have

$\max~\{ L(x, \mathbb{Q}_0) : \mathbb{Q}_0 \in \mathbb{B}{\varepsilon_0}(\hat{\mathbb{P}}_0^\sigma)\} = \frac{\sum_{i \in \mathcal{I}_0} \exp(-\alpha_i)}{N_0 (2\pi)^{p/2}}.$

Pessimistic likelihood
For each $i \in \mathcal{I}_1$ , let $\alpha_i$ be the optimal value of the following two-dimensional optimization problem

$\min_{\substack{a \in \mathbb{R}_+,~d_1 \in [\sigma, +\infty)\\ a^2 + p(d_1 - \sigma)^2 \leq \epsilon_1^2}} ~ \left\{-\log d_1 - \frac{(\|x - \widehat{x}_i\|_2 + a)^2}{2d_1^2} - (p-1) \log \left(\sigma + \sqrt{\frac{\epsilon_1^2 - a^2 - (d_1 - \sigma)^2}{p - 1}}\right) \right\}.$

Then we have

$\max~\{ L(x, \mathbb{Q}_1) : \mathbb{Q}_1 \in \mathbb{B}_{\varepsilon_1}(\hat{\mathbb{P}}_1^\sigma)\} = \frac{\sum_{i \in \mathcal{I}_1} \exp(-\alpha_i)}{N_1 (2\pi)^{p/2}}.$

with the above theorems, we can design an iterative scheme to solve the robust Bayesian recourse problem. For any value $x \in \mathbb{X}$ , we can use projected gradient descent to solve a series of two-dimensional subproblems to evaluate the objective value $F(x)$ .

6. Experiments

We evaluate the robustness of our method to model shifts of different recourses, together with the trade-off against the cost of adopting the recourse’s recommendation. We compare our proposed robust Bayesian recourse (RBR) method against the counterfactual explanations of Wachter, and ROAR robust recourses generated using either LIME and LIMES surrogate models.

Experimental setup

Datasets

We examine the recourse generators on both a synthetic dataset and the real-world datasets: German Credit [6, 7] Small Bussiness Administration (SBA) [8], and Give Me Some Credit (GMC) [9]. Each dataset contains two sets of data: $D_1$ and $D_2$ . The former is the current data which is used to train a current classifier to generate recourses. The latter represents the possible data arriving in the future.
For each dataset, we use 80% of the instances in the current data $D_1$ to train the underlying predictive model and fix this classifier to construct recourses for the remaining 20% of the instances. The future data $D_2$ will be used to train future classifiers, which are for evaluation only.

Classifier

We use a three-layer MLP with 20, 50 and 20 nodes, respectively with a ReLU activation in each consecutive layer. The sigmoid function is used in the last layer to produce predictive probabilities. The performance of the MLP classifier is reported in Table 1 below.

Table 1. Accuracy and AUC results of the MLP classifier on the synthetic and real-world datasets

Metrics

To measure the ease of adopting a recourse, we use the $\ell_1$ -distance as the cost function $\varphi$ on the covariate space $\mathcal{X}$ , this choice is similar to [3] and [10]. We define the current validity as the validity of the recourses with respect to the current classifier $\mathcal{C}$ . To evaluate the robustness of recourses to the changes in model’s parameters, we sample 20% of the instances in the data set $D_2$ as the arrival data. We then re-train the classifier with the old data (80% of $D_1$ ) coupled with this arrival data to simulate the future classifiers $\tilde{\mathcal{C}}$ . We repeat this procedure 100 times to obtain 100 future classifiers and report the future validity of a recourse as the fraction of the future classifiers with respect to which the recourse is valid.
Readers are welcome to read our paper for more details on the experiments.

Cost-validity trade-off

We obtain the Pareto front for the trade-off between the cost of adopting recourses produced by RBR and their validity by varying the ambiguity sizes $\varepsilon_1$ and $\varepsilon_0$ , along the maximum recourse cost $\delta = \|x_0 - x_b\|_1 + \delta_+$ . Particularly, we consider $\varepsilon_0, \varepsilon_1 \in \{ 0.5k \; | \; k = 0, \ldots, 2\}$ and $\delta_+ \in \{0.2l \, | \, l = 0, \ldots, 5\}$
The frontiers for ROAR-based methods are obtained by varying $\delta_{\max} \in \{ 0.02m \; | \; m = 0, \ldots, 10 \}$ , where $\delta_{\max}$ is the tuning parameter of ROAR.
As shown in the figure below, increasing $\varepsilon_1$ and $\delta_+$ generally increase the future validity of recourses yielded by RBR at the sacrifice of the cost, while sustaining the current validity. Yet, the frontiers obtained by RBR either dominate or are comparable to other frontiers of Wachter, LIME-ROAR, and LIMELS-ROAR.

Figure 3. Pareto frontiers of the cost-validity trade-off with the MLP classifier, on synthetic, German Credit, Small Business Administration, and Give Me Some Credit datasets.

7. References
[1] Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2008.
[2] Suresh Venkatasubramanian and Mark Alfano. The philosophical basis of algorithmic recourse. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, page 284–293, 2020.
[3] Sohini Upadhyay, Shalmali Joshi, and Himabindu Lakkaraju. Towards robust and reliable algorithmic recourse. In Advances in Neural Information Processing Systems, 2021.
[4] Viet Anh Nguyen, Soroosh Shafieezadeh-Abadeh, ManChung Yue, Daniel Kuhn, and Wolfram Wiesemann. Calculating optimistic likelihoods using (geodesically) convex optimization. In Advances in Neural Information Processing Systems 32, pages 13942–13953, 2019.
[5] Viet Anh Nguyen, Nian Si, and Jose Blanchet. Robust Bayesian classification using an optimistic score ratio. In Proceedings of the 37th International Conference on Machine Learning, 2020.
[6] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml
[7] U Groemping. South German credit data: Correcting a widely used data set. Reports in Mathematics, Physics and Chemistry, Department II, Beuth University of Applied Sciences Berlin, 2019.
[8] Min Li, Amy Mickel, and Stanley Taylor. “Should this loan be approved or denied?”: A large dataset with class assignment guidelines. Journal of Statistics Education, 26(1):55–66, 2018.
[9] Kaggle Competition. Give me some credit. improve on the state of the art in credit scoring by predicting the probability that somebody will experience financial distress in the next two years., 2011. URL https://www.kaggle.com/competitions/GiveMeSomeCredit/data
[10] Berk Ustun, Alexander Spangher, and Yang Liu. Actionable recourse in linear classification. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 10–19, 2019.