cant get tables to work, charts will have to do

pat-alt · pat-alt · commit 6050300afaab · 2022-08-18T10:21:59.000+02:00
diff --git a/_freeze/dev/notebooks/appendix/execute-results/html.json b/_freeze/dev/notebooks/appendix/execute-results/html.json
diff --git a/build/dev/notebooks/appendix.html b/build/dev/notebooks/appendix.html
diff --git a/dev/Project.toml b/dev/Project.toml
@@ -6,7 +6,6 @@ CounterfactualExplanations = "2f13d31b-18db-44c1-bc43-ebaf2cff0be0"
 DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
 Flux = "587475ba-b771-5e3f-ad9e-33799f191a9c"
 Gadfly = "c91e804a-d5a3-530f-b6f0-dfbca275c004"
-Gumbo = "708ec375-b3d6-5a57-a7ce-8257bf98657a"
 LaplaceRedux = "c52c1a26-f7c5-402b-80be-ba1e638ad478"
 LibGit2 = "76f85450-5226-5b5a-8eaa-529ad045b433"
 LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
diff --git a/dev/notebooks/experiments/mitigation_strategies.qmd b/dev/notebooks/experiments/mitigation_strategies.qmd
@@ -132,6 +132,34 @@ for img in img_files
 end
 ```
 
+### Bootstrap
+
+```{julia}
+n_bootstrap = 1000
+using AlgorithmicRecourseDynamics.Evaluation: evaluate_system
+using DataFrames
+df = DataFrame()
+for (key, val) in results
+    n_folds = length(val.experiment.recourse_systems)
+    for fold in 1:n_folds
+        for i in length(val.experiment.system_identifiers)
+            rec_sys = val.experiment.recourse_systems[fold][i]
+            model_name, gen_name = collect(val.experiment.system_identifiers)[i]
+            df_ = evaluate_system(rec_sys, val.experiment; n=n_bootstrap)
+            df_.model .= model_name
+            df_.generator .= gen_name
+            df_.fold .= fold
+            df = vcat(df, df_)
+        end
+    end
+end
+df = mapcols(x -> typeof(x) == Vector{Symbol} ? string.(x) : x, df)
+using RCall
+save_path = joinpath(output_path, "bootstrap_synthetic.csv")
+using CSV
+CSV.write(save_path)
+```
+
 ### Chart in paper
 
 @fig-mit-paper shows the chart that went into the paper.
@@ -292,6 +320,34 @@ for img in img_files
 end
 ```
 
+### Bootstrap
+
+```{julia}
+n_bootstrap = 1000
+using AlgorithmicRecourseDynamics.Evaluation: evaluate_system
+using DataFrames
+df = DataFrame()
+for (key, val) in results
+    n_folds = length(val.experiment.recourse_systems)
+    for fold in 1:n_folds
+        for i in length(val.experiment.system_identifiers)
+            rec_sys = val.experiment.recourse_systems[fold][i]
+            model_name, gen_name = collect(val.experiment.system_identifiers)[i]
+            df_ = evaluate_system(rec_sys, val.experiment; n=n_bootstrap)
+            df_.model .= model_name
+            df_.generator .= gen_name
+            df_.fold .= fold
+            df = vcat(df, df_)
+        end
+    end
+end
+df = mapcols(x -> typeof(x) == Vector{Symbol} ? string.(x) : x, df)
+using RCall
+save_path = joinpath(output_path, "bootstrap_latent.csv")
+using CSV
+CSV.write(save_path)
+```
+
 ### Chart in paper
 
 @fig-mit-latent-paper shows the chart that went into the paper.
@@ -434,6 +490,34 @@ for (data_name, res) in results
 end
 ```
 
+### Bootstrap
+
+```{julia}
+n_bootstrap = 1000
+using AlgorithmicRecourseDynamics.Evaluation: evaluate_system
+using DataFrames
+df = DataFrame()
+for (key, val) in results
+    n_folds = length(val.experiment.recourse_systems)
+    for fold in 1:n_folds
+        for i in length(val.experiment.system_identifiers)
+            rec_sys = val.experiment.recourse_systems[fold][i]
+            model_name, gen_name = collect(val.experiment.system_identifiers)[i]
+            df_ = evaluate_system(rec_sys, val.experiment; n=n_bootstrap)
+            df_.model .= model_name
+            df_.generator .= gen_name
+            df_.fold .= fold
+            df = vcat(df, df_)
+        end
+    end
+end
+df = mapcols(x -> typeof(x) == Vector{Symbol} ? string.(x) : x, df)
+using RCall
+save_path = joinpath(output_path, "bootstrap_real_world.csv")
+using CSV
+CSV.write(save_path)
+```
+
 ### Chart in paper
 
 @fig-mit-latent-paper shows the chart that went into the paper.
diff --git a/dev/notebooks/experiments/real_world.qmd b/dev/notebooks/experiments/real_world.qmd
@@ -141,6 +141,35 @@ for img in img_files
 end
 ```
 
+### Bootstrap
+
+```{julia}
+n_bootstrap = 1000
+using AlgorithmicRecourseDynamics.Evaluation: evaluate_system
+using DataFrames
+df = DataFrame()
+for (key, val) in results
+    n_folds = length(val.experiment.recourse_systems)
+    for fold in 1:n_folds
+        for i in length(val.experiment.system_identifiers)
+            rec_sys = val.experiment.recourse_systems[fold][i]
+            model_name, gen_name = collect(val.experiment.system_identifiers)[i]
+            df_ = evaluate_system(rec_sys, val.experiment; n=n_bootstrap)
+            df_.model .= model_name
+            df_.generator .= gen_name
+            df_.fold .= fold
+            df = vcat(df, df_)
+        end
+    end
+end
+df = mapcols(x -> typeof(x) == Vector{Symbol} ? string.(x) : x, df)
+using RCall
+save_path = joinpath(output_path, "bootstrap.csv")
+using CSV
+CSV.write(save_path)
+```
+
+
 ### Chart in paper
 
 @fig-real-paper shows the chart that went into the paper.
diff --git a/dev/notebooks/experiments/synthetic.qmd b/dev/notebooks/experiments/synthetic.qmd
@@ -283,7 +283,7 @@ end
 ### Bootstrap
 
 ```{julia}
-n_bootstrap = 1
+n_bootstrap = 1000
 using AlgorithmicRecourseDynamics.Evaluation: evaluate_system
 using DataFrames
 df = DataFrame()
@@ -303,19 +303,9 @@ for (key, val) in results
 end
 df = mapcols(x -> typeof(x) == Vector{Symbol} ? string.(x) : x, df)
 using RCall
-save_path = joinpath(output_path, "bootstrap.html")
-R"""
-dt <- DT::datatable($df) |>
-    DT::formatRound(columns=c("value"), digits=3)
-DT::saveWidget(dt, $save_path)
-"""
-```
-
-```{julia}
-#| eval: true
-using Gumbo
-save_path = joinpath(output_path, "bootstrap.html")
-parsehtml(read(save_path, String))
+save_path = joinpath(output_path, "bootstrap.csv")
+using CSV
+CSV.write(save_path)
 ```
 
 ### Chart in paper {#sec-app-synthetic-paper}
diff --git a/dev/notebooks/generators/clap_roar_generator.qmd b/dev/notebooks/generators/clap_roar_generator.qmd
@@ -13,8 +13,6 @@ output_path = output_dir("generator")
 www_path = www_dir("generator")
 ```
 
-# `ClapROARGenerator`
-
 ```{julia}
 using MLJ
 N = 1000
@@ -129,4 +127,5 @@ for (name,ce) ∈ counterfactuals
     plts = vcat(plts..., plt)
 end
 plt = plot(plts..., size=(800,300), layout=(1,2))
+display(plt)
 ```
diff --git a/dev/notebooks/generators/gravitational_generator.qmd b/dev/notebooks/generators/gravitational_generator.qmd
@@ -13,8 +13,6 @@ output_path = output_dir("generator")
 www_path = www_dir("generator")
 ```
 
-# `GravitationalGenerator`
-
 ```{julia}
 using MLJ
 N = 1000
@@ -139,5 +137,3 @@ plt = plot(plt1, plt2, size=(850,350), layout=(1,2))
 savefig(plt, joinpath(www_path,"gravitational_generator_comparison.png"))
 ```
 
-
-# References
diff --git a/paper/paper.Rmd b/paper/paper.Rmd
@@ -115,3 +115,7 @@ knitr::opts_chunk$set(
 
 ::: {#refs}
 :::
+
+# Appendix
+
+
diff --git a/paper/sections/empirical.rmd b/paper/sections/empirical.rmd
@@ -55,15 +55,15 @@ We use four synthetic binary classification datasets consisting of 1000 samples
 knitr::include_graphics("www/synthetic_data.png")
 ```
 
-Ex-ante we expect to see that by construction Wachter will create a new cluster of counterfactual instances in the proximity of the initial decision boundary. Thus, the choice of a black-box model may have an impact on the paths of the recourse. For generators that use latent space search (REVISE @joshi2019towards, CLUE @antoran2020getting) or rely on (and have access to) probabilistic models (CLUE @antoran2020getting, Greedy @schut2021generating) we expect that counterfactuals will end up in regions of the target domain that are densely populated by training samples. Of course, this is expectation hinges on how effective said probabilistic models are at capturing predictive uncertainty. Finally, we expect to see the counterfactuals generated by DiCE to be uniformly spread around the feature space inside the target class^[As we mentioned earlier, the diversity constraint used by DiCE is only effective for when at least two counterfactuals are being generated. We have therefore decided to always generate 5 counterfactuals for each generator and randomly pick one of them.]. In summary, we expect that the endogenous shifts induced by Wachter outsize those induced by all other generators, since Wachter is the only approach that is not concered with generating what we have defined as meaningful counterfactuals. 
+Ex-ante we expect to see that by construction Wachter will create a new cluster of counterfactual instances in the proximity of the initial decision boundary. Thus, the choice of a black-box model may have an impact on the paths of the recourse. For generators that use latent space search (REVISE @joshi2019towards, CLUE @antoran2020getting) or rely on (and have access to) probabilistic models (CLUE @antoran2020getting, Greedy @schut2021generating) we expect that counterfactuals will end up in regions of the target domain that are densely populated by training samples. Of course, this is expectation hinges on how effective said probabilistic models are at capturing predictive uncertainty. Finally, we expect to see the counterfactuals generated by DiCE to be uniformly spread around the feature space inside the target class^[As we mentioned earlier, the diversity constraint used by DiCE is only effective for when at least two counterfactuals are being generated. We have therefore decided to always generate 5 counterfactuals for each generator and randomly pick one of them.]. In summary, we expect that the endogenous shifts induced by Wachter outsize those induced by all other generators, since Wachter is the only approach that is not concerned with generating what we have defined as meaningful counterfactuals. 
 
 ### Real-world data
 
-We use three different real-world datasets from the Finance and Economics domain, all of which are tabular and can be used for binary classification. Firstly, we use the **Give Me Some Credit** dataset which was open-sourced on Kaggle for the task to predict whether a borrower is likely to experience financial difficulties in the next two years [@gmsc_data]. Originally consisting of 250,000 instances with 11 numerical attributes. Secondly, we use the **UCI defaultCredit** dataset [@yeh2009comparisons], a benchmark dataset that can be used to train binary classifiers to predict the binary outcome variable, whether credit card clients default on their payment. In its raw form it consists of 23 explanatory variables - 4 categorical features relating to demographic attributes^[These have been ommitted from the analysis. See Section \@ref(limit-data) for details.] and 19 continuous features largely relating to individuals' payment histories and amount of credit outstanding. Both of these datasets have been used in the literature on Algorithmic Recourse before (see for example @pawelczyk2021carla, @joshi2019towards and @ustun2019actionable), presumably because they constitute real-world classification tasks involving individuals that compete for access to credit. 
+We use three different real-world datasets from the Finance and Economics domain, all of which are tabular and can be used for binary classification. Firstly, we use the **Give Me Some Credit** dataset which was open-sourced on Kaggle for the task to predict whether a borrower is likely to experience financial difficulties in the next two years [@gmsc_data]. Originally consisting of 250,000 instances with 11 numerical attributes. Secondly, we use the **UCI defaultCredit** dataset [@yeh2009comparisons], a benchmark dataset that can be used to train binary classifiers to predict the binary outcome variable, whether credit card clients default on their payment. In its raw form it consists of 23 explanatory variables - 4 categorical features relating to demographic attributes^[These have been omitted from the analysis. See Section \@ref(limit-data) for details.] and 19 continuous features largely relating to individuals' payment histories and amount of credit outstanding. Both of these datasets have been used in the literature on Algorithmic Recourse before (see for example @pawelczyk2021carla, @joshi2019towards and @ustun2019actionable), presumably because they constitute real-world classification tasks involving individuals that compete for access to credit. 
 
 As a third dataset we include the **California Housing** dataset derived from the 1990 U.S. census and sourced through scikit-learn [@pedregosa2011scikit, @pace1997sparse]. It consists of 8 continuous features that can be used to predict the median house price for California districts. The continuous outcome variable is binarized as $\tilde{y}=\mathbb{I}_{y>\text{median}(Y)}$ indicating whether or not the median house price of a given district is above or below the median of all districts. While we have not seen this dataset used in the previous literature on AR, others have used the Boston Housing dataset in a similar fashion (see for example @schut2021generating). While we initially also conducted experiments on that dataset, we eventually discarded this dataset, since it has been found to suffer from an ethical problem [@carlisle2019racist].
 
-Since the simulations involve generating counterfactuals for a significant proportion of the entire sample of individuals, we have randomly undersampled each dataset to yield balanced subsamples consisting of 2,500 individuals each. We have also standardized all explanatory features since our chosen classifiers are sensetive to scale.
+Since the simulations involve generating counterfactuals for a significant proportion of the entire sample of individuals, we have randomly undersampled each dataset to yield balanced subsamples consisting of 2,500 individuals each. We have also standardized all explanatory features since our chosen classifiers are sensitive to scale.
 
 ## $G$ -- Generators
 
diff --git a/paper/sections/empirical_2.rmd b/paper/sections/empirical_2.rmd
@@ -34,7 +34,7 @@ The same broad pattern also emerges in the third row: we observe the smallest de
 
 Figure \@ref(fig:syn) also indicates that the estimated effects are strongest for the simplest linear classifier, a pattern that we have observed fairly consistently. Conversely, there is virtually no difference in outcomes between the deep ensemble and the MLP. It is possible that the deep ensembles simply fail to capture predictive uncertainty well and hence counterfactual generators like Greedy, that explicitly address this quantity, fail to work as expected.
 
-The findings for the other synthetic datasets are broadly consistent with the observations above. For the Moons data the same broad patterns emerge, although in this case it is less evident that Latent Space generators induce relatively smaller shifts. For the Circles data, it also appears at first sight that Latent Space search yields better results, but it turns out that in this case counterfactual search is simply largely unsuccessful^[We suspect that this in this case the generative model has failed to learn an accuracte representation of the data geenrating process.]. Model shifts and performance deterioration are also quantitatively quantitatively smaller than in what we can observe in Figure \@ref(fig:syn). The same broadly holds for the Moons data,  For the Linearly Separable data we also find substantial domain and model shifts, but no reduction in model performance. 
+The findings for the other synthetic datasets are broadly consistent with the observations above. For the Moons data the same broad patterns emerge, although in this case it is less evident that Latent Space generators induce relatively smaller shifts. For the Circles data, it also appears at first sight that Latent Space search yields better results, but it turns out that in this case counterfactual search is simply largely unsuccessful^[We suspect that this in this case the generative model has failed to learn an accurate representation of the data generating process.]. Model shifts and performance deterioration are also quantitatively quantitatively smaller than in what we can observe in Figure \@ref(fig:syn). The same broadly holds for the Moons data,  For the Linearly Separable data we also find substantial domain and model shifts, but no reduction in model performance. 
 
 Finally, it is also worth noting at this point that the observed dynamics and patterns are consistent throughout the course of the experiment. That is to say that we start observing shifts already after just a few rounds and these tends to increase proportionately for the different generators over the course of the experiment.
 
diff --git a/paper/sections/introduction.rmd b/paper/sections/introduction.rmd
@@ -21,7 +21,7 @@ Suppose Figure \@ref(fig:poc) relates to an automated decision-making system use
 :::
 
 ::: {.example #student name="Student Admission"}
-Suppose Figure \@ref(fig:poc) relates to an automated decision-making system used by a university in their student admission process. Assume that the two features are actually meaningful in the sense that the likelihood of students successfully completing their degree increases in the south-east direction. Then we can think of the outcome in panel (b) as representing a situation where more students are admitted to university (orange), but they are more liekly to fail their degree than students that were admitted in previous years. The university admission committee catches on to this and suspends its efforts to offer Algorithmic Recourse. This represents an opportunity cost to future student applicants, that may have derived utility from being offered recourse.
+Suppose Figure \@ref(fig:poc) relates to an automated decision-making system used by a university in their student admission process. Assume that the two features are actually meaningful in the sense that the likelihood of students successfully completing their degree increases in the south-east direction. Then we can think of the outcome in panel (b) as representing a situation where more students are admitted to university (orange), but they are more likely to fail their degree than students that were admitted in previous years. The university admission committee catches on to this and suspends its efforts to offer Algorithmic Recourse. This represents an opportunity cost to future student applicants, that may have derived utility from being offered recourse.
 :::
 
 Both of these examples are exaggerated simplifications of potential real-world scenarios, but they serve to illustrate the point that recourse for one single individual may exert negative externalities on other individuals.
diff --git a/paper/sections/mitigation.rmd b/paper/sections/mitigation.rmd
diff --git a/paper/sections/related.rmd b/paper/sections/related.rmd