1 Introduction
2 Related work
\(H_0\) is true in reality (\(H_0\)) | \(H_0\) is false in reality (\(\lnot H_0\)) | |
---|---|---|
The decision inferred | \(\mathrm {Pr}(R|H_0) = \alpha\) | \(\mathrm {Pr}(R|\lnot H_0) = 1-\beta\) |
from the sample is | Type 1 error | Power |
“reject \(H_0\)” (R) | ||
The decision inferred | \(\mathrm {Pr}(\lnot R|H_0) = 1-\alpha\) | \(\mathrm {Pr}(\lnot R|\lnot H_0) = \beta\) |
from the sample is | Type 2 error | |
“do not reject \(H_0\)” (\(\lnot R\)) |
2.1 Uniform random samplers
2.1.1 Atomic mutations (QuickSampler)
2.1.2 Hashing-based sampling (Unigen2)
2.1.3 Counting-based sampling (KUS, Smarch, and Spur)
2.1.4 New: BDDSampler, a scalable and uniform sampler
-
BDDSampler, KUS, QuickSampler, and Spur are considerably faster than Smarch and Unigen2.
-
In terms of uniformity, there are three types of samplers: (i) those that mostly fail to produce uniform samples (QuickSampler), (ii) those that usually work but from time to time generate non-uniform samples (KUS and Spur), and (iii) those that always produce uniform samples (BDDSampler, Smarch, and Unigen2).
2.2 Prior work on testing sampler uniformity
2.2.1 Method 1: Generate a massive sample with \(\mathbb {S}\), and compare it with another one obtained simulating an ideal uniform sampler
2.2.2 Method 2: Assume the existence of a uniform sampler \(\mathbb {U}\), and compare the samples generated by both \(\mathbb {S}\) and \(\mathbb {U}\)
2.2.3 Method 3: Measure the distance between the theoretical variable probabilities with the empirical variable frequencies in a sample
2.2.4 Method 4: A statistical goodness-of-fit test that compares the theoretical variable probabilities with the empirical variable frequencies in a sample
2.2.5 New: an improved goodness-of-fit test
2.2.6 Recap
3 The BDDSampler Tool
3.1 From configuration models to Boolean formulas
3.2 A brief introduction to BDDs
3.3 How BDDSampler works
4 Assessing the uniformity of SAT solution samplers
4.1 The SFpC goodness-of-fit test
-
\(F=[f_0, f_1, \ldots , f_n]\) stores the SAT-solution frequency distribution (i.e., the red histograms in Fig. 7). That is,$$\begin{aligned} f_0= & {} \frac{\#\mathrm {SAT\ solutions\ in\ the\ sample\ with\ no\ variables\ assigned\ to\ true}}{\mathrm {sample \ size}} \\ f_1= & {} \frac{\#\mathrm {SAT\ solutions\ in\ the\ sample\ with\ } 1 \mathrm {\ variable\ assigned\ to\ true}}{\mathrm {sample \ size}} \\&\ldots&\\ f_n= & {} \frac{\#\mathrm {SAT\ solutions\ in\ the\ sample\ with\ all\ variables\ assigned\ to\ true}}{\mathrm {sample \ size}} \\ \end{aligned}$$
-
\(P=[p_0, p_1, \ldots , p_n]\) stores the theoretical SAT-solution probability distribution of Fig. 7. That is,$$\begin{aligned} p_i = \frac{\#\mathrm {SAT\ solutions\ in\ the\ population\ with\ } i \mathrm {\ variables\ assigned\ to\ true}}{\mathrm {population \ size}} \end{aligned}$$
-
BDDSampler:$$\begin{aligned}&\mathrm {JSD}(P||F)=0.001085388 \\&2 \cdot 17,738 (\mathrm {ln} 2) \mathrm {JSD}(P||F) = 26.68979\\&p-\mathrm {value}= \mathrm {Pr}\Big (\mathrm {getting\ a\ value\ }\ge 26.68979 \big |H_0\Big ) \sim 1 > \alpha \\&\Rightarrow \mathrm {Test\ result:\ Do\ not\ reject\ }H_0 \end{aligned}$$
-
QuickSampler:$$\begin{aligned}&\mathrm {JSD}(P||F) \sim 1 \\&2 \cdot 17,738 (\mathrm {ln} 2) \mathrm {JSD}(P||F) = 12,923.04 \\&p-\mathrm {value}= \mathrm {Pr}\Big {(}\mathrm {getting\ a\ value\ }\ge 12,923.04 \big {|}H_0\Big {)} \sim ~ 0 \le \alpha \\&\Rightarrow \mathrm {Test\ result:\ Reject\ }H_0 \end{aligned}$$
4.2 Sample size estimation
-
The significance level \(\alpha\) sets the probability of making a Type 1 error, i.e., the probability of rejecting \(H_0\) when it is indeed true (false positive). It is worth noting that \(\alpha\) is also the threshold for rejecting \(H_0\) (i.e., \(H_0\) is rejected whenever the p-value \(\le \alpha\)).
-
\(\beta\) sets the probability of making a Type 2 error, i.e., the probability of accepting a false \(H_0\) (false negative). The expression \(1-\beta\) is called the test’s power, i.e., the probability of rejecting a false \(H_0\).
5 Empirical evaluation
-
G1: Samplers’ evaluation. The first goal G1 is to evaluate the scalability and uniformity of BDDSampler and the following state-of-the-art samplers: KUS17 (Sharma et al. 2018), QuickSampler18 (Dutra et al. 2018), Smarch19 (Oh et al. 2019), Spur20 (Achlioptas et al. 2018), and Unigen221 (Chakraborty et al. 2013; Chakraborty 2015). G1 is broken down into Questions Q1 and Q2:
-
Q1: Samplers scalability. Are BDDSampler, KUS, QuickSampler, Smarch Spur, or Unigen2 able to generate samples with 1,000 configurations for models of all sizes within one hour?
-
Q2: Samplers’ uniformity. Do BDDSampler, KUS, QuickSampler, Smarch Spur, or Unigen2 always generate uniform samples?
-
-
G2: SFpC’s evaluation The second goal G2 is to evaluate the scalability and quality, in terms of validity and reliability, of our SFpC test. G2 is refined into Questions Q3-Q5:
-
Q3: SFpC’s scalability. How much time and how many configurations does SFpC need to check the uniformity of a sampler on a model?
-
Q4: SFpC’s validity. Does SFpC produce results consistent with the results obtained by other uniformity testing methods?
-
Q5: SFpC’s reliability. When SFpC is applied repeatedly to the same model and sampler, are the results consistent?
-
5.1 Experimental setup
-
The DIMACS files of the industrial SAT formulas and JHipster were retrieved from Plazar et al. (2019).
-
The DIMACS file of LargeAutomotive was gathered from Krieter et al. (2018).
-
The DIMACS file of DellSPLOT was obtained from Nöhrer and Egyed (2013).
-
We generated all NNF files from their respective DIMACS files with the d-DNNF compiler d4 (Lagniez and Marquis 2017) that is embedded in KUS.
Model | #Variables | #Clauses | #SAT-Solutions |
---|---|---|---|
JHipster (Halin et al. 2019) | 45 | 104 | 26,256 |
axTLS 1.5.3 | 64 | 96 | \(3.924\cdot 10^{12}\) |
Fiasco 2014092821 | 113 | 4,717 | \(5.144\cdot 10^{9}\) |
DellSPLOT (Nöhrer and Egyed 2013) | 118 | 2,181 | \(7.440\cdot 10^{6}\) |
uClibc 201 50420 | 298 | 903 | \(7.503\cdot 10^{50}\) |
ToyBox 0.5.2 | 544 | 1,020 | \(1.450\cdot 10^{17}\) |
BusyBox 1.23.2 | 613 | 530 | \(7.428\cdot 10^{146}\) |
EmbToolkit 1.7.0 | 2,331 | 6,437 | \(3.961\cdot 10^{334}\) |
LargeAutomotive (Krieter et al. 2018) | 17,365 | 321,897 | \(5.260\cdot 10^{1,441}\) |
5.2 Q1: Scalability of samplers
Model | BDD Sampler | KUS | Quick Sampler | Smarch | Spur | Unigen2 |
---|---|---|---|---|---|---|
JHipster | 0.04 | 0.27 | 0.07 | 911.08 | 0.03 | 3.59 |
axTLS | 0.04 | 0.34 | 0.20 | 1,993.90 | 0.03 | timeout |
Fiasco | 0.07 | 0.45 | 1.47 | timeout | 0.06 | timeout |
DellSPLOT | 0.08 | 0.44 | 0.44 | 3,278.09 | 0.07 | 187.58 |
uClibc | 0.14 | 0.99 | 0.50 | timeout | 0.23 | timeout |
ToyBox | 0.25 | 1.25 | 0.78 | timeout | 0.09 | timeout |
BusyBox | 0.26 | 1.87 | 0.67 | timeout | 0.17 | timeout |
EmbToolkit | 2.61 | timeout | 4.62 | timeout | 9.15 | timeout |
LargeAutomotive | 12.07 | 119.26 | 77.06 | timeout | 24.57 | timeout |
5.3 Q2: Uniformity of samplers
Model | BDD Sampler | KUS | Quick Sampler | Smarch | Spur | Unigen2 |
---|---|---|---|---|---|---|
JHipster | \(\sim 1\) | 0.99 | \(\sim 0\) | \(\sim 1\) | \(\sim 1\) | \(\sim 1\) |
axTLS | \(\sim 1\) | \(\sim 1\) | \(\sim 0\) | timeout | \(\sim 1\) | timeout |
Fiasco | \(\sim 1\) | \(\sim 1\) | 0.30 | timeout | \(\sim 1\) | timeout |
DellSPLOT | \(\sim 1\) | 0.99 | 0.85 | timeout | \(\sim 1\) | 0.96 |
uClibc | \(\sim 1\) | \(\sim 1\) | \(\sim 1\) | timeout | \(\sim 1\) | timeout |
ToyBox | \(\sim 1\) | \(\sim 1\) | \(\sim 0\) | timeout | \(\sim 1\) | timeout |
BusyBox | \(\sim 1\) | \(\sim 1\) | \(\sim 0\) | timeout | \(\sim 1\) | timeout |
EmbToolkit | \(\sim 1\) | timeout | \(\sim 1\) | timeout | \(\sim 0\) | timeout |
LargeAutomotive | \(\sim 1\) | timeout | \(\sim 0\) | timeout | \(\sim 1\) | timeout |
-
As Fig. 13 shows, 16.4% of the KUS samples got a p-value in [0, 0.1]. Furthermore, in 15.89% of the cases, the p-values were less than \(\alpha =0.01\), and thus rejected the uniformity hypothesis. Figure 14 shows four examples were KUS uniformity was rejected. Each subfigure compares, for a particular model, the histogram of the SAT-solution distribution of the whole population (in blue) with the distribution of the generated sample (in red). Unfortunately, the rejected samples do not show any clear pattern that explains the causes of KUS failures. For instance, KUS exhibits difficulties with small models (blasted_case63) but also with large ones (blasted_squaring26), with normal distributions (blasted_case63 and s1238a_7_4) and non-normal distributions (s526_15_7 and blasted_squaring26), with left-skewed distributions (s1238a_7_4) and right-skewed distributions (blas-ted_case63), etc.
-
In our previous evaluation (Heradio et al. 2020), we detected that Spur generated uniform samples for all models except for EmbToolkit. We thought our test was making a Type 1 error, misjudging the sampler uniformity because an extremely low p-value happened due to randomness. However, when we checked the samplers’ uniformity with our new test, we obtained exactly the same results for this particular model, which raised our suspicions. We repeated the experiment one thousand times and Spur never generated a uniform sample for EmbToolkit. Figure 15 shows the results for two of those experiment repetitions. In this case, Spur’s error always displays the same pattern: the solutions in the sample have more variables assigned to true than in the population.
5.4 Q3: Scalability of the SFpC test
Model | Sample size for the SFpC test | Sample size for the FP test |
---|---|---|
JHipster | 4,664 | 5,994 |
axTLS | 6,314 | 7,198 |
Fiasco | 5,460 | 7,646 |
DellSPLOT | 3,889 | 9,131 |
uClibc | 10,987 | 13,047 |
ToyBox | 8,517 | 10,739 |
BusyBox | 17,738 | 18,041 |
EmbToolkit | 26,482 | 28,866 |
LargeAutomotive | 37,626 | 84,522 |
Model | Time |
---|---|
JHipster | 0.03 |
axTLS | 0.07 |
Fiasco | 0.03 |
DellSPLOT | 0.10 |
uClibc | 0.38 |
ToyBox | 0.16 |
BusyBox | 0.41 |
EmbToolkit | 567.93 |
LargeAutomotive | 164.35 |
5.5 Q4: Validity of SFpC
Article | Uniform | Non-Uniform |
---|---|---|
Achlioptas et al. (2018) | Spur and Unigen2 | – |
Chakraborty et al. (2015) | Unigen2 | – |
Chakraborty et al. (2019) | Unigen2 | QuickSampler |
Oh et al. (2019) | Smarch and Unigen2 | – |
Plazar et al. (2019) | Unigen | QuickSampler |
Sharma et al. (2018) | KUS and Spur | – |
This present paper | BDDSampler, Smarch | KUS, QuickSampler |
(SFpC) | and Unigen2 | and Spur |
5.6 Q5: Reliability of SFpC
6 Discussion
-
Q2: Samplers’ uniformity. Three categories of samplers can be distinguished: (i) those that mostly fail to produce uniform samples (QuickSampler), (ii) those that usually work but from time to time generate non-uniform samples (KUS and Spur), and (iii) those that always produce uniform samples (BDDSampler, Smarch, and Unigen2). QuickSampler’s incapacity to generate uniform samples was previously reported by Chakraborty et al. (2019), Plazar et al. (2019), and Heradio et al. (2020). However, this paper is the first one that detects problems with KUS and Spur. We think this finding is due to SFpC’s ability to test samplers’ uniformity on considerably more complex models than previous tests.
-
Q3: SFpC’s scalability. SFpC is the most scalable uniformity test to date. It requires the smallest sample size of all existing tests, enabling the verification of samplers’ uniformity in large models even for the most strict quality settings (\(\alpha = 0.01, \beta = 0.01\), and \(w = 0.1\)).
-
Q4: SFpC’s validity. According to the results, SFpC judgments are consistent with the verdicts given by the alternative methods proposed in the literature.
-
Q5: SFpC’s reliability. The results show that SFpC judgments are reliable, i.e., when SFpC is applied repeatedly to the same model and sampler, the reached conclusions are notably consistent.
-
Sampling parallelization. Although any sampler can be run in a multi-core fashion, thus producing samples concurrently, only Unigen2 and Smarch were specifically designed for that. The focus of our evaluation is on the sampling techniques, not on how those techniques can be parallelized efficiently. Therefore, all samplers were run on a single thread.
-
Use of preprocessing techniques. There are some methods to preprocess the model Boolean formulas for speeding up further computations. For example, Ivrri et al. (2016) claim that sampling with the formulas’ MIS produces 2-3 orders of magnitude performance improvement. Nevertheless, Plazar et al. (2019) empirical results contradict that, showing no running time difference between sampling from the whole formula or the MIS. Anyway, we decided to focus on the sampling techniques, not on how any additional preprocessing methods may impact those techniques.
7 Conclusions
8 Material
-
BDDSampler is available at https://github.com/davidfa71/BDDSampler
-
The code scripts to replicate our experimental validation (i.e., to calculate each model’s sample size, run the samplers, and test the scalability/uniformity of the samplers) are available at https://github.com/rheradio/ConfSystSampling
-
A detailed report on every research question in Section 5 is available at: https://rheradio.github.io/ConfSystSampling
-
The data of Experiments E1 and E2 (including the benchmark models in DIMACS/DDDMP/NNF formats, the generated samples, the goodness-of-fit test results, etc.) are available at https://doi.org/10.5281/zenodo.4514919
-
The data of Experiment E3 are available at https://doi.org/10.5281/zenodo.5509947