## 1 Introduction

For the release of microdata (i.e. record-level data of survey, individuals or business establishments), a commonly-used approach generates synthetic data from statistical models estimated on the original data for proposed release by statistical agencies

(Rubin1993synthetic; Little1993synthetic). We call such a release mechanism where are the protected data and are the parameters used to generate the synthetic data. In particular, we consider the case where is a posterior distribution.The Exponential Mechanism (EM) by McSherryTalwar2007 is a popular approach to generating draws of parameters, , and associated synthetic data which is differentially private (Definition 2). For a Bayesian model, it is equivalent to using a specific scalar weight to exponentiate the likelihood.

###### Definition 1

The Exponential Mechanism releases values of from a distribution proportional to,

(1) |

where is a utility function and is a base or prior distribution. Let is the sensitivity, defined globally over , the algebra of datasets, , governed by product measure, ; is the Hamming distance between . Then each draw of from the exponential mechanism satisfies DP, where is the privacy budget supplied by the publishing statistical agency.

The EM requires the availability of the sensitivity for a chosen utility function . WassermanZhou2010 and SnokeSlavkovic2018PSD construct utility functions that are naturally bounded over all ; however, they are not generally applicable to any population model and in the latter case are very difficult to implement in a computationally tractable way. Dimitrakakis:2017:DPB:3122009.3122020 propose the use of log-likelihood as the utility function. Their proposed posterior mechanism achieves an DP guarantee for each posterior draw of if the log-likelihood utility function is Lipschitz continuous with bound over the space of databases and the space of parameters; however, computing a finite in practice, as acknowledged by Dimitrakakis:2017:DPB:3122009.3122020, is difficult for an unbounded parameter space.

To facilitate computing a finite for a wider range of synthesizing models, SavitskyWilliamsHu2020ppm propose a *pseudo* posterior mechanism, where the utility function is the log-pseudo likelihood with vector weights . It generalizes the EM utility from to to create a new mechanism. This setup allows the control of the level of privacy protection at (), where is the Lipschitz bound for the weighted log-pseudo likelihood utility. This new mechanism is readily implementable for any population synthesizing model.

The new mechanism of SavitskyWilliamsHu2020ppm estimates an weighted pseudo posterior distribution,

(2) |

where the utility, . The associated DP definition for the weighted pseudo posterior mechanism extends Dimitrakakis:2017:DPB:3122009.3122020 with,

###### Definition 2

(Differential Privacy)

(3) |

where .

Defintion 2 limits the change in the pseudo posterior distribution over all sets, (i.e. is the algebra of measurable sets on ), from the inclusion of a single record. Although the pseudo posterior distribution mass assigned to depends on , the guarantee is defined as the supremum over all and for all which differ by one record (i.e. ).

In this article, we focus on two vector-weighted (pseudo posterior) synthesizers under DP for microdata dissemination. The first synthesizer, labeled as Lipschitz-weighted (LW), sets each by-record weight, such that , where represents the maximum of value of a log-likelihood record, over the space of parameter values. This maximum bound is referred to as the Lipschitz bound for . The second synthesizer, labeled as count-weighted (CW), sets each by-record weight, such that , where denotes the disclosure risk probability of record . The is a measure of isolation based on counting the number of records that outside a radius around the true value (HuSavitskyWilliams2020rebds). In both synthesizers, the record-indexed vector weights are used to exponentiate the likelihood contributions where the weights are designed to target high-risk records by downweighting their likelihood contributions.

We conduct two simulation studies with generated count data, a data type commonly collected by statistical agencies (e.g., employment). Results show that the vector-weighted synthesizer LW is the most “efficient" weighting scheme under equivalent privacy budget : it generates synthetic data with the highest utility.

We further propose a new re-weighting strategy that maximizes utility for any vector-weighted synthesizer: the strategy improves the utility of the vector-weighted synthesizer while maintaining an equivalent privacy budget. Specifically, it targets records whose Lipschitz values are relatively low compared to the Lipschitz bound for the database. The strategy increases the weights for those records, resulting in post re-weighting by-record Lipschitz bounds approaching the Lipschitz bound for the database.

The remainder of the article is organized as follows. Section 2 describes the key steps and the implementation algorithms of the two vector-weighted synthesizers under DP. We present two simulation studies in Section 3, then proceed to introduce the re-weighting strategy in Section 4 with an algorithm and a simulation study. Section 5 ends this article with a few concluding remarks.

## 2 Vector-weighted Synthesizer Algorithms

We enumerate the calculation of vector weights for the two synthesizers in Section 2.1 and Section 2.2, with Algorithm 1 and Algorithm 2, respectively.

Each algorithm starts by computing the weights, , which are then used to construct the pseudo likelihood and estimate the pseudo posterior. Next, we draw parameters from the estimated pseudo posterior distribution and compute the overall Lipschitz bound, . The resulting DP guarantee is and is “local" to the database, , and the is *indirectly* controlled through the weights, .
SavitskyWilliamsHu2020ppm show that the local Lipschitz bound, , contracts on the “global" Lipschitz bound, , over all databases, of size , as increases. Synthetic data are then generated using the drawn from the weighted pseudo posterior distribution from step 5 in each algorithm of the corresponding data generating model.

### 2.1 Generating synthetic data under the LW vector-weighted synthesizer

We specify the algorithm for generating synthetic data under the LW weighted pseudo posterior distribution that produces synthetic data with an DP guarantee for database, .

To implement the LW vector-weighted synthesizer, we first fit an unweighted synthesizer and obtain the absolute value of the log-pseudo likelihood for each data base record

and each Markov chain Monte Carlo (MCMC) draw

of . A Lipschitz bound for each record is computed by taking the maximum of the log- pseudo likehiloods over the draws of . We formulate by-record weights, , to be inversely proportional to the by-record Lipschitz bounds. See step 1 to step 4 in Algorithm 1.As can be seen in the algorithm, LW surgically downweights high-risk records in the data distribution. Moreover, LW *indirectly* sets the local DP guarantee, through the computation of the likelihood weights, . Sample R scripts of Algorithm 1 are available in the Appendix.

### 2.2 Generating synthetic data under the CW vector-weighted synthesizer

Next, we present the algorithm of HuSavitskyWilliams2020rebds for generating synthetic data under the CW weighted synthesizer. The weights are estimated as probabilities of identification disclosure based on assumptions about the behavior of a putative intruder seeking to identify a target record.

To compute a weight for each record, , we first calculate its estimated probability of identification disclosure. We assume that an intruder knows the data value of the record she seeks and that she will randomly choose among records that are close to that value. More formally, we cast a ball, , around the true value of for record with a radius . The records outside of the ball are viewed as “isolated" and, therefore, vulnerable to identification. The radius,

, is a policy hyperparameter set by the agency who owns the closely-held data. We count the number of records falling

*outside*of the radius, and take the ratio of this count over the total number of records, a proportion that we label the risk probability of identification. We then formulate by-record weights, , that are inversely proportional to the by-record risk probabilities. See step 1 to step 4 in Algorithm 2.

Even though the weights under CW are computed based on assumptions about the intruder behavior, we are yet able to compute its DP guarantee because SavitskyWilliamsHu2020ppm show that any weighting scheme with produces this formal privacy guarantee.

## 3 Comparison of Performances on Simulated Data

We demonstrate the properties of both vector-weighted synthesizers using two sets of simulated data of size 1000: i) simulated data from , which is nearly symmetric to slightly skewed due to the large value of ; and ii) simulated data from a mixture of two negative binomial, and , where

denotes an over-dispersion parameter under which the variance is allowed to be larger than the mean (with mixture weights of

and ), which produces data with a skewed distribution.For all results, we label the vector-weighted synthesizer in Section 2.1 as LW, and that in Section 2.2 as CW. For comparison, we include a scalar-weighted synthesizer with weight for every record as , where is the local Lipschitz bound for the unweighted synthesizer, which has been shown equivalent to the EM under a log-likelihood utility (SavitskyWilliamsHu2020ppm), and produces an (). This allows us to set the EM DP privacy guarantee to that for the LW vector-weighted synthesizer, so that we may compare their relative utility performances. Results based on this weighting method are labeled as “SW" (scalar-weighted). Finally, we include the unweighted synthesizer, labeled as “Unweighted", which is a Poisson synthesizer for the Poisson-simulated data, and a negative binomial synthesizer for the mixture of negative binomial-simulated data.

All three weighted synthesizers are scaled to have equivalent local , the overall Lipschitz bound; that is, they all satisfy the same DP. This allows us to compare their by-record patterns of downweighting and their risk performances. All model estimation is done in Stan (Rstan). The Stan script for a weighted Poisson synthesizer is available in the Appendix.

### 3.1 Less skewed simulated data

Figure (a)a plots distributions of the record-level Lipschitz bound . The overall Lipschitz bound, , is the maximum Lipschitz bound across all records, i.e. the maximum value of each violin plot on the y-axis for each synthesizer, weighted or unweighted. All three weighted synthesizers have equivalent ’s, by design, to allow comparisons of the weighting distributions and their utility performances. All three overall Lipschitz bounds are substantially lower than that of Unweighted, indicating that the weights, , may be used to control the DP privacy guarantee. Figure (b)b plots the associated distribution or record-level weights .

When the data distribution is less skewed, we see in Figure (a)a that more of the records for LW express relatively high values for concentrated around the overall Lipschitz, . This suggests that LW generally avoids overly downweighting in a fashion that would produce lower by record values. The overall privacy guarantee is governed solely by , the *maximum* value of the by-record Lipschitz bounds such that having records with even lower bounds does not improve the DP privacy guarantee.

Moreover, for LW, its distribution for the in Figure (b)b is skewed towards values closer to , which indicates that we expect relatively good utility performance in the resulting synthetic data in terms of preserving the real data distribution (HuSavitskyWilliams2020rebds). Figure (a)a and Figure (b)b together indicate a relatively efficient design with large values of weights and concentrated Lipschitz bounds.

By contrast, CW shows relatively more records where and are low relative to those of LW, which means that CW is overly downweighting more records and is less targeted than LW. This suggests CW as a less efficient weighting scheme that requires more downweighting to achieve the same DP privacy guarantee.

SW utilizes a scalar weight and shows most records have Lipschitz bounds much lower than the overall Lipschitz bound that governs the DP privacy guarantee. Since all records are equally downweighted, this synthesizer has the effect of reducing the effective sample size, which would be expected to increase uncertainty estimation model parameters, such as in the Poisson and negative binomial models.

For utility evaluation, we include violin plots of the 15th and 90th quantile estimates of the generated synthetic data by all synthesizers. For comparison, we include that from the data, labeled as “Data", in Figure (a)a. We can see that between the two vector-weighted synthesizers, CW performs relatively worse with notably biased quantile estimates. This is because CW, compared to LW, more heavily downweights the mass of the data distribution (Figure (a)a and Figure (b)b), which translates into its reduction of utility in Figure (a)a. These suggest that LW is a more efficient weighting scheme than the probability-based weighting scheme of CW. An additional utility plot of mean and median estimates with a similar outcome is available in the Appendix.

While the scalar-weighted synthesizer SW produces synthetic data that reasonably well preserves the quantiles of the real data distribution in Figure (a)a, when we look at Figure (b)b of the posterior density of the mean parameter

of the real data, the credibility interval for SW is much wider than the two vector-weighted schemes, indicating a substantial loss of information and inferential power. Going back to Figure

(b)b, we see that SW uses a scalar weight around 0.375 for every record (to achieve an equivalent as LW and CW), which is substantially lower than most of the weights of LW and CW. This suggests that SW has the effect of reducing the sample size with a scalar-weight. Figure (b)b also shows the utility distortion induced by CW due to overly downweighting relatively low-risk records.### 3.2 Highly skewed simulated data

When the data distribution is highly skewed, the comparison of by-record Lipschitz bound distributions and that of by-record weight show similar trends as for the Poisson simulation; see Figure (b)b and Figure (a)a.

However, utility plots in Figure (c)c to Figure (e)e show that for the two vector-weighted synthesizers: i) LW does not achieve high utility in the skewed data compared to when the data are less skewed–it overestimates the 15th quantile and underestimates the 90th quantile, indicating a worse control of the tails; however, due to the fact that the mass of the distribution is not much downweighted, the mean and median estimates are relatively robust; ii) CW’s utility performance also worsens, and is in some situations, unacceptable–the quantile estimates are more extreme than LW, indicating an even worse control of the tails; moreover, its mean and median are overestimated, indicating a worse control of the distribution mass. As in Section 3.1, the scalar-weighted SW achieves correctly-centered estimates but with too much estimation uncertainty.

Figure (f)f compares data distribution plots across the true, closely-held data and the four synthesizers. Similar to the previous discussion, the vector-weighted synthesizers, LW and CW, concentrate downweighting to the tails, resulting in synthetic data with shorter tails on both sides. CW downweights the tails more severely for the same privacy protection.

We also see competing effects of SW that distort the data distribution in Figure (f)f. On the one hand, SW is based on the riskiest records in the tails and downweights those likelihood contributions, which will pull in or truncate the tails. On the other hand, since the weight is the same for all records, the main mode of the data distribution is flattened more than the vector-weighted synthesizers of LW and SW, which then reverses some of the truncation in the tails. This also explains the relatively good performance on the quantiles in Figure (c)c.

One way to improve utility is to increase the weights (HuSavitskyWilliams2020rebds; SavitskyWilliamsHu2020ppm). At the same time, we want to keep an equivalent level for the overall Lipschitz bound; otherwise we can no longer maintain the DP guarantee. We now turn to a re-weighting strategy to maximize utility of any vector-weighted synthesizer with equivalent DP guarantee.

## 4 Re-weighting strategy to maximize utility for any privacy budget

### 4.1 Motivation and proposed method

To motivate our re-weighting strategy, we revisit Figure (b)b and focus on the two vector-weighted synthesizers, LW and CW. We know that the DP guarantee is controlled by the maximum Lipschitz bound . As long as the maximum of the by-record Lipschitz bounds stays unchanged, we can increase the by-record Lipschitz bounds to be closer to its maximum. Increased Lipschitz bounds are associated with increased weights, which will result in improved utility (HuSavitskyWilliams2020rebds; SavitskyWilliamsHu2020ppm).

Figure 4 shows scatter plots of by-record Lipschitz bound against by-record weight, , for LW and CW. In each case, the red dashed line indicates the maximum Lipschitz bound . For LW, while only a small number of records having low Lipschitz bound, the majority has relatively high Lipschitz bound with high weight. By contrast, CW has many records with low Lipschitz bounds, and many of them also have low weights. These observations serve as further justification of why LW is a more efficient vector-weighted scheme than CW: LW has done a relatively good job of targeting high-risk records and down-weighting them, while CW is overly downweighting lower-risk records and is less targeted than LW (to achieve the same privacy protection).

Figure 4 reveals that both vector-weighted synthesizers could improve their weighting efficiency to achieve a given maximum Lipschitz bound under a DP risk criterion. We can increase the weights so that the the by-record Lipschitz bounds ’s are getting closer to the red dashed line. In this way, we can improve the utility performance of LW and CW while maintaining an equivalent . In the limit, the best efficiency that may be achieved by a vector-weighted scheme is one where the plot of by-record weights on the x-axis and the by-record Lipschitz bounds on the y-axis is horizontal; that is, there is no relationship between the weights and the Lipschitz bounds.

Our re-weighting strategy calculates re-weighted weights by:

(4) |

where is the maximum Lipschitz bound, and is the Lipschitz bound for record . is weight before re-weighting, and a constant, , is used to ensure that the final maximum Lipschitz bound remains equivalent before and after this re-weighting step. The implementation is in Algorithm 3, and should be inserted between step 4 and step 5 in Algorithm 1 and Algorithm 2 for LW and CW, respectively.

### 4.2 Application to highly skewed data

We demonstrate our re-weighting strategy under the highly skewed negative binomial mixture data, where in Section 3.2 we have seen unsatisfactory utility outcomes of LW and CW. Using produces an equivalent overall Lipschitz bound. The re-weighted synthetic data results are labeled “LW_final" and “CW_final", for LW and CW respectively.

Figure (a)a and Figure (b)b show the before vs after re-weighting scatter plots of Lipschitz bounds and weights. As evident in Figure (a)a, the curve showing the Lipschitz-weight relationship is nearly horizontal at the maximum Lipschitz bound from LW to LW_final, which indicates maximum efficiency. The curve in Figure (b)b becomes much less vertical from CW to CW_final, indicating much improved efficiency.

Figure (a)a confirms that our re-weighting strategy increases the weighting efficiency of LW and CW: the by-record Lipschitz bounds are increased while maintaining an equivalent maximum Lipschitz bound. At the record level, Figure (b)b illustrates that every record has received a higher weight from LW to LW_final, and from CW to CW_final. We get further confirmation from the weight plots in Figure (c)c and Figure (d)d that show weights increase after re-weighting.

Now turning to the utility performances, Figure (a)a and Figure (b)b show huge improvement of utility of CW after re-weighting for all estimates of the generated synthetic data. Its feature of overly downweighting the tails before re-weighting, as we have seen in Section 3.2, has been greatly mitigated by the re-weighting strategy: the extreme quantiles, the mean, and the median estimates are much more accurate. The improvement of utility of LW is less impressive, however we can certainly see improvement in estimating the mean, median, and 90th quantile.

Moreover, utility of the mean parameter estimation in Figure (c)c improves after re-weighting for LW and CW, with a bigger improvement for CW. When we turn to violin plots of one synthetic dataset across the data and the synthesizers in Figure (d)d, we see that the re-weighting strategy improves the tails in LW_final and CW_final, compared to LW and CW, respectively. This reduced down-weighting of the distribution tails is a feature of the re-weighting strategy for any vector-weighted scheme applied to highly skewed data.

## 5 Concluding Remarks

In this article, we demonstrate that the LW vector-weighted scheme provides better utility for equivalent privacy protection than either CW vector-weighted or the scalar-weighted SW. When the data distribution is highly skewed, more records express high risk, especially in the right tail. LW performs worse than under less skewed data in reproducing the tail, though it keeps the mass of the data distribution relatively unaffected.

We propose a re-weighting strategy that improves utility of any vector-weighted scheme in the difficult case of a highly-skewed data distribution, while maintaining an equivalent privacy budget. Applied to both LW and CW, this strategy improves their weighting efficiency by increasing by-record weights that produce Lipschitz bounds below the maximum bound. Improved weighting efficiency substantially mitigates the tendency for vector-weighted schemes to overly downweight the tails, especially for CW. Using re-weighting induces a contraction of *any* vector-weighted scheme towards the most efficient scheme under a DP formal privacy guarantee.

## References

## Appendix

### 1. R scripts of Algorithm 1 in Section 2.1

Computing weights

The stan_estimate_unweighted below is the Stan output of the unweighted synthesizer.

## step 1 log_lik <- stan_estimate_unweighted$log_lik ## S x N matrix N <- ncol(log_lik) S <- nrow(log_lik) log_ratio <- matrix(0,S,N) log_ratio_theta <- matrix(0,S,1) pos <- rep(TRUE,N) ## step 2 for( s in 1:S ){ log_like_xs <- sum(log_lik[s,]) ## full data for(i in 1:N){ pos_i <- pos pos_i[i] <- FALSE log_like_xsi <- sum(log_lik[s,pos_i]) log_ratio[s,i] <- abs(log_like_xs - log_like_xsi) }} log_ratio_theta <- rowMaxs(log_ratio, value = TRUE) L <- quantile(log_ratio_theta,thresh) ## step 3 log_ratio_data <- colMaxs(logthresh_ratio,value=TRUE) f_linres <- function(x){(x-min(x))/(max(x)-min(x))} risks <- f_linres( log_ratio_data ) ## step 4 weights <- c * (1 - risks) + g weights[weights <= 0] <- 0 weights[weights >= 1] <- 1

Compute Lipschitz bound,

The stan_estimate_weighted below is the Stan output of the weighted synthesizer. Step 5 is done by Stan estimation.

## step 6 log_lik <- stan_estimate_weighted$log_lik ## S x N matrix N <- ncol(log_lik) S <- nrow(log_lik) log_ratio <- matrix(0,S,N) log_ratio_theta <- matrix(0,S,1) pos <- rep(TRUE,N) for( s in 1:S ){ log_like_xs <- sum(log_lik[s,]) ## full data for(i in 1:N){ pos_i <- pos pos_i[i] <- FALSE log_like_xsi <- sum(log_lik[s,pos_i]) log_ratio[s,i] <- abs(log_like_xs - log_like_xsi) }} ## step 7 log_ratio_theta <- rowMaxs(log_ratio, value = TRUE) L <- quantile(log_ratio_theta,thresh)

### 2. Stan script for a weighted Poisson synthesizer

functions{ real wt_pois_lpmf(int[] y, vector mu, vector weights, int n){ real check_term; check_term = 0.0; for( i in 1:n ) { check_term += weights[i] * ¯¯¯ poisson_log_lpmf(y[i] | mu[i]); } return check_term;} real wt_poisi_lpmf(int y_i, real mu_i, real weights_i){ real check_term; check_term = weights_i * poisson_log_lpmf(y_i | mu_i); return check_term;}} data { int<lower=1> n; int<lower=1> K; int<lower=0> y[n]; vector<lower=0>[n] weights; matrix[n, K] X; } transformed data{ vector<lower=0>[K] zeros_beta; zeros_beta = rep_vector(0,K);} parameters{ vector[K] beta; vector<lower=0>[K] sigma_beta; cholesky_factor_corr[K] L_beta; } transformed parameters{ vector[n] mu; mu = X * beta;} model{ L_beta ~ lkj_corr_cholesky(6); sigma_beta ~ student_t(3,0,1); beta ~ multi_normal_cholesky(zeros_beta, ¯diag_pre_multiply(sigma_beta,L_beta) ); target += wt_pois_lpmf(y | mu, weights, n);} generated quantities{ vector[n] log_lik; for (i in 1:n) { ¯log_lik[i] = wt_poisi_lpmf(y[i]| mu[i], weights[i]); }}

Comments

There are no comments yet.