More Privacy: Randomized Response
One way to define a privacy measure is through the concept of differential privacy. What is the privacy loss for a person to take part in a survey as opposed to abstaining ? Consider two sets of people differing by one person. What is that person’s privacy loss from being included in the sample surveyed ? Denote the set with the person included as X and without as Y.
The privacy loss from being included in a survey is defined as the natural log of the ratio of two probabilities. The numerator of the ratio is the probability that the survey of X would yield the observed result, and the denominator is the corresponding probability for a survey of Y.
For certain classes of methods, one can define an upper bound on the privacy loss.
Let us say one is a member of a despised group A, who might be penalized if her membership were to be revealed. How could a survey be designed to sample the fraction of population in group A and what would be the privacy loss to the respondents ? Or take a similar case where admitting to either membership or non membership of group A might have adverse consequence.
One way to address this is randomized response. Each respondent tosses two fair coins. If both are heads, she replies that she belongs to A. If both are tails, she says not-A. Else she answers truthfully.
If the actual fraction of people in A were P, and the survey measured a fraction f, then it is easy to see that P=2(f-1/4) . With some more work we can calculate that the upper bound on the privacy loss is ln(3)
This upper bound is only guaranteed for a one time survey. If you could repeatedly poll the respondents the upper bound would increase linearly with the number of repeat surveys.
Erlingsson et al. devised a scheme in 2014 (doi:10.1145/2660267.2660348 ) to meet the issue of repetition. Essentially, it uses randomized response multiple times, but it retains the first response, and then in successive surveys repeatedly applies randomized response using the retained first response. They can demonstrate that the privacy loss can never exceed that from the first randomized response. The mechanism is ingenious and deserves more detail, but this post is getting quite long. Perhaps another day.
sidd