LACE = Large-scale Assurance of Confidentiality Environment
Ideally, we can learn lessons from software projects across multiple organizations. However, a major impediment to such knowledge sharing are the privacy concerns of software development organizations.
Extracting project data from organizations is often very difficult due to the business sensitivity associated with the data. For example, for eleven years Tim Menzies worked at NASA’s software IV&V center. NASA divides its IV&V work between multiple contractors who all work on similar projects. Every three years, all the contracts are subject to competitive bidding. Hence, those contractors are very reluctant to share data about (say) the effort associated with finding mission-critical errors in manned spacecraft, lest that data is used against them during the bidding process.
To convince data collectors/curators to publish their data, the LACE provides them with algorithms (MORPH, CLIFF and LeaF) to privatize their data in such a way as to prevent the competition from learning specific sensitive metric values from their released data; and ensures that the privatized data remain useful for research/classification purposes.
LACE applies novel data privatization algorithms for two data sharing scenarios:
- The case where one data owner wants to share their data; and
- The case where multiple data owners want to collaborate with each other and share their data collectively.
LACE also includes a means to measure the level of sensitive attribute disclosure of a privatized data candidate. This is called IPR, the Increased Privacy Ratio. We cannot claim absolute data privacy with LACE, however, with IPR we can show a data curator how protected the sensitive attribute values of their data will be against sensitive attribute disclosure attacks. It is then up to them to decide whether or not to share their data based on the IPR.
MORPH for Constrained Obfuscation
MORPH is the randomization algorithm in the LACE framework. It exploits the difference between the goals of an adversary and the goals of utility by randomly changing data instances such that they never move across the boundary between the original instance and instances of another class.
The MORPHed instances are created with the following equation:
Here x ∈ D, is the original data point to be changed, y is the resulting MORPHed data point, and z ∈ D is the nearest unlike neighbor of x, that is x's nearest neighbor
(via Euclidean Distance) with a different class label than x. Finally, the random number r is calculated with the property:
This work was published for ICSE 2012 with the title,
Privacy and Utility for Defect Prediction: Experiments with "MORPH"
CLIFF for Data Minimization
One core aspect of LACE is to look at less data. CLIFF is used for data minimization. It assumes tables of training data can be divided into classes. For example, for a table of defect data containing code metrics, different rows might be labeled accordingly (defective or not defective).
CLIFF executes as follows:
- For each column of data, find the power of each attribute sub-range; i.e., how much more frequently that sub-range appears in one class more than any other.
- To control the amount of instances left by CLIFF, we find the product of the powers for each row then,
- Remove the less powerful rows. The result is a reduced data set with fewer rows. In theory, this reduced data set is less susceptible to privacy threats such as sensitive attribute disclosure.
LeaF for Selective Sharing
LeaF is a leader-follower algorithm. It is an online incremental technique for clustering data. The cluster centers are the “leaders” and all other instances are the “followers”. For this work we are only interested in the leaders. The basic algorithm works as follows: (1) initialize cluster centers, then (2) for each instance in the data, find its nearest center. If the distance to the center is less than a user defined distance, then (3) update the cluster. Otherwise, create a new cluster with the instance as the center.
In the scenario of collaboration among multiple data owners, LeaF is applied to each instance selected by CLIFF from the party’s data set to determine if it should be included in the private cache. To fit our work, we adapt LeaF in four ways:
- The cluster centers are never updated to create centroids.
- The initial centers are the two instances I1, I2 that are farthest from each other in the data. We call the distance between Ii and I2, the separation of the data.
- We use the value d = separation/10 to determine if data from a data curator should be included in the private cache (new data is added to the cache if it falls more than d away from its nearest exemplar and their class attribute values are different.)
- Prior to a new instance being added to the cache, it is first MORPHed. LeaF is only used by the data owner who is the initiator for multiparty data sharing. All other data owners that follow, use the private cache. They input their CLIFFed data and the current private cache. The result is an updated cache whose new MORPHed exemplars adhere to the conditions for entering the private cache.
IPR: Increased Privacy Ratio
For IPR, we assume the role of an adversary armed with some background knowledge from the original data set and also supplied with the private data set. When the adversary predicts the sensitive attribute value of a target we use the original data to see if the prediction is correct. If it is we consider this as sensitive attribute disclosure otherwise it is not.