The goals of data privacy and utility (usefulness of data for software engineering tasks) conflict. Data privacy seeks to avoid unwanted information disclosure. Many privacy algorithms achieve privacy but at a cost - the degradation of data utility (e.g. defect prediction or test coverage).
The more private a data set is made, the less useful it becomes.
My work so far has focused on finding the balance between utility and data privacy for defect data in software engineering.
Using instance pruning with CLIFF in combination with synthetic data generation with MORPH - results have been promising. In fact, experimental results have found exceptions to the rule depicted to the left.
Yes, CLIFF and MORPH show that it is possible to privatize data and not only maintain comparable utility to the non-privatized data, but in some cases surpass it. This work will appear in a future issue of IEEE Transactions on Software Engineering (TSE).
This site was developed to share my continued work and interest in privacy issues in software engineering. Of particular interest is how privacy is measured.
The privacy literature is rife with varying privacy metrics making it difficult to compare the effectiveness of one privacy algorithm versus another. It would be interesting to see if applying these different privacy metrics yield similar or opposing results and in the latter case, which metric should a user trust?