Data Privacy in Software Engineering

The goals of data privacy and utility (usefulness of data for software engineering tasks) conflict. Data privacy seeks to avoid unwanted information disclosure. Many privacy algorithms achieve privacy but at a cost - the degradation of data utility (e.g. defect prediction or test coverage).

The more private a data set is made, the less useful it becomes.

My work so far has focused on finding the balance between utility and data privacy for defect data in software engineering.

Using instance pruning with CLIFF in combination with synthetic data generation with MORPH - results have been promising. In fact, experimental results have found exceptions to the rule depicted to the left.

Yes, CLIFF and MORPH show that it is possible to privatize data and not only maintain comparable utility to the non-privatized data, but in some cases surpass it. This work will appear in a future issue of IEEE Transactions on Software Engineering (TSE).

Result of a Privacy vs Utility experiment using defect data (ant-1.3) from the PROMISE repository. The circles represent CLIFF and MORPH, squares - k-anonymity, the triangles - data swapping, and the star - non-privatized data. The red horizontal and vertical lines show the utility (defect prediction) from the Naive Bayes defect model and privacy (respectively). Note that points above and to the right of these lines are private enough (over 80%) and performs adequately (as good as or better than the original data).


This site was developed to share my continued work and interest in privacy issues in software engineering. Of particular interest is how privacy is measured.

The privacy literature is rife with varying privacy metrics making it difficult to compare the effectiveness of one privacy algorithm versus another. It would be interesting to see if applying these different privacy metrics yield similar or opposing results and in the latter case, which metric should a user trust?