Balancing Privacy and Utility in Cross-Company Defect Prediction - TSE 2013

by Fayola Peters, Tim Menzies, Liang Gong and Hongyu Zhang

Background: Cross-company defect prediction (CCDP) is a field of study where an organization lacking enough local data can use data from other organizations for building defect predictors. To support CCDP, data must be shared. Such shared data must be privatized, but that privatization could severely damage the utility of the data.

Aim: To enable effective defect prediction from shared data while preserving privacy.

Method: We explore privatization algorithms that maintain class boundaries in a dataset. CLIFF is an instance pruner that deletes irrelevant examples. MORPH is a data mutator that moves the data a random distance, taking care not to cross class boundaries. CLIFF+MORPH are tested in a CCDP study among 10 defect datasets from the PROMISE data repository.

Results: We find: 1) The CLIFFed+MORPHed algorithms provide more privacy than the state-of-the-art privacy algorithms; 2) in terms of utility measured by defect prediction, we find that CLIFF+MORPH performs significantly better.

Conclusions: For the OO defect data studied here, data can be privatized and shared without a significant degradation in utility. To the best of our knowledge, this is the first published result where privatization does not compromise defect prediction.

author={F. Peters and T. Menzies and L. Gong and H. Zhang}, 
journal={IEEE Transactions on Software Engineering}, 
title={Balancing Privacy and Utility in Cross-Company Defect Prediction},