关键词:
Machine learning
Multi-label classification
Privacy-preserving data mining
摘要:
Our study considers identification of demographic attributes of patients as a multi-label learning problem. This is a novel approach to predict accuracy of classification of patients' quasi-identifiers (race and gender attributes). To classify the sets of attributes, we applied ensembles of several multi-label learning algorithms. The best-performing multi-label ensembles include decision tree algorithms. In the empirical part of this study, we used on the UCI diabetics dataset of over 100,000 records, collected from 130 US hospitals. The dataset consisted of attributes that included patient demographics (race, gender, age), diagnoses code, lab results, etc. Experiments conducted on datasets of 1000, 10,000, 20,000 examples show that the best classifier achieves a high overall accuracy of 0.533 (1000 examples), 0.702 (10,000 examples), 0.569 (20,000 examples), improving over the baseline majority class classification which achieved accuracy of 0.526, 0.586, 0.562, respectively. Our approach can be further integrated into privacy-preserving data mining, where it can be used to assess risk of identification of different groups of individuals within a large data set.