Missing data and the preprocessing perceptron
2004 (English)Report (Other academic)
In this paper, several ways to handle missing data, e.g. removing cases, mean imputation, and multiple imputation, are described and discussed. The Pima-Indians-Diabetes data set is used as a case study. This particular data set is interesting to use since it has not been obvious to all users that it actually contains a substantial amount of missing data. The data set is described in detail and the methods for coping with missing data mentioned in the text is applied on the data set.
The preprocessing perceptron is used to train decision support systems on the data sets. A sketch of a way to impute missing data using the preprocessing perceptron is also proposed and discussed. The accuracy of the trained decision support systems, at the optimal efficiency point, lied in the interval 76-82% for the different methods. The highest values were obtained when all missing data cases were removed both from the test and the training set. This is, however, not a good way to handle missing data since the resulting decision support system is biased. Furthermore it will not be able to handle missing data when used on real data in the future. The results of the remaining methods were surprisingly similar, a reason for this might be that the data set used is rather large. Differences between methods would probably be larger in a smaller data set with larger amount of missing data.
Place, publisher, year, edition, pages
Dept. of Computing Science, Umeå University , 2004. , 30 p.
, UMINF, 04.02
IdentifiersURN: urn:nbn:se:umu:diva-8399OAI: oai:DiVA.org:umu-8399DiVA: diva2:148070