Background: Presence of missing values in databases causes serious threats for knowledge
extraction. Especially in large databases which are integrated from multiple sources, the number of
missing values may be high which in turn may lead to biased inferences. Many methods have been
proposed by researchers for handling ignorable (Missing At Random and Missing Completely At
Random) and non-ignorable missingness (Not Missing At Random). Still, there exists gap in (i)
handling heterogeneous missing attributes (ii) imputing missing values in large databases and (iii)
dealing with both ignorable and non-ignorable missingness.
Objective: This paper addresses all these three issues by proposing a single algorithm called as
Repopulated Bayesian Ant Colony Optimization (RPBACO) by hybridizing Bayesian and Ant Colony
Optimization (ACO) techniques.
Methodology: ACO chooses the right covariate values required for optimal imputation of missing
values based on the probability and pheromone updation values of ants. Bayesian principles are used
to evaluate fitness of solutions in the ACO process which involves local beam search for repopulating
in successive generations. RPBACO is implemented on large real datasets for imputing heterogeneous
(discrete and continuous) missing values with both ignorable and non-ignorable patterns.
Results: The experimental results are encouraging when compared with other existing standard
techniques in terms of both imputation accuracy and computational time calculated at different
missing rates from 5% to 50%. The statistical tests conducted to validate the experimental results also
prove the superiority of RPBACO in all the datasets considered.
Conclusion: RPBACO can be successfully used for handling both ignorable and non-ignorable
missing values in heterogeneous attributes in large datasets with better imputation accuracy.