Abstract
Real-world datasets often contain large numbers of unlabeled data points, because there is additional cost for obtaining the labels. Semi-supervised learning (SSL) algorithms use both labeled and unlabeled data points for training that can result in higher classification accuracy on these datasets. Generally, traditional SSLs tentatively label the unlabeled data points on the basis of the smoothness assumption that neighboring points should have the same label. When this assumption is violated, unlabeled points are mislabeled injecting noise into the final classifier. An alternative SSL approach is cluster-then-label (CTL), which partitions all the data points (labeled and unlabeled) into clusters and creates a classifier by using those clusters. CTL is based on the less restrictive cluster assumption that data points in the same cluster should have the same label. As shown, this allows CTLs to achieve higher classification accuracy on many datasets where the cluster assumption holds for the CTLs, but smoothness does not hold for the traditional SSLs. However, cluster configuration problems (e.g., irrelevant features, insufficient clusters, and incorrectly shaped clusters) could violate the cluster assumption. We propose a new framework for CTLs by using a genetic algorithm (GA) to evolve classifiers without the cluster configuration problems (e.g., the GA removes irrelevant attributes, updates number of clusters, and changes the shape of the clusters). We demonstrate that a CTL based on this framework achieves comparable or higher accuracy with both traditional SSLs and CTLs on 12 University of California, Irvine machine learning datasets.
Original language | English (US) |
---|---|
Pages (from-to) | 201-232 |
Number of pages | 32 |
Journal | Computational Intelligence |
Volume | 31 |
Issue number | 2 |
DOIs | |
State | Published - May 1 2015 |
Keywords
- cluster-then-label
- genetic algorithm
- semi-supervised learning
- unsupervised clustering
ASJC Scopus subject areas
- Computational Mathematics
- Artificial Intelligence