Clustering for Unsupervised Learning

Clustering algorithms have been widely studied in many scientific areas, such as data mining, knowledge discovery, bioinformatics and machine learning. More specifically, cluster analysis which is an essential technique for unsupervised learning, aims to find the underlying structure of a dataset following some given clustering criteria and specific properties of input data.

A density-based clustering algorithm, called density peaks (DP), which was proposed by Rodriguez and Laio [Science, 344(6191): 1492-1496, 2014], outperforms almost all other approaches. Although the DP algorithm performs well in many cases, there is still room for improvement in the precision of its output clusters as well as the quality of the selected centers. We thus propose a more accurate clustering algorithm, seed-and-extension-based density peaks (SDP). SDP selects the centers that hold the features of their clusters while building a spanning forest, and meanwhile, constructs the output clusters in a seed-and-extension manner. Experiment results demonstrate the effectiveness of SDP, especially when dealing with clusters with relatively high densities. Precisely, we show that SDP is more accurate than the DP algorithm as well as other state-of-the-art clustering approaches concerning the quality of both output clusters and cluster centers while maintaining similar running time of the DP algorithm, particularly for a variety of time-series (i.e. non-metric) data. Moreover, SDP outperforms DP in the dynamic model in which data point insertion and deletion are allowed.

On the other hand, however, it is nontrivial to properly select a clustering algorithm for a dataset without prior knowledge, which actually strongly affects the performance of clustering results. Therefore, another study is to develop a mechanism that can provide a way to automatically select an appropriate algorithm with a good setting of its hyperparameters for any dataset. To achieve this goal, we build an Automated Machine Learning (AutoML) framework for clustering, which can smartly choose a proper algorithm and feature preprocessing steps for a new dataset at hand, and set their respective hyperparameters as well. We also proposed two AutoML clustering approaches: one incorporates several Ensemble Clustering methods in hyperparameter tuning, and the other adjusts the weights of clustering validation indices (CVIs) when tuning its objective function. We conduct numerical experiments and demonstrate the effectiveness of the proposed frameworks for AutoML clustering.

From a practical perspective, we believe that SDP and the AutoML frameworks would be helpful to unsupervised learning as well as many real-world applications.

SDP:Seed-and-extension-based Density Peaks Clustering Algorithm

1.Ming-Hao Tung, Yi-Ping Phoebe Chen, Chen-Yu Liu, Chung-Shou Liao. A Fast and More Accurate Seed-and-Extension Density-based Clustering Algorithm, IEEE Transactions on Knowledge and Data Engineering, published online, March 2022. DOI: 10.1109/TKDE.2022.3161117

2. Yen-Hua Fang and Chung-Shou Liao. (2022) Clustering with Learning-based Hyperparameter Selection for a Time-series Monitoring System, submitted.

3. Chin-Fu Lin and Chung-Shou Liao. (2022) AutoML for Ensemble Clustering, submitted.

Comments are closed.