Unsupervised information-based hierarchical clustering of big data

Fecha de publicación: 19/07/2022

A hierarchical cluster analyzer identifies clusters in a big data set by identifying topological structure without distance-based metrics. The hierarchical cluster analyzer stochastically partitions the big data set to create pseudo-partitions of the big data set. The stochastic partitioning may be implemented with a random forest classifier that uses ensemble techniques to reduce variance and prevent overfitting. The hierarchical cluster analyzer implements random intersection leaves (RIL), a data mining technique that grows an intersection tree by intersecting candidate sets generated from the pseudo-partitions. The hierarchical cluster analyzer updates an association matrix according to co-occurrences of data points within each leaf node of the intersection tree. These co-occurring data points exhibit a high degree of similarity, which is recorded in the association matrix. A hierarchy of clusters may then be formed by finding community structure in the association matrix.

Volver