Pokročilé přístupy k hierarchické shlukové analýze v kategoriálních a smíšených datech

Doba řešení: 1. března 2024 - 28. února 2026
Řešitel: Ing. Jaroslav Horníček
Pracoviště: Fakulta informatiky a statistiky
Katedra statistiky a pravděpodobnosti (4100)

Samostatný řešitel
Poskytovatel: Ministerstvo školství, mládeže a tělovýchovy
program: Interní grantová agentura VŠE
Celkový rozpočet: 304 450 CZK
Registrační číslo F4/35/2024
Číslo zakázky: IG410014
The project examines the advanced approaches to hierarchical clustering of objects in mixed-type data containing missing values. Such a data type frequently occurs in real-world applications, e.g., segmentation of bank clients or customers in the shopping mall, especially when the data come from questionnaire surveys. Currently, mixed-type data clustering is commonly performed using the Gower dissimilarity measure, which is a simplistic approach in the categorical part of the formula, where the simple matching approach is used and in missing values treatment, where variables with incomplete values are not included in the calculations. In recent years, new approaches have been proposed to clustering with missing values, e.g., the zero, mode, or tolerance methods for the dissimilarity matrix calculations. In addition, these approaches could be used for missing data imputation. Therefore, the first goal of the project is to compare newly proposed or not-well-explored dissimilarity measures for mixed-type data with each other and the Gower dissimilarity measure. The second goal is to examine different methods for clustering in incomplete data, clustering-based methods for missing data imputation and compare the quality of the results using suitable coefficients. The third goal of the project is to analyze similarity measures for objects characterized by binary variables and propose a method for their classification with respect to their application in hierarchical agglomerative cluster analysis. Finally, enhanced hierarchical clustering methods for incomplete categorical and mixed-type data and new tools for data visualization and dataset generation will be implemented into the nomclust R package for usage in scientific and business domains.

