Enhancing random forest models for survival analysis: Novel methodologies and applications in statistical inference

Věda a výzkum

Doba řešení: 1. března 2025 - 28. února 2027
Řešitel: Ing. MUDr. Lubomír Štěpánek, Ph.D.
Pracoviště: Fakulta informatiky a statistiky
Katedra statistiky a pravděpodobnosti (4100)

Samostatný řešitel
Poskytovatel: Ministerstvo školství, mládeže a tělovýchovy
program: Interní grantová agentura VŠE
Celkový rozpočet: 146 250 CZK
Registrační číslo F4/51/2025
Číslo zakázky: IG410035
This project redefines Classical Random Forest (CRF) models, positioning them as a robust tool for statistical inference and partial prediction in survival analysis. Notably, random forests are highlighted for their almost assumption-free nature, making them versatile in handling diverse datasets.

There is a well-established method called random survival forests, using a hazard function in nodes of decision trees, while we propose a special transformation of entry data into individualized survival probabilities for each class of interests that may be applied for decision rules in tree nodes. Thus, the methodology brings some aspects of novelty compared to popular random survival forests.

The research delves into the analysis of the Poisson-Binomial distribution, modeling the probability by which individual trees within random forests reject the null hypothesis, asserting that there is no difference between classes defined by their survival curves. This approach offers insights into the rejection dynamics within the ensemble, making the behavior of random forests more comprehensible. The Poisson-Binomial approach enhances the interpretability of random forest behavior, making it more understandable through mathematical and statistical descriptions despite its usual consideration as a data-driven black box.

To understand the classification power of individual trees, the study analyzes the null distribution of their ability to classify into only one class defined by the associated survival curve. This investigation yields valuable information on the discriminatory capabilities of trees within the random forest framework.

Incorporating graph theory into the analysis of trees within random forests adds a novel dimension. This approach, coupled with Poisson-Binomial modeling, explores relationships and dependencies among trees, enriching the understanding of their collective behavior in a survival analysis manner. While the Poisson-binomial approach could work better for random forests as entire sets of decision trees, the graph theory approach is more suitable for researching individual trees’ classification power.

The study goes into the exploration of non-binary splitting within individual trees of random forests. This non-so-much-usual approach aims to capture nuanced patterns and structures within the data, contributing to the adaptability and precision of survival analysis models based on random forests. Also, it may enable us to estimate metrics for null hypothesis rejection, such as popular p-value, which is more precise.

A critical aspect of the research involves estimating the minimum number of trees required for an effective CRF methodology in survival analysis within a random forest model. This practical consideration guides optimizing the balance between computational efficiency and model accuracy.

Applications of these refined CRF models extend to various domains, including statistical inference and partly prediction behind economic models’ behavior, such as incoming bankruptcies or customer-acquired inability to pay their debts; also, it may help to make rare events’ predictions or compare their time development, e.g., the emergence of new COVID-19 variants or other events of high complexity, that are hard to be predicted or distinguished from other events of similar manner using standard methods. Finally, to facilitate widespread use, the study emphasizes the implementation of these methodologies using the R programming language.