While survival analysis is a well-established field in statistics, the vast majority of methods commonly used there are limited by strict statistical assumptions. Moreover, similarly to other branches of statistics, many of the techniques might suffer from a high rate of false statistical results or low statistical power. Considering that each event of interest that may play a really essential role in economics, finances, investments, actuaries, or medicine is rare, the latter problem might significantly affect especially those analyses handling the rare events of interest. As a result, both statistical inference, investigating whether there are differences between groups' events of interest's rate or predictions of the rare events, are imprecise or, even worse, may fail to conduct a correct result. The fact an event of interest is rare makes the prediction of the event or related statistical inference tricky. However, when there is a method able to overcome the mentioned issues, it can help to prevent the event of interest from even happening, which is most valuable within rare events of interest with a negative connotation such as e. g., bank loans payment failures or deaths connected to COVID-19.
Continuing our long-term research, in this project, we address the issue of risk of imprecision or incorrectness of analytical results produced by commonly used methods in survival analysis, usually coming from not meeting the expected statistical assumptions. Also, we improve the toolbox of the methods by some alternatives based on machine learning or using other non-conventional approaches such as combinatorial geometry. Since most of the machine-learning algorithms are almost or fully assumption-free, using these algorithms as tools for statistical inference and prediction in survival analysis opens room for overcoming the limitations of traditional methods arising from their statistical assumptions. Taking into account the time-to-event variable, used for an event of interest modeling, it can be decomposed into its time and event component. The approaches we mostly work with are regression algorithms such as regression trees, random forests, and neural networks, predicting when the event of interest occurs (if it does occur), and classifiers such as naive Bayes classifiers, decision trees, and forests and neural networks, respectively, classifying whether an event of interest even does occur or does not. Even more, we focus on presetting the machine-learning techniques' tuning parameters, which, if preset correctly, may effectively reduce the rate of falsely positive results or increase the technique's statistical power to reject a false result correctly.
So far, within several previous grant funding, we have published multiple papers dealing with various details of the topics introduced above; however, our intensive research in the field continuously brings us new challenges we would like to work on. From all the cooperation we have been involved in, several valuable (non-public) datasets provide us with permission to (re)use it for academic purposes and publication. Namely, we dispose of data on COVID-19 cases that are still improved and enriched by new observations. The COVID-19 disease is an excellent model of a rare event of interest and has an interdisciplinary impact and a negative connotation. However, of course, the toolbox of methods we work on is applicable to various events of interest from whatever provenience, e. g. loan payment failures, bank bankrupts, company insolvencies, etc. The latter-mentioned datasets are often publicly available, though.
Besides improving the survival methods for statistical inference and prediction based on machine learning and reducing the need to meet statistical assumptions, we also plan to implement the techniques as a package for R programming and statistical language to offer the toolbox to a wider audience of statistical practitioners.