PRECOG: Predicting REsearch COncepts of siGnificance

Věda a výzkum

Doba řešení: 1. března 2022 - 29. února 2024
Řešitel: Ing. Lucie Dvořáčková
Pracoviště: Fakulta informatiky a statistiky
Katedra ekonometrie (4030)

Samostatný řešitel
Poskytovatel: Ministerstvo školství, mládeže a tělovýchovy
program: Interní grantová agentura VŠE
Celkový rozpočet: 404 880 CZK
Registrační číslo F4/16/2022
Číslo zakázky: IG403022
The main goal of the project is to use embedding techniques in supervised prediction tasks (specifically binary classification) to identify what we call documents and concepts of future significance. We assume that embedding techniques are able to capture these future changes and using information on semantic relationships among words have the potential to identify new significant keywords (concepts) as important predictors of future events not yet observable in the currently available training corpus.

The novel and crucial ingredient of this project is our focus on the notion semantic distance among words in a text. Standard embeddings use as predictors usually first-order relations, such as "a word occurs frequently" or "a word occurs at the beginning of the text". The semantic relationship, or semantic distance, is a second-order relation on pairs of words that should measure whether or not a pair of words often occur in a similar context. Although there exist some formal definitions in literature, we are ready to investigate both the old as well as to design new ones. Then, it can be expected — at least due to some preliminary results in literature — that this second-order relation can serve as a strong predictor of future events, i.e. properties that will become observable in texts that appear in the same corpus in the future, but are not yet available as training data now.

Binary text classification is a type of problem where the goal is to classify new documents into one of two categories based on a set of training data that contains documents whose category is known. Our work focuses specifically on tasks where one particular document may have a different category in different periods, which depends on the nature of the target variable itself. For example, a specific article may be cited very low in 2014 (falls into the category "lowly cited"), but due to various changes during time could come to the for and the next year (2015) may suddenly be cited highly (new category — "highly cited").

The semantic distance approach is surprisingly promising for the prediction of the possible future switch of the paper from the ”low” to the ”high” state; this is the main result of our team’s recent submission to Scientometrics (Beranova et al, 2021), which is now under revision and a starting point for continuing this work within the currently proposed project.

Scientometrics is just an example of an area where our approach has already achieved interesting results; other examples can be found e.g. in material engineering (Tshitoyan et al., 2019).

In addition to determining the future class of a given document, it is possible to focus on specific words (concepts) that had a decisive influence on the class of the entire document. For example, in biomedical articles, we can focus on specific drugs that occur here and examine their relevance in the future. The whole model, therefore, provides two possibilities for interpretation — important future documents and important future concepts.

The whole research consists of four main phases.

1) Vectorization of text data using embedding techniques;
2) Training of the classification model on vectorized data;
3) Prediction of the future document class;
4) Prediction of future relevant concepts.

Projekty řešitele