The thesis and presentation are availble in description folder (here and here).
The algorithms are based on 2 approaches:
- TextRank.
- Sentence clustering using K-Means.
There were several models of text feature extraction under study:
- Bag of words + TF-IDF.
- FastText (pretrained model from DeepPavlov lib).
- RuBERT (pretrained model from DeepPavlov lib).
- RuSBERT (pretrained model from DeepPavlov lib).
- MlSBERT (self-trained model using Sentence BERT for English).
The research showed that the best algorithm for summarization is "Mixed" (based on the union of TextRank algorithm and MlSBERT_KMeans).
All algorithms are in the folder "src/Rus_summarizers".