In English

Towards Topic Detection Using Minimal New Sets

Madeleine Appert ; Lisa Stenberg
Göteborg : Chalmers tekniska högskola, 2017. 85 s.
[Examensarbete på avancerad nivå]

A common way of detecting topics in a collection of articles is to use Hierarchical Clustering. This has shown to be a successful way of clustering texts, and thereby detecting topics. However, this is computationally expensive since the similarities between all articles are compared pairwise. This thesis aims to examine if smaller amounts of data could be used to detect topics. Building on the work of Damaschke [9] and Guðjónsson [13], we processed articles as a sequence of chronologically ordered documents, and represented each document by the previously unseen word combinations, more formally known as minimal new sets of words. Based on the words that each article now is represented by, we selected articles with a word or word combination. We compared this selection to a ground truth created with Hierarchical Clustering, to see if the minimal new sets can be used to approximate clustering in a streaming setting. We performed three experiments and evaluated their results. In the first we selected articles based on one given word. In the second we selected articles based on a given two-word combination. In the third we built on the second experiment, but separated the selected articles if two consecutive articles were not published within a given time limit. Out of the experiments that we performed we found that tracking a pair of words gave the best result. Additionally, we found that the Jaccard index of the word combinations impacted the result, where words appearing more often together gave better results. The results indicate that minimal new sets can be used to detect topics. Our model shows significantly better results than the corresponding random model. However, we still do not consider our model to hold up against established methods. Therefore, we do not think that our current method is suitable for a topic detecting system, but rather that it could be possible to build on our methods.

Nyckelord: Minimal New Sets, Topic Detection, Text Mining

Publikationen registrerades 2017-12-19. Den ändrades senast 2017-12-19

CPL ID: 253923

Detta är en tjänst från Chalmers bibliotek