Term Based Semantic Clusters for Very ShortText Classification

Jasper Paalman, Shantanu Mullick, Kalliopi Zervanou, Yingqian Zhang

Research output: Contribution to conferenceAbstractpeer-review


Very short texts, such as tweets and invoices, present challenges in classification.
Such texts abound in ellipsis, grammatical errors, misspellings, and semantic
variation. Although term occurrences are strong indicators of content, in very
short texts, sparsity makes it difficult to capture enough content for a semantic
classifier A solution calls for a method that not only considers term occurrence,
but also handles sparseness well. In this work, we introduce such an approach
for the classification of short invoice descriptions, in such a way that each class
reflects a different group of products or services. The developed algorithm is
called Term Based Semantic Clusters (TBSeC).
TBSeC attempts to exploit the information about semantically related words
offered by pre-trained word embeddings which acts as a query expansion technique for our invoice descriptions. The contribution of this paper lies in (i) combining the advantages of word embeddings with conventional term extraction techniques (ii) applying our method in an application domain not previously investigated, namely invoice text, which is characterised by specialised terminology and very short, elliptical and/or ungrammatical text, in Dutch, a language that is morphologically richer than English and therefore posing an additional challenge in statistical approaches.
Our proposed method, TBSeC, consists of two stages. In the first stage we
use class-specific textual information to build semantic concept clusters. Concept
clusters are vector representations of strongly related terms that are distinctive
for a certain class. In the second stage, we compute cluster similarity scores on
generated concept clusters for a given description, thereby forming a semantic
feature space. This serves as a ranking function that can be used in both unsupervised and supervised learning tasks (in this work we use Support Vector
Machines as a supervised learning algorithm)
Original languageEnglish
Publication statusPublished - 2019
Externally publishedYes
Event31st Benelux Conference on Artificial Intelligence - Brussels, Belgium
Duration: 6 Nov 20198 Nov 2019


Conference31st Benelux Conference on Artificial Intelligence
Abbreviated titleBNAIC/BENELEARN 2019
Internet address


  • Machine learning


Dive into the research topics of 'Term Based Semantic Clusters for Very ShortText Classification'. Together they form a unique fingerprint.

Cite this