Abstract
Very short texts, such as tweets and invoices, present challenges in classification.
Such texts abound in ellipsis, grammatical errors, misspellings, and semantic
variation. Although term occurrences are strong indicators of content, in very
short texts, sparsity makes it difficult to capture enough content for a semantic
classifier A solution calls for a method that not only considers term occurrence,
but also handles sparseness well. In this work, we introduce such an approach
for the classification of short invoice descriptions, in such a way that each class
reflects a different group of products or services. The developed algorithm is
called Term Based Semantic Clusters (TBSeC).
TBSeC attempts to exploit the information about semantically related words
offered by pre-trained word embeddings which acts as a query expansion technique for our invoice descriptions. The contribution of this paper lies in (i) combining the advantages of word embeddings with conventional term extraction techniques (ii) applying our method in an application domain not previously investigated, namely invoice text, which is characterised by specialised terminology and very short, elliptical and/or ungrammatical text, in Dutch, a language that is morphologically richer than English and therefore posing an additional challenge in statistical approaches.
Our proposed method, TBSeC, consists of two stages. In the first stage we
use class-specific textual information to build semantic concept clusters. Concept
clusters are vector representations of strongly related terms that are distinctive
for a certain class. In the second stage, we compute cluster similarity scores on
generated concept clusters for a given description, thereby forming a semantic
feature space. This serves as a ranking function that can be used in both unsupervised and supervised learning tasks (in this work we use Support Vector
Machines as a supervised learning algorithm)
Such texts abound in ellipsis, grammatical errors, misspellings, and semantic
variation. Although term occurrences are strong indicators of content, in very
short texts, sparsity makes it difficult to capture enough content for a semantic
classifier A solution calls for a method that not only considers term occurrence,
but also handles sparseness well. In this work, we introduce such an approach
for the classification of short invoice descriptions, in such a way that each class
reflects a different group of products or services. The developed algorithm is
called Term Based Semantic Clusters (TBSeC).
TBSeC attempts to exploit the information about semantically related words
offered by pre-trained word embeddings which acts as a query expansion technique for our invoice descriptions. The contribution of this paper lies in (i) combining the advantages of word embeddings with conventional term extraction techniques (ii) applying our method in an application domain not previously investigated, namely invoice text, which is characterised by specialised terminology and very short, elliptical and/or ungrammatical text, in Dutch, a language that is morphologically richer than English and therefore posing an additional challenge in statistical approaches.
Our proposed method, TBSeC, consists of two stages. In the first stage we
use class-specific textual information to build semantic concept clusters. Concept
clusters are vector representations of strongly related terms that are distinctive
for a certain class. In the second stage, we compute cluster similarity scores on
generated concept clusters for a given description, thereby forming a semantic
feature space. This serves as a ranking function that can be used in both unsupervised and supervised learning tasks (in this work we use Support Vector
Machines as a supervised learning algorithm)
Original language | English |
---|---|
Publication status | Published - 2019 |
Externally published | Yes |
Event | 31st Benelux Conference on Artificial Intelligence - Brussels, Belgium Duration: 6 Nov 2019 → 8 Nov 2019 http://ceur-ws.org/Vol-2491/ |
Conference
Conference | 31st Benelux Conference on Artificial Intelligence |
---|---|
Abbreviated title | BNAIC/BENELEARN 2019 |
Country/Territory | Belgium |
City | Brussels |
Period | 6/11/19 → 8/11/19 |
Internet address |
Keywords
- Machine learning