Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents


Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for the creation of Arabic text corpora. In particular, we create a text classification process for Arabic news articles downloaded from web news portals and sites. The suggested procedure is a pilot project that uses some human predefined set of documents that have been assigned to some subjects or categories. A vectorized Term Frequency, Inverse Document Frequency (TF-IDF) based information processing was used for the initial verification of the categories. The resulting validated categories used to predict categories for new documents. The experiment used 1000 initial documents pre-assigned into five categories of each with 200 documents assigned. An initial set of 2195 documents were downloaded from a number of Arabic news sources. They were preprocessed for use in testing the utility of the suggested classification procedure using the cosine similarity as a classifier. Results were very encouraging with very satisfying precision, recall and F1-score. It is the intention of the authors to improve the procedure and to use it for Arabic corpora creation

