2014 6th International Conference on Information Technology and Electrical Engineering (ICITEE), Yogyakarta, IndonesiaAutomated Document Classification for News Article in Bahasa Indonesia based on Term Frequency Inverse Document Frequency (TF-IDF) Approach Ari Aulia Hakim, Alva Erwin, Kho I Eng, Maulahikmah Galinium, Wahyu Muliady Faculty of Engineering and of Information Tehcnology Swiss German University, BSD, Tangerang, Indonesia{ari.hakim[at]student., alva.erwin[at], ie.kho[at], maulahikmah.galinium[at]} sgu.ac.id, wahyu.muliadi[at]akonteknologi.comAbstract— The exponential growth of the data may lead us to the promising, because TF-IDF considers every word’s weight byinformation explosion era, an era where most of the data cannot be using two approaches, the frequency of a term and in howmanaged easily. Text mining study is believed to prevent the world many file a term can be found.from entering that era. One of the text mining studies that mayprevent the explosion era is text classification. It is a way to classify The purpose of this research is to create a classifier that canarticles into several predefined categories. In this research, the classify online news articles in Bahasa Indonesia in a highclassifier implements TF-IDF algorithm. TF-IDF is an algorithm that accuracy.counts the word weight by considering frequency of the word (TF)and in how many files the word can be found (IDF). Since the IDF This paper will be divided into five more sections as followcould see the in how many files a term can be found, it can control respectively: the second section will be talking about thethe weight of each word. When a word can be found in so many files, related works, and the system overview will be presented init will be considered as an unimportant word. TF-IDF has been the third section, followed by the implementation in the fourthproven to create a classifier that could classify news articles in section, after that, result and discussion will be presented inBahasa Indonesia in a high accuracy; 98.3 %. the fifth section and the sixth section explains the conclusion. Index Terms—Text mining, Text Classification, TF-IDF II. RELATED WORKapproach There are numerous researches in the digital news articles I. INTRODUCTION classification that have been conducted until now and a lot of algorithms have been implemented in order to create a The fact that most of the data is stored in the form of text classifier that has accuracy that close to 100%.and the exponential growth of data triggers an increase in thenumber of text mining research. Since those conditions may Decision tree, support vector machine [4], k-nearestlead us to enter the information explosion era, an era where neighbor [5], [6] are most common algorithms that have beenthe data cannot be mantained easily [1]. used in this field of study; the statistical approach can also be implemented in this research [7]. Those algorithms also can be One of the interesting text mining researches that may applied to several different languages.prevent us to enter information explosion era is textclassification. Text classification is a way to automatically Unfortunately, the number of the Indonesian researcherscategorize uncategorized data into several predefined that want to participate in this field of research is very low. Itcategories [2]. As the data are grouped based on their topics, it can be proved by the number of the scientific paper about textwill make the data management easier. Text classification classification in Bahasa Indonesia which is also very low.can also give brief information about documents to the readersbefore they read through it [3]. One of the text classification researches in Bahasa Indonesia that can be found were used K-NN approach [5] and There are several algorithm that can be implemented to do the result accuracy looks promising; 92%.document classificication, such as; k-nearest neighbor (KNN),naïve bayes, etc. But, in this research we chose term frequency But, in order to achieve optimum accuracy, anotherinverse document frequency (TF-IDF) as the main algorithm algorithm has to be tested and we decided to implement termof this research. frequency inverse document frequency in this research. There are two reasons why we chose that algorithm. First, III. SYSTEM OVERVIEWTF-IDF is one of the most recognized word weightingalgorithms and then TF-IDF accuracy in classifying article is The goal of this research is clear, to create an online news article classifier that can classify online news article into fifteen predefined categories and these are the categories; beauty, business, football, economy, entertainment, health,978-1-4799-5303-5/14/$31.00 ©2014 IEEE
2014 6th International Conference on Information Technology and Electrical Engineering (ICITEE), Yogyakarta, Indonesiafood and drink, lifestyle, automotive, education, politics, C. Duplicate Removalproperty, sport, technology and travel. All of the duplicate words in the lexicon have to be The main process divided into two parts; the pre-processing removed, otherwise it will increase the processing time andphase and the processing phase. In the pre-processing phase, make the word weight calculation more complicated.the word and weight dictionary will be created and in theprocessing phase the uncategorized articles will be categorized D. Stop Word Removalbased on theirs topics. Stop-words removal is one of the methods that has been To create the word and weight dictionary, there are seven used since the early day of information retrieval research [9].steps that have to be conducted, and those are; tokenization, This algorithm is mainly used to eliminate several things frombigram creation, duplicate removal, stop-words removal, word a sentence or text documents, such as; the term that appear thefiltering based on the term frequency, supervised word most and the least, it can also eliminate unimportant words orremoval to create a word dictionary and tf-idf implementation words that do not have specific meaning, for example into get the weight of each word. English are; “the”, “a” or “an” or in Bahasa Indonesia are “yang”, “di” and “ke” or any other preposition and Then, in the processing phase, it turns the input file or conjunction.uncategorized file into that word’s weight in each category, allof the weights in each category will be totalized, system will The removal process used stop word dictionary in Bahasafind the biggest result to produce the output. Indonesia that can be taken for free from internet [10]. This phase conducted because of the same reason as the previous All of the steps mentioned above will be explained in the step, to reduce the processing time and the calculationnext section. complexity. IV. IMPLEMENTATION E. Term Frequency Filtering In order to create a good classifier by implementing the tf- To reduce the dimension, all of the words that have a lowidf algorithm, the classifier has to be trained by using several term frequency have to be removed. In this research, 200 werearticles that have been grouped based on its category so that it set as the threshold frequency. So, the words that appear lesscan classify the articles precisely, the more articles set as the than 200 times will be removed.training set the better output generated. F. Supervised Word Removal 7.500 online news articles have been gathered to create agood classifier (@500 articles for each category) and those To reduce the dimension and the calculation complexity,articles have been grouped based on the topic of the article. the lexicon has to be checked manually since the bigram creation create a lot of invalid word that listed in the lexicon. As mentioned in the previous section, there are several steps The previous steps may also reduce the lexicon dimension, butto create a good word and weight dictionary, and those are; there will be some invalid words left.A. Tokenization This step also may not discard all of the invalid words, but it still able to reduce the lexicon creation phase processing Most of the text mining research usually involves word or time.sentences and to process those words any further, they have tobe split word by word [8]. So, in this phase, all of the words in G. Term Frequency Inverse Document Frequencythe sentences will be segmented and all of the punctuation will Implementationalso be discarded since it cannot represent any category, it canalso simplify the calculation process in the next steps. This is After the lexicon has been created, we calculated the weightthe tokenization example:Input : This paper divided into 6 parts. of the each word in it. The weight of the each word in theOutput : [This] [paper] [divided] [into] [six] [parts] lexicon calculated based on term frequency inverse documentB. Bigram Creation frequency (TF-IDF). In order to enrich the dictionary and get higher accuracy, italso has to be provided with the bigram, a term that consists of TF-IDF is one of the most recognized algorithms in texttwo words. The bigram is created by combining the word in n-index with the word in n+1 index from the lexicon created in mining research [11]. Term frequency is the number a wordthe tokenization phase.Input : [This] [paper] [divided] [into] [six] [parts] can be found in an essay or document and idf is theOutput : [This] [paper] [divided] [into] [six] [parts] [Thispaper] [paper divided] [divided into][into six] [six parts] computation from log of the inverse probability of word being found in any essay [12]. TF-IDF is calculated in equation 1: tfidf(t,d,D) = tf(t,d) X idf (t,D) (1) Where: • tfidf(t,d,D) = Tfidf weight of a term computation result • tf(t,d) (term frequency) = How many times a term can be found in a text or a category • idf(t,D) (inverse document frequency) =
2014 6th International Conference on Information Technology and Electrical Engineering (ICITEE), Yogyakarta, Indonesia Computation of log multipled by inverse probability Table II of a term being found in any essay Accuracy TableTf-idf calculation is calculated in subsection G and H. Table Idescribes the Tf-idf training set example. Category Accuracy (%) Beauty 98.7 Table I TF-IDF Training Set Example Business 98.4 Seq Doc 1 The game of life is a game of everlasting learning Food and Drink 99.1 2 The unexamined life is not worth living 3 Never stop learning and learning Economy 97.5 Assume that we want to calculate the weight of term Education 98.87“learning” in the document #1 and #3. As seen in the table, theterm “learning ”appears once in the #1, meanwhile it appears Entertainment 99.75twice in #3, and that term were written in two documents. So,the calculation can be seen as follows: Football 99.875 • Tf-idf(learning,1) = 1 * log (3/2) = 0.23 Health 97.13 • Tf-idf(learning,3) = 2 * log (3/2) = 0.46 Lifestyle 99 As seen in the table I, the word “learning” has two differentweight in the #1 and #3, it happens because that related term Automotive 97.4have more affection in the #3. Politic 96.13H. Normalizing the TF-IDF weight TF-IDF has been proven to produce a good computation Property 98.3[14], but the fact is documents exist today has different size Sport 99.8and it may create a contrast result. So the output from theprevious step should be normalized and make the calculation Technology 97.75result from all of the weight into 1 and the calculation can beseen below, where, NTf-idf is normalized tf-idf: Travel 97.3 The most possible reason that why it could be happened is the football articles have a lot of terms that only belong to that topic, such as; football club and also the player name. Fig. 1 illustrates the example of the article that could be categorized correctly. The article belongs to a category and contains a lot of words that represent that category.NTf-idf(#1) = 0.23 / (0.23 + 0.46) = 0.33NTf-idf(#3) = 0.45 / (0.23 + 0.46) = 0.66 To test the classifier, more than 12.000 articles weregathered and we personally asked 53 persons to manuallygroup the articles based on its topics (@800 articles) as theground truth to simplify the testing phase. V. RESULT AND ANALYSIS Figure Correctly Categorized Article Example After all of the steps have been implemented, the result was As seen in the Fig. 1, the article contains the terms that cangenerated. The result can be read in the table II. only represent the football articles, such as; world cup, football and Deschamps (former French football Player and The best category that can be categorized by this classifier France football team’s coach in World Cup 2014).is Football for 99.8%, followed by entertainment and sport in99.75% and the worst is politics by 96.13%. Furthermore, Fig. 2 is the example of the incorrectly categorized articles. So, TF-IDF algorithm could categorize the online newsarticles in a high accuracy since the average accuracy of thisclassifier is 98.3% [14]. The accuracy of the classifier in classifying articles fromseveral categories is diverse, while the accuracy incategorizing football articles is the highest and politics is thelowest.
2014 6th International Conference on Information Technology and Electrical Engineering (ICITEE), Yogyakarta, Indonesia to classify the articles. Even though human can do it faster than computer, but the fact that human can only work for 8 hours a day could make the process longer than a computer that can do it 24 hours a day. This algorithm promises an high accuracy, but it still has some weakness, one of them is to categorize the article that belong to a category but it contains a lot of word that describe another category, since this algorithm considers every word’s weight and make the system create a wrong output. The other weakness of this algorithm is that this algorithm takes a lot of time due to the large number of the word in the dictionary.Figure Incorrectly Categorized Article Example VII. ACKNOWLEDGEMENT Even though this article will be classified as a football This research presented in this paper is supported by Infor-article by human, but it contains a lot of words that can mation Technology Department of Swiss German University.represent the other category, and the weight calculation of thisarticle can be read in the table III. VIII. REFERENCES Table III [1] Jasiliu A.Kadiri And Niran A. Adetoro, \"Information Explosion And The Challenges Of Information Andcommunication TechnologyThe Result of Incorrectly Categorize Article Computation Utilization In Nigerian Libraries Andinformation Centres,\" Ozean Journal Of Social Sciences 5, 2012.No Category Result [2] Chee-Hong Chan, Aixin Sun, And Ee-Peng Lim, \"Automated Online1 Politic 23.06 News Classification With Personalization,\" 4th International Confrence Of Asian Digital Library (ICADL), Pp. 320-329, December2 Football 15.51 2001.3 Business 14.48 [3] Shlomo Argamon, Casey Whitelaw, Paul Chase, And Sushant Dhawle, \"Stylistic Text Classifcation Using Functional Lexical As seen in the table III, the classifier categorized this article Features,\" 2005.as a politic article, but, it recognizes this article as the footballarticle in the second position. [4] Motaz K. Saad, The Impact Of Text Preprocessing And Term Weighting On Arabic Text Classification., 2010. The classification process by using computer takes 151hours, and the details environment of this research are: [5] Arni Darliani Asy’arie And Adi Wahyu Pribadi, \"Automatic News Articles Classification In Indonesian Language By Using Naive Bayes • Processor Intel(R) Core(TM)2 Duo CPU @2.26 GHz Classifier Method,\" Iiwas, 2009. • Memory 2048Mbyte • Hard Disk Drive 500 GB [6] Thiago Salles And Leonardo Rocha, \"Automatic Document • Operating System Windows 8 Pro 32-bit Classification Temporally Robust,\". • Developed Under Java Programming Language [7] Shrikanth Shankar And George Karypis, \"A Feature Weight VI. CONCLUSION Adjustment Algorithm For Document Categorization,\". TF-IDF algorithm could categorize online news articles in a [8] Riyad Al-Shalabi, \"Stop-Word Removal Algorithm For Arabichigh accuracy. It has been proved by the output of this Language,\" 2004.research; the average accuracy is 98.3% which is quitesatisfying. [9] Wang Pidong. (2011, June) WANG PIDONG'S HOMEPAGE. [Online]. URL: \"http://wangpidong.blogspot.com/2011/06/ TF-IDF has a high dependency to the lexicon; it also has indonesian-stop- words-list.html\"been proved by the failure in classifying the articles thatbelong to a topic but it represents the other topic. 7500 articles [10] Tian Xia And Yanmei Chai, \"An Improvement To TF-IDF: Termare enough to create a quite good lexicon which can create a Distribution Based Term Weight Algorithm,\" Journal Of Software, P.classifier that has a high accuracy. 413, 2011. The online news articles classification by using computer is [11] Jana Vembunarayanan. (2013, October) Seeking Wisdom. [Online].more effective than human classification, as stated in the HYPERLINK \"http://janav.wordpress.com/2013/10/27/tf-idf-and-fourth section, to categorize the articles; human takes 89 hours cosine-similarity/\" [12] Mingyong Liu And Jiangang Yang, \"An Improvement Of TFIDF Weighting In Text Categorization,\" In International Conference On Computer Technology And Science (ICCTS), Singapore, 2012. [13] Sabine Schulte. (2011, August) Tokenizing. [Online]. HYPERLINK \"http://www.coli.uni-saarland.de/~schulte/Teaching/ESSLLI- 06/Referenzen/Tokenisation/schmid-hsk-tok.pdf\" [14] Ari Aulia Hakim, \"Automated Document Classification For News Article In Bahasa Indonesia Based On Term Frequency Inverse Document Frequency (TF-IDF) Approach,\" Swiss German University, Tangerang, Bachelor Thesis 2014.
Search
Read the Text Version
- 1 - 4
Pages: