Important Announcement
PubHTML5 Scheduled Server Maintenance on (GMT) Sunday, June 26th, 2:00 am - 8:00 am.
PubHTML5 site will be inoperative during the times indicated!

Home Explore Proceedings of 2019 4th International Conference on Information Technology

Proceedings of 2019 4th International Conference on Information Technology

Published by b.pramuk, 2019-10-20 23:27:43

Description: USB - Proceedings of 2019 4th International Conference on Information Technology

Keywords: InCIT2019,TNI,CITT

Search

Read the Text Version

2019 4th International Conference on Information Technology (InCIT2019) V. SYSTEM EVALUATION TABLE II. SATISFACTION RATE EVALUATED BY USERS The researcher designed the questionnaire survey used Questions Satisfaction Rate (percent) Likert scale to collect data from three system experts and ten students. The criteria consist of four parts as follows: Mean S.D. Satisfaction ( X ) Rate 1) Utility 2) Reliability Part 1 System Benefits 4.50 0.53 Very satisfy 3) Convenience 1.1 Covered functions for users 4.60 0.52 Very satisfy 4) Effectiveness 1.2 This system can be truly used. Likert scale was used; the least to the most 1 – 5, Part 2 System Credibility 4.20 0.63 Satisfy respectively. [6] 2.1 No problem during using the system 4.40 0.52 Satisfy 1 - Strongly agree 2.2 Convenient for users 4.50 0.53 Very satisfy 2 - Agree 2.3 Day and time cheating 3 - Neither agree nor disagree protection 4.70 0.48 Very satisfy 4 - Disagree 5 - Strongly disagree 2.4 Subjects cheating protection 4.70 0.48 Very satisfy The evaluation of attendance recording by QR code via Part 3 System Convenience 4.60 0.52 Very satisfy smartphone was evaluated by three experts and ten users. The 3.1 Not complicated page 4.60 0.52 Very satisfy results are as TABLEs I and II, respectively. 4.70 0.48 Very satisfy 3.2 Convenient page TABLE I. SATISFACTION RATE EVALUATED BY EXPERTS 4.50 0.71 Very satisfy 3.3 Not complicated system Questions Satisfaction Rate (percent) 4.50 0.53 Very satisfy 3.4 Easy to use and not many Part 1 System Benefits Mean S.D. Satisfaction equipment required 4.40 0.70 Satisfy 1.1 Covered functions for users ( X ) Rate 4.90 0.32 Very satisfy 1.2 This system can be truly used. 3.5 Suitable graphic and color of 4.56 0.27 Very satisfy Part 2 System Credibility result page 2.1 No problem during using the system 3.6 Data completeness 2.2 Convenient for users Part 4 System Effectiveness 2.3 Day and time cheating protection 4.1 Login steps 4.2 Information accuracy 2.4 Subjects cheating protection Part 3 System Convenience Total 3.1 Not complicated page 3.2 Convenient page 4.33 0.58 Satisfy VI. CONCLUSION AND FUTURE WORK 3.3 Not complicated system 4.33 0.58 Satisfy 3.4 Easy to use and not many This research reports attendance recording by QR code via equipment required 3.00 0.00 Neutral smart phone. The report was categorized in 3 modules 1) 3.5 Suitable graphic and color of Teacher Module: sharing and creating a classroom 2) result page 3.67 0.58 Satisfy Generate Module: QR code generated 3) Student Module: 4.00 1.00 Satisfy data recording. The evaluation of satisfaction rate by experts 3.6 Data completeness 4.00 1.00 Satisfy shows that the score is very high and suitable for using. The Part 4 System Efficiency average is 3.76 (S.D.=0.48) which means very satisfy. Also, 4.1 Login steps 3.67 0.58 Satisfy the evaluation from the user results the average in 4.56 4.2 Information accuracy 3.67 0.58 Satisfy (S.D.=0.27) which means very satisfy. The data were Total 3.67 0.58 Neutral recorded correctly (as the average of the highest satisfaction 4.33 1.15 Satisfy rate). 3.00 1.00 Neutral For the further research, the researcher plans to improve this system in order to be used other classes. In addition, the 3.67 0.58 Satisfy researcher aims that the recorded data are corrected accurately and are more convenient for users. 3.67 0.58 Satisfy 4.00 0.00 Satisfy REFERENCES 3.76 0.48 Satisfy [1] Jun-Chou Chuang, Yu-Chen Hu & Hsien-Ju Ko. \"A NovelSecret From TABLE I, the evaluation of satisfaction rate by three Sharing Technique Using QR Code\", International Journal of Image experts shows the average score 3.76 (S.D.=0.48) Most scores Processing (IJIP), Volume (4) : Issue (5), pp.468-475, 2010. are in very satisfy rate in the benefits, convenience, and information accuracy. This results from the system can show [2] Tan Jin Soon. 2008. \"QR Code.\" [Available online] at all data and students cannot cheat the date, time and subjects. http://qrbcn.com/imatgesbloc/Three_QR_Code.pdf. (August 17, 2017). From TABLE II, the evaluation of satisfaction rate by ten [3] Allen Grove, Anne Marie Helmenstine , Jocelly Meiners & Robert users shows the average score 4.56 (S.D.=0.27). Most scores Longley. (May 25, 2019). \"Smartphone Technologies of the Future\" are in very satisfy rate according to the convenience, benefits, [Available online] at https://www.thoughtco.com/future-smartphone- information accuracy, efficiency and effectiveness. Moreover, technology-4151990. (June 18,2019). the system is not complicated to use. It requires only smartphone and internet. Also, this results from the system can [4] Francisco Liébana-Cabanillas, Iviane Ramos de Luna & Francisco J. show all data which is convenient for the teachers. Movntoro-Ríos (2015) User behaviour in QR mobile payment system: the QR Payment Acceptance Model, Technology Analysis & Strategic Management, 27:9, 1031-1049, DOI: 10.1080/09537325.2015.1047757. [5] Jaeseok Yang, “Mobile Assisted Language Learning: Review of the Recent Applications of Emerging Mobile Technologies”, English Language Teaching, Volume (6) : Issue (7), pp.19-25, June 2013. 177

2019 4th International Conference on Information Technology (InCIT2019) [6] Dane Bertram. “Likert Scales” [Available online] at http://poincare.matf.bg.ac.rs/~kristina/topic-dane-likert.pdf (August 18, 2017 178

2019 4th International Conference on Information Technology (InCIT2019) Unveiling Malicious Activities in LAN with Honeypot Zhiqing Zhang Hiroshi Esaki Hideya Ochiai The University of Tokyo The University of Tokyo The University of Tokyo [email protected] [email protected] [email protected] Abstract—Security monitoring of remote local area network Fig. 1: Architecture of monitoring system: Nodes are de- (LAN) is getting more and more important these days as the ployed in each LAN for monitoring traffic through it, and cyber attacks shifts the target to the hosts in LANs. However, sending data file to server for processing and reporting just the monitoring cannot recognize the differences between vulnerability tests and malware attacks, which may cause address from all the IP addresses one-by-one, which we confusion among network operators. This paper proposes an call ARP scan, in this paper. Then, it sends TCP SYNs to architecture of cloud-based LAN-security monitoring system potentially available ports, which is TCP port scan. If it gets that can differentiate vulnerability tests and malware attacks TCP SYN+ACK from the target host, it means that the TCP happens in the remote LANs. We also design the algorithm port is open. This behaviour is actually similar to nmap – to classify those activities into (1) ARP scan, (2) TCP port vulnerability tester. scan, (3) application-level connection establishment and (4) intrusion – the latter two activities are mostly caused by Malware, then, establishes the connection to the target malware not by vulnerability testing. We demonstrate with host at the application layer, and tries several username and our prototype implementation that our system can differentiate password pairs, i.e., brute-force attacks to login. And, after those behaviors: i.e., vulnerability tests and malware attacks. getting privileges to work on the platform, it downloads files into the platform or steal data from the database. Index Terms—local area network, security, honeypot, intru- These actions are almost unique to malware’s behavior. sion detection system, malware We recognizes that there are some vulnerability tester that behaves in similar way [2], but they would be deployed with I. INTRODUCTION extensive attentions. Cyber attacks towards local area network (LAN) have In this work, we have developed a prototype of monitoring shown an increasing trend in recent years. By adapting meth- node, collection and processing server. We also tested basic ods like phishing email, malware intrudes in one of innocent functions of detection of vulnerability tests and malware hosts in LAN and infects it, posing a serious threat on all attacks in our laboratory in order to show that it can detect hosts with its capability of propagation. Devices may stop them separately, and report to the network administrator. working simultaneously and cause a huge impact. Malware Statistical analysis of detecting attacks for a certain duration intrusion in LAN needs to be detected as early as possible in the real network environment will be provided in our next before they make serious attacks after the proliferation. step: i.e., beyond the scope of this paper. Network monitoring is a basic approach for detecting such intrusions. However, from the analysis of the network traffic monitoring itself, the system also detects vulnerability test [1] as \"intrusion\". In network operators’ point of view, it is important to detect vulnerability tests and malware attacks separately in order to warn the network administrator with higher priority in case of malware attacks. This paper proposes an architecture of cloud-based LAN- security monitoring system that can differentiate vulnerabil- ity tests and malware attacks happened in the remote local area networks (Fig. 1). In the architecture, the monitoring node runs several honeypots that are supposed to be interact with malware in the LAN along with basic packet capturing. The server collects the logs of honeypots and packet capture, and our algorithm running on the server recognizes the events. In order to propagate into other hosts or steal important information from its databases in a LAN, malware tries to find open TCP ports from all the hosts in the network. In this process, it broadcasts ARP requests to lookup the MAC 179

2019 4th International Conference on Information Technology (InCIT2019) Fig. 2: Design of monitoring node: Honeypots are installed on node covering large range of protocols to differentiate vulnerability testers and malware, and node sends data files to server by ssh This paper is organized as follows. We summarize related III. DESIGN OF MONITORING SYSTEM work in Section II. In Section III, we present our design and details of monitoring system. Then we demonstrate our A. Architecture functions in Section IV. After that, we discuss our next step in Section V and conclude in Section VI. Fig.1 shows our design of monitoring system. Our target is to monitor intrusions and malicious activities in each II. RELATED WORK LAN network. We collect data file delivered from monitoring nodes and notify network administrator after processing on By representing a real computer system, honeypots are server. used as a trap for unauthorized communication in network. They perform as a part of IDS: they can be used to detect Each node is deployed in each LAN network. It is malicious activities and profile attackers, and provide IDS noteworthy that in each LAN, we monitor traffic captured with comprehensive and detailed information of attacks [3]. from our monitoring node, but not sniffing packets in the With their feature of reaction, honeypots extract attack data whole network. We connect our node on a normal port but without mingled with production activity data [4], which not an advance one(mirroring port), without any requirement could be used in intrusion detection of system. Besides, and modification of configuration on switch, which enables honeypots can be, and are recommended to be used to large-scale deployment of our nodes. Each node is equipped combining with other tools in IDSs [3]. with honeypots for reaction towards attackers. Honeypot logs are sent to the server together with pcap file periodically. The main advantage of using honeypot in LAN is that they can profile attacks from both outside and inside. When The server collects data files of all nodes, and processes malware usually propagates in LAN by infecting other hosts, to detect suspicious activities related to ARP scan, TCP port LAN intrusion detection are supposed to be also alert on scan, application-level connection establishment and intru- hijacked hosts. In this scenario, low-interaction type should sion. The server generates reports and notify their network be used to avoid security risk caused in LAN by real administrator on abnormality. service opening. Li et al. [5] set up both virtual honeypot and physical honeypot in their system, where the former B. Design of monitoring node one works for server protection and the latter for trapping attackers. Lionel et al. [6] noticed the weakness of IoT We configure tcpdump command to record all traffic re- devices in LAN. Despite of telnet, which is the main target lated to our node. ARP scan and TCP port scan detection are of malware, they found a list of ports that are easy to be processed based on pcap file recorded. From ARP requests attacked on an IoT device. These works concentrate only on listed in pcap file, we could understand if ARP scan is honeypot log without any other security tools and methods. happening, and from TCP SYNs listed, we could detect TCP port scan in LAN network. Miroslaw et al. [7] monitored both network traffic and attacks logged in honeypots in their system. And from Honeypot plays a role in verify real intrusion in our network traffic they found ARP scan from source of gateway system. Normally when a malicious host detects an existing and other hosts. But they fail to see the connection between host by getting its IP address with ARP scan, it directly the results of these two tools. In our system, we combine sends TCP SYN to potential opening ports. If the port is the usage of ARP scan detection, TCP SYN detection closed and related service is not in use, TCP RST-ACK will and honeypots, providing a method of detecting abnormal be responded. Obviously for our monitoring node, we do activities and unveiling malicious sources. not want the connection to fail so easily, in which case we cannot get more information of the intruding host. We want the host to make connections to us after three-handshakes and interact with our node. With this consideration, we install honeypot on the node. Specifically, we use two honeypots 180

2019 4th International Conference on Information Technology (InCIT2019) reaction. This is because for single honeypot, it covers a HA = A ∧ H limited number of protocols, and we want to monitor more services that might suffer from attacks. Here we demonstrate our method on detecting different behaviors on each node. Algorithm 1 shows our method For data collection, node connects to the server with of setting up a dictionary for honeypot events. With this SSH, and uses rsync command to deliver pcap file and structure modification, we can access to sessions created honeypot logs. As our node are installed in real working from a source IP in O(1) time complexity. LAN environment, it is highly possible that heavy traffic causes large amount of packets saved in pcap and logs. Algorithm 1 HONEYPOT_LOG_PROCESSING(log) This in turn puts a heavier burden on traffic because of transmission of large files. To solve this problem, we limit 1: HP _dict = {} transmission rate and utilize idle time (usually at night) for 2: for event in log do communication. 3: HP _dict[event.src_ip][session_id].add(event) 4: end for C. Processing and Reporting 5: return HP _dict We set up a server for collecting data from LANs and Algorithm 2 DATA_FILE_HANDLER(pcap, log) processing for reporting to network administrator. Our server has two interfaces: 1) Data collection, for receiving pcap files 1: A =DETECT_ARP_SCAN(pcap) and honeypot logs from all nodes in their LAN with SSH, 2: REPORT_ARP_SCAN(A) and 2) Reporting, for notifying administrator on the activities 3: TA, TA_packets = {} that we consider malicious by email. Main function of our 4: HA, HA_sessions = {} server is to detect abnormal activity in each LAN, including 5: for p ∈ pcap do ARP scan, TCP port scan, application-level connection es- 6: if p.T CP.SY N and !p.T CP.ACK and p.ip.src_ip tablishment and intrusion, based on the relationship of these activities. in A then 7: TA.add(p.ip.src_ip) Specifically, we classify hosts with abnormal activity into 8: TA_packets[p.ip.src_ip].add(p) 3 types. 9: end if 10: end for 1) Host with ARP scan. This kind of host makes sudden 11: REPORT_TCP_PORT_SCAN(TA, TA_packets) ARP scan to other host in the LAN, to get their MAC 12: HP _dict =HONEYPOT_LOG_PROCESSING(log) addresses by broadcasting IP address. But they have not done 13: for IP in TA do any TCP related behavior until now. This kind of host can be 14: if IP in HP _dict then malicious, but also can be intentional, as vulnerability tests 15: HA.add(IP ) often includes ARP scan after network configuration. Let A 16: HA_sessions[IP ].add(HP _dict[IP ]) be the set of source IP addresses who make ARP scan in 17: end if the LAN. 18: end for 19: REPORT_TCP_INTRUSION(HA, HA_sessions) 2) Host only with TCP port scan and ARP scan. From the list of hosts that make ARP scan, there is a subset that makes Algorithm 2 shows our main process of classification and TCP SYNs to ports on detected hosts. Let T be the set of reporting. A is detected with fitting model of degree of ARP source IP addresses that made TCP SYNs to the monitoring request trend [8]. Degree of ARP request means the number node, and TA be the set of IP adresses that made TCP SYNs of target hosts to which ARP request is sent by a source host. after ARP scan. By storing history degree value in database and applying fitting algorithm, sudden increasing of degree, which means TA = A ∧ T abnormal ARP scan, could be detected. We define p as a packet in pcap, and from TCP SYNs packets we filter This kind of host can be also innocent, as there is the case hosts with TCP port scan. From honeypot log dictionary we when vulnerability testers makes TCP port scan with tools filter hosts making connections to target host. We report the like nmap, after modification on network configuration or behaviors separately. even periodically. IV. DEMONSTRATION 3) Host with application-level connection establishment or intrusion after ARP scan. This hosts should be considered A. Experiment Settings malicious, because after ARP scanning hosts in LAN and found ones that exist, the scanning host makes connection For testing, we have used Raspberry Pi 3 for our monitor- into the discovered host with several kinds of behavior. The ing node platform. For honeypot, we have adopted Cowrie behaviors include but not limited to brute-force connections, trying user name and password to login, executing command or script and downloading malicious files on the discovered host. Let H be the set of hosts making connection and trapped into honeypots, we define this kind of host as: 181

2019 4th International Conference on Information Technology (InCIT2019) [9] and Dionaea [10], that cover ssh, telnet, smb, mysql, ftp, 2019-04-15 16:41:22.090400 172.16.1.208:35452–>172.16.1.231:1723 http and other services. 2019-04-15 16:41:22.090593 172.16.1.208:58040–>172.16.1.231:21 2019-04-15 16:41:22.090845 172.16.1.208:39818–>172.16.1.231:445 We deployed one monitoring node(n001) on a LAN in ... our laboratory. Our server was also setup on our lab. From another host, we conducted two types of vulnerability tests This results indicate that only TCP scan activities were with using nmap. We also simulated two types of real attacks observed after the ARP scan, which were potentially just that try to intrude into the monitoring node. We checked if scans similar with nmap – vulnerability tester. our algorithm could differentiate these events appropriately. Our node was set to static IP address 172.16.1.231, while C. Case 3 and 4: Malware Attacks our attacker was set with 172.16.1.208. As for Case 3, the system has developed the following Case 1 ARP Scan. We made ARP scan to 256 IP addresses report. within 172.16.1.0/24 with nmap command: Timestamp Dionaea Status Source IP Protocol nmap − Pn 172.16.1.0/24 2019-04-15 17:04:25.112226 connection.tcp.accept 172.16.1.208 smbd Case 2 TCP Port Scan. After ARP scan, We made TCP 2019-04-15 17:04:27.774802 connection.free 172.16.1.208 smbd port scan to the IP address of the monitoring node 2019-04-15 17:04:28.456049 connection.tcp.accept 172.16.1.208 smbd by: 2019-04-15 17:04:30.581191 connection.free 172.16.1.208 smbd nmap − sT 172.16.1.231 2019-04-15 17:04:31.068269 connection.tcp.accept 172.16.1.208 smbd 2019-04-15 17:04:33.179649 connection.free 172.16.1.208 smbd Case 3 Application-Layer Connection Establishment. We 2019-04-15 17:04:33.694936 connection.tcp.accept 172.16.1.208 smbd used SMB client to connect with SMB service run- 2019-04-15 17:04:36.424633 connection.free 172.16.1.208 smbd ning on the monitoring node by Dionaea honeypot, and closed the connection. We repeated this several This result shows that the host has established connections time. This simulates brute-force attacks made by to the monitoring node with SMB protocol, which potentially malware. be a malware attacks or an enhanced network scans [2]. Case 4 Downloading a File. We login to telnet running As for Case 4, the system has developed the following on the node with guessed username and password, report. and after successful login, we executed commands to generate a file on the honeypot’s platform. Timestamp Cowrie Status Source IP Protocol Details B. Case 1 and 2: Vulnerability Tests 2019-04-15 17:05:48.405094 session.connect 172.16.1.208 telnet user/pwd: \"root/123\" As for Case 1, the system has developed the following 2019-04-15 17:05:57.662858 login.success 172.16.1.208 telnet cmd:\"echo a » t.txt\" 2019-04-15 17:06:19.874716 command.input 172.16.1.208 telnet url:\"/home/root/t.txt\" report. 2019-04-15 17:06:22.686089 file_download 172.16.1.208 telnet 2019-04-15 17:06:22.700734 log.closed 172.16.1.208 telnet Detected ARP scan from IP: 172.16.1.208 (MAC: a0:99:9b:1a:30:7f) on 2019-04-15 17:06:22.714036 session.closed 172.16.1.208 telnet n001 It scans 256 IP addresses. This result shows that the malware has intruded into the ... honeypot area of the monitoring node and downloaded a file 2019-04-15 16:38:40.175681 Who has 172.16.1.1 tell 172.16.1.208 onto the platform. This really means that the malicious host 2019-04-15 16:38:40.175683 Who has 172.16.1.2 tell 172.16.1.208 really contained a real malware and attacked to the available 2019-04-15 16:38:40.177130 Who has 172.16.1.3 tell 172.16.1.208 nodes in the LAN. 2019-04-15 16:38:40.177133 Who has 172.16.1.4 tell 172.16.1.208 2019-04-15 16:38:40.177137 Who has 172.16.1.5 tell 172.16.1.208 V. DISCUSSION 2019-04-15 16:38:40.177139 Who has 172.16.1.6 tell 172.16.1.208 2019-04-15 16:38:40.177142 Who has 172.16.1.7 tell 172.16.1.208 In this paper, we have studied basic design and functions ... of detecting vulnerability tests and malware attacks sepa- rately using honeypot, and we have shown that our proposed This report indicates that there was a potential activities system can recognizes these differences. In the real field, for discovering all the hosts in the network. honeypot will interact with real malware and sometimes generate too much logs or consume the disk space. Thus, real As for Case 2, the system has developed the following implementation of honeypot should cover such issues along report. with resilience to malware attacks: i.e., honeypot should not be the infected. After the careful testing of the monitoring Detected 1008 TCP SYN attacks from IP: 172.16.1.208 (MAC: node, it can be deployed into the real field, which was beyond a0:99:9b:1a:30:7f) during and after the ARP scan. the scope of this paper. ... Importance of distinguishing vulnerability tests from ma- 2019-04-15 16:41:22.088210 172.16.1.208:40668–>172.16.1.231:135 licious activities should not be ignored because tests for 2019-04-15 16:41:22.088470 172.16.1.208:41788–>172.16.1.231:1720 network status, host existence and ports status are not always 2019-04-15 16:41:22.088810 172.16.1.208:38802–>172.16.1.231:23 conducted by administrators. Each host with security tools 2019-04-15 16:41:22.089080 172.16.1.208:39808–>172.16.1.231:443 or software has potential for conducting a test to check if 2019-04-15 16:41:22.089320 172.16.1.208:60796–>172.16.1.231:1025 current LAN is secure enough for itself to connect in, which 2019-04-15 16:41:22.089565 172.16.1.208:43746–>172.16.1.231:139 2019-04-15 16:41:22.089835 172.16.1.208:52368–>172.16.1.231:8080 2019-04-15 16:41:22.090004 172.16.1.208:55746–>172.16.1.231:587 2019-04-15 16:41:22.090156 172.16.1.208:55028–>172.16.1.231:5900 182

2019 4th International Conference on Information Technology (InCIT2019) has been proved by our discovery on several security tools who make ARP scans and check TCP ports periodically. In the scope of this paper, we do not consider on IP spoofing, but it could really happen in the real network. Also there are cases when malicious hosts are continuously changing its IP addresses. With the expansion of experiment range, we will include considerations in our future work. VI. CONCLUSION We have proposed an architecture of cloud-based LAN- security monitoring system that can differentiate vulnerabil- ity tests and malware attacks happened in the remote local area networks. We presented our design of monitoring node with honeypot installed. We designed the algorithm run on the collection server to recognize abnormal activities such as ARP scan, TCP port scan, application-level connection es- tablishment and intrusion. We demonstrated that our system could recognize those behaviors, pointing out that latter two activities are mostly caused by malware not by vulnerability testing. REFERENCES [1] G. F. Lyon, Nmap network scanning: The official Nmap project guide to network discovery and security scanning. Insecure, 2009. [2] S.Limjitti, H.Ochiai, H.Esaki, and K.Sripanidkulchai, “Iot-vulock: Locking iot device vulnerability with enhanced network scans,” in the 13th International Conference on Ubiquitous Information Man- agement and Communication (IMCOM 2019), 2019. [3] M. Baykara and R. Das¸, “A survey on potential applications of honeypot technology in intrusion detection systems,” International Journal of Computer Networks and Applications (IJCNA), vol. 2, no. 5, pp. 203–208, 2015. [4] M. Nawrocki, M. Wählisch, T. C. Schmidt, C. Keil, and J. Schönfelder, “A survey on honeypot software and data analysis,” arXiv preprint arXiv:1608.06249, 2016. [5] L. Li, H. Sun, and Z. Zhang, “The research and design of honeypot system applied in the lan security,” in 2011 IEEE 2nd International Conference on Software Engineering and Service Science. IEEE, 2011, pp. 360–363. [6] L. Metongnon and R. Sadre, “Beyond telnet: Prevalence of iot protocols in telescope and honeypot measurements,” in Proceedings of the 2018 Workshop on Traffic Measurements for Cybersecurity. ACM, 2018, pp. 21–26. [7] M. Skrzewski, “Monitoring malware activity on the lan network,” in International Conference on Computer Networks. Springer, 2010, pp. 253–262. [8] M. Kai, S.Kobayashi, H.Esaki, and H.Ochiai, “Arp request trend fitting for detecting malicious activity in lan,” in the 13th International Con- ference on Ubiquitous Information Management and Communication (IMCOM 2019), 2019. [9] M. Oosterhof, “Cowrie honeypot,” Security Intelligence, 2014. [10] Dionaea, “Documentation release 0.8.0,” 2018. 183

2019 4th International Conference on Information Technology (InCIT2019) Very Short-Term Solar Power Forecasting Using Genetic Algorithm Based Deep Neural Network Sukrit Jaidee Wanchalerm Pora Department of Electrical Engineering, Department of Electrical Engineering, Faculty of Engineering, Faculty of Engineering, Chulalongkorn University Chulalongkorn University Bangkok, Thailand Bangkok, Thailand [email protected] [email protected] Abstract—This paper presents a method for finding optimal by experts through trial and error which takes long time to get parameters of a deep learning model by Genetic Algorithm accurate models. Nowadays, Graphic Processing Unit (GPU) (GA). The model is employed to forecast output of a solar farm performance has considerably increased, resulting in a wide 4 hours in advance. Its inputs are from both forecasted weather use of Reinforcement Learning and Evolutionary Algorithms data and data obtained from weather monitoring instrument. to automate the finding of the best model structure. Genetic Performance of four NN (Neural Network) types: DNN (Deep Algorithm (GA) is used to select Artificial Neural Networks Neural Network), LSTM (Long Short-Term Memory), GRU (ANN) input data. It is also used to find the appropriate (Gated Recurrent Unit), and CuDNNGRU (Cuda Deep Neural structure of ANN to find the number of the hidden layers and Network Gated Recurrent Unit), is compared. Feature the number of neurons in each hidden layer or find the engineering by Exponential Moving Average (EMA) finds that connected weight between the neurons, which will help to get a time-series of irradiance helps to improve the model the most appropriate parameters without using trial and error performance. GA is exploited to find the most appropriate the [1] - [2]. number of lookback (or the window size), and the number of neurons in each and all three hidden layers. The GRU model Solar power forecast generally use machine learning yields the least RMSE at 7.83%. However, if the training time model to convert the weather data into power through the is to be considered, the CuDNNGRU model yields slightly higher training model [3]. But as solar power sources are located in RMSE at 7.87%, but its training time is less than half of that of many different areas and have different capacity sizes, small the GRU. generating sources typically do not have weather monitoring devices as it’s not worth the budget to install those devices. Keywords—Deep Neural Networks, Genetic Algorithm, These sources, therefore, have no inputs to the model. Recurrent Neural Network, Numerical Weather Prediction, Solar Consequently, Numerical Weather Prediction (NWP), which Power Forecast provides the weather forecast of various variables as inputs to the model instead of the actual measurement is introduced. I. INTRODUCTION The NWP can provide weather forecast values from one hour to several days with moderate precision and can be used with Nowadays, solar power generation has increased the actual measurement to increase the accuracy of significantly compared to other types of renewable energy. It forecasting. From the previous literature review, it was found is used to replace the original form of the generation such as that there are two types of NWP being used together and using the power generation from coal and etc., which is unclean the ANN to forecast solar power of the power plants. [4] Most energy. On the other hand, the solar power generation is not forecasting methods use historical data or variables related to reliable due to natural uncertainty. The reliability and stability forecasted weather data. Photovoltaic (PV) power is also used of the electrical system can be degraded. Therefore, solar directly in training models [5] or there is solar power forecast power forecast is a major challenge in the integration of this from irradiance by using downscaling models [6]. volatile renewable energy into the grid system of Thailand in greater quantities. The higher volume in the integration, the This paper aims to increase the accuracy of solar power large fluctuations to the system. If this volatile energy is not forecast by automatically evaluating various parameters of the properly managed, it will cause problems to the reservation of deep learning model (as shown in Figure 1) in order to find power supply and frequency regulation. In order to deal with the number of lookback hours, the most suitable number of these problems, very accurate forecasting far in advance will neurons in each hidden layer of the model with GA using allow us to be able to prepare enough power and keep up with weather forecasting data from Numerical Weather Prediction the demand at that time and will allow the system to maintain (NWP), weather data measured from measuring devices, and frequency stability. data obtained from feature engineering processes. The accuracy of the forecasting model depends on many II. OPTIMAL PARAMETERS OF DEEP LEARNING factors, such as choosing the right algorithm, selecting the MODELS WITH GENETIC ALGORITHMS input variables and tuning the parameters of the model, etc. Finding the most suitable value of various factors is a real A. Genetic Algorithm (GA) challenge as each problem has different suitable values and GA is a problem solving with Stochastic Optimization many tests must be conducted to get the appropriate values. Generally, the structure of Deep Neural Network is designed Techniques using Heuristic Search, which uses the 184

2019 4th International Conference on Information Technology (InCIT2019) information already known to help in finding solutions. It is B. Genetics Algorithm Based DNN (GA-DNN) widely used to find the near optimal solution for the problems, Deep Neural Network (DNN) is a general model that is which need to find the most suitable solutions in large parameters space. [7] The GA process is divided into two applied in various applications like image recognition. parts: 1) determine the Chromosome Encoding that will act as However, this model comes with restriction. Due to no the solutions or parameters to search, 2) determine the Fitness memory unit, it cannot remember the previous state to model Function to be used to test the solutions got from GA whether data sequence. This limitation is a major problem for sequence it is appropriate to be used as a next-generation solutions or data such as text and time series. Moreover, with the same not. Fitness Function is Root Mean Square Error (RMSE ) of amount of time series data, DNN cannot learn from data as the model. In this study, Binary Array which has bits 0 or 1 to much as RNN despite the deeper model due to no memory unit represent the desired parameters is used. The implementation resulting in fixed-size input, whereas RNN can adjust the of GA consists of 3 parts: 1) Selection: It is a comparison of amount of learning in each cycle of input, RNN can find more solutions in the space to choose a good solution and to create data as a result of that. the next generation solutions while eliminating bad solutions. For selection in this study, we use the Roulette Wheel In order to apply DNN with time connections in the same Selection method. 2) Crossover: It is the creation of a new way as RNN, LSTM and GRU, training data set is normally solutions from the existing solutions by bringing together the created with most recent n time steps input and the next n+1th parent’s solutions because when there are various solutions, time step target/output. For the efficient time series model, there is an opportunity to choose the better, right solution. 3) time features such as hour-of-day and day-of-year as well as Mutation: It is a method of increasing the variety of solutions depth and complexity need to be added. However, for the short after passing the crossover process by creating new solutions time series (less than 100 time steps, for instance) model of by using Randomly Swapping or Turning-Off Solution Bits, intricate relationship through time do not need to be created such as Binary Mutation, etc. Since creating solutions in because DNN can be as much efficient as RNN. Nonetheless, different ways will help prevent sticking in the Local Optimal it also depends on data of network architecture, training time, Solution. Roulette Wheel Selection is a selection that uses the initialization of the weights and parameters. Thus, the principle that the solution that has better suitability will have application of Genetic Algorithm (GA) can help find the a greater chance of being chosen and Randomly Swapping is optimum model structure for the most efficient time series. a change in value at the randomly selected position. The GA-DNN structure derived from GA, shown in Table 4, has 8 nodes in the output layer because in this study we need to forecast the 8-time steps that each step is in half an hour scale. In conclusion, in terms of speed, DNN can be trained faster than RNN, LSTM and GRU because of no memory unit resulting in less parameters while in terms of efficiency, DNN yields worse results. Theoretically, RNNs including GRU and LSTM can efficiently process sequence data, whereas DNN is suitable for something else. C. Recurrent Neural Network (RNN) RNN is a model that is suitable for time series data. There is an internal hidden state that holds state information. The previous Hidden State has been used to calculate the current Hidden State and the current Hidden State is used in calculation of data in the next period with equations according to (1) - (2) ℎ = ( +ℎ ) (1) = (ℎ ) (2) Fig. 1. Proposed approach. Where ℎ is the new state, ℎ is the old state, is the time, is the input vector with the size according to lookback value from GA - that is, in each input is a vector with the size according to lookback value, meaning it is the value with the number of previous hours based on lookback. is output vector, which = 1,2, … ,8 is the forecast values in each time- step, σ is the activation function of the parameter , and , , is the weight matrix for calculating inputs, previous Hidden State and present Hidden State. RNN is suitable for time series data with not-very-far time sequence. If the distance is very large, the Vanishing Gradient will occur. 185

2019 4th International Conference on Information Technology (InCIT2019) D. Genetic Algorithm Based GRU (GA-GRU) Where is the Input Gate, is the Forget Gate, is the The GA-GRU structure derived from GA, shown in Table Output Gate, is the Cell State, ℎ is the old state, is the activation function, and is the weight matrix and is the 4 GRU [8], is a structure developed from LSTM to reduce the time. The strength of LSTM is that it can tell when to write, complexity of the structure by adjusting the gate in the LSTM forget or allow to read. to be a reset gate which serves to assess whether how much of the previous state information should be forecasted with the The advantage of LSTM over GRU will be stated in this current input data, and the update gate is responsible for section. For the work that needs model that can process assessing how much previous state data should be collected. sequence data of long-distance relations (more of feedback Equations according to (3) - (5) look on data), LSTM is more efficient than GRU because LSTM is designed with the ability to remember the sequence = ( +ℎ ) (3) data of long-distance relations in which it is far better than = ( +ℎ ) (4) GRU. However, that causes complicated structure (a large (5) number of parameters) resulting in longer time for LSTM to ℎ = ∙ (ℎ ∙ ) + ( ) + (1 − ) ∙ ℎ be trained and a great deal of memory bandwidth usage for computation. Free platform, namely Google Colab and Where is the Reset Gate, is the Update Gate, ℎ is the Kaggle, and general computer are still restricted by memory old state, is the activation function, and is the weight bandwidth; and long-distance relations require a large amount matrix and is the time. of data for model to be trained. For time series data, RNN is more efficient than DNN as F. Genetic Algorithm Based CuDNNGRU (GA- stated in section B. So, in this section the advantage of GRU CuDNNGRU) over LSTM is specified as follows: 1) GRU’s computation is more efficient than LSTM because it does not require internal The GA-CuDNNGRU structure obtained from GA is memory unit causing less complicated structure. GRU has two shown in Table 4. CuDNNGRU is Deep Learning Library gates (reset and update gates) whereas LSTM has three gates developed by NVIDIA which is a library that will speed up (namely input, output and forget gates). The input and forget the GPU which generally referred to as Fast GRU. gates are coupled by an update gate and the reset gate is applied directly to the previous hidden state. Thus, the G. Feature Engineering with Exponential Moving Average responsibility of the reset gate in LSTM is really split up into (EMA) both r and z. 2) GRU can be trained faster than LSTM. It also requires less data to generalize because of less parameters (U In order to improve the accuracy of forecasting, we will and W are smaller) resulting in easier computation since there add EMA (Exponential Moving Average of Irradiance) as the are only two gates. However, LSTM can yield better results if variable because EMA gives more weight to the data which is the amount of data is up to some level. 3) GRU exposes the near the current time and provides a lower weight to the data complete memory unlike LSTM. In other words, GRU which is far away. It is the result of the Weight Moving exposes the full hidden content without control, so Average (WMA) of which weight value is exponential, as the applications which that acts as advantage might be helpful. equation below; E. Genetic Algorithm Based LSTM (GA-LSTM) ( ́ ) = ∙ + (1 − ) ∙ , =1 (11) The GA-LSTM structure derived from GA is shown in , >1 Table 4. The LSTM model [9] was developed to reduce the Where = 0.9 The trend of Irradiance in the next few problem of Vanishing Gradient that occurs in RNN. LSTM hours is related to the trend of Irradiance in the previous few uses hidden state and cell State to store data and send to hours, which is the reason for adding this variable to one of process in the next period where the various gates evaluate the inputs of this study. how much data will be stored in hidden state and cell state. Those gates consist of input gate, output gate, and forget gate. H. Numerical Weather Prediction (NWP) Each gate is responsible for deciding whether the information The numerical forecasting model is a model that provides can go through or not, by considering the importance of the data. The data with very little value will not be able to pass various weather forecast data, such as temperature, air through the gate. This helps reduce the occurrence of pressure, short and long waves and etc. In this study, the NWP Vanishing Gradient with equations according to (6) - (10) model is used as another tool for forecasting as it gives forecast values in the form of grid points throughout the area = ( +ℎ ) ) (6) we have configured, in which each grid point is referenced = +ℎ (7) with latitude and longitude. The NWP model is widely used = ( +ℎ ) (8) in the forecast that needs the weather variables as inputs. The = ( ∘ )+ ∘ ( +ℎ (9) NWP model is also used in solar power forecast by using (10) weather parameters of the four points that are closest to the ℎ = ( )∘ position that needs to be forecasted. [10] So, it means that four sets of data from the four nearest points are used as features for the model. The values obtained from the NWP model of 186

2019 4th International Conference on Information Technology (InCIT2019) the four nearest points are interpolated to get the estimated hour scale, is EMA of irradiance, NWP is the value of the value from the four nearest points to the area that needs to be predicted, then use the estimated value for the forecast, or use weather variables obtained from the MWP model, and is the actual each forecast data from each nearest 4 points independently to measurement value from the devices. In the process of training model perform the Ensemble. It is a technique that uses many models and assessment, the model is divided into three steps: 1) decode GA's to help find solutions.[10] There is a forecast from the only solution, which is in the form of a Binary Array in decimal, to obtain single nearest point and then use the distance between the lookback, the number of neurons in each hidden layer. 2) prepare the point with forecasted weather data and the point interested to data set using lookback obtained from GA and divide the data set into consider as a weight for each point. The nearest point has a the Training Set and Validation Set. 3) train the DNN model and higher weight than the point that is far away [11] calculate the RMSE value on the Validation Set, which will be the value of Fitness Score of GA solutions in a current generation. After that, DEAP Package will be used to define what to use GA and give Binary Array instead of the solution that is 28 bits long. The initial value will be randomly selected by using Bernoulli Distribution. [14] And the initial value in the part of the Crossover, Mutation, and Roulette Wheel Selection positions will be randomized using the III. DATA DESCRIPTION Bernoulli Distribution as well. In terms of the GA parameter configuration in this study, the Population Size is set to 5, Num The actual measurement data from the equipment used in Generations is also 5 and Gene Length is 28 genes. Based on trial and the model is from Chulalongkorn University (Chulalongkorn error, the GA parameters are set up and GA parameters in each University Building Energy Management System; problem will vary depending on the size and nature of the problem. CUBEMS). CUBEMS is a system that pulls and collects data In the final step, when the structure of the appropriate DNN model is from measuring devices to manage power within the building. obtained, which is a structured model that provides the least RMSE The information used in this paper is from January 1, 2017 to value. This model will be taken to train on the Training Set and test October 31, 2018 which is information throughout the day. on Testing Set which are prepared in the beginning. In the training Weather data obtained from CUBEMS consist of solar process, Batch Size, which is the number of samples per Mini-batch, irradiance, temperature, relative humidity, uv index, and wind is 32 and use 50 Epochs, which are the greatest number of times that speed. All variables are retrieved and stored in the database the Training Algorithm will cycle through all samples and use every minute. But in the experiment, we resampled the data Rectified Linear Unit (ReLU) as activation function and find into the hour scale by finding average (mean) because the data Optimizer using Root Mean Square Prop (RMSProp) [15] with Loss from the NWP model is in the hour scale. The NWP model Function as Mean Squared Error (MSE) and use techniques to find used in this paper will be the Weather Research and the best model from Early Stopping using the condition that if min Forecasting (WRF) model. The inputs used with the WRF Delta in this experiment requires that the min delta be zero after model are the Global Forecast System (GFS) and the outputs Patience Epochs (the number of epochs that do not have this better of the model will be given at 00 UTC and 12 UTC by National value in this experiment requires Patience Epochs = 5) that causes the Center for Environmental Prediction (NCEP). This model Validation Loss not improved for 5 Epochs, the training will stop. works twice a day to predict weather forecasts for 2 days in The results of this model experiment gave the RMSE of 8.88% advance. Forecast values are available every hour. The data (Figure 2). The RMSE value is likely to increase in the far-away time- used in the training model will be divided into 2 sets, namely step and the most prediction error is at 10.00 am in each day training set as data from 1 January 2017 to 31 December 2017 according to Figure 3 and testing set as data from 1 January 2018 to 31 October 2561. performance index to evaluate the results of this study B. Experiments on GA-GRU is Root Mean Square Error (RMSE) and Mean Bias Error (MBE) with the following equation (12) In building the GA-GRU model, we used the same data and methods in creating GA-DNN models. The results of this = 100% 1 ( ) − ( ) /( . . ) (12) model experiment gave RMSE value of 7.83% (Figure 2). RMSE values in each time-step are the same as of DNN and = (1⁄ ) ∑ ( ) − ( ) where ( ) is the models, but the RMSE values of the time before noon are forecast power at time , is the number of samples and clearly lower than of the DNN model. From Figure 3, we can installed capacity of CUBEMS is 8 kW see that the model has the most prediction error at 6.00 am and 1.00 pm each day. At 6 o'clock in the morning, there will be Under Prediction and Over Prediction at 1.00 pm. IV. EXPERIMENTS AND RESULTS C. Experiments on GA-LSTM A. Experiments on GA-DNN In building the GA-LSTM model, the results from this model experiment gave the RMSE value of 7.92%. From In this study, we developed the model on Keras [12] which Figure 4, it is evident that this model is accurate in the morning, especially from 7.00 am to 10.00 am. After that, there will be over prediction. is the Python Deep Learning Library because it supports to run on Recurrent Networks and also can run on Graphics Processing Unit (GPU). For GA, we will use Distributed D. Experiments on GA-CuDNNGRU Evolutionary Algorithms in Python, DEAP, which is a python package for an evolutionary computation framework for rapid In building the GA-CuDNNGRU model, the results from this model experiment gave the RMSE value of 7.87%. prototyping and testing ideas. [13] To create the GA-DNN model, input data consist of 23 variables including ( ), ( + 1) to ( + 8), ( ), ℎ ( ), ( ), ( ), ( ), ( ), ( + 1) ( + 8) where is the time in the 187

2019 4th International Conference on Information Technology (InCIT2019) TABLE I. THE RMSE OF EACH MODEL IN EACH TIME-STEP TABLE V. THE TRAINING TIME OF EACH MODEL. Model RMSE (%) of Time Steps t+8 Model DNN GRU LSTM CuDNNGRU t+1 t+2 t+3 t+4 t+5 t+6 t+7 12.2 42 126 603 51 DNN Time 4.60 5.11 5.28 6.13 10.05 11.41 11.75 (Sec) GRU 1.70 1.53 1.53 2.56 8.69 10.73 11.58 12.10 V. SUMMARY AND DISCUSSION OF RESULTS LSTM 2.56 2.39 4.09 3.92 8.86 10.56 11.41 11.75 From the results of the experiment in previous sections, it can be seen that GRU model yields the best forecast in term CuDNN- 1.70 2.04 1.53 1.87 8.69 10.90 11.75 12.27 of accuracy with the RMSE of 7.83%. The DNN model GRU consumes the least training time of 42 seconds. If considering both the training time and the RMSE together, the TABLE II. THE RMSE OF EACH MODEL CuDNNGRU model gives the best results, with RMSE of 7.87% and training time of 51 seconds. It is well-known that Model RMSE (%) of Hours the accuracy of a model depends on the quality and quantity 6 7 8 9 10 11 12 of data used in the training. In this case, appropriate setting of DNN the GA configuration is also important. For example, 4.60 5.11 6.47 8.01 9.37 8.18 7.50 Crossover Probability should be between 0.6 and 0.95 (some researches suggest 0.85-0.95 [16]). Its Mutation Probability GRU 2.56 2.21 1.70 1.36 1.36 1.53 1.87 should be between 0 and 0.01 per chromosome position because if we set it too high, GA will behave similarly to the LSTM 3.24 3.24 3.41 3.41 3.58 3.58 3.92 Random Search. If the population size is too large, the training time will be unacceptable. (Population size = 10 CuDNNGRU 2.73 2.90 2.56 2.90 2.21 1.87 1.87 times of features [17]) These facts show that GA still has limitations. Therefore, fine tuning is still required to improve 13 14 15 16 17 18 the performance. Automatic tuning of GA is a good research topic. DNN 6.64 5.45 1.87 2.21 2.21 2.56 GRU 2.21 1.70 1.53 1.19 0.34 0.34 LSTM 3.41 2.90 2.21 1.02 1.02 1.19 CuDNNGRU 2.21 1.70 1.87 1.19 1.70 2.04 TABLE III. THE MBE VALUES OF EACH MODEL Model MBE (%) of Hours 6 7 8 9 10 11 12 DNN - 1.70 2.04 5.11 7.16 5.45 4.09 0.34 GRU - - - - 0.51 1.19 1.36 1.53 0.85 0.51 0.51 LSTM - 0.51 1.02 1.53 2.21 2.04 2.56 0.85 CuDNNGRU - - - - - - -0.68 2.04 2.39 2.21 2.21 1.36 0.85 13 14 15 16 17 18 Fig. 2. The RMSE of each model DNN 3.41 3.58 0.34 0.17 0.00 - 0.00 GRU 1.53 1.36 0.85 0.68 0.00 - 0.17 LSTM 1.19 0.17 - - - - 0.51 0.17 0.51 0.34 CuDNNGRU - - - - - - 1.02 0.85 1.36 0.68 1.36 1.70 TABLE IV. THE STRUCTURE OF EACH MODEL Model T I H1 H2 H3 O RMSE (%) DNN 33 23 70 116 130 8 8.88 GRU 41 23 72 61 251 8 7.83 Fig. 3. The RMSE of each model in each time of day. LSTM 62 23 250 24 443 8 7.92 7.87 CuDNN 35 23 46 30 15 8 -GRU where is lookback, is the input layer, is the output layer and ℎ is the hidden layer 188

2019 4th International Conference on Information Technology (InCIT2019) thank the advisors who have provided advice and those who give advice in this study. Fig. 4. The MBE of each model in each time of day REFERENCES Fig. 5. The actual value compared to the forecast value of each model. [1] Y. Tao and Y. Chen, \"Distributed PV power forecasting using genetic algorithm based neural network approach,\" Proceedings of the 2014 Fig. 6. The RMSE percentage of each model. International Conference on Advanced Mechatronic Systems, Kumamoto, 2014, pp. 557-560. ACKNOWLEDGMENT The author would like to express gratitude to the [2] Harendra Kumar Yadav, Yash Pal & Madan Mohan Tripathi, \"A novel Electricity Generating Authority of Thailand for granting GA-ANFIS hybrid model for short-term solar PV power forecasting in scholarships for the master's degree and research funding and Indian electricity market, Journal of Information and Optimization Sciences\", 40:2, 377-395, 2019 [3] Chow SKH, Lee EWM, Li DHW, “Short-term prediction of photovoltaic energy generation by intelligent approach”, Energy Build. 2012; 55: 660–667. [4] Fernandez-Jimenez LA, Muñoz-Jimenez A, Falces A, Mendoza- Villena M, Garcia-Garrido E, Lara-Santillan PM, Zorzano-Alba E, Zorzano-Santamaria PJ, “Shortterm power forecasting system for photovoltaic plants”, Renewable Energy 2012; 44: 311–317. [5] Bacher P, Madsen H, Nielsen HA, “Online short-term solar power forecasting”, Solar Energy 2009; 83(10):1772–1783. [6] Yona A, Senjyu T, Saber AY, Funabashi T, Sekine H, Kim CH, “Application of neural network to one-dayahead 24 hours generating power forecasting of photovoltaic system, in Intelligent Systems Applications to Power Systems”, 2007. ISAP 2007. International Conference on, 2008, pp. 1–6. [7] Gen, M. and Cheng, R., “Genetic Algorithm and Engineering Design. “John & Wiley Sons, New York. 1997. [8] Cho et al, “Learning Phrase Representations using RNN Encoder– Decoder for Statistical Machine Translation”, 2014. [9] Greff, K., Srivastava, R.K., Koutnik, J., Steunebrink, B.R., Schmidhuber, J., “LSTM: a search space odyssey.” IEEE Trans. Neural Netw. Learn. Syst. 28(10) [10] ROMAIN JUBAN, PATRICK QUACH, “Predicting daily incoming solar energy from weather data”, Stanford University - CS229 Machine Learning [11] Jeff Patra, “Solar Energy Prediction Using Machine Learning”, 2017 [12] F. Chollet., “Keras: Deep learning library for theano and tensorflow. [Online].”, Available: https://keras. io/k, 2018. [13] François-Michel De Rainville, Félix-Antoine Fortin, Marc-André Gardner, Marc Parizeau and Christian Gagné, \"DEAP: A Python Framework for Evolutionary Algorithms\", in EvoSoft Workshop, Companion proc. of the Genetic and Evolutionary Computation Conference (GECCO 2012), July 07-11, 2012. [14] Colin Reeves, “GENETIC ALGORITHMS Chapter 3.” [15] Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, Benjamin Recht, “The Marginal Value of Adaptive Gradient Methods.”, 2017. [16] Zhong, Jinghui & Hu, Xiaomin & Zhang, Jun & Gu, Min., “Comparison of Performance between Different Selection Strategies on Simple Genetic Algorithms.”, 2005 [17] James Bergstra, Yoshua Bengio, “Random Search for Hyper-Parameter Optimization.”, 13(Feb):281−305, 2012. 189

2019 4th International Conference on Information Technology (InCIT2019) A Proposal of a Students’ Voluntary Growth System of Generic Skills by Objective Evaluation Methods Koji KAWASAKI Hideyuki KOBAYASHI Yoshikastsu KUBOTA Dept. of General Engineering Dept. of General Engineering Dept. of General Engineering National Institute of Technology National Institute of Technology National Institute of Technology Sendai College Sendai College Sendai College Sendai Miyagi, JAPAN Sendai Miyagi, JAPAN Sendai Miyagi, JAPAN [email protected] [email protected] [email protected] Kuniaki YAJIMA results clarified our students’ growth characteristics of GSs in Dept. of General Engineering the educational curriculum at Sendai KOSEN. In this paper, National Institute of Technology we first report the GSs growth characteristics of Hirose Campus students. For the utilization of the evaluation results Sendai College of the GSs survey, furthermore, we propose the \"students’ Sendai Miyagi, JAPAN voluntary growth system of GSs\" as a method to encourage [email protected] students to effectively develop their GSs. Abstract—A continuous survey of students’ Generic Skills II. OBJECTIVE EVALUATION OF GENERIC SKILLS (GSs) has been conducted since academic year of 2014 at National Institute of Technology, Sendai College (Sendai In order to quantify GSs, there are two representative KOSEN). Five years have passed since the survey started, and methods, that is, direct evaluation by teachers and students the survey from students’ admission to their graduation at using rubrics, and indirect evaluation using external Sendai KOSEN has been completed. Our students’ growth assessments. We have conducted a five-year continuous characteristics of GSs in the educational curriculum at Sendai survey of our college students' Generic Skills using Progress KOSEN became evident from this five-year continuous survey.. Report on Generic Skills (PROG) [4], which is an assessment Based on the survey results, we propose a “students’ voluntary of GSs. PROG is an objective test, so we can use it compare growth system of GSs.” Using this proposed method, students the scores of an examinee with an average score of all can be expected to enter a cycle of improving their own generic examinees. It means that the examinees can recognize their skills voluntarily and effectively. strong/weak points by comparing their scores to the average scores of their classmates at school (and to those of other Keywords— Quantitative evaluation of Generic Skills, universities also). An outline of PROG is described below. Utilizations of PROG, Voluntary feedback by the students themselves The PROG test was originally developed by KAWAI- JUKU [5] and the test consists of two parts: the Literacy part, I. INTRODUCTION which evaluates the examinee’s ability to apply knowledge to solve new or inexperienced problems, and the Competency Sendai KOSEN [1] is a National Institute of Technology part, which evaluates the examinee’s coping abilities with located at Sendai, Miyagi, and consists of two campuses, their surroundings, including decision making or action Hirose and Natori. In Sendai KOSEN, high-quality principle characteristics. Evaluation components of PROG engineering education is provided. In Hirose Campus, which test were defined by reference to key-competencies consists of three departments: Information System, determined by DeSeCo project [6] of OECD. The evaluation Information and Telecommunication System, and Intelligent contents of the Literacy part were classified into six categories, and Electronics System, especially, we provide high-quality and those of the Competency part were classified into three and high-level education on information technology. categories that consist of 9 contents and 33 components. (The evaluation items of PROG are shown in Figure 1.) With the rapid development of ICT, diversity and complexity of society has increased, and the changing speed The questions of the Literacy part are similar to those of of social infrastructure has become faster. Under such Synthetic Personality Inventory (SPI) [7]. In Literacy part test, circumstances, in addition to the expertise and technical skills 30 questions are presented, such as numerical reasoning, acquired at colleges and universities, it is important to nurture reading comprehension, understanding of figures and graphs students with GSs, consisting of fundamental competencies and so on. In the Competency part, on other hand, 260 and literacy skills to make good use of their expertise and questions are given in a questionnaire format, to examine the skills. The reformation program of educational environment characteristics of the examinee’s behaviors. An example of at Sendai KOSEN [2] was adopted as an Acceleration the question in the Competency part is as follows; “Which one Program for University Education Rebuilding (AP) [3] in 2014, and it was a driving force to tackle the nurturing and evaluating of students’ GSs in addition to the education of information technology. A continuous survey of students’ GSs starting from academic year of 2014 have been conducted for 5 years. This means that the survey from students’ admission to their graduation at Sendai KOSEN has been completed and the 190

2019 4th International Conference on Information Technology (InCIT2019) do you prioritize in the discussion at the meeting?” and the TABLE I. THE GRADES OF STUDENS WHO TOOK THE examinees have to choose their degree of priority from the option “MY opinion 1 - 2 - 3 - 4 - 5 Group opinion”. The PROG TEST IN EACH YEAR scores of components in the Competency part are evaluated, by comparing the answers of the examinees with statistically processed exemplary answers from many Japanese businesspersons who were classified into the high level. The scores of PROG test are quantified with values from 1 to 7 (or 5, depending on the components), indicating that larger numbers are better results. Evaluation elements of literacy part Collecting information Aalysing information Literacy Finding factors Forming strategies Linguistic processing Non-linguistic processing Evaluation elements of Competency part main (3) medium (9) small (33) components categories contents Friendliness Consideration Relating well Interest in others with others Understanding diversity Networking Interpersonal Cooperating Establishment of Trust competency with others Understanding roles and Teamwork Sharing of Informations Managing Mutual support others Consulatation, coaching and motivating Disucussion Competency Self control Voicing opinions Creative discussions Personal Adjusting opinions, Negotiation and Persudaing managing Self confidence Self awareness competency Stress coping Stress management Behavior Understanding uniqueness management Self-efficacy and positive mind Self transformation via opportunities Idetifying Proactive action problems Completion Good habits Problem Planning Gathering information solving solutions Understanding the essence competency Cause analysis Goal establishment Implementing Scenario construction solutions Plan evaluation Risk analysis Practice Fig. 2. Yearly changes of overall scores of the same Review and Adjustment students from their 1st years to 5th years in Literacy Inspection and Modification and Competency parts. Fig. 1. List of PROG Evaluation components III. RESULTS (GROWTH CHARACTERISTICS OF GSS) Figure 2 shows yearly changes of overall scores of the Table 1 indicates the grades of students who took the same students from their 1st years to 5th years in Literacy and PROG in each academic year. A continuous survey of Competency parts. It is obvious from Fig. 1 that their abilities students’ GSs started from the academic year of 2014 and five of both Literacy and Competency had steadily grown with the years have passed since then. Now we can assess how the progress of their grade. However, the Competency scores did students’ GSs changes as their grade progresses in Sendai not increase in one year from 2nd to 3rd grade, and the KOSEN, Hirose Campus. Literacy ones also did not increase in one year from 3rd to 4th grade. On the other hand, Figure 3 shows the comparison of overall scores from 1st year students to 5th year students in Literacy and Competency parts in 2018. For the Literacy part, there is a significant difference of scores between the 1st grade students and the 3rd grade students, while the scores of 191

2019 4th International Conference on Information Technology (InCIT2019) the 3rd, the 4th and the 5th grade students do not show much difference. For the Competency part, in contrast, it shows a significant difference in the scores of the upper grade students, while there is a little difference in the scores of the lower grade students. It was found out that the Literacy abilities of Hirose Campus students grew first until 3rd grade, and the Competency abilities of them grew after the 3rd grade. From the results of both the follow-up survey of the same students and the comparative survey in the same year, it was revealed that the education curriculum at Sendai KOSEN was leading the sufficient growth in both students’ literacy and competencies. However, on the other hand, it has become clear that at some grades the students’ did not show the growth for the abilities of the literacy or competency. Since we have clarified the growth characteristics of our students, our future works are the analysis of relation between contents of education and the GSs growth characteristics, improvement of the curriculum and lessons. When the good relations are observed and the improvement of curriculum and lessons are achieved, we will report the results at the future meeting of InCIT. Fig. 3. Comparison of overall scores from 1st year students to 5th year students in Literacy and Competency parts in 2018. Fig. 4. Our college’s learning-achievement-record (LAR) system 192

2019 4th International Conference on Information Technology (InCIT2019) IV. HOW TO TAKE ADOVANGATE OF OBJECTIVE EVALUATION characteristics and intentions. It is very helpful that, hence, RESULTS OF GSS (A PROPOSAL OF A STUDENTS’ VOLUNTARY they can recognize the growth of their objectively evaluated GSs and notice their own strong and weak points. Students GROWTH SYSTEM) can recognize their abilities and strengths/weaknesses clearly, and can live daily lives with enhancing their strengths further We will propose “a students’ voluntary growth system of or improvement of their weaknesses in mind. A conscious life GSs” as a utilizing way of the objective assessment method is expected to foster GSs more effectively and efficiently than of GSs. The proposed system is a combined system of our living unaware of them. By repeating the cycle that college’s learning-achievement-record system and visualized continuing conscious life for one year, take the PROG test, PROG results. recognition of the growth, and re-setting of their goals in the next year, students will voluntarily improve their lives and First, we describe our college’s learning-achievement- realize the GSs growth cycle. record (LAR) system shown in Figure 4. LAR system is a system in which a student evaluates and records several Furthermore, we are now trying to analyze the relation evaluation items for 1) fundamental knowledge, 2) between the contents of our education and the GSs growth internationality and ethics, 3) problem solving ability, characteristics. If obvious correlation between them is solution ability and engineering design ability, 4) observed, we are planning to incorporate support functions to communication ability and 5) abilities required in their the proposed system for efficient growth of students’ GSs, for department. Students can recognize how much they have example, lesson selection guide functions to develop a certain grown by evaluating their own abilities at the beginning and ability of competencies. the end of academic year. In addition, students can clarify their goals in each grade by entering them into the system by This proposed “a students’ voluntary growth system of themselves, and teachers can support the students’ growth by GSs” is in specification design phase, and is being developed commenting on their goals. towards its implementation and operation. By adding input function of PROG scores and graph V. CONCLUTION function of the yearly change of PROG scores to the LAR system, the students can easily recognize their growth because A continuous survey of students’ Generic Skills (GSs) has their growth characteristics are visualized as in Figure 5. been conducted since the academic year of 2014 at Sendai Furthermore, by adding a function of comparing one’s scores KOSEN. Five years have passed since the survey started, and with those of other examinees, which is one of the advantages the survey from admission to graduation at Sendai KOSEN of the objective evaluation method (PROG), the students can has been completed. Our students’ growth characteristics of compare their scores to the average value of their class, grade GSs in the educational content at Sendai KOSEN became and other schools including universities. By visually clear from this five-year continuous survey. It is obvious from recognizing their own growth characteristics and comparing the follow-up survey of the same students from their 1st year their scores with the average value of their class and grade, to 5th year that their abilities of both Literacy and they can clearly recognize their strong and weak points. This Competency are steadily growing with the progress of the means that it is difficult for students to accurately recognize grade. On the other hand, from 1st year students to 5th year their own GSs because GSs indicate the behavioral students in the same year 2018, their abilities of Literacy have strongly grown in two years from 1st to 3rd grade, while their Yearly changes of scores in each elements of Literacy part abilities of Competency have not shown strong growth in two years from 3rd to 5th grade. Score 7 Collecting Aalysing 6 information information Furthermore, we propose a “students’ voluntary growth 5 ・・・ system of GSs” that utilizes the objective evaluation results of 4 12345 12345 the survey. In proposed system, students will voluntarily 3 realize the GSs growth cycle by repeating the cycle that 2 (this year) continuing conscious life for one year, take the PROG test, 1 recognition of the growth, and re-setting of their goals in the next year. Grade We will tackle the following points as our future tasks. Yearly changes of scores in each elements of Competency part For an improvement of the school curriculum, we analyze in detail the students’ growth characteristics and our educational Score 7 Friendliness Consideration contents, and will realize the improvement of the curriculum 6 12345 and classes. In addition, for a feedback to the students, it is 5 12345 ・・・ very important to develop and operate the proposed system as 4 soon as possible. Furthermore, it is considered that there is an 3 (this year) optimal distribution of GSs in each work content. Therefore, 2 it can be used for career support by analyzing the necessary 1 generic skills for each work content and incorporating it into the proposed system. Grade : average score of grade : average score of class Fig. 5. Image of a students’ voluntary growth system of GSs 193

2019 4th International Conference on Information Technology (InCIT2019) ACKNOWLEDGMENT [3] About Acceleration Program for University Education Rebuilding (AP):http://www.mext.go.jp/a_menu/koutou/kaikaku/ap/ (URL: In This survey was supported by the Acceleration Program Japanese) for University Education Rebuilding, Ministry of Education in Japan. We are deeply grateful to the principal, teachers and [4] About Progress Report on Generic Skills (PROG) test: https://www. support staff of our college for their contribution to this kawaijuku.jp/jp/research/prog/ (URL: In Japanese) project. In addition, we appreciate Mr. Kondo at RIASEC Inc. for his helpful cooperation in the analysis of data. [5] About KAWAI-JUKU: http://www.kawai-juku.ac.jp/ (URL: In Japanese) REFERENCES [6] About DeSeCo Project at OECD: http://www.oecd.org/education [1] About National Institude of Technology, Sendai college (Sendai /skills-beyond-school/definitionandselectionofcompetenciesdeseco. KOSEN), https://www.sendai-nct.ac.jp/english/ (URL) htm (URL) [2] A. Takahashi, Y. Kashiwaba, et al: A3 Learning System: Advanced [7] About Synthetic Personality Inventory (SPI): https://www.spi.recruit. Active and Autonomous Learning System, International Journal of co.jp/ (URL: In Japanese) Engineering Pedagogy (iJEP), Vol.6, No.2, pp.52-58 (2016). 194

2019 4th International Conference on Information Technology (InCIT2019) Power Allocation for Sum Rate Maximization in 5G NOMA System with Imperfect SIC : A Deep Learning Approach Worawit Saetan Sakchai Thipchaksurat Department of Computer Engineering Department of Computer Engineering Faculty of Engineering Faculty of Engineering King Mongkut’s Institute of Technology Ladkrabang King Mongkut’s Institute of Technology Ladkrabang Bangkok, Thailand Bangkok, Thailand [email protected] [email protected] Abstract—Non-orthogonal multiple access (NOMA) is In this paper, we propose a power allocation scheme for regarded as a promising technology for enhancing the spectral sum rate maximization for downlink NOMA system in the efficiency (SE) in 5G communication system. In this paper, we presence of imperfect SIC. The main idea is to use the deep propose a power allocation scheme for maximizing sum rate for learning to predict the optimal power allocation. The optimal downlink NOMA system in the presence of imperfect successive scheme can be obtained by formulating a power allocation as a interference cancellation (SIC). The proposed scheme uses sum rate maximization problem and then exhaustive searching deep learning to predict the optimal power allocation through the solution of the formulated problem. exhaustive search. Simulation results reveal that the proposed scheme can achieve the sum rate performance close to the The rest of this paper is organized as follows. System model optimal scheme but with much lower computational complexity. and problem formulation are introduced in section II. The proposed power allocation scheme is presented in section III. Keywords—Non-orthogonal multiple access (NOMA), power The simulation results are shown in section IV. Finally, the allocation, sum rate maximization, deep learning, imperfect SIC conclusion is drawn in section V. I. INTRODUCTION II. SYSTEM MODEL AND PROBLEM FORMULATION Non-orthogonal multiple access (NOMA) is a multiple A. System Model access technology for 5G system [1]. The principle of NOMA is to permit more than one user to occupy the same resources We consider a downlink NOMA system, where one base via a power domain. As a result, NOMA can achieve more spectrum efficiency than orthogonal multiple access (OMA) station (BS) serves all K users and both of them are equipped such as TDMA. with single antennas. The location of BS is center of the cell In [2], the authors reveal that NOMA can provide superior sum rate than OMA with fixed power allocation and uniformly and all users are randomly distributed within that cell. deployed users. In [3], the authors focus on optimal power allocation for sum rate maximization in a NOMA system with The transmitted signal which is performed at BS by using impaired CSI at transmitter. In [4], the issue of non orthogonal user selection, power allocation among the selected users and superposition coding through power domain multiplexing can power allocation across the sub-bands have been addressed for an OFDM based NOMA system. In [5], the impact of power be expressed as allocation on the fairness of downlink NOMA system was investigated. In [6], the authors proposed an energy efficient K√ (1) power allocation scheme which involved to the sum rate x = pksk maximization problem for SC-NOMA system. In [7], energy efficient power allocation was studied for a hybrid system with k=1 NOMA integrated to OMA. where sk is the transmitted signal of user k and pk is the Most previous researches on NOMA assumed that SIC transmission power for signal of user k. The received signal process is performed perfectly in NOMA system. However, in at user k can be written by practical scenarios, error propagation during SIC process can occur. As a result, the receiver cannot cancel the interference yk = hkx + nk (2) from the other users’ signals with poorer channel gain. This uncancelled interference can degrade the system performance. where hk is the channel coefficient from user k to BS , whose value depends on the path loss between BS and user k. The term nk denotes additive white Gaussian noise (AWGN) with variance σ2. In this paper, we assume that channel state information (CSI) is perfectly known at BS and the channel coefficients are sorted as |h1| ≥ |h2| ≥ · · · ≥ |hK |. In NOMA principle, the SIC process is used to cancel interference at the receiver and may be imperfect in practical 195

2019 4th International Conference on Information Technology (InCIT2019) systems. This imperfect SIC can cause residual interference. We generate multiple sets of the above tuple As a result, the received SINR after SIC process with residual ({ǫ, Ptot, {Hk}}, {αk}) and then use them to train a interference at the user k can be expressed as DNN. The mean square error between the output of DNN and the label {αk} is used as the loss function. The scaled SIN Rk = Hk αk Ptot k−1 (3) conjugate gradient (SCG) algorithm is adopted as the optimization algorithm. The hyperbolic tangent is used for K the hidden layers as the activation function, and the logistic function is used for output layer as the activation function. Hk Ptot ǫk,iαi + HkPtot αj + 1 In the testing stage, all input parameters are generated i=k+1 j=1 with the same distribution as the training stage. The tuple ({ǫ, Ptot, {Hk}}, {αk}) is also used as the testing data. We where Hk = |hk|2/σ2 is the channel response normalized by pass each tuple through the trained DNN and then collect noise (CRNN) of user k, Ptot is the total transmission power, the predicted output. We calculate the spectrum efficiency of αk is the power allocation factor of signal of user k, and ǫk,i is the power allocation factor predicted by the trained DNN and the uncancelled fraction of signal power of user i that relates compare it with that provided by exhaustive search algorithm. to user k [8]. Using the shannon’s capacity formula, The achievable data rate of user k is given by Rk = log2(1 + SIN Rk) B. Problem Formulation IV. SIMULATION RESULTS In (3), the transmission power can enhance the desired In this section, the performance of the proposed scheme signal but can also raise the residual interference, which can is evaluated for downlink NOMA system. We compare our decrease the system performance. In order to improve the proposed scheme with optimal power allocation scheme system performance, we propose a power allocation for sum (OPA) via exhaustive searching the solution of the rate maximization. The sum rate is given by optimization problem (4). The OPA scheme provides upper bound performance but is not practical due to its high K (4) computational complexity. We define the minimum distance between users and the BS as 40 m. The cell radius is set to Rsum = Rk 300 m. For the sake of simplicity, all Rkmin are equal to Rmin and all ǫk,i are equal to ǫ [8]. In addition, table I shows the k=1 simulation parameters. The power allocation problem for sum rate maximization can be formulated as the following optimization problem max Rsum TABLE I. Simulation Parameters αk (5) Parameter Name Value K Minimum User Data Rate (Rmin) 0.2 bps/Hz C1 : αk ≤ 1 Path Loss Exponent (v) 2.3 Noise Power Density (N0) 174 dBm k=1 Total Transmission Power (Ptot) 10 - 100 mW The uncancelled fraction of signal power (ǫ) 0, 0.01 C2 : αk ≥ 0, ∀k C3 : Rk ≥ Rkmin where C1 is the transmission power constraint for the BS, C2 For 2-users NOMA system, Figure 1 compares our ensures that signal power of each user is not negative, and C3 proposed scheme with the OPA scheme. We investigate the sum rate versus the total transmission power. The proposed indicates that data rate of each user must be larger than the scheme can provide the sum rate close to that of the OPA minimum user data rate of each user Rkmin. scheme. For the case ǫ = 0.00, the proposed scheme provides 99.78% sum rate of the OPA scheme. For the case ǫ = 0.01, III. THE PROPOSED POWER ALLOCATION SCHEME the proposed scheme provides 99.48% sum rate of the OPA scheme. Nevertheless, the sum rate of both schemes reduces Our proposed approach learns the exhaustive search scheme because of the uncancelled fraction of signal power. by using a deep neural network (DNN), where the input feature of DNN is a set of the channel response normalized by For 3-users NOMA system, Figure 2 compares our noise (CRNN) {Hk}, the total transmission power Ptot, and proposed scheme with the OPA scheme. We investigate the the uncancelled fraction of signal power ǫ , and the output sum rate versus the total transmission power. The proposed label of DNN is a set of the power allocation factor {αk}. scheme can provide the sum rate close to that of the OPA scheme. For the case ǫ = 0.00, the proposed scheme provides The tuple ({ǫ, Ptot, {Hk}}, {αk}) is used as the training 97.97% sum rate of the OPA scheme. For the case ǫ = 0.01, data, where the total transmission power Ptot and the the proposed scheme provides 98.83% sum rate of the OPA uncancelled fraction of signal power ǫ are in the fixed range, scheme. Nevertheless, the sum rate of both schemes reduces and a set of channel response normalized by noise (CRNN) because of the uncancelled fraction of signal power. {Hk} is calculated from the users’ locations with uniform distribution and the channel realizations with Rayleigh distribution. A set of optimal power allocation factor {αk} is achieved by running the exhaustive search power allocation algorithm. 196

2019 4th International Conference on Information Technology (InCIT2019) 8.5 0.45 OPA scheme 0.4 Proposed scheme 8 0.35 7.5 Sum Rate (bits/sec/Hz) Average CPU Time (ms) 0.3 7 0.25 6.5 0.2 6 0.15 5.5 0.1 5 0.05 4.5 10 20 30 40 50 60 70 80 90 100 0 Total Transmission Power (mW) 10 20 30 40 50 60 70 80 90 100 Fig. 1. The sum rate of the proposed scheme and the optimal Total Transmission Power (mW) scheme (OPA) for 2-users NOMA system. Fig. 3. The average CPU time of the proposed scheme and the optimal scheme (OPA) for 2-users NOMA system. Sum Rate (bits/sec/Hz) 10 20Average CPU Time (ms) 9.5 18 OPA scheme 9 Proposed scheme 8.5 16 8 7.5 14 7 12 6.5 10 6 5.5 8 5 6 10 20 30 40 50 60 70 80 90 100 4 Total Transmission Power (mW) 2 Fig. 2. The sum rate of the proposed scheme and the optimal scheme (OPA) for 3-users NOMA system. 0 10 20 30 40 50 60 70 80 90 100 Total Transmission Power (mW) Fig. 4. The average CPU time of the proposed scheme and the optimal scheme (OPA) for 3-users NOMA system. For 2-users NOMA system, Figure 3 shows the average V. CONCLUSION CPU time versus the total transmission power. In this figure, we set ǫ = 0.01. Obviously, the proposed scheme takes the In this paper, we propose the power allocation scheme for average CPU time much less than the OPA scheme. For sum rate maximization for downlink NOMA system in the example, at the total transmission power is 80 mW, the average presence of imperfect SIC. The power allocation problem is CPU time of the proposed scheme is 0.0029 ms but that of formulated as the sum rate maximization problem. Then, we the OPA scheme is 0.2777 ms, which is about 95.75 times. find the solution of the formulated problem via exhaustive Particularly, when the total transmission power increases, the search. The deep neuron network is trained to learn the average CPU time of the proposed scheme keeps constant but optimal power allocation. The simulation results verify that that of the OPA scheme grows exponentially. This is because the proposed scheme can closely attain the sum rate of the exhaustive search uses a large number of combination. the optimal scheme but with much lower computational complexity. For 3-users NOMA system, Figure 4 shows the average CPU time versus the total transmission power. In this figure, REFERENCES we set ǫ = 0.01. Obviously, the proposed scheme takes the average CPU time much less than the OPA scheme. For [1] L. Dai, B. Wang, Y. Yuan, S. Han, C. I, and Z. Wang, “Non-orthogonal example, at the total transmission power is 80 mW, the average multiple access for 5g: solutions, challenges, opportunities, and future CPU time of the proposed scheme is 0.0072 ms but that of research trends,” IEEE Communications Magazine, vol. 53, no. 9, pp. the OPA scheme is 9.8191 ms, which is about 1363.76 times. 74–81, Sep. 2015. Particularly, when the total transmission power increases, the average CPU time of the proposed scheme keeps constant but [2] Z. Ding, Z. Yang, P. Fan, and H. V. Poor, “On the performance of non- that of the OPA scheme grows exponentially. This is because orthogonal multiple access in 5g systems with randomly deployed users,” the exhaustive search uses a large number of combination. IEEE Signal Processing Letters, vol. 21, no. 12, pp. 1501–1505, Dec 2014. [3] M. R. Zamani, M. Eslami, and M. Khorramizadeh, “Optimal sum- rate maximization in a noma system with channel estimation error,” in Electrical Engineering (ICEE), Iranian Conference on, May 2018, pp. 720–724. 197

2019 4th International Conference on Information Technology (InCIT2019) [4] P. Parida and S. S. Das, “Power allocation in ofdm based noma systems: A dc programming approach,” in 2014 IEEE Globecom Workshops (GC Wkshps), Dec 2014, pp. 1026–1031. [5] S. Timotheou and I. Krikidis, “Fairness for non-orthogonal multiple access in 5g systems,” IEEE Signal Processing Letters, vol. 22, no. 10, pp. 1647–1651, Oct 2015. [6] Y. Zhang, H. Wang, T. Zheng, and Q. Yang, “Energy-efficient transmis- sion design in non-orthogonal multiple access,” IEEE Transactions on Vehicular Technology, vol. 66, no. 3, pp. 2852–2857, March 2017. [7] M. Zeng, A. Yadav, O. A. Dobre, and H. V. Poor, “Energy-efficient power allocation for hybrid multiple access systems,” in 2018 IEEE In- ternational Conference on Communications Workshops (ICC Workshops), May 2018, pp. 1–5. [8] A. Agrawal, J. G. Andrews, J. M. Cioffi, and T. Meng, “Iterative power control for imperfect successive interference cancellation,” IEEE Transactions on Wireless Communications, vol. 4, no. 3, pp. 878–884, May 2005. 198

2019 4th International Conference on Information Technology (InCIT2019) Dimensionality Reduction Based on Feature Selection for Rice Varieties Recognition Huu-Thanh Duong Vinh Truong Hoang Faculty of Computer Science Faculty of Computer Science Ho Chi Minh City Open University Ho Chi Minh City Open University Ho Chi Minh, Vietnam Ho Chi Minh, Vietnam [email protected] [email protected] Abstract—Rice is a primary source of people all around chai [4] develop a Rice Seed Germination Evaluation System the world. Rice quality is mainly depend on the genetic based on image processing neural networks. This systems characteristics of a rice variety. One of the most important tasks combine color, morphological and textural features extracted for quality control of rice production is rice seed classification. from rice images in order to perform germination prediction. We present an approach for rice seed images classification Chen et al. [5] fuse geometric, shape and color features ex- based on Histogram of Oriented Gradient descriptor and tracted from color images of corn kernels. Then, two optimal feature selection. The experiment is applied on a VNRICE feature sets were reduced by discriminant analysis, and used benchmark dataset and show the efficiency of the proposed as inputs into neural network for classifying. Mebatsion et approach by reducing the number of selected features and al. [6] combine Fourier descriptors and three geometrical increasing the accuracy. for automatic classification of non-touching cereal grains. The combined model defined by morphological and color Index Terms—HOG descriptor, feature selection, color space, attributes achieved a classification accuracy more than 98% Fisher score, rice seed image for the two datasets from Canada (Western Amber Durum and Western Red Spring). Szczypin´ski et al. [7] identify the I. INTRODUCTION barley varieties based on image attributes extracted from shape, color and texture of individual kernels. Chaugule Rice is the most important agricultural plant in many other and Mali [8] propose a new feature extraction approach countries and it is a primary source of food consumed by for classifying paddy seeds based on seed color, shape, almost half of world population. The purity of rice seeds is and texture from Horizontal–Vertical and Front–Rear angles. one of the most important factors to have a high yielding Kuo et al. [9] recognize rice grains image by using the crops. Thus, rice quality is mainly depend on the genetic sparse-representation based classification. Li et al. [10] use characteristics of a rice variety. In reality, different rice the laser scanning system to acquire the three dimensional varieties might be mixed, which can be affected to the quality point cloud of a rice seed. The length, width, thickness of a rice crop. An inspection process is needed to identify the and shape of rice seed are computed based on the oriented category of rice variety. Currently, the identification process bounding box. Kurtulmus¸ and U¨ nal [11] apply three groups of unwanted seeds is done manually by visual inspection in of texture features extracted from gray level co-occurrence Vietnamese companies. This process is time consuming and matrix, gray level run length matrix and Local Binary Pattern can lead to a degradation of the quality of seeds. Therefore, (LBP) for classifying the seven rapeseed varieties scanner a computer vision system is needed to automate the process. images. Two color spaces (HSI and Y CbCr) are further investigated to obtain the robust features. Grains are vital The advances of this field have brought several benefits to people all around the world, thus the demand for efficient to our daily life. A huge of machine vision systems have methods in grain production is growing. Phan et al. [12] eval- been developed and applied to deal with real-life problems. uate and compare different local image descriptors (GIST, In agriculture, there are several works that apply computer SIFT, morphological features) and classifier (Random Forest, vision for analysis of defects, automatic characterization, KNN, SVM) for rice seed varieties identification. They quality evaluation of fruits, vegetables, and grains [1], [2], showed that the random forest gives the best results for [3]. Grain seed varieties identification is one of the most discriminating rice seed images. important tasks in agriculture. Several approaches have been proposed to solve this problem. It is possible to realize a In many real world applications, data with high dimen- method for identifying the variety of rice seeds in varieties sionality decreases the performance of learning process due mixed with the advancement of technology and engineering. to the curse of dimensionality and the existence of irrelevant, This using an automatic computer-aided vision system for redundant, and noisy features. Processing and stocking such analyze rice seeds and determine their purity. For this, the step of analysis and image recognition require the definition of descriptors effective in representing and discriminating different classes of seed textures. Lurstwut and Pornpanom- 199

2019 4th International Conference on Information Technology (InCIT2019) amounts of high dimensional data become a challenge. at each position (i, j) from an image I are computed as It is necessary to choose a small subset of the relevant features from the original ones in order to reduce the time follow: of computing and also memory for storage of data. In order to solve the class discrimination problem in color ∆i = |I (i − 1, j) − I (i + 1, j)| (1) texture classification, feature selection method is applied to by using the class labels to identify a sub-set of the most ∆j = |I (i, j − 1) − I (i, j + 1)| (2) discriminative variables. Feature extraction method project features into a new feature space with lower dimensionality M (i, j) = ∆i2 + ∆2j (3) and the new constructed features are usually combinations of original features. Examples of feature extraction methods A (i, j) = tan−1 ∆i (4) include Principle Component Analysis, Linear Discriminant ∆j Analysis [13]. Since feature extraction builds a set of new features and we cannot get the physical meaning of these B. Feature selection features in the transformed space by a further analysis. In contrast, feature selection selects a subset of features In terms of availability of supervised information, feature from the original feature set without any transformation selection techniques can be roughly classified into three and maintains physical meanings of original features. In groups: supervised, unsupervised and semi-supervised meth- that context, feature selection gives the models for better ods [17]. Most of supervised and semi-supervised feature readability and interpretability. Feature selection method is selection methods assess the relevances of features by the widely used in diverse applications: machine learning, data information of class label. Based on the different strategies analysis and recently successfully applied in computer vision of evaluation, feature selection can be classified into three such as information retrieval, visual object tracking [14]. groups: filter, wrapper and hybrid methods [18]. Filter meth- ods select the subset of features as pre-processing step with- In the past decade, a various discriminative and compu- out involving the classifiers. Typical filter methods consist of tationally efficient image descriptors have been proposed, two steps. In the first step, feature relevance is ranked by a which significantly contribute for pattern recognition tasks. feature score according to some feature evaluation criteria Among of them, the Histograms of Oriented Gradient which can be either univariate or multivariate. Wrapper (HOG) descriptor is one of the most successful descriptor methods evaluate each candidate feature subset through the to recognize and detect object. In order to have a compact classification algorithm and using the estimated accuracy characterization and reduce the high-dimensional feature of the classification algorithm as its evaluation metric (i.e vector, we propose an approach to apply feature selection accuracy). They then select the most discriminative subset on HOG features extracted from rice varieties images based of features by minimizing the prediction error rate of a on different color spaces. The following of this paper is orga- particular classifier. Hybrid methods combine both filter nized as follows. Section II briefly reviews HOG descriptor and wrapper methods into a single framework, in order to and feature selection methods. Next, section III presents provide a more efficient solution to the feature selection the experimental results of rice seed images classification. problem [19]. Finally, the conclusion is given in section IV. Given data matrix X = [x1, ..., xi, .., xN ], where xi ∈ II. FEATURE EXTRACTION AND SELECTION D×N , each image Ii is associated with a class label yi, {xi, yi}, yi ∈ {1, .., c, .., C}, where C is the number of A. Histograms of Oriented Gradient descriptor classes and Nc denotes the number of instances in the class c. Fisher score is one of the most widely used supervised HOG descriptor is successfully applied for different com- feature selection score. The principal idea of Fisher score is puter vision tasks recognition [15]. It is first proposed by to identify a subset of features so that the distances between Dalal and Triggs [16] and mainly used for object detection samples in different classes are as large as possible, while and classification. HOG feature is extracted by counting the distances between samples in the same class are as small the occurrences of gradient orientation base on the gradient as possible. angle and the gradient magnitude of local patches of an Let µr denotes the mean of all instances on the rth image. The gradient angle and magnitude at each pixel feature, µrc and (σrc)2 the mean and variance of class c is computed in an 8 × 8 pixels patch. Next, 64 gradient corresponding to the rth feature, respectively. The Fisher feature vectors is divided into 9 angular bins 0 − 180◦ (20◦ score of the rth feature, which should be maximized, is each). Then, on each patch, the histogram of orientation is calculated as follows [20]: built by accumulating the magnitudes of gradient. Finally, histograms of orientation from each patch are normalized F isherr = C Nc(µrc − µr )2 (5) and concatenated. The gradient magnitude M and angle A c=1 C Nc(σrc )2 c=1 where, the numerator is the between-class variance con- sidering the rth feature and the denominator is the within- class variance considering the rth feature. The features are 200

2019 4th International Conference on Information Technology (InCIT2019) then ranked in the ascending order to select the relevant ones cutoff for the number of selected features. We use various according to their score value. cutoff values from 1% to 100% of the relevant features. Once the number of selected features is determined for training set, III. EXPERIMENTAL RESULTS it is then applied for testing set. Figure 2 shows the results obtained by using Fisher score with 1-NN is a function of A. Dataset and experimental setup the number of selected features. We see that HSV color space outperforms other spaces according to different cutoff The benchmark rice seed (VNRICE) dataset consists of values. This space gives the classification rate more than six common Vietnam rice seed varieties, including BC-15, 91.50% with only 214 features (at the 10% of cutoff) by Huong Thom-1, Nep-87, Q-5, Thien Uu-8, Xi-23. These using Fisher score. rice seeds are sampled from a rice seed production company where the rice varieties were grown and harvested following Table III shows the classification results obtained by using certain conditions for standard rice seeds production [12]. Fisher score under different color spaces and its maximized All images are acquired by a CMOS image sensor color accuracy associated with number of selected features. Sim- cameras. Figure 1 illustrates example images from this ilarly, the HSV color space gives the best rate with 8,981 dataset. Each image represents a class of rice seed varieties. features selected. Comparing the results obtained for each In order to show the efficiency of the selection stage, the 1- color space of table II and table III, Fisher score allow to NN classifier associated with the L1-distance is considered reduce largely number of selected features, for example on due to its simplicity and non-parametric. The classification YIQ space the accuracy improve +0.18% by using only 6% performance is evaluated by the accuracy rate. of total number of features. It is worth to note that all the rates obtained by Fisher score are always higher than the TABLE I results without using feature selection. CHARACTERISTIC OF DATASET Rice variety # Training # Testing Total TABLE II set set images (%) CLASSIFICATION RESULTS OF THE DIFFERENT COLOR SPACES ON THE BC-15 VNRICE TEXTURE DATABASE WITHOUT USING FEATURE SELECTION Huong Thom-1 917 917 1,834 APPROACH. THE VALUE IN BOLD INDICATES THE HIGHEST ACCURACY Nep-87 1,048 1,048 2,096 Q-5 700 699 1,399 OBTAINED. Thien Uu-8 962 962 1,924 Xi-23 513 513 1,026 Color spaces Accuracy Dimension 1,115 1,114 2,229 21,384 RGB 92.14 Furthermore, color images are acquired by devices that HSV 93.07 code the colors in the RGB space. In reality, there are many I1 I2 I3 91.53 other color spaces with different specific properties [21] Luv 90.90 and it is known that the classification performances depend XYZ 92.23 on the choice of the color spaces in which a classifier Y CbCr 91.26 operates [22]. We exploit eight well-known color spaces YUV 91.53 (RGB, HSV, I1I2I3, Luv, Y CbCr, YUV, YIQ) to code rice YIQ 91.22 images. All images are resized to the most common size by 103 × 441 pixels before extracting HOG features since TABLE III the classifier needs the same dimension of input vectors. We CLASSIFICATION RESULTS OBTAINED BY USING FISHER SCORE UNDER randomly divide the original dataset into training and testing sets by hold-out method. Table I presents the characteristic DIFFERENT COLOR SPACES AND ITS MAXIMIZED ACCURACY of VNRICE dataset and number images of each class after the decomposition. ASSOCIATED WITH NUMBER OF SELECTED FEATURES. THE VALUE IN BOLD INDICATES THE BEST RESULTS OBTAINED. The dimension of feature vector extracting from HOG descriptor is 21,384. We can observe that the classification Color spaces Accuracy Cutoff Number of performance is varied on each color space. We firstly present ratio (%) selected features the results obtained by HOG features on eight different color RGB 92.42 spaces (in table II) without using feature selection method. HSV 93.34 30 6,415 However, we can observe that the RGB space does not give I1 I2 I3 91.78 42 8,981 the best performance. This result confirms again the work Luv 91.01 22 4,704 in [23]. XYZ 92.29 64 13,685 Y CbCr 91.38 95 20,314 Fisher score is then applied to compute the score for each YUV 91.78 66 14,113 feature of the training set and rank them according to their YIQ 91.40 22 4,704 values. Different feature subsets are chosen by setting a 6 1,283 201

2019 4th International Conference on Information Technology (InCIT2019) BC-15 Huong thom-1 Nep-87 Q-5 Thien uu-8 Xi-23 Fig. 1. Example images from six rice seed varieties. Accuracy 0.93 [5] Xiao Chen, Yi Xun, Wei Li, and Junxiong Zhang. Combining dis- 0.925 criminant analysis and neural networks for corn variety identification. HSV Computers and Electronics in Agriculture, 71:S48–S53, April 2010. 0.92 I1I2I3 0.915 Luv [6] H.K. Mebatsion, J. Paliwal, and D.S. Jayas. Automatic classification RGB of non-touching cereal grains in digital images using limited morpho- 0.91 XYZ logical and color features. Computers and Electronics in Agriculture, 0.905 YCbCr 90:99–105, January 2013. YIQ 0.9 YUV [7] Piotr M. Szczypin´ski, Artur Klepaczko, and Piotr Zapotoczny. Identi- 0.895 fying barley varieties by computer vision. Computers and Electronics 10 20 30 40 50 60 70 80 in Agriculture, 110:1–8, January 2015. 0.89 Number of selected features 0 [8] Archana A. Chaugule and Suresh N. Mali. Identification of paddy va- rieties based on novel seed angle features. Computers and Electronics Fig. 2. Classification results obtained by using Fisher score under different in Agriculture, 123:415–422, 2016. color spaces and cutoff values. [9] Tzu-Yi Kuo, Chia-Lin Chung, Szu-Yu Chen, Heng-An Lin, and Yan- IV. CONCLUSION Fu Kuo. Identifying rice grains using image analysis and sparse- representation-based classification. Computers and Electronics in We presented a method for rice varieties recognition Agriculture, 127:716–725, September 2016. based on HOG features and hybrid feature selection. The experimental results showed the efficiency of our approach [10] Hua li, Yan Qian, Peng Cao, Wenqing Yin, Fang Dai, Fei Hu, and by exploiting different color spaces. By only using 42% Zhijun Yan. Calculation method of surface shape feature of rice seed of number selected features, the classification rate obtained based on point cloud. Computers and Electronics in Agriculture, is higher than those ones without using feature selection 142:416–423, November 2017. method. This result can allow to reduce the dimension and decrease the time computing of recognition process. This [11] F. Kurtulmus¸ and H. U¨ nal. Discriminating rapeseed varieties using work is now extended to combine and fuse feature selection computer vision and machine learning. Expert Systems with Applica- method to achieve a higher rate and compact descriptor. tions, 42(4):1880–1891, March 2015. ACKNOWLEDGMENT [12] Phan Thi Thu Hong, Tran Thi Thanh Hai, Le Thi Lan, Vo Ta Hoang, Vu Hai, and Thuy Thi Nguyen. Comparative study on vision This work was supported by Ho Chi Minh City Open based rice seed varieties identification. In 2015 Seventh International University under Grant No E.2019.06.1. The authors also Conference on Knowledge and Systems Engineering (KSE), pages would like to thanks Dr. Hong Phan from Vietnam National 377–382. IEEE, 2015. University of Agriculture for providing the Rice Seed Image dataset. [13] K. Fukunaga. Introduction to statistical pattern recognition. Computer science and scientific computing. Academic Press, Boston, 2nd ed REFERENCES edition, 1990. [1] Juliana Gomes and Fabiana Leta. Applications of computer vision [14] Pui Yi Lee, Wei Ping Loh, and Jeng Feng Chin. Feature selection in techniques in the agriculture and food industry: A review. European multimedia: The state-of-the-art review. Image and vision computing, Food Research and Technology, 235:989–1000, 09 2014. 67:29–42, 2017. [2] Diego Ina´cio Patr´ıcio and Rafael Rieder. Computer vision and artifi- [15] Oscar Deniz, Gloria Bueno, Jesu´s Salido, and Fernando De la Torre. cial intelligence in precision agriculture for grain crops: A systematic Face recognition using histograms of oriented gradients. Pattern review. Computers and Electronics in Agriculture, 153:69–81, 2018. Recognition Letters, 32:1598–1603, 09 2011. [3] Avinash C. Tyagi. Towards a second green revolution. Irrigation and [16] N. Dalal and B. Triggs. Histograms of oriented gradients for human Drainage, 65(4):388–389, 2016. detection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 886–893. [4] Benjamaporn Lurstwut and Chomtip Pornpanomchai. Image analysis IEEE. based on color, shape and texture for rice seed ( Oryza sativa L. ) germination evaluation. Agriculture and Natural Resources, [17] K. Benabdeslem and M. Hindawi. Constrained laplacian score 51(5):383–389, October 2017. for semi-supervised feature selection. In Machine Learning and Knowledge Discovery in Databases, pages 204–218. Springer, 2011. [18] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. The Journal of Machine Learning Research, 3:1157–1182, 2003. [19] M. Monirul Kabir, M. Monirul Islam, and K. Murase. A new wrapper feature selection approach using neural network. Neurocomputing, 73(16-18):3273–3283, October 2010. [20] C. M. Bishop. Neural networks for pattern recognition. Oxford university press, 1995. [21] Imtnan-Ul-H. Qazi, O. Alata, J. C. Burie, A. Moussa, and C. Fernandez-Maloigne. Choice of a pertinent color space for color texture characterization using parametric spectral analysis. Pattern Recognition, 44(1):16–31, January 2011. [22] A. Porebski, N. Vandenbroucke, and L. Macaire. Supervised texture classification: color space or texture feature selection? Pattern Analysis and Applications, 16(1):1–18, 2013. [23] Vinh Truong Hoang. Multi color space LBP-based feature selection for texture classification. Ph.d thesis, University of Littoral Coast Opal, February 2018. 202

2019 4th International Conference on Information Technology (InCIT2019) Examination of A-txt System Independent from OSs After Developed the iOS and Android Version Shin-nosuke Suzuki Naoki Terajima Yutaro Akimoto Dept. of Innovative Electrical and Dept. of Innovative Electrical and Dept. of Innovative Electrical and Electronic Engineering Electronic Engineering Electronic Engineering National Institute of Technology National Institute of Technology National Institute of Technology (KOSEN),Oyama College (KOSEN),Oyama College (KOSEN),Oyama College Oyama, Japan Oyama, Japan Oyama, Japan [email protected] Akira Okada Ryohei Kameyama Masaya Yamaguchi Dept. of General Education Covelline, LLC. Covelline, LLC. National Institute of Technology Tokyo, Japan Tokyo, Japan (KOSEN),Oyama College [email protected] [email protected] Oyama, Japan Abstract— The Active Textbook System (A-txt) is the next behavior as an iOS device. On the other hand, some problems generation learning material system. This system can be to in A-txt software development came up from the viewpoint of enhance an ordinary book by adding digital content to the program configuration in iOS and Android. Based on them, figures in the book by using AR technology. The content can be guidelines for future A-txt development have been considered browsed using the users’ own smart devices. Also, once the and it has been examined that the transition to a system content is downloaded, it can be used offline. In this time, independent from OSs using Unity in order to solve the following the iOS version of A-txt, the system for Android problems identified in the development process. devices, which have more users, have been developed. Since the Android OS is free, there are a wide variety of terminals II. OUTLINE OF A-TXT SYSTEM adopted. Therefore, recommended hardware specifications Figure 1 shows the outline of the A-txt system and the flow have been formulated by benchmark evaluation. As a result, of utilization. Here, teachers who produce content are applications compatible with two major operating systems regarded as producers, and students who learn using the became available. Based on the results, it has been examined system are regarded as learners. First, the producer selects that the transition to a system is independent from OSs using images, figures and formulas, etc. to be markers in a book, and Unity in order to solve the problems identified in the creates a pattern file by extracting the feature quantity from development process. them using software. Next, the pattern file and the additional content to be linked to it are registered in the content server. Keywords— AL (Active learning), A-txt (Active textbook In this system, these operations can be operated by graphical system), iOS, Android, Unity user interface (GUI) on a PC screen. Furthermore, access to the content server is possible remotely within the same I. INTRODUCTION network as the one on campus. Therefore, the producers can upload contents from their room to the server. In A-txt, Various studies of digital technology to education have general movies, static images, audio files, etc. can be used as been reported [1], as well as active learning (hereinafter AL), content. It also supports digitization of handwritten materials which is considered as a method for learning ambitiously [2]. with cameras or scanners. As a result, it is a system that is easy This study is one of them, and the authors propose an active for even producers who do not have ICT expertise to handle. textbook system (A-txt) [3], [4]. A-txt is a learning material that aims to enhance the students’ understanding and interest Fig. 1. Outline of the A-txt system. by adding digital content to ordinary textbooks using Augmented Reality (AR) technology [5]. Content production and registration to the server can be done relatively easily using a general-purpose personal computer (PC), and the content added to the book can be browsed using the users’ own smart devices. Therefore, the introduction cost can be reduced without the need for a dedicated machine. Also, once content is downloaded, it can be used offline. This point is an advantage that other systems do not have. So far, A-txt has been developed only for iOS devices (Apple). As those devices have standardized the hardware specifications, image recognition in AR is stable. In this time, our research team has succeeded in developing A-txt for Android (Google) devices used by more users around the world [6]. Additionally, a recommended specification has been formulated using some benchmark softwares for the hardware of an Android device that achieves the same 203

2019 4th International Conference on Information Technology (InCIT2019) On the other hand, the learners start the A-txt application different performance. Additionally, as a quantitative pre-installed on their mobile devices and download additional evaluation, benchmark tests were also applied using some content via the network. After that, by capturing the marker benchmark softwares that can compare both OSs. The image registered as the pattern file with the camera of their softwares used here were “Antutu benchmark” [8], which devices, the content is displayed on the screen. Here, the performs comprehensive measurement, and “3D Mark” [9] content is stored in the cache in the devices, so once it is which emphasizes measurement of graphic performance. downloaded, A-txt can be used offline thereafter. Therefore, Each hardware was tested three times with each benchmark unless many downloads are done at the same time, it will not software, and the average value was used as the evaluation be a network load. Additionally, the application also has a value. Table 1 and 2 shows the comparison of the main function to contact the producer [4]. The learners can take a hardware specifications, and the numerical value of the picture of the part where the content is not placed or the part benchmarks and the users’ sensory evaluation of both OS they want to ask using the function in the application, add a hardwares. These evaluations were classified into four levels: comment, and send it to the producer. This realizes interactive smooth, somewhat smooth, somewhat heavy, and heavy. education. From the results, in Android, with the model developed in Currently, the system that consists of A-txt uses Mac mini recent years, such as Pixel3 and Xperia XZ2 Compact, no (Apple) as a server machine, and creates a pattern file from problems were found in operation, and a very high benchmark markers based on ARToolKit6 [7], a free AR development score equal to or higher than the iOS device was recorded. On library, and developed dedicated applications for servers and the other hand, low-performance and hardwares which users’ devices. The System development is collaborated with developed several years ago, such as P20 lite and XperiaA4, a Japanese venture company, Covelline, LLC. had low benchmark values. Although the static images could be displayed, the marker capture was unstable itself and the III. DEVELOPMENT OF ANDROID VERSION A-TXT movies could not be played smoothly. In the iOS device, the tablet types are better than the smartphones in both the A. Function of Android version A-txt operating status and the benchmark value. Tablet computers In this time, Android version A-txt has been developed so have a large screen and are easy to view, so it is preferable to use A-txt for tablets for educational purposes. that more users can use it. The Android version was aimed to realize the same function as the iOS version, but some were Since the Android OS is free, the hardware that adopts the not achieved. In the Android version, static images can be AR OS has a wide range of performance. In order for A-txt to display without any problems. However, for movies, as shown work without any problems, in addition to the comprehensive in Figure 2, capturing a marker was possible but AR display evaluation in the benchmark, the GPU evaluation on graphics cannot be done. In order to play the movies, it is necessary to requires high numbers. In the hardware of iPhone6s or later display the full screen after capturing the marker image. In versions which realize almost comfortable operation, the addition, depending on the model of Android, processing is comprehensive evaluation in Antutu is over 200,000, and the quite slow, and the screen may change after a while by tapping GPU evaluation is 66,000 or more. In 3D Mark, iPhone 7 plus the screen several times. The reason for this is due to the is inferior to 6S in graphic score, but it operates comfortably. unique structure of the program in Android OS. The 3D According to the test results, the numerical value which display function is not included in this version for the same considered it becomes an index. In the Android device tested reason as the AR display function. this time, it is Pixel3 and Xperia XZ2 Compact that became smooth operation. Based on these results, the recommended B. Performance comparison using multiple hardware and requirements for Android devices that operate A-txt quantitative evaluation by benchmark tests comfortably are as follows. In order to find hardware specifications that allow A-txt to operate comfortably, including the Android version as well as the iOS version, quantitative evaluation is required. Therefore, operation tests were implemented on multiple models with (a) Android version does not display (b) While captured, tap the screen. It (c) The movie starts by the play AR just by capturing. becomes full screen display button. Fig. 2. Steps for playing movie for Android version A-txt. 204

2019 4th International Conference on Information Technology (InCIT2019) TABLE I. COMPARISON OF HARDWARE SPECIFICATIONS, BENCHMARK AND SENSORY EVALUATION IN ANDROID TERMINALS. Device name Xperia XZ2 Compact Pixel3 Android P20 Lite XperiaA4 Brand SONY Google Huawei SONY OS 9.0.0 9.0.0 8.0.0 6.0.1 OS Version Qualcomm Quad-Core snapdragon845 ARMv7 Processor CPU model Qualcomm SDM845 HiSilicon Kirin 659 8 4 Number of CPU core 576.0-2803.2MHz 8 8 300-2457MHz OpenGL ES 3.2 576-2803MHz 480-2362MHz OpenGL ES 3.0 Clock frequency OpenGL ES 3.2 OpenGL ES 3.0 4.93 4.51 GPU version 1080 × 2160 5.46 5.83 720 × 1280 1080 × 2160 1080 × 2280 Screen size (inch) 2019 2015 283,256 2018 2018 63,414 Screen resolution (pixel) 84,887 257,906 86,753 35,339 125,666 71,707 41,390 6,528 Release year 15,973 122,685 13,157 2,642 21,775 11,651 5,407 4,105 Comprehensive evaluation 3,346 15,049 3,616 2,889 6,655 2,688 2,386 1,638 CPU evaluation 4,411 9,866 5,604 1,186 8,154 3,503 1,208 GPU evaluation 1,869 4,491 124 1,116 Smooth 2,016 98 Heavy Antutu Data processing score Smooth 1,728 Somewhat heavy Image processing score RAM score ROM score Total Score 3D Mark Graphics score Physics score User's sensory evaluation TABLE II. COMPARISON OF HARDWARE SPECIFICATIONS, BENCHMARK AND SENSORY EVALUATION IN IOS TERMINALS. Device name iPhone7 Plus iPhone6s iPad (6th gen.) iPad Pro Apple Brand 12.2 12.1.4 iOS 12.2 Apple A10 Fusion Apple A9 APL0898 Apple A10X OS 12.2 4 2 Apple A10 Fusion 6 OS version 2.32GHz 1.8GHz 2.36GHz OpenGL ES 2.0metal OpenGL ES 3.0 4 OpenGL ES 3.0metal CPU model 2.34 GHz 5.5 4.7 OpenGL ES 3.2 12.9 Number of CPU core 1920 × 1080 1334 × 750 2732 × 2048 9.7 Clock frequency 2016 2015 2048 × 1536 2017 193,916 122,447 279,640 GPU version 78,921 47,811 2018 108,700 66,787 47,070 206,808 111,969 Screen size (inch) 8,198 4,437 88,206 9,852 2,188 1,269 66,338 2,342 Screen resolution (pixel) 4,367 2,718 9,035 4,740 3,964 2,212 5,156 Release year 2,251 824 4,526 4,792 2,935 2,027 4,211 6,413 Comprehensive evaluation 1,244 4,175 3,567 2,543 Smooth 723 5,896 Smooth CPU evaluation Somewhat smooth 1,500 Smooth GPU evaluation Antutu Data processing score Image processing score RAM score ROM score Total Score 3D Mark Graphics score Physics score User's sensory evaluation 205

2019 4th International Conference on Information Technology (InCIT2019) • Antutu Benchmark value: Comprehensive evaluation (a) iOS 200,000 or more, GPU evaluation 66,000 or more. (b) Android • 3D Mark Benchmark value: Total score 2,000 or Fig. 3. Comparison of AR display program configuration outline. more, Graphics score 3,000 or more. These problems in the Android version A-txt may be • CPU: Maximum clock frequency 2.8 GHz or more. solved if the library is changed from the current ARToolKit 6 to the latest version ARToolKitX [13], because, the source • GPU version: OpenGL ES 3.2 or later. code of ARToolKit series is open to the public, and However, in the case of static image indication, there was development is continuing. The second option is to change the no problem (except for unstable marker capture) in all the library. ARCore [14] is prepared for Android as opposed to models used this time, so it can be used sufficiently if it is an ARKit [15] for iOS. These libraries are expected to operate Android device that is currently commonly used in the world. smoothly because they can perform application development Therefore, A-txt has been realized in iOS and Android, which suitable for each hardware. However, since each library is are the two major operating systems of smart devices, and developed by different companies, there is no guarantee that almost all students can use A-txt. As a result, A-txt reached a devices of both OSs can realize the same function. In addition, level that can be used in class and self-study. In the future, we since both need to be programmed completely separately, the will move to the stage of measurement of learning effects time and the cost for development are greatly increased. This using A-txt. leads to the third option. Currently, A-txt is developed for each OS. Using Unity [16], [17] makes it possible to create an IV. FUTURE DEVELOPMENT PLAN OF A-TXT ~ TRANSITION TO application compatible with both OSs. Figure 4 shows the SYSTEM INDEPENDENT FROM OSS USING UNITY ~ outline of the system configuration of Unity version A-txt. There are other advantages to adopt Unity that is not In Android A-txt developed this time, it is impossible to dependent on the OS. Unity has its own 3D drawing system, display video in AR. This is because iOS and Android have and it is relatively easy to create 3D content including different development environments. As for development animations and sounds. In addition, since ARToolKit series environment and language, iOS is utilizing Xcode and support Unity, the know-how up to now can be used. On the Objective-C or Swift, and Android is Android Studio and Java other hand, Unity has difficulty in creating a screen of user or Kotlin. In iOS, as a library for AR display of content on interface such as setting screen, but it is not a big problem marker images, “SceneKit” [10] which handles 3D easily is considering that development for each OS is not necessary incorporated as a standard. On the other hand, the Android and various loads are reduced. does not have such a library, so it is necessary to write related set of programs using OpenGL directly. Therefore, difficulty From the above, it is considered that Unity version A-txt and cost of the Android version program are much higher than which has the same function, screen display and operation those of iOS version. method regardless of the OS is suitable as an educational application. Based on the above concept, we have created a Figure 3 shows the comparison of AR display program prototype of Unity version A-txt. Figure 5 shows the configuration outline in iOS and Android. In the AR display development screen. In this time, we used macOS version of program, the drawing instruction is finally sent to the Unity (version 2018.3.5f1) and incorporated ARToolKitX. smartphone’s GPU via OpenGL. When dealing with 2D This prototype can work in the same way on both iOS and graphics, you do not need to be aware of GPU or OpenGL. Android. In addition, the dependency on Android devices was On the other hand, when drawing something as AR, it is almost the same as Chapter III-B. Although there is no necessary to do 3D representation and it will use the function problem with capturing AR markers, the screen display is of OpenGL. However, OpenGL performs only simple data unstable and there are still some rooms for improvement. processing, and does not realize abstract concepts like objects However, we have succeeded in developing an application in 3D space. Therefore, when displaying AR content, it is that can operate the OS independently. necessary to have the ability to individually adjust the position, size, and appearance of an object such as a marker or 3D model as an element. In iOS, as shown in Fig. 3 (a), the 3D object processing function described above is provided as a framework called SceneKit, which can be called from a program and utilized. In addition, SceneKit is equipped with a linking function that converts video files into textures, and the video decoding function is also provided as an independent framework, “AV Foundation” [11]. As a result, AR display can be programmed relatively easily. On the other hand, Android also provides “MediaCodec” [12] library that performs video decoding. However, since the essential 3D object processing function is not installed, its function needs to be developed separately like Fig. 3 (b). It is said that Android has a library equivalent to SceneKit created by volunteers, but it is not an official one, and it is not as easy to link with other libraries like the SceneKit. As a result, compared to iOS, it is more difficult to implement the AR display function on Android. 206

2019 4th International Conference on Information Technology (InCIT2019) As the next step, our research team is going to develop and blush up the Unity version A-txt prototype. With the completion of Unity version, we will compare with iOS and Android version, and consider the platform suitable for introduction to various educational situations. Fig. 4. Constitution outline of Unity version A-txt. ACKNOWLEDGMENT This work is being supported by Grant-in-Aid for Fig. 5. Prototype of Unity version A-txt. scientific research from the Ministry of Education, Culture, Sports, Science and Technology of Japan (No. 18K02884). V. CONCLUSIONS AND FUTURE PLANS The authors are grateful to this grant. In this time, our research team has developed the Android version of the A-txt system. By using a general hardware, it REFERENCES has been shown that the Android version is at a practical level as well as the iOS version. Also, benchmark test has been [1] Kanematsu, Hideyuki; Barry, Dana M. STEM and ICT in intelligent performed to formulate a terminal index for A-txt to operate environments. Switzerland: Springer; 2016. comfortably. However, in the Android version, there are some problems with AR and 3D indication. As a future [2] Kuniaki Yajima and Soru Takahashi. Development of Evaluation development policy to solve those problems, the authors have System of Active Learning Students, Science Direct Procedia concluded the transition to the Unity version as application Computer Science 2017;112:1388-1395. development not based on OS. Finally, we have succeeded in developing a prototype of Unity version A-txt and confirmed [3] Shin-nosuke Suzuki, Yutaro Akimoto, Manabu Ishihara and Yukio the operation independent of the mobile terminal OS. Kobayashi. Basic Development of the Active Textbook System consisted of a General book and a Portable Electronic Terminal, Science Direct Procedia Computer Science 2017;112:109-116. [4] Shin-nosuke Suzuki, Yutaro Akimoto, Yasuhiro Kobayashi, Manabu Ishihara, Ryohei Kameyama, Masaya Yamaguchi. A Proposal of Method to Make Active Learning from Class to Self-Study using Active Note Taking and Active Textbook System, Science Direct Procedia Computer Science 2018;126: 957-966. [5] Augmented Reality in Education, https://www.apple.com/education/docs/ar-in-edu-lesson-ideas.pdf [6] Android v iOS market share 2019, https://deviceatlas.com/blog/android-v-ios-market-share. [7] ARToolKit6, http://www.arreverie.com/blogs/tag/artoolkit6/. [8] Antutu benchmark, http://www.antutu.com/. [9] 3D Mark, https://benchmarks.ul.com/. [10] SceneKit, https://developer.apple.com/documentation/scenekit. [11] AV Foundation, https://developer.apple.com/av-foundation/. [12] MediaCodec, https://developer.android.com/reference/android/media/MediaCodec. [13] ARToolKitX, http://www.artoolkitx.org/. [14] ARCore, https://developers.google.com/ar/?hl=ja. [15] ARKit, https://developer.apple.com/jp/arkit/. [16] Unity, https://unity.com/ja. [17] Yaxin Liang, Digital Circuit Learning System Based on Unity3D, Open Journal of Social Sciences 2018; 6: 333-343. 207

2019 4th International Conference on Information Technology (InCIT2019) Detection and Classification of Network Events in LAN Using CNN Yuwei Sun Hiroshi Esaki Hideya Ochiai Graduate School of Information Graduate School of Information Graduate School of Information Science and Technology Science and Technology Science and Technology The University of Tokyo The University of Tokyo The University of Tokyo Tokyo, Japan Tokyo, Japan Tokyo, Japan [email protected] [email protected] [email protected] Abstract—Security of local area network (LAN) attracts 56 56 many attentions in recent years. A malware delivered, for example by phishing e-mails, intrudes and expands into the 56 .0 56 other hosts in the LAN easily these days. Previous works have taken machine learning approach to detect anomaly caused by .4 malware's behavior. However, there are still not so many researches that try to clarify the reasons of such anomaly, i.e., 85 1 5 8 5 5 458 explanation of the anomaly causes. In this research, we propose a method using convolutional neural network (CNN) with a Fig. 1. We selected eight protocols and set different protocols at regions in refined normalization function and learning function to detect one single image, with a width and height of 48 pixels. and also classify different events happened in the LAN. We use Hilbert Curve, array exchange, and projection to generate This kind of activities is a network event. In this paper, we feature maps to represent protocol information of events within focus on ARP scan, TCP scan, UDP scan, and those scans to a predetermined time span. We have tested our scheme on three specific ports as the types of network events, which we target different active network cases. We have obtained an average to classify by our scheme. recall rate of 76% for detecting and classifying 8 types of events categorized as normal, arp scan, tcp scan, scan of tcp port 23, We put the protocol information of traffic data in the LAN scan of tcp port 80, udp scan, scan of udp port 137, scan of udp using Hilbert Curve (Fig. 1) to generate feature maps port 1900 for those active networks. representing different types of events, which we will discuss in detail in the section 3. We use CNN, a kind of machine Keywords—supervised learning, CNN, LAN, cyber security, learning method, usually used in tasks of object detection and visualization classification, for recognizing the network events happened in the LAN. I. INTRODUCTION After that, we can use the trained model to do inference, Recently, the problem of cyber security has been debated which is a method that constantly monitors and safeguards the by not only experts but also users of network. Especially, in network. Our model will detect and classify the types of the local area network (LAN), the attack known as phishing, attacks at the same time. In this way, we can provide some delivering malware and spread to social media, messaging evidences of the occurrence of attacking. Our research is services and applications. Attack like this can affect every focusing on the method of visualizing features of traffic data aspects of targets’ personal and working lives. Malware and training CNN model using these feature maps. invasion is also frequent. As a result, it is necessary to detect the anomaly in LAN and above that, identifying the type of This paper is organized as follows. Section 2 discusses attacking, which contributes to solving of the security problem. related works about anomaly detection from traffic data in the LAN. Section 3 provides an overview of our algorithm With the development of machine learning technology, it including the visualization of time related features of traffic is considerable that using machine learning to try to deal with data and training of the CNN model. Section 4 presents the the problem we mentioned above. As we all know, machine performance evaluation of the algorithm for events learning is used in the previous researches to identify the classification using a standard named recall. Section 5, we anomaly in the LAN, which tells you the network state is discuss the result of evaluation in different environments. normal or abnormal. The previous researches with machine Section 6, we conclude the paper and give out the possibility learning usually focus on the detecting of anomaly instead of of a broader practical use of our scheme. showing the detail of the attack itself, such as the occurrence of TCP scan or ARP scan. II. RELATED WORK In this paper, we propose a scheme that can learn and Anomaly detection in computer networks attracts many classify network events happening in the LAN using attentions, with more than 40 years of evolution [1]. With the convolutional neural network (CNN). We propose a method rapid growth and increasing complexity of network of creating an image from network packet dump of a certain infrastructures, and the evolution of attacks, identifying and duration (Fig. 1). By putting this image through CNN, it can preventing networks attacks are getting more and more learn the features of the network without hardcoding type of challenging. Considering the remanding robust attribute to the network events. When a malware intrudes into a LAN, and different network environments, in a wider perspective, many try to expand into (or steal some data from) the other hosts in the LAN, it tries to access some specific TCP or UDP ports of all the hosts. 208

2019 4th International Conference on Information Technology (InCIT2019) researchers have considered a semi-supervised approach, recorded in bit format can be expressed and represented as where they train the classifier with “normal” traffic data only, high-dimensional feature information in images. so that knowledge about anomalous events will be constructed and thus be detected. For example, Ugo et al. [2] described a A. Visualization of Time Related Features of Traffic Data usage of Restricted Boltzmann Machines (RBMs) to combine the expressive power of generative models with good In expressing the features of traffic data, we first introduce classification accuracy capabilities. Asmaa et al. [3] presented the concept of \"fineness\" here. Fineness is a parameter to show a comprehensive discussion of using RBM model for feature how finely we should consider the information hidden in big learning and the classifier model for anomaly detection. data. And the time span we use to record the traffic data to Another research by Hsu and Lin [4], they used support vector generate each feature map is defined as in (1). machine (SVM) models, showing the ability of the proposed methodology to detect DDoS attacks in an efficient and T = Tst × fineness× (,0-.1/)2 (1) accurate way. Moreover, in the research of Mehdi et al. [5], they used a class of advanced machine learning techniques, Here, Tst (time standard) is a standard time span for namely Deep Learning (DL), to facilitate the analytics and recording, which is defined as 60s in this research. St is a learning in the IoT domain. Kazumasa et al. [6] used the parameter of the basic recording segment, showing the approach of machine learning to define a feature vector for standard size of a feature map. And we use 8 pixels here (thus detecting Command and Control (C&C) server in botnet using the size of the image is 24 × 24). What’s more, the parameter network traffic data. of size (here, exempted values are 8, 16, 32, 64, 128) is the actual one we use to generate the feature maps. And fineness However, the introduce of machine learning is almost (here, exempted values are 5.0, 1.0, 0.5, 0.2, 0.1, 0.05, 0.02, aimed to detect the anomaly which is different from the 0.001) is mentioned above. purpose of our research. We proposed an approach to classify different types of events in the LAN. Different from only As described above, it is possible to use the parameters labeling normal data or both normal data and abnormal ones (fineness and size) to bring features of traffic data, even which is demonstrated in previous work using an approach of different in time span and fineness into images with the same machine learning, we labeled different types of events to train size. the CNN model, building a classifier in a dynamic way. Furthermore, researchers have demonstrated that it is possible Furthermore, it is necessary to comprehensively consider to apply CNN not only to image classification tasks but also the adaptability to the model of deep learning how to express to signal classification tasks. For instance, Tatsuya Harada and the time-related features of the whole event using feature maps Yuji Tokozume [7] discussed an approach using CNN to from each protocol. Hilbert Curve is a method used to classify different types of time-series environmental sounds. transform the structure of data so that it fills up all space in an In our research, we focus on representing the protocol image. As such, it is considered that traffic information in information of traffic data in the LAN by 2-D images data, a LAN can be represented by a single image on the premise that dataset built to train the CNN model. temporal characteristics will not be lost when using Hilbert Curve. Here, as a method of expressing time-related features III. NETWORK EVENTS CLASSIFICATION WITH CNN of an event in LAN, we put features of nine types of protocols which are captured in a predetermined time span into an image Traffic data in the LAN can be recorded through a tool (Fig. 1). named tcpdump, which allows a user to record TCP, IP, UDP and other packets being transmitted or received in That means, statistical information of nine types of a network and also which computer is attached by the packets. protocols collected in the LAN can be represented in different After that we use a library named dpkt to analyze the big data regions of an image (48 × 48) through array exchange and to understand which protocol is being used in the projection. communication. In this research, we analyze the traffic data and calculate the number of how many packets are transmitted B. Training CNN Model or received for each protocol within one second to generate a feature map representing the occurrence of the specific events. When it comes to time related data, two machine learning Then, in order to detect an event from the traffic data captured methods are mainly used. Recurrent neural networks (RNNs) in the LAN, a CNN model was introduced and trained using and convolutional neural networks (CNNs). For recurrent these feature maps. After that, trained model will be used as neural networks, the data along the time axis is input in order an initialization of the detection system, preparing an of time. On the other hand, in the case of a convolutional environment for inference. Then, the monitor server will network, it has a character of movement invariance with capture traffic data in the LAN and start data mining with a respect to the input of time related data, exhibiting good predetermined time span. At the analyzing step, as described performance. In this research, we use CNNs. The above, the communication information within a configuration of the convolutional network is considered to predetermined time span (every 128 seconds) of each protocol mainly include convolutional layers, some pooling layers and is statistically recorded by dpkt. Then, the density data of the fully-connected layers. Through using a kernel (also called a recorded time protocol will be converted to a time series filter) in each convolutional layer and the pooling layer in feature map using Hilbert Curve. Here, the generated feature some of the layers, we can compress the information of the maps are stored on the server simultaneously with the input data. It is thought that the information contained in the generation. Using the generated maps, the trained CNN model image can be expressed by combining the layers. Finally, in will compute the probability of the occurrence of each event. order to get the outcome as one-dimensional information, The one with the highest possibility is considered occurring in the LAN. It is understood that communication information 209

2019 4th International Conference on Information Technology (InCIT2019) !\"#$%&'(%)'\"*: - = *1 -= *1 2 2 1 1 01%#*'*2 34*5)'\"*: 6)+1 = 6) − 9 1 ⊙ =0 ℎ)+; =6) Fig. 2. A CNN model we used in our research: different from the former model, we used normalization before each convolutional layer to transfer the input data into a distribution of standard deviation. And we used a learning function named RMSProp, in order to attain sustainability of training and a higher accuracy. some fully-connected layers are combined in the deviation at the time of initialization (5) [8]. And n1 is the convolutional neural networks, and the output can be node number of the previous layer. narrowed to a specific range. In this way, a total of 2,340 data collected by the previous steps are used as an input, and the β = M? (5) types of events to be detected are used as labels. The considerable CNN model [8] is shown as above (Fig. 2). 2 In considering the amount of data that can be collected at Based on training the model in this way, the trained model this stage and the complexity of the problem as a learning is stored in the monitoring system (in this case, the personal model, we designed the capacity and configuration of the computer on which the monitoring system is installed) and model. The entire model is a four-layer CNN model consisting after that, using the trained model, from new traffic data the of two convolutional layers and two fully coupled layers. The model can analyze the data and detect the anomaly in LAN. error function uses a cross entropy error (2). K is the number of nodes in the output layer, k is the node number in the output IV. EVALUATION layer, y4 is the output value, and t4 is the correct value. Recall is one kind of evaluation standard, used to represent L=− 7 t 4 logy4 (2) how many of true positives are recalled. In the case of network 4 event classification, we calculate the recall rate of all the eight events as normal, default scan, and scan to some specific ports Then, the parameters are updated by the learning function of both UDP and TCP. Through using recall, we can know using the loss function. As a learning function, we choose how many times each event is successfully classified. And the RMSProp function (3) (4) with the following characteristics, average recall rate can be considered as the performance the emphasis is placed on the latest gradient information more evaluation of the scheme. than the past gradient information and gradually the past gradient information is forgotten, instead, the new gradient A. Experimental Approach information is greatly reflected. L is the loss function defined above, W is the weight of by zero, ρ is the decay rate which For evaluating our algorism, we prepared three different we take a value of 0.9 here. network environments (IP A, IP B, IP C). Network A is a relatively small-scale LAN for the personal daily operating. h1 = ρ ∗ h1>? + 1−ρ ∗ BC ⊙ BC (3) For network B, it is a LAN of the lab, connected with many BDE BDE personal devices, which is used for a purpose of researching and daily operating. And network C is a large-scale network W1H? = W1 − η ? ⊙ BC (4) of the department building, consisting of several servers and JEHK BDE IoT (Internet of things) devices. In addition, it is possible to obtain more stability when Then we visualized and analyzed the communication learning the model through the method of making the information in these three different LANs. The constitution of parameter values disperse to the distribution of standard the LAN in our experiment includes two terminals and a router. Specifically, it is conceivable that one terminal is connected to the LAN as a monitoring system, and the other one performs operations which recognized as attacks as follows: 210

2019 4th International Conference on Information Technology (InCIT2019) (a) Normal (b) Default Scan (c) Default TCP Scan (d) TCP Port 23 Scan (e) TCP Port 80 Scan (f) Default UDP Scan (g) UDP Port 137 Scan (h) UDP Port 1900 Scan Fig. 3. Line 1, from left to right: A feature map of event as normal, default scan, default TCP scan, specific scan of TCP port 23; Line 2, from left to right: A feature map of specific scan of TCP port 80, default UDP scan, specific scan of UDP port 137, specific scan of UDP port 1900. (when fineness is 0.5 and size is 16) • scan (arp): nmap [Network] (a) normal (e) TCP port 80 scan • scan (tcp): nmap -sT [Network] (b) default TCP scan (f) UDP port 137 scan • scan (tcp port 23): nmap -sT -p 23 [Network] (c) TCP port 23 scan (g) UDP port 1900 scan • scan (tcp port 80): nmap -sT -p 80 [Network] (d) default scan (h) default UDP scan • scan (udp): nmap -sU [Network] Fig. 4. Feature maps generated at the last layer of CNN model, namely the hidden4 layer shown in Figure 2, for each event. • scan (udp port 137): nmap -sU -p 137 [Network] C. Event Classification Accuracy • scan (udp port 1900): nmap -sU -p 1900 [Network] We use all three experimental environments and discuss Here, we captured packets information in the LAN when the accuracy of model detection. As an environment, we used each type of event occurs and saved tcpdump files on the a personal computer separated from the space on our campus terminal. In this research, we ran total seven different kinds of to simulate the actual operation situation. Here, two terminals commands to simulate different events in the LAN. For each connected to the same LAN are prepared, one running scan or command, we ran 5 times based on a capture interval of one a scan of a targeted port. Another terminal runs the tool of day. After running all commands, we got 5 capture files for tcpdump simultaneously in the same LAN as a monitoring each event as the dataset. server, as a mechanism built to record traffic data of the LAN. The influence on the recorded data by operation of other B. Feature Map facilities in the LAN or other users can be considered. Through putting feature information of each protocol into We train this CNN model using a learning rate of 0.001. a single image mentioned above. That is, feature information The training graphs of the model in different experimental of nine types of protocols regarding traffic data within a environments are shown as below (Fig. 5). Also, the recall of specific time span is included in one image (Fig. 3). each event for a trained model inferring in each network environment is shown as below (TABLE I). A statistical graph and feature map obtained by adjusting the fineness discussed in the previous step shows that when V. DISCUSSION the size is 16 and the fineness is 0.5, the feature representation of traffic data seems to be the best, considering the In this paper, we analyzed the traffic data in the LAN and computation cost and clarity of features. Then, multiplied by realized the visualization of the feature by bringing the time the standard value of 64.000s, we can get a feature map (48 × related protocol information into feature maps. The 48) representing time-sequential information of nine types of parameters of adaptive fineness and size were adopted after protocols recorded in 128.0s. It is considered the most suitable discussing the effect of different values on traffic data feature for representing the features of traffic data when an event representation. After that, a CNN model, a kind of deep occurs. learning method, was introduced and constructed according And by using CNN, we can get a set of compressed feature maps, which can be used to represent different relations between the protocols (Fig. 4) 211

2019 4th International Conference on Information Technology (InCIT2019) TABLE I. EVALUATION OF THE ALGORITHM UNDER DIFFERENT NETWORK ENVIRONMENTS Event Network A Network B Network C Train Data Test Data Recall Train Data Test Data Recall Train Data Test Data Recall normal scan (default) 280 44 0.787 600 123 0.549 1479 202 0.786 scan (tcp) 215 30 0.998 1200 101 0.995 560 56 0.714 scan (tcp port 23) 220 30 0.985 550 138 0.905 1651 300 1.00 scan (tcp port 80) 215 50 0.718 1098 250 0.590 1693 362 0.976 scan (udp) 430 100 0.665 1098 236 0.702 1190 363 0.979 scan (udp port 137) 395 100 0.575 701 114 0.649 1015 226 0.969 scan (udp port 1900) 425 100 0.988 1200 258 0.883 1190 291 0.624 425 100 0.871 1200 250 0.835 1231 360 0.477 Fig. 5. From left to right: Training graphs in Network A, Network B, Network C environments. From the graph we can find the difference between train loss and test loss is not so large to contribute to the occurring of the over-fitting, which shows a strong flexibility of the model. to the characteristics of the task. We discussed and compared Technology Laboratory, National Institute of Standards and the inference accuracy of the algorithm using recall in three Technology, 1980. different active network environments, contributing to an average recall rate of 76%. [2] U. Fiore, et al., Network anomaly detection with the restricted Boltzmann machine, Neurocomputing, 2013. VI. CONCLUSION [3] Hsu, C.-W., Lin, C.-J., A comparison of methods for multiclass support In the level of practical use, we aim to realize a highly vector machines. IEEE Trans. Neural Netw. 13 (2), 415–425., 2002. versatile monitoring system regardless of the maintenance status of the network environment in the LAN. In recent years, [4] Mehdi Mohammadi, Ala Al-Fuqaha, Sameh Sorour, Mohsen Guizani, malware incidents frequently occur in Asian countries such as Deep Learning for IoT Big Data and Streaming Analytics: A Survey, Thailand, Vietnam, and China. It is obvious that the IEEE COMMUNICATIONS SURVEYS & TUTORIALS, 2018. introduction of a highly versatile monitoring system is required, including the cause of imperfections in the network [5] Asmaa Elsaeidy, Kumudu S. Munasinghe, Dharmendra Sharma, Abbas environment. The security issues of such networks are likely Jamalipour, Intrusion detection in smart cities using Restricted to be mitigated. Boltzmann Machines, Journal of Network and Computer Applications 135 76–83, 2019. Besides, as deep learning is used to deal with the task of network security, this method to safeguard the network is [6] Kazumasa Yamauchi, Junpei Kawamoto, Yoshiaki Hori, Kouichi considerable in the field of the network security. Furthermore, Sakurai, Evaluation of Machine Learning Techniques for C&C Traffic with the development of technology, it seems that the control Classification, Information Processing Society of Japan Vol.56 No.9 of technology will also be very significant while researchers 1745–1753, 2015. incorporating AI technology. [7] Yuji Tokozume, Tatsuya Harada, LEARNING ENVIRONMENTAL In future, not only inferences will be made using trained SOUNDS WITH END-TO-END CONVOLUTIONAL NEURAL models, but also the design of models that can be changeable NETWORK, IEEE International Conference on Acoustics, Speech and in response to environmental situation in the network is Signal Processing, 2017. considerable. It is possible to improve the expressive facility of features by further devising which protocols should be [8] Yann LECun, Patrick Haffner, Leon Bottou, and Yoshua Bengio, chosen according to the attribute of the network existing in the Object Recognition with Gradient-Based Learning, Shape, Contour and actual network. Also, in order to reduce the burden of Grouping in Computer Vision, p.319, 1999. monitoring on the administrator, other than the type of the event occurring will be announced, it is possible that system [9] lan Goodfellow, Yoshua Bengio, Aaron Courvile, Deep Learning can automatically generate a log and at the same time narrow (Adaptive Computation and Machine Learning), Francis Bach, The down the collected traffic data to the targeted part where the MIT Press, 2016. event occurred. [10] Eric Krokos, Alexander Rowden, Kirsten Whitley, and Amitabh REFERENCES Warshney, “Visual Analytics for Root DNS Data” IEEE, 2018. [1] J P. Anderson, Computer Security Threat Monitoring and Surveillance, [11] Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, ImageNet Tech- nical Report, Computer Security Division of the Information Classification with Deep Convolutional Neural Networks, NIPS'12 Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, Pages 1097-1105, 2012. [12] Y.LeCun, K.Kavukcuoglu, and C.Farabet. Convolutional networks and applications in vision. In Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on. IEEE, 2010. [13] Y.LeCun, B.Boser, J.S.Denker, D.Henderson, R.E.Howard, W.Hubba rd and L.D.Jackel.Backpropagation Applied to Handwritten Zip Code Recognition, Neural Computation, p.541-551,1989. [14] Anna L. Buczak, Member, IEEE, and Erhan Guven, A Survey of Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection, IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 18, NO. 2, SECOND QUARTER 2016. 212

2019 4th International Conference on Information Technology (InCIT2019) Data Augmentation Based on Color Features for Limited Training Texture Classification Huu-Thanh Duong Vinh Truong Hoang Faculty of Computer Science Faculty of Computer Science Ho Chi Minh City Open University Ho Chi Minh City Open University Ho Chi Minh, Vietnam Ho Chi Minh, Vietnam [email protected] [email protected] Abstract—Image classification problem requires a large images based on neural style transfer algorithm to augment number of labeled training images to obtain a good gener- the original image datasets [3]. Generative Adversarial Net- alization. However, it is very time consuming and expensive works (GAN) has been proposed to generate new images to to label data. The color information has been demonstrated improve the classification performance. Several works apply that it contains relevant information for characterizing texture GAN to generate samples in different applications [4], [5], image. This paper presents a novel data augmentation method [6]. Schliep et al. [7] propose to augment data based on dedicated to color texture classification. We firstly use color Markov chain Monte Carlo and parameter expansion. Wang features extracted from different color spaces to augment data et al. [8] extract the triple local features based on Gabor in case of one sample training per class. The experimental filters on different scale which exploit the distinctiveness results is very promising since it significantly improves the and similarity of facial images. Deng et al. [9] introduce classification performance on four benchmark color texture and and a modify two-layer CapsNet networks for hyper datasets. spectral imaging classification using very limited training images. Keywords—data augmentation, color features, texture classi- fication, LBP Texture analysis approaches were essentially designed for working with grayscale images. On the other hand, it I. INTRODUCTION has been demonstrated that color information is relevant to characterize the texture [10]. Color and texture are Texture classification refers to the task of assigning a two naturally related properties of the image, but these given texture to one of predefined texture categories. It is a characteristics are often analyzed separately. Different color fundamental issue of texture analysis, playing a major role texture classification approaches and color space have been in various applications such as biomedical image analysis, proposed in the literature [11]. Therefore, it is very difficult industrial inspection, biometrics. Texture classification has to determine a suitable color space for a specific application. become a challenging topic in computer vision because the Many authors propose to use a single space or multi color real world images often exhibit a high degree of complexity, spaces for texture analysis application [12]. In this work, we randomness and irregularity. In computer vision applica- present a novel method of data augmentation by using color tions, it seems to be impossible to obtain enough labeled features coded in several color spaces, aiming at solving the data to cover all object classes, especially when there is limited labeled samples training problems. The following of tens of thousands of categories. this paper is structured as follows. Section II introduces the Opponent Color Local Binary Patterns. Next the proposed Data augmentation is a task to deal with limited labeled approach is presented in section III. The experimental training samples in order to achieve a higher performance results and conclusion are given in section IV and V. or better generalization. For example, in medical imaging, the available data is insufficient due to the labeling data is II. OPPONENT COLOR LOCAL BINARY PATTERNS time-consuming and expensive. Recently, the deep learning approach requires large volumes of training data [1]. More- In the past decade, a various color descriptor have been over, image classification with single sample image per class proposed to solve different practical problems in computer is a real scenarios and very challenging application including vision and pattern recognition. Among of them, the Local surveillance, identity card since it lack reliable data for Binary Pattern (LBP) operator is one of the most successful training. Usually, there are a various approaches to create descriptor to characterize texture images due to its ease new image from original training data by applying various of implementation and low computational complexity [13]. transformations including random flipping or distorting the The definition of the original LBP operator has then been input image, adding Gaussian noise, or cropping a patch generalized to explore intensity values of points on a circular from a random position. Fawzi et al. [2] propose an adaptive neighborhood. The circular neighborhood is defined by con- algorithm via trust-region optimization strategy for choosing sidering the values of radius R and P neighbors around the the appropriate transformations in order to generate the central pixel. Mathematically, the LBPP,R code is computed new samples images. Zheng et al. use eight different style 213

2019 4th International Conference on Information Technology (InCIT2019) by comparing the gray value gc of the central pixel with the We propose to augment the training data in the context of gray values {gi}iP=−01 of its P neighbors, as follows: limited training sample by using other color spaces. Figure 1 illustrates an image of flower that has been coded in the P −1 (1) RGB, HSV and ISH color spaces. Under the human vision system, the RGB space reflects the flower of the real world LBPP,R = s (gi − gc) × 2i better than the others. We suppose that only one texture image per class in RGB color space is available, other color i=0 space allows to increase the number of training images. In this case, color information that representing texture where the threshold function s (t) is defined as: virtually enlarge the training size from one to three images. 1 if t ≥ 0 (2) IV. EXPERIMENTAL RESULTS s (t) = 0 otherwise A. Data preparation The original LBP computation is firstly computed on grayscale images. Opponent Color LBP (OCLBP) was in- troduced by Maenpaa et al. [14] which aims to extend LBP to color. It has been applied in various application of color texture classification [12]. In order to characterize color texture, OCLBP, the LBP operator is applied on each pixel and for each pair of components (Ck, Ck), k, k ∈ {1, 2, 3}. In this definition, opposing pairs such as (C1, C2) and (C2, C1) are considering to be highly redundant, and so, one of each pair is used in the analysis. In this work Extend Opponent Color LBP (EOCLBP) is considered to represent color texture image. RGB (c) USPTex (d) STex LBP Features HSV LBP Features (a) New BarkTex (b) OuTex-TC-00013 ISH Fig. 2: Example images from four texture datasets (a) New LBP BarkTex, (b) OuTex-TC-00013, (c) USPTex, (d) STex. Each image represent each class. Features TABLE I: Summary of image databases used in experiment. Fig. 1: A texture image in RGB space is augmented which coded in HSV and ISH color spaces. LBP features are Dataset name Image size # class # training # test Total extracted from each image to increase the training size for New BarkTex 64×64 6 816 816 1632 classifier input. Outex-TC-00013 128×128 68 680 680 1360 USPTex 128×128 191 1146 1146 2292 III. PROPOSED APPROACH STex 128×128 476 3808 3808 7616 Color is the perceptual result of light in the visible region The proposed method is evaluated on four benchmark of the electromagnetic spectrum. The human retina has color texture datasets such as New BarkTex [16], Outex-TC- three types of color photoreceptor cells, which respond to 00013 [17], USPTex [18] and STex (see several examples incident radiation with somewhat different spectral response in figure 2). The test suites of these dataset can be found1. curves [15]. Because there are exactly three types of color photoreceptor, three numerical components are necessary 1 https://www-lisic.univ-littoral.fr/∼porebski/Recherche.html and theoretically sufficient to represent a color. A color image can thus be converted in different color spaces by a linear transformation. 214

2019 4th International Conference on Information Technology (InCIT2019) TABLE II: Classification results (accuracy) obtained by different color spaces according to with and without data augmentation techniques on the four color texture datasets. The value in bold indicates the best results obtained. New BarkTex Outex-TC-00013 USPTex STex Color space Augmentation Without Augmentation Without Augmentation Without Augmentation Without RGB HSV 44.70 43.46 76.06 76.45 70.08 70.05 75.01 65.90 ISH 43.66 33.70 70.80 69.08 67.84 61.76 75.18 61.68 Average 44.06 32.61 70.82 71.22 66.35 61.40 76.31 60.11 44.14 36.59 72.56 72.25 68.09 64.40 75.50 62.56 They are divided into training set and testing set by holdout ACKNOWLEDGMENT method (as shown in table I) in a supervised learning con- text. In order to evaluate our proposed approach, one image This work was supported by Ho Chi Minh City Open of each class of the predefined training set is randomly used University under Grant No E.2019.06.1. for data augmentation. It is applied by coding this image into HSV and ISH color spaces to obtain two more images REFERENCES of each class. LBP features of these three images are then extracted and used as input of 1-NN classifier associated [1] A. Antoniou, A. Storkey, and H. Edwards, “Augmenting Image Clas- with L1-distance. The estimated accuracy is applied on the sifiers Using Data Augmentation Generative Adversarial Networks,” testing set that indicates the number of images is classified in Artificial Neural Networks and Machine Learning – ICANN correctly. The final result was averaged over 100 random 2018, V. Ku˚rkova´, Y. Manolopoulos, B. Hammer, L. Iliadis, and selections one sample image per class. I. Maglogiannis, Eds. Cham: Springer International Publishing, 2018, vol. 11141, pp. 594–603. B. Results [2] A. Fawzi, H. Samulowitz, D. Turaga, and P. Frossard, “Adaptive Table II presents the experimental results obtained on data augmentation for image classification,” in 2016 IEEE the four benchmark color dataset. The first column indicates International Conference on Image Processing (ICIP). Phoenix, the first color space used for single original image. For each AZ, USA: IEEE, Sep. 2016, pp. 3688–3692. [Online]. Available: dataset, two sub-columns present the results obtained with http://ieeexplore.ieee.org/document/7533048/ data augmentation and the ones using only one sample image per training. The last row shows the average results on the [3] X. Zheng, T. Chalasani, K. Ghosal, S. Lutz, and A. Smolic, “STaDA: three color spaces considered. Using the single original im- Style Transfer as Data Augmentation,” in Proceedings of the 14th age in the RGB color space on the New BarkTex, the result International Joint Conference on Computer Vision, Imaging and obtained on the testing set in the same color space is 43.46%. Computer Graphics Theory and Applications. Prague, Czech The data augmentation technique by incorporating two more Republic: SCITEPRESS - Science and Technology Publications, color spaces, the result increases to 44.70%. Similarly, we 2019, pp. 107–114. obtain the same results on USPTex and STex dataset by increasing samples for each color space. Thus, the average [4] H. Shi, L. Wang, G. Ding, F. Yang, and X. Li, “Data results shows that this simple method allow to improve the Augmentation with Improved Generative Adversarial Networks,” accuracy rate +7.55%, +0.31%, +3.69%, +12.94% for New in 2018 24th International Conference on Pattern Recognition BarkTex, OuTex-TC-00013, USPTex, STex respectively. The (ICPR). Beijing: IEEE, Aug. 2018, pp. 73–78. [Online]. Available: generated images by transforming into another color space https://ieeexplore.ieee.org/document/8545894/ always keep the same texture or object. That is the reason it clearly enhances the performance comparing with other [5] M. Frid-Adar, E. Klang, M. Amitai, J. Goldberger, and H. Greenspan, augmentation technique (i.e flip or crop). “Synthetic data augmentation using gan for improved liver lesion classification,” 2018 IEEE 15th International Symposium on Biomed- V. CONCLUSION ical Imaging (ISBI 2018), pp. 289–293, 2018. We introduced a simple but effective data augmentation [6] D. Nie, R. Trullo, J. Lian, C. Petitjean, S. Ruan, Q. Wang, and technique for color texture classification based on color D. Shen, “Medical image synthesis with context-aware generative information. We virtually create a new texture image by adversarial networks,” in International Conference on Medical Image coding in other color space. This method is evaluated on Computing and Computer-Assisted Intervention. Springer, 2017, pp. four color texture dataset in the context of one sample 417–425. image training per class. The experimental results shows that this approach significantly improve the classification [7] E. M. Schliep and J. A. Hoeting, “Data augmentation performance. How to improve the feature extraction and and parameter expansion for independent or spatially obtain the robust descriptor based on color space is the future correlated ordinal data,” Computational Statistics & Data of this work. Analysis, vol. 90, pp. 1–14, Oct. 2015. [Online]. Available: http://linkinghub.elsevier.com/retrieve/pii/S0167947315000973 [8] X. Wang, B. Zhang, M. Yang, K. Ke, and W. Zheng, “Robust joint representation with triple local feature for face recognition with single sample per Person,” Knowledge-Based Systems, p. S095070511930245X, May 2019. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S095070511930245X [9] F. Deng, S. Pu, X. Chen, Y. Shi, T. Yuan, and S. Pu, “Hyperspectral Image Classification with Capsule Network Using Limited Training Samples,” Sensors, vol. 18, no. 9, p. 3153, Sep. 2018. [Online]. Available: http://www.mdpi.com/1424-8220/18/9/3153 [10] F. S. Khan, R. M. Anwer, J. D. Weijer, M. Felsberg, and J. Laakso- nen, “Compact color–texture description for texture classification,” Pattern Recognition Letters, vol. 51, no. 0, pp. 16 – 22, 2015. 215

2019 4th International Conference on Information Technology (InCIT2019) [11] C. Ferna´ndez-Maloine, F. Robert-Inacio, and L. Macaire, Eds., Numerical color imaging. London, UK : Hoboken, NJ: ISTE Ltd. ; John Wiley & Sons, 2012. [12] V. Truong Hoang, “Multi color space lbp-based feature selection for texture classification,” PhD thesis, Universite´ du Littoral Coˆte d’Opale, Feb. 2018. [Online]. Available: https://tel.archives- ouvertes.fr/tel-01756931 [13] L. Liu, P. Fieguth, Y. Guo, X. Wang, and M. Pietika¨inen, “Local binary features for texture classification: Tax- onomy and experimental study,” Pattern Recognition, vol. 62, pp. 135–160, Feb. 2017. [Online]. Available: http://linkinghub.elsevier.com/retrieve/pii/S003132031630245X [14] T. Ma¨enpa¨a¨ and M. Pietika¨inen, “Classification with color and texture: jointly or separately?” Pattern Recognition, vol. 37, no. 8, pp. 1629–1640, Aug. 2004. [15] G. Wyszecki and W. S. Stiles, Color Science: concepts and methods, quantitative data and formulae, 2nd ed. New York: John Wiley & Sons, Inc., 1982. [16] A. Porebski, N. Vandenbroucke, L. Macaire, and D. Hamad, “A new benchmark image test suite for evaluating color texture classification schemes,” Multimedia Tools and Applications, vol. 70, 05 2014. [17] T. Ojala, T. Maenpaa, M. Pietikainen, J. Viertola, J. Kyllonen, and S. Huovinen, “Outex - new framework for empirical evaluation of texture analysis algorithms,” in Proceedings of the 16th International Conference on Pattern Recognition, vol. 1, 2002, pp. 701–706 vol.1. [18] A. R. Backes, D. Casanova, and O. M. Bruno, “Color texture analysis based on fractal descriptors,” Pattern Recognition, vol. 45, no. 5, pp. 1984 – 1992, 2012. 216

2019 4th International Conference on Information Technology (InCIT2019) 0LFUR51$*HQH6LJQDWXUHV3UHGLFWLRQIRUFDQFHUV ZLWK'UXJ'LVFRYHU\\  7DQ1LWD\\D 3UDWKDQ3KXPSKXDQJ &RPSXWHU6FLHQFH &RPSXWHU6FLHQFH 6FKRRORI,QIRUPDWLRQ7HFKQRORJ\\ 0HD)DK/XDQJ8QLYHUVLW\\ 6FKRRORI,QIRUPDWLRQ7HFKQRORJ\\ &KLDQJ5DL7KDLODQG 0HD)DK/XDQJ8QLYHUVLW\\ &KLDQJ5DL7KDLODQG 3LWXNSRQJ3RPMDOUHQ 1LOXERQ.XUXEDQMHUGMLW &RPSXWHU6FLHQFHDQG,QQRYDWLRQ ,4',75HVHDUFK*URXS 6FKRRORI,QIRUPDWLRQ7HFKQRORJ\\ 6FKRRORI,QIRUPDWLRQ7HFKQRORJ\\ 0HD)DK/XDQJ8QLYHUVLW\\ 0HD)DK/XDQJ8QLYHUVLW\\ &KLDQJ5DL7KDLODQG &KLDQJ5DL7KDLODQG Abstract—0LFUR51$VLVWKHRQHRINLQGRIQRQFRGLQJ51$ FDQFHU JHQHV LQWR 33, QHWZRUN KHOSV WR UHYHDO 33,V ZLWK ZKLFK FRQWURO JHQH H[SUHVVLRQ 1RZDGD\\V PDQ\\ RI 1H[W VLJQLILFDQW IXQFWLRQDO ELRORJLFDO SURFHVVHV DQG WKHUDSHXWLF *HQHUDWLRQ 6HTXHQFLQJ 1*6  XVH PLFUR51$V VHTXHQFH DV >@ &XUUHQWO\\ WKH LQWHUDFWLRQV EHWZHHQ GUXJV DQG WDUJHW EDVHGRIPHGLFDOWUHDWPHQWUHVHDUFK0LFUR51$VKDYHHPHUJHG SURWHLQVKDYHVWXGLHGDVLQWHUDFWLYHQHWZRUN>@0DQ\\GUXJV DQG DUH FRQVLGHUHG DV SRWHQWLDO WDUJHWV IRU FDQFHU SUHYHQWLRQ LQWHUDFW ZLWK SODVPD RU WLVVXH SURWHLQV WR IRUP D GUXJ DQG WKHUDS\\ ,W SOD\\V UROH LQ LQYROYLQJ LQ GUXJ IXQFWLRQV E\\ PDFURPROHFXOH FRPSOH[ +RZHYHU IHZ FRPSXWDWLRQDO FKDQJLQJ H[SUHVVLRQ OHYHO RI VSHFLILF JHQHV 7KLV UHVHDUFK UHVHDUFKHV SUHGLFW GUXJFDQFHU UHODWLRQVKLS WKURXJK FRPSRVLQJWZRVWHSVILUVWO\\PLFUR51$WDUJHWJHQHVRIYDULRXV PLFUR51$ GDWD 7KH ZRUN RI /LDQJ <X   VWXGLHG FDQFHUV ZHUH LGHQWLILHG E\\ .0HDQ DSSURDFK 7KHQ FDQFHU SUHGLFWLQJHIILFLHQWGUXJVIRUEUHDVWFDQFHUXVLQJPLFUR51$ SURWHLQWDUJHW GUXJV ZHUH GLVFRYHUHG WKURXJK SURWHLQSURWHLQ DQGWLVVXHLQIRUPDWLRQ>@7KLVZRUNFRQVWUXFWHGWKHQHWZRUN LQWHUDFWLRQ FOXVWHUVXVLQJ &OXVWHU21( DOJRULWKP PHWKRG 7KH RI GUXJPLFUR51$GLVHDVH WR REWDLQ WKH SRWHQWLDO GUXJ XOWLPDWHJRDORIWKLVZRUNLVWRLGHQWLI\\HIIHFWLYHDQGVDIHGUXJV GLVHDVHDVVRFLDWLRQIRUEUHDVWFDQFHU7KHLUUHVXOWVKRZVWRS IRUFDQFHUSDWLHQWEDVHGRQPLFUR51$H[SUHVVLRQIRUVHOHFWLQJ SUHGLFWHGHIILFLHQWGUXJVIRUEUHDVWFDQFHU>@1RZDGD\\V YXOQHUDEOH WDUJHWV DQG DGGUHVVLQJ LQQRYDWLYH WKHUDSLHV IRU DFRPSXWHUKDVPDQ\\UROHVZKLFKGHYHORSVDQGKHOSVKXPDQ FDQFHUV WR HYROXWLRQV 7KH SURV IHDWXUH RI FRPSXWHU IRU KHOSLQJ LQ FDVH LV SHUIRUPDQFH VWRUDJH DQDO\\VLV DQG SURFHVV WKH GDWD Keywords—microRNA, microRNA-target gene prediction, K- %LRLQIRUPDWLFVLVDILHOGRIVFLHQFHFROOHFWLQJELRORJLFDOGDWD Mean, network clustering, drug discovery, cancer VXFKDVJHQRPLFVSURWHRPLFVDQGPHWDERORPLFVE\\XVLQJD V\\VWHPDWLFGHYHORSLQJDSSOLFDWLRQIRUXVHGZLWKELJGDWD , ,1752'8&7,21  &DQFHU LV DEQRUPDO JURZLQJ DQG GLYLGLQJ RI FHOOV LQ KXPDQDQGRUJDQLVPERG\\>@,WLVWKHRQHLPSRUWDQWFDXVH  RI GHDWK RI ZRUOG SRSXODWLRQ ,Q  KDV  PLOOLRQ EHFRPH QHZ FDQFHU SDWLHQWV DQG DW WKH VDPH WLPH KDV  )LJXUH+XPDQ¶VPLFUR51$VWUXFWXUH PLOOLRQSHRSOHGLHEHFDXVHRIFDQFHU>@,Q7KDLODQGDURXQG  SHRSOHKDVEHHQGLHGEHFDXVHRIFDQFHULQ>@ 7KHUHIRUH RXU SUHVHQW ZRUN DLPV WR LGHQWLI\\ FOXVWHULQJ 7RGD\\ QH[W JHQHUDWLRQ VHTXHQFLQJ XVH PLFUR51$ DV PRGHOWRSUHGLFWPLFUR51$WDUJHWJHQHLQWHUDFWLRQRIYDULRXV EDVHGRIPHGLFDOWUHDWPHQW0LFUR51$VDUHWKHVPDOOSDUWRI FDQFHUV 7KHQ ZH IXUWKHU GLVFRYHUHG VLJQLILFDQW GUXJV 51$ WKDW FRQWURO JHQH UHJXODWLRQ >@ 7KH ILJXUH  FODULI\\ WKURXJKFDQFHUSURWHLQSURWHLQLQWHUDFWLRQPRGXOHV DERXW PLFUR51$ VWUXFWXUH LQVLGH KXPDQ ERG\\ ,Q KXPDQ ERG\\ WKHUH DUH WZR W\\SHV RI JHQHV UHODWLQJ WR FHOO ,, 0(7+2'2/2*< GHYHORSPHQWZKLFKDUHWXPRUVXSSUHVVRUJHQHDQGRQFRJHQH ,IWKHUHLVDQDEQRUPDOGHYHORSLQJRUPXWDWLRQLWFDQFDXVH A.Research Framework FDQFHU 0LFUR51$ IXQFWLRQ DV HLWKHU RQFRJHQHV RU WXPRU  ,QWKLVUHVHDUFKZHGLYLGHWRWZRSKDVHVZKLFKLV.0HDQ VXSSUHVVRUV XQGHU FHUWDLQ FRQGLWLRQV >@ ,W SDUWLFLSDWHV LQ &OXVWHULQJ0RGHOIRUSUHGLFWLQJRIPLFUR51$WDUJHWJHQHVRI UHJXODWLQJ JHQH H[SUHVVLRQ E\\ WDUJHWLQJ PXOWLSOH JHQHV LQ RUGHU WR LQYROYLQJ LQ VLJQLILFDQW FDQFHU UHODWLQJ ELRORJLFDO SURFHVVHV LQFOXGLQJ FHOO SUROLIHUDWLRQ DQG DSRSWRVLV >@ 5HFHQWO\\ $QWLFDQFHU WKHUDSLHV EDVHG RQ PLFUR51$V DUH FXUUHQWO\\ EHLQJ GHYHORSHG LQ RUGHU WR LPSURYH RXWFRPH RI FDQFHUWUHDWPHQW>@3URWHLQ3URWHLQ,QWHUDFWLRQ 33, LVWKH IRUPRILQWHUDFWLQJQHWZRUNRISURWHLQVZKHUHVRPHSURWHLQV PDNH FRQQHFWLRQV WR PDQ\\ RWKHU SURWHLQV ,QWHJUDWLRQ RI  217

2019 4th International Conference on Information Technology (InCIT2019) FDQFHUVDQG&OXVWHU21(1HWZRUNFOXVWHULQJIRUPLFUR51$ 9DULRXV FRQILUPHG FDQFHU JHQHV ZHUH GRZQORDGHG IURP 33,'UXJPRGXOHV P6LJ'% 7KHUH DUH WKUHH PLFUR51$WDUJHW JHQH SUHGLFWLRQ  6WHSZHLGHQWLILHGFOXVWHULQJPRGHO .0HDQ E\\XVLQJ VFRUHV EHLQJ XVHG LQ WKLV VWHS ZKLFK ZHUH REWDLQHG IURP :(.$ DSSOLFDWLRQ WR SUHGLFW PLFUR51$WDUJHW JHQH RI ',$1$7DUJHW6FDQDQGPLU'%SUHGLFWLRQWRROV FDQFHUV7KHUHDUHWKUHHPLFUR51$WDUJHWSUHGLFWLRQVFRUHVWR EH XVHG DV IHDWXUHV WUDLQLQJ VHW  ZKLFK DUH WKH VFRUHV IURP ',$1$ LV DQ DOJRULWKPV GDWDEDVHV DQG VRIWZDUH IRU SUHGLFWLRQ WRROV QDPHG ',$1$ 7DUJHW6FDQ DQG PLU'% LQWHUSUHWLQJDQGDUFKLYLQJWKHGDWDLQDV\\VWHPDWLFIUDPHZRUN )LJXUHGHVFULEHDERXWSKDVHZRUNIORZZKLFKVWDUWLQJZLWK UDQJLQJIURPWKHDQDO\\VLVRIH[SUHVVLRQUHJXODWLRQIURPGHHS FROOHFWLQJ GDWD IURP WKUHH VRXUFHV DQG ILQG WKH VDPH VHTXHQFLQJ GDWD ,W LV XVHG WKHUPRG\\QDPLFV PHWKRG IRU PLFUR51$DQGJHQHWDUJHWIURPWKUHHVRXUFH ,IXVLQJ64/ FDOFXODWHWKHJHQHPLFUR51$LQWHUDFWLRQVFRUH>@0LFUR MRLQ WKUHH GDWD VRXUFHV ZLWK VDPH QDPH RI PLFUR51$  WR 7LV',$1$¶VDOJRULWKPIRUSUHGLFWLRQPLFUR51$WDUJHWWKH PDNH WKH ³WUDLQLQJ VHW´ $IWHU WKDW UXQ WKH WUDLQLQJ VHW RQ ODVWHVWYHUVLRQRIPLFUR7LVPLFUR7&'6',$1$PLFUR7 :(.$WRROVRIWZDUHE\\XVH.PHDQVDOJRULWKPWRFOXVWHULQJ &'6 LV YHUVLRQ WK RI PLFUR7 DOJRULWKP VHULHV >@ 7KLV GDWDE\\. DQ\\JURXSQXPEHU  DOJRULWKPFDQLGHQWLI\\PLFUR51$WDUJHWVLQ¶875',$1$ PLFUR7&'6KDVKLJKDFFXUDF\\DQGKLJKHVWVHQVLWLYLW\\DWDOO  RIVSHFLILFOHYHO>@ )LJXUH3KDVHFOXVWHULQJPRGHO 7DUJHW6FDQSUHGLFWVELRORJLFDOWDUJHWVRIPLFUR51$VE\\  VHDUFKLQJ IRU WKH SUHVHQFH RI FRQVHUYHG PHU PHU DQG  6WHS  LV WR LGHQWLI\\ PLFUR51$33,'UXJ PRGXOHV PHUVLWHVWKDWPDWFKWKHVHHGUHJLRQRIHDFKPLFUR51$,W 3URWHLQSURWHLQLQWHUDFWLRQLQIRUPDWLRQZDVGRZQORDGHGIURP XVHV VHHG FRPSOHPHQWDU\\ PHWKRG IRU FDOFXODWH WKH JHQH %LR*ULGDQGWKHQMRLQWKH%LR*ULGGDWDZLWKRXUWUDLQLQJVHW PL51$LQWHUDFWLRQVFRUH>@1RUPDOO\\7DUJHW6FDQSUHGLFW IURP VWHS  LQ RUGHU WR REWDLQ PLFUR51$33, QHWZRUN WDUJHW RI PDPPDOLDQ PLFUR51$ >@ 7KH\\ GHILQH WKH VHHG LQIRUPDWLRQ DVDQ LQSXWIRUQHWZRUN FOXVWHULQJDQDO\\VLV :H SRVLWLRQRIDPDWXUHPL51$PL51$IDPLO\\LVPL51$ DGRSWHG &\\WRVFDSH SURJUDP ZLWK &OXVWHU21( DOJRULWKP WR LQFOXGH ZLWK WKH VDPH VHHGP VHTXHQFH >@ $IWHU H[WUDFW VLJQLILFDQW PLFUR51$33, PRGXOHV DQG FROOHFW WKLV SUHGLFWLRQ 7DUJHW6FDQ ZLOO FDOFXODWH VFRUH DV FRQWH[W UHVXOW LQWR RXU GDWDEDVH )LQDOO\\ SURWHLQGUXJ LQIRUPDWLRQ ZKLFKLVWKHWDUJHWLQJVFRUHWKDWWKHVXPRIIHDWXUHV VLWH IURP'UXJ%DQNZDVPHUJHGWRHDFKPLFUR51$33,PRGXOH W\\SH VXSSOHPHQWDU\\ SDLULQJ ORFDO $8 PLQLPXP GLVWDQFH 7KH ILQDO UHVXOWV ZHUH FROOHFWHG LQ IUHHO\\ ZHESDJHV IRU V51$$  V51$&   V51$*  V51$$  V51$&  LQYHVWLJDWRUV WR REVHUYH )LJXUH  GHVFULEH DERXW VWHS  V51$*  VLWH$  VLWH&  VLWH*   875 OHQJWK  ZRUNIORZWKDWVWDUWLQJZLWKMRLQLQJWKHWUDLQLQJVHWIURPSKDVH 6$ 25) OHQJWK  25) PHU FRXQW    875 RIIVHW PHU DQGSURWHLQSURWHLQLQWHUDFWLRQGDWDIURP%LR*ULGWRJHWKHU FRXQW  7$ WDUJHW VLWH DEXQGDQFH  636 VHHGSDLULQJ 7KHQXVH&\\WRVFDSHVRIWZDUHWRUXQ&OXVWHU21(DOJRULWKPWR VWDELOLW\\ 3&7 SUREDELOLW\\RIFRQVHUYHGWDUJHWLQJ >@ FUHDWHPLFUR51$33,PRGXOHV)LQDOO\\MRLQWKHPRGXOHVZLWK GUXJGDWDIURP'UXJ%DQNWRFUHDWHILQDOUHVXOW PL5'% LV DQ RQOLQH GDWDEDVH IRU PLFUR51$ WDUJHW SUHGLFWLRQ FRPPRQ IHDWXUHV DVVRFLDWHG ZLWK PLFUR51$  WDUJHW ELQGLQJ KDYH EHHQ LGHQWLILHG DQG XVHG WR SUHGLFW )LJXUH3KDVH33,GUXJLQWHUDFWLRQ PLFUR51$WDUJHWVZLWK690>@PL5'%KDVGHYHORSWKHLU  DOJRULWKP IRU PL51$ SUHGLFWLRQ EDVHG RQ VXSSRUW YHFWRU B. Data Sources PDFKLQH 690 DQGKLJKWKURXJKSXWWUDLQLQJGDWDVHWV7KHLU 1) Step 1: K-Mean Clustering Model for predicting of SUHGLFWLRQ DOJRULWKP YDOLGDWHG E\\ LQGHSHQGHQW H[SHULPHQWDO microRNA-target genes of cancers GDWDIRULWVLPSURYHGVHOHFWLYLW\\RQSUHGLFWLQJDODUJHQXPEHU RIPL51$GRZQUHJXODWHGJHQHWDUJHWV>@  2) Step 2: ClusterONE Network clustering for microRNA-PPI-Drug modules &RQILUPHG33,ZDVGRZQORDGHGIURP%LR*5,'GDWDEDVH SURWHLQGUXJ LQWHUDFWLRQ GDWDVHW ZDV REWDLQHG IURP 'UXJ%DQN7KH%LRORJLFDO*HQHUDO5HSRVLWRU\\IRU,QWHUDFWLRQ 'DWDVHWV %LR*5,'  LV D SXEOLF GDWDEDVH WKDW DUFKLYHV DQG GLVVHPLQDWHVJHQHWLFDQGSURWHLQLQWHUDFWLRQGDWDIURPPRGHO RUJDQLVPV DQG KXPDQV 8VHU FDQ ILQG DERXW SURWHLQSURWHLQ LQWHUDFWLRQRQWKLVZHEVLWH>@ 'UXJ%DQN LV D ZHEVLWH ZKLFK SURYLGH GUXJ DQG GUXJ WDUJHW GDWDEDVH ,W FRPELQHV GHWDLOHG GUXJ GDWD ZLWK FRPSUHKHQVLYHGUXJWDUJHWLQIRUPDWLRQ8VHUFDQFROOHFWWKH GUXJGDWDDQGXVHGWRILQGWKHGUXJWKDWJHQHQHHG>@  C. Step1: K-Mean Clustering Model for predicting of microRNA-target genes of cancers .0HDQV FOXVWHULQJ LV D W\\SH RI XQVXSHUYLVHG OHDUQLQJ ZKLFK LV XVHGZKHQ GDWD LV XQODEHO 7KH JRDO RI.0HDQV DOJRULWKP LV WR ILQG WKH JURXS LQ GDWD ZLWK WKH QXPEHU RI 218

2019 4th International Conference on Information Technology (InCIT2019) JURXSVUHSUHVHQWHGE\\WKHYDULDEOH.VXFKDV.HTXDORU ,,, 5(68/76 >@ A.K-Mean Clustering Model for predicting of microRNA- 7KHILJXUHVKRZDERXW. . . DQG. :KHQ target genes of cancers ZHGLYLGHWRROHDVWFOXVWHU LQWKLVUHVHDUFKLV.  WKHFOXVWHU LV XQSURILWDEOH EHFDXVH WKH FOXVWHU FDQ¶W H[SODLQ DERXW WKH ,QWKLVUHVHDUFKZHXVHNPHDQDOJRULWKPIRUFOXVWHULQJ VLJQDWXUHRIDQ\\JURXS:KHQZHGLYLGHWRRPDQ\\FOXVWHU LQ $OOFOXVWHUUHSUHVHQWVVLJQLILFDQWPLFUR51$WDUJHWJHQHSDLU WKLVUHVHDUFKLV.  WKHFOXVWHUZRQ¶WVKRZWKHGLIIHUHQFH RULQDQRWKHUZRUGWKHPLFUR51$WDUJHWVWRWDUJHWJHQHZLWK RIDQ\\JURXSEHFDXVHRYHUODSGDWDZLOORFFXUPDQ\\SRVLWLRQV KLJK FRQILGHQW ,PSRUWDQW WKLQJ RI FOXVWHULQJ LV ³FHQWURLG´ FHQWURLG LV WKH NH\\ WR FRPSDUH EHWZHHQ DQ\\ FOXVWHU DQG .0HDQDOJRULWKPDW. LVDGRSWHGWRFOXVWHUFRQILUPHG FHQWURLGFDQVKRZLGHQWLW\\RIDQ\\JURXS PLFUR51$WDUJHW JHQH RI YDULRXV FDQFHUV EDVHG RQ WKUHH SUHGLFWLRQ VFRUHV RI ',$1$ 7DUJHW6FDQ DQG PLU'% 7KH $VZHNQRZWRRPDQ\\RUWRROHDVWFOXVWHUDUHQRWXVHIXO WKUHH SUHGLFWLRQ VFRUHV RI DOO PLFUR51$WDUJHW JHQH WKDW . DQG. ORRNOLNHXVHIXOIRUWKLVUHVHDUFK7DEOHVKRZ UHODWHG WR YDULRXV FDQFHU ZHUH DQDO\\]HG E\\ :HNDSURJUDP DERXWFRPSDULQJRI. DQG. $WFOXVWHURI. LVRFFXU 7KH UHVXOW LV D FOXVWHULQJ PRGHO IRU SUHGLFWLQJ XQNQRZQ IURP VKDULQJ RI FOXVWHU  RI .  DQG ILJXUH  VKRZ WKH PLFUR51$WDUJHWJHQHSDLUEDVHGRQWKUHHSUHGLFWLRQVFRUHV ORFDWLRQ RI FOXVWHU  RI .  WKDW GRHVQ¶W KDYH WKHLU RZQ ,I WKH WKUHH VFRUHV RI XQNQRZQ PLFUR51$WDUJHW JHQH SDLU VLJQDWXUH6RWKDW. LVQRGHVHUYHWRXVHIRUWKLVUHVHDUFK IDOO LQ DFFHSWDEOH YDOXHV VR ZH FDQ SUHGLFW ZKHWKHU VXFK $W .  HYHU\\ FOXVWHU VKRZ WKHLU RZQ VLJQDWXUH FOHDUO\\ DQG PLFUR51$DQGWKHWDUJHWJHQHLQWHUDFWWRHDFKRWKHU HYHU\\FOXVWHUKDVWKHLURZQDUHD  7$%/(, &/867(5(',167$1&(6  &OXVWHUHG,QVWDQFHV . .  .  &OXVWHU        &OXVWHU        &OXVWHU        &OXVWHU      )LJXUHFOXVWHU   D. Step2: ClusterONE Network clustering for microRNA- PPI-Drug modules *UDSK WKHRU\\ DSSURDFKHV KDYH EHHQ DSSOLHG WR GHVFULEH WKHWRSRORJLFDOSURSHUWLHVRIWKHQHWZRUN&OXVWHULQJPHWKRGV DUHDSSOLHGWRWKHSURWHLQLQWHUDFWLRQJUDSKLQRUGHUWRGHWHFW  )LJXUH'JUDSK N DQGN  KLJKO\\FRQQHFWHGVXEJUDSKV>@  7R LGHQWLI\\ VLJQLILFDQW PLFUR51$33, PRGXOHV RI 7$%/(,, . &/867(5&(1752,'6 FDQFHUVZHSUHSDUHGGDWDVHWRIFRQILUPHGPLFUR51$WDUJHW .  &OXVWHU6RXUFH 'LDQD 7DUJHW6FDQ PLU'% JHQH SURWHLQ  RI HDFK FDQFHU W\\SH WKHQ H[WUDFWHG WKHLU     VLJQLILFDQWVXEQHWZRUNV PRGXOHV E\\&OXVWHU21(DOJRULWKP         ZLWKGHIDXOWYDOXHVVHWWLQJXVLQJ&\\WRVFDSHWRRO&OXVWHU21(  DOJRULWKP UHFRPPHQG FOXVWHULQJ ZLWK RYHUODSSLQJ  7KHFHQWURLGVRIFOXVWHUDUHVKRZRQWDEOH&OXVWHU¶V QHLJKERUKRRG H[SDQVLRQ 7KH\\GHWHFWSRWHQWLDO RYHUODSSLQJ ORFDWLRQ LV ',$1$  7DUJHW6FDQ  DQG PLU'% &OXVWHU¶VVLJQDWXUHKDVKLJK',$1$DQG SURWHLQFRPSOH[HVIURP33,GDWD>@&OXVWHU21(FRPSXWHV PLU'%VFRUHDQGORZ7DUJHW6FDQVFRUH&OXVWHU¶VORFDWLRQ LV ',$1$  7DUJHW6FDQ  DQG PLU'%  WKHSYDOXHDQGRWKHUUHODWLRQVKLSRISURWHLQPHPEHULQDQ\\ &OXVWHU¶VVLJQDWXUHKDVKLJK',$1$ORZPLU'%VFRUHDQG WKHPLGGOHWRWKHOHDVW7DUJHW6FDQVFRUH&OXVWHU¶VORFDWLRQ FDQFHU W\\SH 3YDOXH UHSUHVHQWV VLJQLILFDQW SURWHLQ PRGXOH LV ',$1$  7DUJHW6FDQ  DQG PLU'%  &OXVWHU¶VVLJQDWXUHKDVKLJK',$1$PHGLXPPLU'%VFRUH ZKLFKOHVVWKDQFDOO³VXJJHVWLYH´DQGSYDOXHZKLFKOHVV DQGPHGLXP7DUJHW6FDQVFRUH WKDQ  FDOO ³VLJQLILFDQW´ ,Q WKLV UHVHDUFK ZH REVHUYHG  WKHVLJQLILFDQWSURWHLQPRGXOH $IWHU REWDLQLQJ VLJQLILFDQW SURWHLQ PRGXOHV RI HDFK FDQFHU W\\SH ZH PHUJHG SURWHLQGUXJ LQWHUDFWLRQ WR HDFK PRGXOHLQRUGHUWRGLVFRYHUHIIHFWLYHGUXJV7KHSURWHLQGUXJ LQWHUDFWLRQLVH[WUDFWHGIURPWKH'UXJ%DQN  219

2019 4th International Conference on Information Technology (InCIT2019)  ,QWKLVUHVHDUFKZHIRXQGRXWOLHUGDWDIURPVDPSOH )LJXUH'UXJGLVFRYHU\\SDJH  RU˓«DQGWKHGDWDWKDWFRUUHFWWKHVLJQDWXUHRIDQ\\ FOXVWHU LV ˓ « 7KLV SHUFHQWDJH FDQ HVWLPDWH WKH ,9 ',6&866,21$1'&21&/86,21 VLJQLILFDQFH RI .0HDQ &OXVWHULQJ 0RGHO IRU SUHGLFWLQJ RI 7KH SUHVHQW ZRUN DLPV WR GLVFRYHU HIILFDF\\ GUXJV IRU PLFUR51$WDUJHWJHQHVRIFDQFHUV FDQFHU WUHDWPHQW EDVHG RQ JHQRPLF LQIRUPDWLRQ 7KH FOXVWHULQJPRGHOE\\.PHDQ .  ZDVLGHQWLILHGWRSUHGLFW B. ClusONE Network clustering for microRNA-PPI-Drug 0LFUR51$WDUJHWJHQHLQWHUDFWLRQ)XUWKHUPRUHZHH[SORUHG modules VLJQLILFDQW PLFUR51$33, PRGXOHV RI YDULRXV FDQFHU W\\SHV E\\ &OXVWHU21( QHWZRUN FOXVWHULQJ DSSURDFK WKHQ PHUJHG  /LVWRIVLJQLILFDQWPLFUR51$33,PRGXOHVRIHDFKFDQFHU HDFK PRGXOH ZLWK WKHLU LQWHUDFWLQJ GUXJ FKHPLFDO W\\SH ZDV H[WUDFWHG E\\ &OXVWHU21( DOJRULWKP 7KHQ SURWHLQ FRPSRXQGV  ,W LV H[SHFWHG WKDW WKLV UHVHDUFK ZRUN ZLOO EH D GUXJ LQWHUDFWLRQ LQIRUPDWLRQ ZDV PHUJHG WR HDFK PRGXOH VWDUWLQJ SRLQW IRU GLVFRYHULQJ HIIHFWLYH DQG VDIH GUXJV IRU )LJXUHGHSLFWVRQHH[DPSOHRIFRPSOHWHFRPSRQHQWVLQWKH FDQFHU SDWLHQW EDVHG RQ JHQRPLF LQIRUPDWLRQ IRU VHOHFWLQJ VLJQLILFDQWPRGXOHIURPRXUVWXG\\ YXOQHUDEOH WDUJHWV DQG DGGUHVVLQJ LQQRYDWLYH WKHUDSLHV IRU FDQFHUV '&+ Lenalidomide $%/ ''; /'+% Stiripentol Artenimol KDVPL5 DS $&.12:/('*0(17 7KLVZRUNZLOOQRWEHVXFFHVVIXOZLWKRXWWKHNLQGVXSSRUWRI *HQH PL51$ 3URWHLQ 'UXJ  WKHDGYLVRUFRPPLWWHHVDQGDOOVXSSRUWLYHSHUVRQV:HZRXOG OLNHWRWKDQNHYHU\\RQHIRUWKHLUDGYLFHVDQGKHOS2XUJUDWLWXGH )LJXUH&RPSOHWHFRPSRQHQWVLQVLJQLILFDQWPRGXOH JRHV WR 6FKRRO RI ,QIRUPDWLRQ 7HFKQRORJ\\ DQG ,4',7  UHVHDUFKJURXSRI0HD)DK/XDQJ8QLYHUVLW\\IRUDOOVXSSRUWV )LJXUHVKRZVVLJQLILFDQWPRGXOHRI$FXWH/\\PSKRF\\WLF 5()(5(1&(6 /HXNDHPLD $// WKDW$%/LVDFDQFHUSURWHLQLQYROYHGLQ YDULHW\\RIFHOOXODUSURFHVVHVVXFKDVFHOOGLYLVLRQDQGUHVSRQVH >@ µ:KDW ,V &DQFHU\"¶ National Cancer Institute 6HS WRVWUHVV>@$%/LVWDUJHWHGE\\PL5DS7KLVSURWHLQ LQWHUDFWVWR&'+'';DQG/'+%SURWHLQV)XUWKHUPRUH  >2QOLQH@ $YDLODEOH LW LV IRXQG WKDW &'+ LV WDUJHWHG E\\ OHQDOLGRPLGH FKHPLFDO FRPSRXQG GUXJ  /'+% LV WDUJHWHG E\\ VWLULSHQWRO DQG KWWSVZZZFDQFHUJRYDERXW DUWHQLPROFKHPLFDOFRPSRXQG GUXJ 7KLVUHVXOWPD\\VXJJHVW WKDW PL5S KDV QHJDWLYH RU SRVLWLYH LQYROYLQJ LQ FHOO FDQFHUXQGHUVWDQGLQJZKDWLVFDQFHU >$FFHVVHG  GLYLVLRQ IXFWLRQ RI $%/  $QG WKH GUXJ FRPSRVLQJ RI OHQDOLGRPLGH VWLULSHQWRO DQG DUWHQLPRO PD\\ EH FKHPLFDO -XQ@ FRPSRXQGV RI HIIHFWLYH GUXJ IRU WUHDWLQJ DEQRUPDO FHOO GLYLVLRQSURFHVVLQ$//SDWLHQWV >@ µ/DWHVWJOREDOFDQFHUGDWD&DQFHUEXUGHQULVHVWR PLOOLRQQHZFDVHVDQGPLOOLRQFDQFHUGHDWKVLQ ± ,$5&¶ >2QOLQH@ $YDLODEOH KWWSVZZZLDUFIUIHDWXUHGQHZVODWHVWJOREDOFDQFHU GDWDFDQFHUEXUGHQULVHVWRPLOOLRQQHZFDVHV C. Webpages DQGPLOOLRQFDQFHUGHDWKVLQ>$FFHVVHG  $IWHUDOOWKHLPSRUWDQWWKLQJLVSUHVHQWDWLRQRXUUHVHDUFK UHVXOWWRSXEOLF:HESDJHLVHDV\\ZD\\WRSUHVHQWWKHRXWFRPH -XQ@ RIWKLVVWXG\\6RZHGHVLJQHGDQHDV\\ZHESDJH2QWKHZHE SDJHV WKHUH DUH WZR PDLQ IXQFWLRQV )LUVW IXQFWLRQ LV D >@0,7 VL*(0WHDP  microRNA: Detecting IXQFWLRQ WR ILQG PLFUR51$ DQG JHQH VLJQDWXUH DQG SUHGLFW QHZ PLFUR51$ DQG JHQH SDLU ,Q WKLV IXQFWLRQ ZH XVH WKH a Cell Specific Profile, >2QOLQH@$YDLODEOH DOJRULWKPIRUILQGLQJWKHFHQWURLGWKDWGHVHUYHGIRUQHZGDWD E\\FDOFXODWHWKHVFRUHVZLWKWKHFHQWURLGV6HFRQGIXQFWLRQLV DWKWWSLJHPRUJ7HDP0,7([SHULPHQWVPL5 DIXQFWLRQWRGLVFRYHUGUXJIRUWUHDWDQ\\FDQFHUW\\SH)LJXUH 1$>$FFHVVHGUG$XJXVW@ VKRZVZHESDJHRI WKHILUVWIXQFWLRQDQGILJXUH VKRZVWKH ZHESDJHRIWKHVHFRQGIXQFWLRQ >@ <3HQJDQG&0&URFHµ7KHUROHRI0LFUR51$VLQ KXPDQ FDQFHU¶ Signal Transduction and Targeted TherapyYROS-DQ >@ : -L % 6XQ DQG & 6X µ7DUJHWLQJ 0LFUR51$V LQ &DQFHU*HQH7KHUDS\\¶Genes (Basel)YROQR-DQ  >@ ; /L et al. µ,QWHJUDWHG $QDO\\VLV RI 0LFUR51$  PL51$  DQG P51$ 3URILOHV 5HYHDOV 5HGXFHG &RUUHODWLRQ EHWZHHQ 0LFUR51$ DQG 7DUJHW *HQH LQ &DQFHU¶ BioMed Research International  >2QOLQH@ $YDLODEOH KWWSVZZZKLQGDZLFRPMRXUQDOVEPUL >$FFHVVHG-XQ@  >@ µ2QFRWDUJHW _ &RPSXWDWLRQDO PRGHOLQJ RI PHWKLRQLQH )LJXUH*HQH PL51$SUHGLFWLRQSDJH F\\FOHEDVHG PHWDEROLVP DQG '1$ PHWK\\ODWLRQ DQG WKH  LPSOLFDWLRQV IRU DQWLFDQFHU GUXJ UHVSRQVH SUHGLFWLRQ¶ >2QOLQH@ $YDLODEOH 220

2019 4th International Conference on Information Technology (InCIT2019) KWWSZZZRQFRWDUJHWFRPLQGH[SKS\"MRXUQDO RQFRWDU >@$JDUZDOHWDO  7KHFRQWH[WVFRUH>2QOLQH@ JHW SDJH DUWLFOH RS YLHZ SDWK>@  SDWK>@  $YDLODEOHDW >$FFHVVHG0D\\@ KWWSZZZWDUJHWVFDQRUJYHUWBGRFVFRQWH[WBVFRUH >@ <)HQJ4:DQJDQG7:DQJµ'UXJ7DUJHW3URWHLQ BWRWDOVKWPO>$FFHVVHGUG$XJXVW@ 3URWHLQ ,QWHUDFWLRQ 1HWZRUNV $ 6\\VWHPDWLF >@1:RQJDQG;:DQJµPL5'%DQRQOLQHUHVRXUFHIRU 3HUVSHFWLYH¶ BioMed Research International  PLFUR51$WDUJHWSUHGLFWLRQDQGIXQFWLRQDODQQRWDWLRQV¶ >2QOLQH@ $YDLODEOH Nucleic Acids Res. YRO  QR 'DWDEDVH LVVXH SS KWWSVZZZKLQGDZLFRPMRXUQDOVEPUL '-DQ >$FFHVVHG-XQ@ >@ :HLMXQ /LX ;LDRZHL :DQJ DQG 1DWKDQ :RQJ   >@ /<X-=KDRDQG/*DRµ3UHGLFWLQJ3RWHQWLDO'UXJV :K\\ PL5'%\" >2QOLQH@$YDLODEOH DW IRU %UHDVW &DQFHU EDVHG RQ PL51$ DQG 7LVVXH KWWSZZZPLUGERUJIDTKWPO:K\\BPL5'% 6SHFLILFLW\\¶Int J Biol SciYROQRSS± >$FFHVVHGUG$XJXVW@ 0D\\ >@& 6WDUN %- %UHLWNUHXW] 7 5HJXO\\ / %RXFKHU $ >@0 ' 3DUDVNHYRSRXORX et al. µ',$1$/QF%DVH %UHLWNUHXW] DQG 0 7\\HUV µ%LR*5,' D JHQHUDO H[SHULPHQWDOO\\ YHULILHG DQG FRPSXWDWLRQDOO\\ SUHGLFWHG UHSRVLWRU\\ IRU LQWHUDFWLRQ GDWDVHWV¶ Nucleic Acids Res PLFUR51$WDUJHWVRQORQJQRQFRGLQJ51$V¶Nucleic YROQR'DWDEDVHLVVXHSS'±'-DQ Acids Res.YROQR'DWDEDVHLVVXHSS ' >@' 6 :LVKDUW et al. µ'UXJ%DQN D NQRZOHGJHEDVH IRU -DQ GUXJVGUXJDFWLRQVDQGGUXJWDUJHWV¶Nucleic Acids Res >@',$1$  0LFUR51$WDUJHWSUHGLFWLRQ>2QOLQH@ YROQR'DWDEDVHLVVXHSS'±'-DQ $YDLODEOHDWKWWSGLDQDLPLVDWKHQD >@µ$QHIILFLHQWNPHDQVFOXVWHULQJDOJRULWKPDQDO\\VLVDQG LQQRYDWLRQJU'LDQD7RROVLQGH[SKS\"U VLWHLQGH[ LPSOHPHQWDWLRQ,(((-RXUQDOV 0DJD]LQH¶>2QOLQH@ >$FFHVVHGUG$XJXVW@ $YDLODEOH >@0DULD'3DUDVNHYRSRXORX*HRUJLRV*HRUJDNLODV KWWSVLHHH[SORUHLHHHRUJGRFXPHQW 1LNRV.RVWRXODV,RDQQLV69ODFKRV7KDQDVLV >$FFHVVHG-XQ@ 9HUJRXOLV0DUWLQ5HF]NR&KULVWRV)LOLSSLGLV >@6 %URKpH DQG - YDQ +HOGHQ µ(YDOXDWLRQ RI FOXVWHULQJ 7KHRGRUH'DODPDJDVDQG$*+DW]LJHRUJLRX   DOJRULWKPV IRU SURWHLQSURWHLQ LQWHUDFWLRQ QHWZRUNV¶ ',$1$PLFUR7ZHEVHUYHUYVHUYLFHLQWHJUDWLRQ BMC BioinformaticsYROS1RY LQWRPL51$IXQFWLRQDODQDO\\VLVZRUNIORZV>2QOLQH@ >@ 7DPiV 1HSXV] +DL\\XDQ <X DQG $OEHUWR 3DFFDQDUR $YDLODEOHDW   'HWHFWLQJ RYHUODSSLQJ SURWHLQ FRPSOH[HV LQ KWWSVZZZQFELQOPQLKJRYSPFDUWLFOHV30& SURWHLQSURWHLQ LQWHUDFWLRQ QHWZRUNV  92/ 12  >$FFHVVHGUG$XJXVW@ SS  >2QOLQH@ $YDLODEOH DW >@ µ3UHGLFWLQJHIIHFWLYHPLFUR51$WDUJHWVLWHVLQ KWWSVZZZUHVHDUFKJDWHQHWSXEOLFDWLRQB PDPPDOLDQP51$V¶>2QOLQH@$YDLODEOH 'HWHFWLQJBRYHUODSSLQJBSURWHLQBFRPSOH[HVBLQBSURWHLQ KWWSVZZZQFELQOPQLKJRYSPFDUWLFOHV30& SURWHLQBLQWHUDFWLRQBQHWZRUNV >$FFHVVHG UG $XJXVW >$FFHVVHG-XQ@ @ >@$JDUZDOHWDO  7DUJHW6FDQ>2QOLQH@$YDLODEOH >@*+5HIHUHQFHµ$%/JHQH¶Genetics Home DWKWWSZZZWDUJHWVFDQRUJYHUWB>$FFHVVHGUG Reference>2QOLQH@$YDLODEOH $XJXVW@ KWWSVJKUQOPQLKJRYJHQH$%/>$FFHVVHG-XQ >@$JDUZDOHWDO  VHHG>2QOLQH@$YDLODEOHDW @ KWWSZZZWDUJHWVFDQRUJGRFVVHHGKWPO>$FFHVVHG  UG$XJXVW@   221

2019 4th International Conference on Information Technology (InCIT2019) Machine Trading by Time Series Models and Portfolio Optimization Pongsak Thuankhonrak Ekarat Rattagan Suronapee Phoomvuthisarn Faculty of Information Science and Faculty of Information Science and Faculty of Information Science and Technology Technology Technology Mahanakorn University of Technology Mahanakorn University of Technology Mahanakorn University of Technology Bangkok, Thailand Bangkok, Thailand Bangkok, Thailand [email protected] [email protected] [email protected] Abstract — Machine learning algorithms such as Support Vector individual investors especially retail investors who have Machine (SVM) and Artificial Neural Network (ANN) are used for some basic knowledge and limited budget could build up machine trading. The problem with SVM and ANN is that it is their portfolio efficiently resulting to the expansion in difficult to determine appropriated features for both models; financial market and interest for investing in financial moreover, it is also time consuming to perform backpropagation instruments as alternative investments for wealth for ANN when a number of both features and data increase. management. However, machine trading is not confined only these two algorithms, but also time series models. In this study, we will Machine learning algorithms, such as Support Vector employ time series models namely Autoregressive Integrated Machine (SVM) and Artificial Neural Network (ANN), are Moving Average (ARIMA) and Holt – Winters’ Exponential the learning methods commonly used for machine trading Smoothing (HW) as the guidance for trading since time series [2]. However, these methods must determine all factors that models only require time series as an input. We will perform time influence on the asset price and therefore not scalable to series analysis to snatch the trading opportunity in the Stock obtain all trading information both directly and indirectly Exchange of Thailand (SET). There are fifty companies on the list affecting particular stock such as, GDP growth, foreign of SET50 index; we will choose five amongst them to invest exchange and past financial performance. Time series measured by Sharpe Ratio; the top five from this measurement will models instead use only one factor that indexed in time order be selected as invested assets in simulated portfolio. Furthermore, to determine the asset price. Hence, the input for the model the well-known portfolio optimization framework by Harry calibration is only the price of the stock itself. Markowitz will be used to ensure that the combination of the invested assets is located on the efficient frontier; the result from In this study, we will employ time series models namely this study is favorable as the return generated by these activities Autoregressive Integrated Moving Average (ARIMA) and outperforms market return; furthermore, manipulating time series Holt – Winters’ Exponential Smoothing (HW) as the into different time lags yields higher return and as well as guidance for trading. There are fifty companies in the list of combining both ARIMA and HW models to help predict stock SET50 index announced by the Stock Exchange of Thailand prices also improves power of prediction of time series models. (SET); only the top five amongst them measured by Sharpe This study will help retail investor who has limited resource Ratio[3] will be invested, the top five from this overthrow bias and intuition throughout investment decision measurement will be selected as invested assets in simulated process ranging from finding the stocks for investment to portfolio. Furthermore, the well-known portfolio capturing market movement for trading opportunity. optimization framework by Harry Markowitz[4] will be used to ensure that the combination of the invested assets is I. INTRODUCTION located on the efficient frontier. Investment in financial securities becomes parts of This paper is conducted at the empirical level to evaluate people’s lives. Investment in financial instruments is not the extended time series models using ARIMA and HW. By only a mean of yield enhancement, but also a vehicle to using mean square error (MSE), both models are then obtain tax privileges such as investing in Long Term Equity compared in terms of robustness and accuracy. Besides, by Fund (LTF) and Retirement Mutual Fund (RMF). combining both ARIMA and HW together, we can make Furthermore, corporates managing treasury operations also better decisions to foresee the closing price. The body of this needs investment apart from bank deposits for their surplus paper is presented in nine sections. Section 2-5 summarizes cash. the essential but important prior works related to time series models. The comparison of the extended ARIMA and HW Some institute investors might utilize trading system model follows the data analysis process and then evaluated such as algorithmic trading or arbitrage opportunity to in Section 6 and 7. Section 8 combined both ARIMA and exploit the market discrepancy[1]; however, retail investors HW model for better judgement the closing price. The will find it difficult to weather the market situation due to summary is then discussed in the end. the inability to access trading information on time and limited budget for their advanced financial technology; besides, they have little knowledge of trading compared with large financial institutions who invest a huge amount on their trading platform and skilled personnel, thus there is a need for the guideline regarding portfolio construction which can be implemented on personal computer such that 222

2019 4th International Conference on Information Technology (InCIT2019) II. LITERATURE REVIEWS from the past to present as can be seen from the following generalized form; A number of numerical predicting models have been developed to forecast assets prices. Barack and Lawrence Xt = β0 + Ԑt – β1Ԑt-1 – β2Ԑt-2 - …… - βqԐt-q (2) [5] implemented ANN once to predict the stock prices in Nairobi Securities Exchange and New York Stock MA model is the linear model comprised of the series Exchange; the models gave the preferable results; however, they cannot carried all experiment for all sixty stocks due of parameter β and white noise Ԑ from the past until present; to time consuming during model turning as they needed while, β0 is the average of the selected period of time series. around two hours to fit the model to predict the price of a stock; besides, they had to single out the stocks ARMA is the combination of both AR and MA model. qualitatively for model’s inputs. SVM also used to predict The generalized from of ARMA can be described as, stock prices with substantial performance by using time series of historical data including open price, low price, Xt = a0 + a1Xt-1 + a2Xt-2 +……+ apXt-p + Ԑt – β1Ԑt-1 - β2Ԑt- high price and closed price as the features [6]; however, selecting features is critical and controversial [7] as both 2 - …… - βqԐt-q (3) ANN and SVM can provide different results with different set of features such that, in fundamental analysis, there are ARMA model incorporates both its value in the past many aspects to which an investor can pay attention, for (Xt) along with white noise (Ԑt), ARIMA model is the example, in top down analysis [8], the factors we can take extension of ARMA in order to handle non-stationary into account including the current state in economic cycle, process which is caused by factors like seasonality and fiscal policy, monetary policy, nature of industry, trend within time series. The standard form for ARIMA regulations and pricing and costing policy etc.; these factors model is formulated as follows; can be incorporated into both ANN and SVM and will not yield the same result. Besides, unlike ANN, we do not have ( )∆ = + ( ) (4) to resort the performance of the computer to perform back propagation [9] ; hence, in this study we will employ time Where; series models to capture the stock movements as time series ∆d = difference between lag time such as Xt – Xt-1 models require fewer inputs just only historical price is α(L) = 1 – α1L – α2L2 - …… - αpLp (ARMA model) needed; moreover, we are not demanded much time to fit β(L) = 1 – β1L – β2L2 - …… - βqLq (MA model) the model since we do not have to perform back propagation, thus time series models give us more B. Holt and Winters exponential smoothing model (HW) flexibility to capture the market movement and guide the trading. The model was firstly developed by Professor Charlest C. Holt for forecasting trends in production. Later, III. AUTOREGRESSIVE INTEGRATED MOVING AVERAGE Professor Peter R. Winters has improved the model by (ARIMA) AND HOLT-WINTERS’ EXPONENTIAL SMOOTHING adding seasonality so that the model can handle business time series which normally exhibit trend and seasonality. WITH SEASONALITY HW can accommodate the time series which stationary A. Autoregressive integrated moving average (ARIMA) assumption can be disregarded. Hence, HW yields the great benefits as normal business time series does exhibit trend, ARIMA model is one of time series models which can seasonal, circle and irregular effects within time series. The handle none-stationary process. The foundation inherited in model can be written as Yt = Tt + St + Ct + It where Yt is the ARIMA model stems from Autoregressive (AR) model and value of time series t, Tt is trend at time t, St denotes Moving average (MA) model. seasonal at time t, Ct denotes business life circle and It represents irregularity at time t. Autoregressive or AR model is time series model which works on the stable process which can be elaborated by the HW can be classified into two choices – additive and following generalized form; multiplicative. The difference between the two models is the calculation of seasonal factor St (described below) Xt = a0 + a1Xt-1 + a2Xt-2 +……+ apXt-p + Ԑt (1) where the former will handle seasonality by simply subtracting each value in time series by its expected value, AR model is just a linear model where its current value while the latter calculates the ratio between the value in time series and its expected value; however, the Xt depends on the former value of Xt-1 and Xt-2 so on and so multiplicative HW method is ubiquitously used amongst forth plus Gaussian white noise (Ԑt) where the effect of the two. white noise is the effect of unexpected even on time series. The model utilized Triple Exponential Smoothing Ԑt is stable and follow the Normal Distribution with mean method which comprises of three components including (µ) equals to 0 and stable variance (σ2) and Ԑt can be also level, trend and seasonal. The expression for the called random shock since it represents the unforeseen components can be generalized as, circumstances that deviate from the process. Expression for additive model Moving average or MA model is different from AR = ( − ) + (1 − )( + ) Level (5) model in the sense that Xt depends on its white noise (Ԑt) = ( − ) + (1 − ) Trend (6) 223

2019 4th International Conference on Information Technology (InCIT2019) = ( − ) + (1 − ) Seasonal (7) D. Irregular event =+ + Forecast (8) Sometimes, the time series exhibits all components including trend, cycle and seasonality; however, there Where the initial values can be computed as below, could possibly be extraordinary event which influences its fashion so that the figure is not resembled in its patterns. = ( + +⋯+ ) (9) The example for this particular occasion is the announcement of 30% of capital reserve by Bank of = + +⋯+ (10) Thailand. After the announcement, the SET index dropped = − , = − ,…….., = − (11) from 721.85 to 587.92 or 19.52%. Not until did the investors consume all information, the index adjusted itself to the previous level in the next trading day. Expression for multiplicative model Level (12) V. SHARPE RATIO = + (1 − )( + ) In order to determine which investment asset should be = ( − ) + (1 − ) Trend (13) included in the investment portfolio, we need to gouge the = + (1 − ) Seasonal (14) associated risk and return of the investment asset. Generally speaking, high return from investment always comes along =+ + Forecast (15) with high associated risk, thus there should be a measurement to evaluate the choice of amongst investible While initial values can be computed as additive model assets. as following: Sharpe ration is one measurement used to compare the candidate assets. Sharpe ratio assesses the excess return of = , = ,………., = (16) candidate investment over risk free rate adjusted by its individual risk elaborated by following formula; Where: ℎ = (17) α, β and γ denote exponential smoothing parameters and p is numbers of seasonal in a year such as 12 for Where: monthly collected data or 4 for quarterly one. Rx = Return from investment asset IV. TIME SERIES Rf = Risk free return A time series is the data set indexing accordingly to its time order. The example of time series are unemployment σx = standard deviation of investment asset’s return rate, sale revenue of a company or even inflation rate. The component of time series includes trend, cycle, Seasonal Clearly, the Sharpe ration is the measurement to variations and irregular fluctuation [10]. compare the return over risk free and its individual risk, thus the higher Sharpe ratio, the higher compensation of A. Trend return over stand-alone risk, then we will consider the stock that yield high Sharpe ratio compared to its peer to include A trend is the component which makes time series to into our portfolio. increase or decrease as the passage of time. The example of trend is the sale value which can be small at the inception VI. PORTFOLIO THEORY of business, then will become bigger when the company keeps growing. Markowitz [11] proposed the expected return and expected risk measurement of the portfolio; also, he B. Cycle showed that variance of the portfolio can be used as the risk measurement of overall portfolio leading to diversification A cycle of time series is the full loop of evolution of study in asset allocation. Markowitz includes following time passage. The full loop may begin with upward trend assumptions in his proposal. and follow by downward trend or vice versa. Normally, the cycle takes 2 or 10 years to complete the example of cycle 1. Investors consider each investment alternative in the in the time series is the economic circle which normally form of distribution of expected return described by includes the boom and the bust. probability of distribution. C. Seasonality 2. Investors maximize one-period expected utility and utility function followed by diminishing marginal utility of Most business time-series present seasonality. wealth. Seasonality creates time-series with the pattern repeating every each calendar year. Seasonality differs from cycle in 3. Investors calculate derive return from investment from the way that the loop of the pattern in seasonality finishes the portfolio variance of expected return within one calendar year, while the patter in cycle normally takes more than a year to compete it pattern. 224

2019 4th International Conference on Information Technology (InCIT2019) 4. Investors consider both expected return and risk, then 1. Airport of Thailand PLC (AOT) the utility function of each investor depends on both 2. Energy Absolute PLC (EA) expected risk and return. 3. Global Power Synergy (GPSC) 4. Indorama Ventures PCL (IVL) 5. Given the same amount of risk, investors will invest 5. Beauty Community PCL (BEAUTY) in the asset yielding the highest return and the vice versa. The next is to find the portion for each candidate The study from Markowitz suggests that to construct the such that we will obtain the optimum portfolio in efficient portfolio, the combination of the invested assets accordance with Markowitz’s portfolio theory, in this case must be located on the efficient frontier as efficient frontier we will find the portion for each stock such that the risk represents the best reward of return given the certain (variance) of the portfolio is minimum. The result is that we amount of risk measured by standard deviation [12] where will invest 55%, 2.4%, 22%, 16.2% and 4.4% in AOT, EA, the construction of the portfolio will maximize the expected GPSC, IVL, and BEAUTY accordingly and the efficient return Rp while keep the variance (σ2p) of the portfolio at frontier is shown next. the minimum as described below, Rp = wTµ (18) σ2p = wT∑w (19) Where; Rp = Expected return from portfolio Fig. 1. Efficient frontier for candidate stocks σ2p = Portfolio variance σp = Portfolio standard deviation C. Model training µ = Vector of expected return form each asset w = Vector of share of wealth in each asset We will simulate the trading for 100 days, then we will ∑ = Variance covariance matrix of asset return compare the result with SET50 performance; the simulation employs 300 data set starting from 6th April 2017 onward σp = ∑ (20) because among 50 stocks comprised in SET50, TPI Polene Power Pcl., (TPIPP) was firstly traded on that day before, VII. PROCEDURE hence, older data prior 6th April for TPIPP does not exist. we divide data set into two subsets, first for model training The procedure for experiment will follow data analysis and the second one for prediction, thus the trading will be process which the steps will be as following; made on the last 100 days from dataset. 1. Data acquisition 2. Exploring and preparing data We simulate 100 trading days and each time we 3. Model Training built both ARIMA and HW once, then we will use both 4. Model performance evaluation model to predict the price of the stock and make the decision where we will long, short or hold the stock. We A. Data acquistion also incorporate 0.25% commission on trading value. To acquire the list of stocks comprising in SET50, we TABLE I. PARAMETERS IN ARIMA MODELS FROM THE FIRST ITERATION need to access to The Stock Exchange of Thailand’s (SET) website, SET has its own policy to include stocks for Stock ARIMA model Model SET50 index. The list will be updated in accordance with AOT ARIMA(0,1,0) SET’s policy for selecting stock to be included in the EA ARIMA(0,1,0) − = . SET50 and as well as to be removed, thus the list will GPSC ARIMA(2,2,1) change when stocks in the list does comply or does not − = . comply with SET’s criteria. − − We also need another source of data because SET only =− . −. provides historical prices traced back for 6 months or −. roughly 180 records which are quite small, thus we need to retrieve historical prices from Yahoo Finance to obtain IVL ARIMA(1,1,1) − longer data series. = . −. B. Explorting and preparing data + . All data from the acquisition is needed to be cleaned BEAUTY ARIMA(0,1,0) − = . such that only relevant data will be used. Until this process we have all 50 stocks and their times series, thus we will compute Sharpe Ratio to select only top 5 stocks yielding highest scores and the top performers are 225

2019 4th International Conference on Information Technology (InCIT2019) TABLE II. PARAMETERS IN ARIMA MODELS FROM THE FIRST stock all are the same as we have determined in earlier except that the time series we will use here will be varied by ITERATION its lag time. Stock Smoothing parameters ( , , ) What we means about lag time here is that; when we say AOT we will use two lags of time to forecast the future price EA = . , =, = means we will collect the closed price of our stock every GPSC period indicated by aforementioned lags of time such as, if IVL = . , =. ,= we use two lags of time for our prediction we will collect the closed price very two days and use as our time series for BEAUTY = . , =. ,= building models both ARIMA and Holt – Winters or when we use four lags of time for building up forecasted = . , =. , models we will gather closed price every four days =. depicted in diagram below; =. , =, = D. Model performance and evaluation T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 One dimension for selecting the model is to see whether Fig. 2. Time line with different lag time the model generates large amount of errors. One methodology for gauging the accuracy of the model is to - Members of two lag time series is comprised of T0, T2, find the mean square errors (MSE) the least MSE, the better T4, T6… T10 performance of the model. Here, we can summarize the - Members of four lag time series is comprised of T0, T4 MSE for each model the traded stocks employed after 100 and T8 simulations of trading in the table. TABLE III. SUMMARY OF SUM SQUARE ERRORS (MSE) AND PERFORMANCE Stocks ARIMA Model TABLE IV. SUMMARY OF PERFORMANCE FOR EACH LAG TIME 0.57 Holt and Winters AOT 4.43 0.81 Number Profit No. Profit per Trade EA 2.03 5.05 of Lag ARIMA Holt - Trade ARIMA Holt- GPSC 1.18 2.95 IVL 0.26 1.17 1 winters 100 Winters BEAUTY 8.21 10.35 2 -9.38% -9.65% 50 -0.093% -0.096% Total 20.33 3 32.75% 20.64% 33 0.655% 0.413% Return -9.38% -9.65% 4 19.69% 14.41% 25 0.596% 0.436% 5 14.54% 1.95% 20 0.581% 0.078% The first glimpse at the table, we can say that 5.74% 4.32% 0.287% 0.216% ARIMA model outperforms HW model as total MSE for ARIMA model is far lower than one from ARIMA model; The result shown above summarized the result from the result is true for the accuracy describing above. From trading by using different lags up to five lags of time. We the simulation for 100 times, we found that portfolio using can see that using different lags of time yields different ARIMA model as the predictor yields -9.38%, while HW result. The most relevant criteria for evaluating the yields -9.65%, while if an investor invests during the performance is profit per trade, as it says, the profit per trade simulated period will suffer from 11.43% loss; thus, only will get rid of bias based on the number of trading ARIMA model outperforms the market return. transactions such that we scale down the result into per trading transaction to judge which model perform better. We can conclude from our experiment that; even though, the model fits time series quite well, the instruction We can see that when we use time series used two lags to trigger the execution to buy or sell is matter, especially of close prices as the input, ARIMA provides the highest when the prediction signals the up move, but the signal is return or 0.655% per trading, while Holt – Winters performs too weak in other words just a little higher than previous best when using three lag of closing prices as the input. price. In this case, the gain might not compensate Seemingly, using different lag time enhanced the commission fees and eventually, we will lose as we trade performance of overall model as the result of too often; the lost is not from the inaccuracy of the prediction, but from the commission fees instead. 1. Lower commission fees as the number of trade is drastically reduced. The first proposed methodology VIII. ENHANCED PERFORMANCED BY MULTIPLE DAY assumes that we will made the transaction every and each transaction we will pay 0.25% on the amount of trading, thus TRADING implementing time lag for trading means the less trading transaction; with two lags will cut the commission fee by We implement the same strategy for both ARIMA and half. HW model; however, the input time series will be modified in accordance with lag time starting from two to five lags of 2. Daily volatility reduces unnecessary transaction. We time series. can see on the next graph that the volatility of AOT daily price is higher than AOT four lag price as the latter graph is To be able to compare with our previous strategy, we smoother than daily’s one. The smoother time series will will also use the same assumptions using in daily trade enhance the model due to both ARIMA and HW employ the strategy which includes the invested assets (AOT, EA, IVL, previous day trading price in building up the model the less GPSC and BEAUTY), trading period and stock (from 1st volatility of time series, the more accuracy of the model. January 2018 to 28th June 2018) and the weight in each Furthermore, daily trading will overlook the power of 226


Like this book? You can publish your book online for free in a few minutes!
Create your own flipbook