Important Announcement
PubHTML5 Scheduled Server Maintenance on (GMT) Sunday, June 26th, 2:00 am - 8:00 am.
PubHTML5 site will be inoperative during the times indicated!

Home Explore 2020簡介0619

2020簡介0619

Published by tomoaki.sinica, 2020-06-21 22:43:42

Description: 2020簡介0619

Search

Read the Text Version

Brochure 2020 在中長期的目標中,我們將基於將 crash-deterministic 固態硬碟的儲存系統,進一步發展 高效能兼高可靠度圖形計算處理系統。由於圖形在運算時,會於記憶體層級與儲存層級之 間產生大量的資料搬移。我們接續的目標是減少大量的資料搬移藉以提升整體系統效能。 圖形運算中資料存取的行為主要取決於圖形的架構 (Graph Layout) 以及圖形演算法 (Graph Algorithm) 的設計。圖形的架構主要是由一群點 (Nodes) 及邊 (Edges) 建構而成。在社群網 路或網際網路中,圖形的架構通常為無尺度圖 (Scale-Free Graph),這類圖中存在少數的中 樞節點 (Hub Nodes),其邊的個數為一般節點邊的數十位甚至數百倍。我們觀察到,無尺 度圖若為有向圖 (Directed Graph),其入度 (In-Degree) 與出度 (Out-Degree) 的個數將有極大 的差異。相較於一般節點,中樞節點具有極高的入度。這也代表,在對一張無尺度的圖進 行分析運算時,如果能有效的辨別中樞節點,將其預先搬入快取記憶體並將其有效的鎖在 記憶體中,將能避免大量的資料搬移。因此針對圖形架構的部分,我們將提出一套有效的 中樞節點辨識及鎖定機制。此外,為了避免記憶體的浪費,我們也會提出相關的中樞節點 預測及剔除機制,以適應執行期間動態變化的圖形存取行為。 而針對執行期間圖形節點及邊會經常變動的應用中,我們將整合本計畫所發展的 crash- deterministic 固態硬碟及其驗證過的固態硬碟管理系統,進一步強化圖形節點及邊在修改 並寫入固態硬碟的效能,並且能簡化驗證的複雜度的前提下,就能保證圖形資料能夠在斷 電或當機後,仍能正確並且快速的回復,並且驗證及正確性。因此,最後我們將把三年的 成果整合實現成一個 GraphStor 的系統,實現一個兼顧「高效率」和「高可靠度」的圖形 處理系統。 101

內所 合 計 GraphStor:Efficient & Recoverable Graph-processing 作 畫 System Design, from Devices to Systems and Applications IIS Collaborative Projects Principal Investigators: Dr. Yu-Fang Chen and Dr. Yuan-Hao Chang Project Period: 2020/1~2022/12 Due to the rapid growth in \"big data\" production, the technique of graph computing to extract useful information from various big data sources is receiving signi cant attention from both industry and academia. The primary performance bottleneck in graph computing is the massive amount of data throughput between memory and other storage devices. Enormous amounts of data movement is unavoidable in current computer architectures because the rate of data production is growing faster than that of memory availability. It is impossible to store all such data in memory banks. Instead, some data has to be stored in cheaper but slower storage devices and moved to the faster memory on demand. However, in modern computer systems, every abstract layer (e.g., the device layer, le system layer, and database layer) usually maintains its own backups or data journals so that the system can recover damaged data caused by system crashes or sudden power shortages. However, performing data backups or journaling not only requires additional space but also entails excessive data movement. Under such a background, our three-year project GraphStor began in January 2020 under the approval of IIS, with funding of NTD $2.5 million/year. The goal of GraphStor is to design a highly e cient and reliable graph processing system to meet requirements in this era of big data. Our main design principle is to reduce the amount of data movement and data backup in the system stack (including at the device, le, and memory system levels). We are exploring a few possible solutions, including in-memory computing, ultra-low-latency (ULL) storage as slow memory, and le concept redesign with a data structure that is more suitable for graph computing. A collaboration between the two project PIs is required to achieve the goal of developing a graph-processing system with high performance and reliability. Speci cally, we need storage design experts to achieve performance improvement and formal method experts to ensure system reliability. The short- term objective of this project is to develop a highly e cient and reliable crash-deterministic solid-state drive (SSD)-based storage system (see below) that could enable the crash recovery mechanism of our entire graph- processing system. Our mid- to long-term objectives are to develop a highly efficient and reliable graph- processing system that employs our crash-deterministic SSD. It is unavoidable that storage systems will encounter occasional and sudden power shortages or crashes. A crash recovery mechanism is a necessary component of storage system design. Our short-term objective is to design a vertical-integrated crash recovery mechanism. This mechanism will cover every layer of the graph- processing system, from storage device to the application layers, to reduce redundant backups and journals, thereby simultaneously achieving high e ciency and reliability. Our design will involve several complex data structures and algorithms. We plan to leverage formal veri cation techniques to ensure implementation of our complex design is correct. �Complete formal verification is the only known way to guarantee that a system is free of programming errors.� --the seL4 team from Data81, who developed the first formally verified OS kernel 102

Brochure 2020 Currently, SSDs are considered mainstream storage devices. We have found that traditionally designed systems su er from sudden failures and cannot always be guaranteed to recover to a steady state upon being rebooted. Thus, all written data created after the ush operation can be lost. We illustrate an example of this scenario in Figure 1. Suppose that eight sectors exist in a device. W1 occurred before the latest ush operation was triggered, and W2 and W3 occurred after the latest flush operation was triggered, with W2 and W3 writing one and three sectors, respectively. When the system crashed upon completion of W3, all of the write operations incurred by W2 and W3 may or may not be executed after the system is rebooted, thus resulting in 16 possible states after the system reboot. To deal with this issue, storage or graph processing systems at the upper layers incur high overheads to guarantee user-data consistency. The general solution in current le systems (e.g., ext4) is to maintain a journal and recover the system through journaling techniques after system crashes. Figure 1 : SSD behaviors incurred by traditional and deterministic design. We are using the out-of-place update characteristic of SDD to design algorithms and data structures to ensure that the storage system will recover to the system state right after the last ush operation following a crash. We have termed our proposed design crash-deterministic SSD. Intuitively, each flush operation triggered in the crash-deterministic SSD can be considered as a \"state-storing\" action. Thus, the design of recovery mechanisms in upper system layers can be simplified by utilizing this feature. For example, the database system can trigger the ush operation directly after each transaction has been processed so completion of the transaction is solely \"all or nothing\" (atomic) upon recovery from a system crash. We plan to use crash- deterministic SSD as the foundation for the crash recovery mechanism of GraphStor. 103

IIS Collaborative Projects所 內 合 作 計 畫 In our preliminary experiments (Figure 2), we found that we can still correctly recover the le system from a crash without maintaining a journal (e.g., using the ext2 instead of ext4 le system) when we used crash- deterministic SSDs. Moreover, for the database layer, the number of fsync operations (which cleans the temporary state maintained by the current le system and triggers an SSD sync action) required to perform a single transaction can be reduced from 5 to 1 when crash-deterministic SSDs are used. The main drawback of employing a crash-deterministic SSD design is that it relies on complex data structures and algorithms, potentially giving rise to implementation errors in the firmware layer. Even a one-byte error in a storage device can disrupt the entire indexing mechanism and hence all stored data. To ensure the correctness of our rmware implementations, we are utilizing formal veri cation techniques. We are developing a system to check the crash-deterministic properties of our firmware implementation by integrating several verification tools, including C-language verification tools (e.g., Rosette+Serval) and theorem provers (e.g., Agda). This system will automatically check the C program written by firmware developers and con rm that the program is crash-deterministic or, if it is not, it will nd a counterexample. Figure 2 : The advantage of crash-deterministic SSD. We assert that it is optimal to begin designing crash recovery mechanisms at the storage device layer. Both MIT and the University of Washington have proposed crash recovery systems over the past two years. Their crash recovery mechanisms are focused on the le system level. However, design complexity is much higher at the le system level compared to at the rmware level, so adding a veri ed crash-recovery mechanism to the le system level incurs higher veri cation costs. Taking the MIT le system DFSCQ as an example, it takes around ten person-years for system implementation and to prove system correctness. Even worse, DFSCQ is three times slower than the standard ext4 le system in terms of performance. If we use crash-deterministic SSDs as the starting point for our crash-recovery mechanism, we expect that overall system performance will not diminish, despite requiring formal veri cation of its correctness. In fact, system performance might even be improved due to the removal of redundant crash recovery mechanisms (such as journaling) from other system layers. Thus, overall implementation time can be reduced to 2~4 person-years because our formal veri cation system has a higher degree of automation than that employed by MIT. 104

Brochure 2020 Based on the results from our crash-deterministic SSD, we will further propose a highly e cient and reliable graph-processing system. Since graph computing entails heavy data trafficking between memory and storage layers, we will focus on developing techniques to hide or even eliminate extensive data movements to boost the performance of our graph-processing system, while being aware of graph access behaviors. Access patterns incurred by graph computing are mainly in uenced by graph layouts and graph algorithms. Most graph layouts are constructed by a set of nodes and edges. Graph layouts in the scenarios of social networks and the World Wide Web usually follow the power law and are termed \"scale-free graphs\". For each scale-free graph, some hub nodes have tens or even hundreds the number of edges than other nodes. The access frequency of each node on performing graph computations is strongly correlated with the number of edges connected to the node. Thus, hub nodes may endure tens or even hundred more access events compared with other nodes. Accordingly, data movements between memory and storage layers can be reduced if we can efficiently recognize the hub nodes, and prefetch/pin (or lock) the hub nodes in the main memory. Consequently, we are proposing hub-node pinning strategies to pin recently used hub nodes in the memory. To avoid wasting the memory resources and prevent high memory access misses, we will predict and unpin recently unused pinned hub nodes with being aware of the dynamic behavior of graph access events. We will also propose a vertical integration solution to integrate the proposed crash- deterministic SSD and validated SSD management system with dynamic graph applications, which change nodes and edges through time. That approach of enhancing the performance of SSDs by simplifying validation complexity and accounting for graph access behaviors will ensure that graph data can be quickly and correctly recovered when a system encounters sudden power failures or crashes. At the end of this three-year project, our GraphStor system will integrate all of these research outputs, representing a highly e cient and reliable graph- processing system. 105

八大實驗室 Research Laboratories

生物資訊實驗室 108 Bioinformatics Laboratory 110 112 電腦系統實驗室 114 Computer System Laboratory 116 118 資料探勘與機器學習實驗室 120 Data Mining and Machine Learning Laboratory 122 124 多媒體技術實驗室 126 Multimedia Technologies Laboratory 128 130 語言與知識處理實驗室 132 Natural Language and Knowledge Processing Laboratory 134 136 網路系統與服務實驗室 138 Network System and Service Laboratory 程式語言與形式方法實驗室 Programming Languages and Formal Methods Laboratory 計算理論與演算法實驗室 Computation Theory and Algorithms Laboratory 107

大八 實驗 室 生物資訊實驗室 Research Laboratories 研究人員 我們的研究是以資訊技術應用在生物和醫學研究為主,針對不同生物體學(omics)進行 蔡懷寬 / 召集人 解析,主要研究可分成(:一、)基因體和轉譯體研究,;和二(、二醣)合蛋成、白蛋體白及代體謝及體蛋研白究基,因分體別。 詳分述別如詳下述。如下: 研究員 一、基因體和轉譯體研究 宋定懿 處理巨量定序資料的方法論開發: 新世代定序技術已成為基因體和轉錄體研究的主要 研究員 工具。然而定序的資料量相當龐大,定序的序列也常有錯誤,因此在資料處理時,常 造成記憶體不足和計算時間冗長等困擾,亟待解決。因此,近年來我們致力於發展方 何建明 法和開發工具來解決一連串計算的問題。在次世代 read mapping 上,我們發展了一個 極為快速的演算法 Kart。Kart 將一個 read 分解成更小的碎片,並將其個別匹配到基因 研究員 序列。實驗結果證實 Kart 比一般 mappers 快三到十倍之多,而且錯誤容忍度及準確度 均極高。同樣的方法也可以應用在 RNA-seq 以及全基因體序列比對上,效果也甚佳。 林仲彥 在基因體組裝方面,我們設計一套以延伸為基礎的組裝程式(稱為 JR-Assembler)利 用整個短序做延伸,加快執行速度同時跨過小於序列長度的重複片段等方式來大幅降 研究員 低記憶體的需求和執行時間,同時也考慮組裝的品質。和其他程式比較,JR-Assembler 的記憶體使用率和執行時間更有效率,而且保持良好的組裝品質,特別序長度等於及 施純傑 大於 150 bp 的基因組資料。 在無須使用參考基因組的轉錄組組譯方面,我們開發一 個新的軟體 JR-Trans。這個新組譯器的核心參考了我們之前實作的基因組組譯器 JR- 研究員 Assembler,保留了原有在延伸組譯片段的效率,同時亦能夠偵測轉錄組中的斷點。與 現下主流的轉錄組組譯器比較,JR-Trans 的優點在於更快速執行速度、更少的硬體需 許聞廉 求,以及更正確的組譯結果。 特聘研究員 處理巨量定序資料的工具和資訊整合平台開發: 我們正在開發使用 10x Genomics linked-read 資料的新方法,以使基因體組裝更完整並解出更多的遺缺基因。基因 108 體組裝草圖的 sca old 序列稿通常會含有許多遺缺的間隙。10x Genomics 公司運用 Chromium 平台製備選殖庫 (Library) 與 linked reads 標誌系統,使得來自相同 DNA 區 段的一群讀序可被加上同一標誌。這些 DNA 區段的長度約 10-100 K bp,有益於填補 較長的間隙。在本研究中,我們有兩個目標。首先,我們正在發展 intra-sca old 間隙 填補方法,該方法基於擴大化的局部組裝。擴大化局部組裝先根據序列比對得出的讀 序來找出可能相關的標誌,接著組裝來自可能相關標誌的所有讀序。我們已經開發了 三種方法,包括 sca old-based, gap-based and window-based 的局部組裝方法。根據 鰻魚基因體草圖的初步結果,gap-based 方法在品質上 ( 鹼基填補率、尋獲的遺缺基因 ) 優於其他方法。在執行時間部分,gap-based 方法是第二好的方法,並很接近最快的 方法(window-based 方法)。但是,gap-based 方法使用 96 核心電腦的執行時間約 20 天,非常耗時。因此,我們正在開發容錯分散式演算法,以運用 IaaS 計算雲來加速 運算。第二個目標,我們正在發展 sca olding 方法,首先尋找可能相鄰的 sca old, 然後運用擴大化局部組裝將它們試圖連接起來。此外,研發團隊已針對目前生物醫 農 研 究 社 群 在 生 醫 大 數 據 上 的 迫 切 需 求, 建 構 了 人 類 與 小 鼠 基 因 體 基 因 表 現 平 台 (DocExpress + MOLAS) 及全基因體甲基化評估與分析平台 (DocMethyl + epiMOLAS), 已可以透過 DOCKER 架構與雲端平台結合圖像化介面工具,來提供院內研究團隊自 行分析資料的能力,並上載至 DOCKER HUB,開放給研究社群使用,將能大幅減少處 理大量原始定序資料的複雜度,加速整體研究速度,並能將初步數據上載至註解平台 (MOLAS/ epiMOLAS),進行深度分析。透過前述架構與演算法,可應用於個人與非模 式物種的基因體組裝。以此一基礎,研發團隊已完成多個高經濟生物之基因體線上分 析平台,如日本鰻 http://molas.iis.sinica.edu.tw/jpeel2016) 及龍膽石斑 (http://molas.iis. sinica.edu.tw/grouper2016) 等。相關程式也建構為 DOCKER 影像檔,可在 Dockerhub 下載使用 (https://hub.docker.com/u/lsbnb)。 轉錄調控網路: 基因表現受到多種遺傳因子的精準調控。我們對於基因體的各項調 控子如何影響基因調控以及其演化特徵非常感興趣。最近的研究進展包括三個方向: (1) 轉錄調控的機制與演化,(2) 表觀遺傳學與增強子的功能,以及 (3) 辨認非編碼 RNA

Brochure 2020 與轉錄異構物。為了對基因體生物學透過電腦科學 管使用廣受歡迎的 TransProteomic Pipeline (TPP),仍 來進行實驗分析,我們整合各種體學的資料,結合 需一連串的人工操作;為了方便使用者自動分析資 電腦科學、統計學與分子生物學來開發新的生物資 料,我們開發名為 WinProphet 的軟體。此軟體結合 訊分析工具並回答生物問題。例如在研究心室肥大 TPP 強大的功能和外部命令行的程式,讓使用者透過 在小鼠實驗的基因調控機制,我們藉由分析基因和 圖形介面方便地建構、管理分析流程,並自動執行 miRNA 在不同時間點全轉譯組的表達差異,並建立 流程。使用者所建構的流程,亦能以 XML 檔案格式 基因調控的動態網路。我們發現不少已知跟心室肥 下載,重複使用。其次,我們開發名為 Multi-Q 2 的 大的轉譯因子和 miRNAs,並非在一開始就有顯著地 定量軟體,是和我們之前發表的 Multi-Q 軟體完全不 表達,許多都是在後期才有表達。藉由這個網路, 同的全新開發,可達到高準確率和涵蓋率。我們也 讓我們更清楚的了解,在心臟受壓迫時所造成心室 進行許多不同資料正規化和其他多種設定的分析比 肥大的過程中,調控網路的動態變化。 較,以便瞭解達到較高準確率的關鍵。 二、醣合成、蛋白體及蛋白基因體 人類蛋白體計畫是世界蛋白體組織所發起的計畫, 主旨為確定人體所有的蛋白質。最近幾年計畫的重 智能化的一鍋式反應: 醣和製藥有著極為密切的關 點,在於從鑑定出失蹤蛋白,這些蛋白質目前尚未 係。翁啟惠教授聞名世界的「程式化一鍋式反應」 有蛋白質層次的實驗證據;我們進行相關研究。首 (one pot) 是第一個也是唯一可自動化而快速合成寡 先,我們從資訊分析角度探討為何失蹤蛋白質不易 醣 的 方 法。 利 用 library 中 類 似 於 LEGO 的 building 從現有的質譜實驗中被鑑定。其次,要鑑定失蹤蛋 blocks ( 基塊 : 目前有一百多個單醣及少數雙醣 ),他 白質,必須遵守人類蛋白體計畫嚴謹的資料分析準 們早期的軟體能推薦合成寡醣的多種基塊組合,再 則,因此我們探討同質量替代效應對鑑定失蹤蛋白 由生物學家做最後的定奪。其中翁教授最大的貢獻 質的影響。第三,由於為要能偵測失蹤蛋白質,必 就是提出,「只需要安排這些基塊反應數值 (RRV) 的 須確認該蛋白質有獨一的水解胜肽,因此我們開發 適當差距就可以讓合成反應有效地一鍋完成」,解 名為 iHPDM 的網站服務,其後端包含一個龐大的由 決了以往依賴嘗試與錯誤來處理保護基及多重反應 15 種蛋白酶剪切的所有胜肽資料庫,以便研究者選 的實驗困境。然而,one pot 有兩大限制 : (1) 目前的 擇合適的蛋白酶進行尋找失蹤蛋白質的質譜實驗。 library 約有 150 個基塊,僅能合成單調的寡醣 ; (2) 此外,我們也協助中研院團隊分析質譜實驗的資料, 由於 RRV 差距的限制,one pot 能夠合成的寡醣最多 以確保所鑑定的失蹤蛋白質符合人類蛋白體計畫的 不超過 10 個單醣。為了祛除 (1) 的限制,我們首先 資料分析準則。 提出「虛擬單醣基塊」的概念,用程式產生五萬多 個虛擬單醣基塊。並利用機器學習預測每個基塊的 蛋白基因體的生物資訊:蛋白質編碼區的變異,如: RRV 數值 ( 準確度在 97% 以上 ),使得學者能事先評 單胺基酸變異、序列插入或刪除、剪接點,可能和 估需要產生何種基塊,大量增加可合成醣的種類。 癌症相關;例如:在台灣肺癌病人可看到表皮生長 對於限制 (2),我們則提出「組合式多醣基塊」的概 因 子 受 體( 蛋 白 質 ) 在 基 因 體 層 次 的 L858R 單 胺 念,以程式模擬多層次的 one pot,理論上幾可合成 基酸變異。在蛋白質層次驗證蛋白質編碼區變異, 任何形狀及大小的多醣體。這個新智能化方法將可 也就是從質譜儀實驗資料中鑑定出變異胜肽;此蛋 讓多醣體及醣合成的研究跨入一個新的境界。 白基因體研究,近年來逐漸受到重視。然而,此研 究領域有兩項挑戰;首先,要從質譜儀資料鑑定變 圖一:使用 Auto-CHO 進行 異胜肽,必須先建構客製化的蛋白質序列資料庫, Globo H 之一鍋化合成。 此資料庫需必須有充足的包含變異胜肽的蛋白質序 列;其次,即使從質譜實驗鑑定出變異胜肽,仍必 蛋白體生物資訊: 蛋白體是重要研究課題,因為蛋 須進行一些檢測來確認的確是變異胜肽,如:檢查 白質執行細胞內的各種功能,也是藥物標的。質譜 這些變異胜肽是否經由已知的胜肽經質量修飾後可 儀目前是研究蛋白體最重要的實驗技術,分析質譜 得。我們針對第一個挑戰,提出名為 MinProtMaxVP 儀資料最主要的任務,就是要針對生物樣本進行蛋 的演算法,將產生最少數量蛋白質序列包含所有單 白質鑑定(定性)與定量。就此,我們提出計算方法, 胺基酸變異的組合之變異胜肽的計算問題,轉化為 並開發軟體以進行分析。首先,質譜儀資料分析儘 傳統的集合覆蓋問題。未來我們將發展一個軟體系 統來建構上述客製化蛋白質資料庫,此軟體將利用 MinProtMaxVP 演算法及合適的產生 decoy database 的方法。針對第二個挑戰,我們探究同質量替代效 應對變異胜肽鑑定的影響;此外,並提出一個名為 LeTE-fusion 的流程,來評估鑑定出變異胜肽的可能 性,並可以此來評估質譜儀資料鑑定出的變異胜肽 是否可能。未來我們將繼續發展方法,以嚴謹地確 認質譜儀資料所鑑定變異胜肽。希望我們的成果能 有助於癌症的蛋白基因體研究。 109

大八 實驗 室 Bioinformatics Labratory Research Laboratories Research Faculty Our current research focuses on bioinformatics for \"omics\" studies, categorized into two main areas: I. Huai-Kuang Tsai / Chair proteomics and metabolomics, and II. glycan synthesis, proteomics and proteogenomics. Research Fellow I. Genomics and Transcriptomics Developing Methods for Sequencing Data. With the ascension of next-generation sequencing (NGS) Ting-Yi Sung as a predominant technology for genome and transcriptome studies, we have devoted ourselves to developing new methodologies and tools for analyzing NGS data. For NGS read mapping, we have Research Fellow developed an ultra-e cient divide-and-conquer algorithm, called Kart, which divides a read into small fragments that can be aligned independently. Our experiments show that Kart is 3 to 10 times faster Jan-Ming Ho than other alignment tools, yet still generates reliable alignments even when the error rate is as high as 15%. We also applied that same strategy to RNA-seq mapping and whole genome comparisons, Research Fellow and have obtained superb results. For de novo genome assembly, we have proposed an extension- based assembler, called JR-Assembler, which not only assembles a giga-base-pair genome using Chung-Yen Lin lllumina short reads but also achieves better overall assembly quality in lower execution time than rival systems. Moreover, the advantages of JR-Assembler in terms of memory usage and execution time Research Fellow improve as read length increases. For transcriptome assembly, we have also proposed a new de novo transcriptome assembler, called JR-Trans. Apart from using the kernel developed for JR-Assembler for Arthur Chun- \"read extension by jumping\" and \"read extension with back-trimming\", JR-Trans also detects breakpoint Chieh Shih sites and provides a recursive procedure for branch extension and contig merging. In comparisons to current assembly tools using real data, JR-Trans achieves better overall assembly quality, requires lower Research Fellow execution time, and requires less memory. Integrated Tool and Platform Development for Sequencing Data. We are developing methods to make Wen-Lian Hsu genome drafts more complete and to discover more missing genes using 10x Genomics-linked read data. Draft assemblies usually result in several gapped sca old sequences. The 10x Genomics company Distinguished uses Chromium technology and a barcode system to prepare read libraries and to make read pairs from Research Fellow the same genomic fragment with the same barcode. The sizes of those genomic fragments are ~10-100 Kbp, which are useful for lling longer gaps. We have two goals for this project. First, we are developing 110 intra-sca old gap- lling methods based on the \"expanded\" local assembly method, which rst identi es potential barcodes of mapped read pairs and then assembles all linked reads of potential barcodes. We have developed three methods, including sca old-based, gap-based, and window-based local assembly methods. Preliminary results for a draft eel genome show that the gap-based method outperforms the others in terms of quality, including filled gapped bases and missing genes found. In terms of runtime, the gap-based method is second best and close to being the fastest method (i.e., just behind the window-based method). However, the runtime of the gap-based method using 96 cores in parallel is ~20 days, which is excessive. Therefore, we are developing fault-tolerant distributed algorithms to speed up the process by scaling out on IaaS computing clouds. For the second goal, we are developing new sca olding methods, by which potentially adjacent sca olds are identi ed rst and then they are connected by expanded local assembly. These tools can be applied to personal genome assembly and genome assembly of non-model organisms. Meanwhile, we are also developing a cloud-based architecture to accelerate execution of de novo genome assemblers. For assembled genomes, we have implemented pipelines that can decipher genome structure, annotate genes, and rapidly estimate expression profiles. Using our own web platform (http://molas.iis.sinica.edu.tw) and an integrated approach toward genomics, transcriptomics, proteomics, methylomics and other omics, we are tackling projects related to fusion gene discovery from clinical biopsies, functional annotation of non-model organisms (e.g., Giant grouper, http://molas. iis.sinica.edu.tw/grouper2016 and Japanese eel, http://molas.iis.sinica.edu.tw/jpeel2016), precision phenotyping of pathogens, and identi cation of mechanisms that restrict virus replication for vaccine development. Most of our advanced tools and platforms are publicly available on dockerhub (https:// hub.docker.com/u/lsbnb). Transcriptional Regulation. Gene expression is finely regulated by multiple genetic factors. We are interested in understanding the dynamic interactions between cis- and trans-regulatory elements and the evolutionary signatures of genomes. We have recently made progress in: (1) transcriptional regulatory mechanisms and evolution, (2) epigenetics and enhancer functions, and (3) the discovery of non-coding RNA and mRNA isoforms. To help identify and test novel hypotheses in genome biology, we integrate multi-omics data and develop novel tools that combine concepts from computer science, statistics, and molecular biology. For example, from a study of cardiac hypertrophy in mice, we have

Brochure 2020 collected time-course transcriptome data isolated from interfaces. The constructed pipelines can be exported as XML hypertrophic murine hearts with or without transverse aorta les together with all of their parameter settings for reusability banding surgery at different times. By analyzing the entire transcriptome dataset and reconstruction of transcription factor and portability. Second, isobaric-labeling techniques, such as co-expression networks, we discovered that global genetic the TMT 10plex reagent set, are being widely adopted in MS changes began soon after cardiac pressure overload, i.e., before experiments to perform relative quantitation of proteins in morphological changes from cardiac hypertrophy. multiple samples, e.g., tumor and adjacent normal tissues. We II. Glycan synthesis, Proteomics and Proteogenomics have developed a new tool called Multi-Q 2 (totally different AI Approach for Glycan Synthesis. Carbohydrates play from the previous version Multi-Q) for isobaric-labeling important roles in organisms, and glycan synthesis is essential quantitation analysis that exhibits improved quantitation to understand drug action and vaccine developments. The accuracy and coverage. In addition, we have also conducted world-renowned programmable one-pot oligosaccharide many analyses on different normalization schemes and other synthesis approach developed by Dr. Chi-Huey Wong was settings to ensure better quantitation accuracy. the first automatic means of synthesizing a large number of The Human Proteome Project (HPP) organized by the Human oligosaccharides rapidly. Optimer, the early software for that Proteome Organization aims to characterize the entire human one-pot method, provides synthetic blueprints by searching proteome. Its main goal in recent years has been focused against a Building Block Library (BBL), also designed and on detecting missing proteins, i.e., those that have not been synthesized by Wong. By ordering BBLs with appropriate relative experimentally detected at the protein level. In recent years, reactivity values (RRVs), glycans can be synthesized efficiently we have been conducting research on HPP. First, we explored and e ectively in \"one pot\". However, there are two limitations challenges from the informatics perspective why missing to this method: (1) there are only ~150 BBLs with measured proteins are hard to detect from MS experiments. Second, since RRVs in the library, and (2) the current one-pot method can rigorous HPP data interpretation guidelines must be adopted only synthesize small oligosaccharides due to RRV ordering to claim identi cation of a missing protein, we investigated the requirements. To overcome the rst limitation, we proposed the e ect of isobaric substitution on missing protein identi cation. concept of \"virtual BBLs\". More than 50,000 virtually-generated Third, given that unique in silico-digested peptides are crucial BBLs, whose RRVs are predicted by machine learning (with >0.97 to identify a missing protein, we developed a web server called PCC accuracy), provide more synthetic solutions for chemists. iHPDM, which contains a comprehensive proteolytic peptide To overcome the second limitation, we proposed the concept database constructed from human proteins digested by 15 of \"hierarchical one-pot synthesis\" by recursively composing protease combinations to facilitate selection of proteases used in fragments using a one-pot approach and treating fragments as MS experiments. In addition, we have also helped collaborators new BBLs in the library. Our new program, Auto-CHO, provides within Academia Sinica to analyze their MS data and identify synthetic solutions for complex glycans by multiple hierarchical missing proteins according to HPP data interpretation one-pot operations. This research has made great strides in guidelines. glycan synthesis by computer simulation. Bioinformatics for Proteogenomics. Protein-coding region variations―including single amino acid variations (SAV ), Figure1 : The one-pot synthesis indels, and alternative splicing junctions―may cause or have illustration of Globo-H been linked to particular cancers. For example, SAV L858R in by Auto-CHO. Epidermal Growth Factor Receptor has been observed at the genomic level in lung cancer patients in Taiwan. Proteogenomics Bioinformatics for Proteomics. Studying proteomics is important research that can validate such variations at the protein level because proteins perform various functions in cells and are by means of MS experiments is receiving increased attention. targets for drugs to tackle disease. Mass Spectrometry (MS) However, such research must surmount two main challenges. has become the predominant technology for large-scale First, sufficient SAV-harboring protein sequences must be proteomic research. The main tasks of MS-based proteomics generated to construct a customized database for identifying analysis are protein identification and quantitation. We have SAV variant peptides from MS data. Second, even when variant developed computational methods and implemented software peptides have been identified from MS data, it is necessary tools for these two tasks. First, analyses of MS data involve a to validate them, for instance, to check whether they can be series of steps, even when using the popular TransProteomic obtained from known peptides by isobaric substitution. To Pipeline (TPP). To reduce the need for manual interventions to tackle the rst issue, we formulated the problem of generating launch each step, we have developed a software tool, called a minimized number of SAV-harboring protein sequences to WinProphet, which seamlessly integrates with TPP and other contain all possible combinations of SAVs as the classical set external command-line programs to allow users to configure, covering problem, and proposed an e cient algorithm, called manage, and automatically execute pipelines through graphical MinProtMaxVP, to solve it. In the future, we plan to develop a software tool for generating a customized target-decoy database for variant peptide identi cation based on MinProtMaxVP and appropriate decoy generation methods. To tackle the second issue, we have investigated the effect of isobaric substitution on variant peptide identi cation, in addition to missing protein identification, and have also proposed a so-called LeTE- fusion pipeline to evaluate the possibility of detecting variant peptides. In the future, we will further propose several methods for data validation to ensure rigorous identification of variant peptides. Our results can hopefully be applied to research on proteogenomic characterization of di erent cancers. 111

八大 實驗 室 電腦系統實驗室 Research Laboratories 研究人員 電腦系統實驗室成立於 2009 年。主要研究領域包括動態二元碼編譯,編譯器與平行計算 機結構,非揮發性記憶體設計與管理,整合記憶體架構之電腦系統。 吳真貞 / 召集人 一、動態自動平行化與效能優化技術研究 研究員 現今的處理器設計趨勢,主要以提高處理器的平行計算能力,以提高其運算的效能。 王建民 不同的處理器架構常採用不同的平行化計算模型,例如:CPU 使用多核心與單指令多 資料流 (SIMD),GPU 使用巨量執行緒,而特定領域加速器則使用了特別的平行計算模 副研究員 型。然而,這樣的平行模型多樣性卻帶來軟體開發與維護的負擔,並增加軟體開發所 需的時程。為了解決這個問題,我們提出了一個編譯器技術的解決方案,稱為動態二 李丕榮 元碼翻譯 (Dynamic Binary Translation)。軟體開發者只需編寫一種平行模型的程式,透 過二元碼翻譯,自動將程式執行碼轉換成其他的平行模型。我們首創的二元碼動態「迴 研究員 圈改寫」方法,成功地將程式執行碼的迴圈做最佳向量化 (vectorization),大幅提升了 程式的平行度。本計劃目標為設計一個跨處理器架構的二元碼優化系統,我們將研究 張原豪 不同平行模型的特性,研究更多平行模型間的轉換方法,以及利用處理器指令集的新 功能帶來更多可能的效能提升。 研究員 圖一:動態二元碼翻譯與自動平行化優化系統。 二、支援人工智慧之優化編譯器技術 近年來,許多深度學習模型 ( 例如 CNN, RNN, LSTM, GAN) 已被廣泛應用於各個領域。 除此之外,計算機架構也朝向多元發展 ( 例如 CPU, GPU, GPGPU, DSP, FPGA, AI 加速器 等 ),無論是高階伺服器或是嵌入式系統,我們也觀察到這些平台上通常配備了異質 或非對稱性 (heterogeneous/asymmetric) 的運算裝置,如 CPUs + GPUs, and/or FPGAs。 在這樣多個深度學習模型與多個運算架構的組合下,我們需要一個方法,能夠有效地 將各種深度學習模型對應到各種計算機架構。為完成此目標,我們目前正在設計一個 優化深度學習的編譯器 (AI Compiler)。針對異質性或非對稱性平台,深度學習編譯器 將根據平台上運算裝置的特性或限制,以及分析深度學習模型的圖形結構,編譯出最 佳的協同運算方式。此編譯流程涉及異質運算,排程,以及編譯器優化技術。效能的 可移植性 (performance portability) 和運算裝置的可擴展性 (extensibility) 將是設計此編 譯器的重要研究目標。另外,我們也研究如何利用機器學習理論來輔助我們找到最佳 的編譯優化方案。 112

Brochure 2020 圖二:人工智慧優化編譯器。 三、非揮發性記憶體之深度學習 統,我們重新檢視現行作業系統中記憶體管理系統、 儲存系統與檔案系統的設計細節,從而提出一個全 深度學習訓練通常需要大量的 DRAM 記憶體空間, 新的記憶體內檔案系統設計,此一檔案系統能依目 但 DRAM 有較高的硬體成本及漏電問題。基於這些 前使用者資料量來動態調整儲存系統容量,以最大 問題,我們提出利用非揮發性記憶體為電腦系統之 化主記憶體可用的空間,藉以優化程式運行的效率。 主憶體來優化深度學習訓練之效能。由於非揮發性 同時,我們也重新檢視並設計作業系統中的頁面快 記憶體具有低漏電及高密度的特性,我們提出利用 取系統,利用主記憶體資料與檔案系統資料共存在 非揮發性記憶體來擴大訓練系統之主記憶體容量並 同一個非揮發性記憶體裝置上的特性,從而 (1) 最小 降低訓練系統之總體能耗。然而,非揮發性記憶體 化資料搬移的成本並 (2) 最大化非揮發性記憶體的壽 具有非對稱的讀寫特性以及有限的寫入次數限制, 命。 為了解決這些問題,在類神經網路的深度學習訓練 研究之中,我們分析類神經網路在訓練過程中的資 圖三:基於非揮發性記憶體與近似演算法之深度學習應用。 料流程與資料內容的變化特性,並利用這些分析所 得的特性,提出利用近似運算技術 ( 包含近似記憶體 圖四:整合記憶體架構之電腦架構。 與儲存體技術 ),在不損失正確性的前提下,提升類 神經網路訓練的效能。此技術的核心精神為利用近 似運算技術來降低資料的寫入精準度以換取較少的 資料寫入能耗及變化量。此研究方向的成果將在不 改變現行系統架構的前提下,就能有效的解決深度 學習訓練所遭遇的能耗及容量問題,同時也能為深 度學習在系統支援方面開啟一個全新的設計發展。 四、整合記憶體架構之電腦系統 由於各種巨量資的應用被快速地被使用在各種不的 系統之上,I/O 裝置的資料傳輸速率也變成極為關鍵 的效能因素。為了解決儲存系統的存取效能低落問 題,現行的作法大多採取擴大 DRAM 主記憶體容量 來減少 I/O 裝置的資料傳輸量,但卻造成更多的能耗 與硬體成本。有鑑於此,我們提出利用非揮發性記 憶體來同時取代系統的主記憶體與儲存體的「整合 記憶體架構」,因為當非揮發性記憶體取代 DRAM 為主記憶體同時取代磁碟機為儲存裝置時,系統的 效能將隨著非揮發性記憶體的可位置化特性、非揮 發性高容量與高存取性能而有全新的變化。為了強 化非發揮性記憶體所體現的「整合記憶體架構」系 113

八大 實驗 室 Computer System Laboratory Research Laboratories Research Faculty The Computer Systems Lab was established in 2009. Its primary research areas include binary translation, compiler and parallel architecture, deep learning with non-volatile memories, and Jan-Jan Wu / Chair one-memory computer systems. Research Fellow I. Auto-parallelism with Dynamic Binary Translation Chien-Min Wang Parallelization is critical for multicore computing and cloud computing. Hardware manufacturers have adopted many distinct strategies to improve parallelism in Associate Research microprocessor design. These strategies include multi-cores, many-cores, GPU, GPGPU, and Fellow SIMD (single instruction, multiple data), among others. However, these parallel architectures have very different parallel execution models and several issues arise when migrating PeiZong Lee applications from one system to another, such as: (1) application developers have to rewrite programs based on the target parallel model, increasing ‘time to market’; (2) legacy Research Fellow applications become under-optimized due to under-utilization of parallelism in the target hardware, signi cantly diminishing potential performance gain; and (3) execution migration Yuan-Hao Chang among heterogeneous architectures is difficult. To overcome these problems, we have developed an e cient and retargetable dynamic binary translator (DBT) to transparently Research Fellow transform application binaries among different parallel execution models. In its current iteration, the DBT dynamically transforms binaries of short-SIMD loops to equivalent long- SIMD ones, thereby exploiting the wider SIMD lanes of the hosts. We have shown that SIMD transformation from ARM NEON to x86 AVX2 can improve performance by 45% for a collection of applications, while doubling the parallelism factor. We plan to extend the DBT system by supporting more parallel architectures and execution models. Figure 1 : Dynamic Binary Translation and parallelism optimization system. II. AI Compiler for Servers and Embedded Systems In recent years, deep learning has become a rapidly growing trend in big data analysis and has been successfully applied to various elds. Many deep learning models (e.g. CNN, RNN, LSTM, and GAN) have been proven to work very well for recognition of images, natural language, etc. For this AI domain, we need a solution to e ciently run such deep learning models on a wide diversity of computing architectures. High-end servers may be equipped with powerful computing devices, e.g., a combination of high-end CPUs, GPUs, FPGAs, and AI accelerators. Small embedded systems may have a low-end CPU or DSP, and small memory capacity. Di erent compilation strategies are required to achieve optimal performance for any con guration of computing device and deep learning model. 114

Brochure 2020 Figure 2 : AI compiler framework. Hence, an optimizing compiler framework is needed and storage. Replacing DRAM with NVM will reduce the for deep learning, i.e., an AI compiler. The function of an energy consumption of main memory and, by replacing AI compiler is to translate deep learning architectures disk drives, we can enhance I/O performance because and optimize performance according to the capability of NVM’s superior byte-addressability, non-volatility, of the hardware architecture. That means developing a capacity for scalability, and high access performance. single optimizing compiler framework that can compile To achieve this goal, we are investigating the designs of for a variety of hardware architectures. Performance existing memory management, storage, and le systems portability and extensibility are also key issues in the in di erent operating systems, e.g., Linux. We anticipate design of such an AI compiler. designing a new in-memory file system that resides in NVM-based storage, the capacity of which can be III. Deep Learning with Non-volatile Memories adapted to user data size and access behavior of which is optimized for NVM. Meanwhile, the page cache system Deep learning training usually consumes considerable for storage devices can be altered to: (1) minimize data DRAM space, but DRAM has high power leakage and movement between memory and storage; and (2) unit costs. Due to these issues with DRAM, we propose maximize NVM lifetime by utilizing the fact that memory to enable training of neural networks (NNs) on non- and storage reside in the same NVM device and have the volatile-memory (NVM)-based computer systems. As same address space. NVMs exhibit low leakage power and high density, we propose to adopt NVM as the main memory to replace Figure 3 : Approximate computing for deep learning training DRAM in NN training, which requires huge amounts of with non-volatile memories. main memory. However, NVMs have asymmetric read/ write features and limited write endurance. To overcome Figure 4 : One-memory architecture. those limitations for NN training, we are analyzing NN data flow and data content during NN training. By utilizing the derived characteristics of data ow and data content, we propose to apply approximate computing technologies, including approximate memory and storage techniques, to improve NN training performance without losing training accuracy by reducing the cost of each write and trading some limited precision of written value. The results of this research could effectively resolve major problems with NN training without having to change existing computer architectures. Moreover, our research will create a whole new paradigm for computer system design to support deep learning. IV. One-Memory Computer Systems Given that data-intensive applications are increasingly being run on computer systems, it has become critically important to improve I/O efficiency. In response to the low access performance of existing storage devices, enlarging DRAM capacity to support more data-intensive applications introduces greater energy consumption and higher hardware costs. Therefore, we propose to adopt non-volatile memory (NVM) as both main memory 115

大 八 實驗 室 資料探勘與機器學習實驗室 Research Laboratories 研究人員 在此資料爆炸的時代,各類型的資料 ( 如:感測器、軌跡、交易、多媒體、社群網路、網 頁瀏覽紀錄等 ) 正以越來越快的速度產生。隨著硬體與網路豐富及低成本的特性,探究所 楊得年 / 召集人 有運用這些資料的方法以增進現有應用,或是探討新技術解決難題的時機從未比現在更 佳。資料探勘與機器學習實驗室組成的主要目標,就是開啟創新研究,並發展下列卓越之 研究員 技術:(1) 有效收集、表示、儲存、處理與分析巨量資料;(2) 探究資料探勘技術以快速且 有效地發現各式各樣資料中有價值的知識。目前本團隊的研究聚焦於下列領域:一、機器 張原豪 學習於網路廣告即時競價;二、社群網路分析與查詢處理;三、寫入受限記憶體與儲存裝 置之高效資料管理系統;四、複合式類神經網路:理論與 PM2.5 預測的應用。在這些研究 研究員 領域中,我們正進行下列研究: 陳孟彰 研究員 葉彌妍 研究員 廖弘源 特聘研究員 一、機器學習於網路廣告即時競價 網路展示型廣告已由過去固定內容固定價位的銷售方式,演變成為現在根據不同瀏覽 人次由競價結果動態決定內容:在使用者造訪網址且網頁完整呈現的短時間內,有興 趣的廣告主們透過程式化交易競標爭取廣告曝光。在即時競價交易環境中,廣告主要 能快速且精準地跨媒體購買廣告,必需仰賴好的預測模型自動出價。本實驗室目標協 助廣告主在有限且不完整的歷史競價資料中,設計出合適的機器學習方法,以建立有 效的預測模型來精準地預測廣告點擊率,進而能在有限的廣告成本下決定適切的出價 策略。 二、社群網路分析與查詢處理 在新興的平台(如:Facebook Live 與 Twitch)中,直播串流的普及已促使新的影片內 容與社群團體的急速增長,除了讓使用者得以即時開設各種主題(如:新聞、運動與 競賽)的串流頻道外,直播中最重要的兩項特徵有 (1) 多重串流:觀眾可同時觀看活 動實況的多個串流與 (2) 即時的社群互動:觀眾可彼此聊天互動,也可和直播主聊天, 或贈送虛擬禮物。然而,現有選擇直播頻道的方法仍聚焦於滿足使用者個人喜好,並 未考慮供應觀眾間的即時社群互動以及提供多樣化的直播頻道內容。故我們設計一 套「社群感知多樣化喜好直播頻道查詢(SDSQ)」,其同時選出一組多樣化且受歡迎 的直播頻道及一群彼此間社群關係緊密的觀眾。我們證明 SDSQ 為 NP-hard,且在任意 比例內不可近似,並設計一套 2 倍近似演算法 SDSSel,可保證其誤差上限。在 Twitch 上的使用者個案研究驗證了 SDSQ 的必要性及 SDSSel 的實用性。社群網路的普及度急 遽上升卻造成了不當使用,日益增加的社群網路精神障礙(如:網路關係成癮、資訊 過量負荷與強迫上網行為等)已在近期受到關注。這些精神障礙的症狀通常是被動地 被發現的,這使得臨床介入治療的時機常被延誤,因此,探勘線上網路行為可提供及 早並主動探測到社群網路精神障礙的機會。而探測社群網路精神障礙是具挑戰性的, 因為從線上社群活動記錄並無法直接觀察到使用者的心理狀態。我們提出一套機器學 習的框架──社群網路精神障礙探測(SNMDD)以及新的張量模型(STM),以協助 心理治療專家及早減緩社群網站成癮患者的症狀。我們提出一套動態貼文抽換與輔助 系統(N3S)的新框架,以動態貼文過濾及傳遞的方式作為社群網站成癮治療的輔助, 其包括成癮程度模型(ADM)以測量動態貼文對不同使用者的成癮程度,以及一個新 的最佳化問題,以在不犧牲使用者喜好的情況下,將行為治療的效果最大化。我們亦 116

設計一套具有理論效能保證的隨機演算法。大規模 Brochure 2020 使用者個案研究及實驗結果皆驗證了 SNMDD 及 N3S 117 的效果。 三、寫入受限記憶體與儲存裝置之高效資料管理系統 近年來,由於大數據應用的快速發展,數位資料總 量上升的速度急速加劇。為了跟上這個腳步,一些 新型態的記憶體與儲存體也應運而生。例如非揮發 性記憶體可降低主憶體的成本,同時疊瓦式硬碟能 進一步降低磁碟機的單位容量成本。基於這樣的發 展主軸,我們研究如何在非揮發性記憶體上設計新 的資料排序演算法,達成最小化寫入量,同時達成 極低的讀取時間複雜度。此外,我們進一步研究如 果能在具有循序寫入限制的疊瓦式硬碟上,設計一 個高效的資料索引系統,此系統能在索引改變時, 快速並有效的進行索引更新。這些設計經過一系列 的系統實驗證實,我們的研究確實具有高度的執行 效率。 四、複合式類神經網路:理論與 PM2.5 預測的應用 本 研 究 針 對「 複 合 式 類 神 經 網 路 」 探 討 其 架 構 以 及效能。複合式類神經網路是將多個「預學(pre- trained)」 類 神 經 網 路 模 型 以 及「 未 學(non- instantiated)」神經網路模型視為元件,依照有向 且無迴圈的有根(rooted)圖形的結構加以組合而 成。所謂的預學類神經網路是指,針對特定函數或 或具體任務精心訓練的模型;未學類神經網路則相 反,是指尚未經過訓練的模型。雖然一般普遍相信 複合式類神經網路的表現應該優於其中任何一個元 件,但之前的文獻裡並沒有具體的理論保證。我們 在此提供了複合式網路的建構方法,也從理論方面 去證明複合式類神經網路高機率可以表現得比網路 中的任何一個預學元件要好。此外,如果將一個額 外的預學元件加入既存的複合式網路,則新的複合 式類神經網路仍然高機率比其中的預學元件有較好 的表現。在應用方面,為了在現實世界檢視前述理 論上的結果,我們針對 PM2.5 預測這個複雜的問題 設計了幾個實驗。實驗結果顯示我們實作出來的複 合式網路,其預測的誤差值往往會小於任何一個既 學元件,與理論推導的結果一致,同時也顯示了複 合式網路的優點。

八大 實室 Data Mining and Machine Learning 驗 Research Laboratories Laboratory Research Faculty De-Nian Yang / Chair In this era of data availability, various types of data (e.g., sensory, trajectory, transactions, multimedia, social networks, Web browsing logs, etc.) are being generated at a monumental Research Fellow rate. Due to the abundant and inexpensive nature of hardware and networks, the time has never been better to explore all possible means of utilizing such data to enhance existing Yuan-Hao Chang applications or to investigate new technologies for solving di cult problems. The Data mining and Machine Learning Laboratory was formed with the main objectives of initiating innovative Research Fellow research and strengthening scienti c and technological excellence in: (1) e ective collection, representation, storage, processing, and analysis of massive data; and (2) exploring data mining Meng-Chang technologies to e ciently and e ectively discover valuable knowledge within various types Chen of data. Currently, the research of this group focuses on the following areas: I. Modeling and Prediction for Real-Time Bidding on Online Display Advertising; II. Social Network Analysis and Research Fellow Query Processing; III. E cient Data Management for Large-scale Computer Systems with Write- constrained Memory and Storage Devices; and IV. Composite Neural Network: Theory and Its Mi-Yen Yeh Application to PM2.5 Prediction. Research Fellow Mark Liao Distinguished Research Fellow I. Modeling and Prediction for Real-Time Bidding on Online Display Advertising Online display advertising is now programmatic. The ad impression displayed for each website visitor may be di erent and is dynamically determined by a mechanism called Real- Time Bidding (RTB), which brokers interactions between publishers and advertisers. In the RTB environment, advertisers rely heavily on having a good prediction model for ad click- through rates to e ectively and e ciently target potential customers and o er reasonable bids. We aim to design an appropriate learning method from historical bidding data with incomplete labels so that advertisers can have an e ective model to accurately predict ad click-through rates and establish a good bidding strategy under budgetary constraints. II. Social Network Analysis and Query Processing The popularity of live streaming has led to explosive growth in new video content and social communities on emerging platforms such as Facebook Live and Twitch. In addition to allowing users to create streaming channels on various topics in real-time (e.g., news, sports, games), these new platforms support two unique features: (1) multi-streaming, with viewers on these platforms being able to follow multiple streams of live events simultaneously; and (2) live interactions, enabling viewers to chat with each other and the broadcaster and send virtual gifts to the broadcaster in real-time as events unfold. However, existing approaches for selecting live streaming channels still focus on satisfying the individual preferences of users, without considering the need to accommodate real-time social interactions among viewers and to diversify the content of streams. Therefore, we are formulating a new Social- aware Diverse and Preferred Live Streaming Channel Query (SDSQ) that jointly selects a set of diverse and preferred live streaming channels and a group of socially tight-knit viewers. We have demonstrated that SDSQ is NP-hard and inapproximable within any factor, and have designed SDSSel, a 2-approximation algorithm with a guaranteed error bound. We are performing a user study on Twitch to validate the need for SDSQ and the usefulness of SDSSel. Moreover, the explosive growth in popularity of social networking leads to problematic usage. An increasing number of social network mental disorders (SNMDs)― such as Cyber-Relationship Addiction, Information Overload, and Net Compulsion―have been recently documented. Currently, symptoms of these mental disorders are usually observed passively, resulting in delayed clinical intervention. We argue that mining online social behaviors provides an opportunity to actively identify SNMDs at an early stage. It is 118

Brochure 2020 challenging to detect SNMDs because mental status namely SW-B+tree. Both of these systems have been cannot be observed directly from online social activity evaluated by a series of experiments and the results logs. Our approach, novel and innovative with regard have demonstrated the e ectiveness of the designs. to SNMD detection, does not rely on self-revealing of those mental factors via questionnaires. Instead, we IV. Composite Neural Network: Theory and Its Application are proposing a machine learning framework, namely to PM2.5 Prediction Social Network Mental Disorder Detection (SNMDD), with a new SNMD-based Tensor Model (STM). We hope This work investigates the framework and performance to assist mental healthcare professionals to alleviate issues of composite neural networks, which are the symptoms of users with Social Network Addictions composed of a collection of pre-trained and non- (SNAs) in the early stages by means of behavioral instantiated neural network models that are connected therapy. We are proposing a novel framework, called as a rooted directed acyclic graph to solve complex Newsfeed Substituting and Supporting System (N3S), problems. A pre-trained neural network model is for newsfeed filtering and dissemination in support generally well trained and targeted to approximate of SNA interventions. We first propose the Additive a specific function. Despite a general belief that a Degree Model (ADM) to measure addictiveness of composite neural network may perform better than a newsfeeds for different users. We can then formulate single component, overall performance characteristics a new optimization problem aimed at maximizing the are not clear. In this work, we have constructed the efficacy of behavioral therapy without sacrificing user framework of a composite network and have shown preferences. Accordingly, we will design a randomized that it performs better than any of its pre-trained algorithm with a theoretical bound. The results of large- components with a high probability bound. In addition, if scale user studies and experiments are demonstrating an extra pre-trained component is added to a composite the e cacy of both SNMDD and N3S. network, overall performance is typically not degraded. In our study, we are exploring a complex application, i.e., III. Efficient Data Management for Large-scale Computer PM2.5 prediction, to validate composite network theory. Systems with Write-constrained Memory and Storage In our empirical evaluations of PM2.5 prediction, our Devices composite neural network models support the proposed theory and perform better than other machine learning In recent years, digital data volumes have rapidly been models, demonstrating the advantages of our proposed growing larger due to emerging big data applications. framework. Current memory and storage capacities cannot balance cost and performance to meet the requirements for data storage. Some novel memory technologies have been developed for large-scale data applications. We are considering a novel computer architecture that applies: (1) non-volatile random access memory (NVRAM) as the main memory for reducing power consumption of the system; and (2) shingled magnetic recording (SMR) drives as storage to enhance capacity. However, both of these new devices have the write constraint issue, and they may significantly affect performance and endurance. Therefore, to efficiently manage massive data over such architectures, we are focusing on index management and have proposed methodologies for mitigating the overheads of two different levels of the memory hierarchy. First, to accelerate the index search operation, we have proposed an NVRAM-friendly sorting algorithm called B*-sort. B*-sort not only enhances the performance of the sorting algorithm on NVRAM, but also maximizes lifetime while conducting the sorting algorithm. To mitigate the overhead for managing indexes on SMR drives, we have also developed a sequential-write-constrained B+tree index scheme, 119

八大 實驗 室 多媒體技術實驗室 Research Laboratories 研究人員 多媒體技術與生物科技及奈米科技,被公認為是二十一世紀最具影響力的科技產業。在過 去二十多年來,我們已見證了多媒體相關技術對於日常生活中的多層面影響與改善。多媒 蘇 黎 / 召集人 體科技的應用極廣,促使了包含視訊、音樂、立體動畫、影像、聲音等技術上的進步,並 衍生出更多科學研究的持續挑戰。多媒體技術實驗室成員的主要研究方向,包括多媒體訊 助研究員 號處理、電腦視覺和機器學習。每位研究人員除了專注於個人有興趣的研究專題外,也透 過共同參與大型計劃,以期在重要研究議題上能有關鍵性突破。目前本實驗室正在執行的 王新民 大型合作計畫共有兩項:一、結合視訊及音訊之多媒體應用;二、深度學習於多媒體資料 處理之研發及應用。茲分述如下: 研究員 一、結合視訊及音訊之多媒體應用: 呂俊賢 我們的研究重點,在於開發結合視訊及音訊特徵的多媒體技術和應用。更詳細地來 研究員 說,我們探討視訊混搭 (Video mashup) 的問題:鑑於手持式裝置越來越流行,人們可 以輕易在演唱會現場錄製一段現場演唱視訊,並上傳到 YouTube 或者 Vimeo 與朋友 林仁俊 共享。觀賞這些手持裝置在不同位置所錄得的視訊片段,往往感覺吃力且不愉快,原 因是眾多觀眾在錄影時並未事先協調好誰先錄,誰取那一段來錄等細節。因此,放在 助研究員 YouTube 上的眾多視訊片段難免重複播放或短少某些片段。為了能讓未到演唱會現場 的觀眾有愉快的「再次欣賞」機會,吾人需要一個有系統的「整合」演算法則。 陳祝嵩 演唱會視訊混搭面臨的挑戰有四個方面:(1) 要使混搭後的視訊扣人心弦,我們必須考 研究員 慮音樂與鏡頭敘事的關聯性。通常,導演會根據歌曲的曲風、節奏等因素,適當地切 換鏡頭,如距離長短、拍攝角度等,來彰顯歌曲所欲傳達的情感以及想法。因此,如 黃文良 何從音樂預測導演會使用的鏡頭類型,是一項艱鉅的挑戰;(2) 眾多由不同角度拍攝之 視訊片段如何從「視覺」或「聽覺」上找出其先後順序,並有效加以銜接,也是一挑 研究員 戰;(3) 觀眾所拍攝的演唱會視訊片段多半由手機所錄製,為了提升聆賞者在視聽上的 體驗,如何增強「畫面」及「聲音」的品質,也是一項挑戰;(4) 為了使視訊混搭系統 廖弘源 更容易落實,如何在不降低效能的前提下,輕量化類神經網路模型,也是一項艱鉅的 挑戰。舉例而言,要如何從音樂自動預測導演的鏡頭敘事方式,其最難的部分是音樂 特聘研究員 和鏡頭之間的關係很難被發現並建模。受到電影分鏡概念的啟發,我們發展了一套機 率式集成模型,使其能夠整合各種時間解析度的信息,編碼音樂以及鏡頭之間的關係, 劉庭祿 如下圖所示。另一方面,為了使模型更有效率,我們進一步提出一種模型蒸餾的技術, 通過與集成模型的合作訓練來學習出一個輕量級的分類器。我們相信考量音樂與鏡頭 研究員 之間的關聯性來混搭視訊是具吸引力的,因為音樂和視覺敘事被自然地融合在一起。 而輕量化的模型也將有希望地被落實在我們的生活中。 多 媒 二、深度學習於多媒體資料處理之研發及應用: 體 科 深度學習在近幾年是一火紅的研究方向。多媒體資料處理領域有許多經典問題及新興 技 議題。深度學習已被證明在樣式比對 (pattern matching) 極具效用,既存的多媒體資料 形 處理議題有一些原本在辨識上效果並不好,我們打算引入深 度學習來處理一些既存的 塑 議題,希望能大大提升原本不易突破的瓶頸。在新研究議題方面,我們一方面解決較 我 困難,例如與視訊、邊緣運算 (edge computing) 相關的議題,另一方面也進行深度學 們 習理論研究,發展新的機器學習架構。具體研究內容如下: 的 未 1. 半線性化 (un-rectifying) 神經網路技術:在神經網路中最難處理的部分是分析非線 來 120

Brochure 2020 圖一:所提出的機率式集成模型,用於音樂到鏡頭敘事的翻譯。 STFT 代表短時距傅立葉轉換,而 VQ 代表向量量化。 性的激活函數。半線性化的技術,將分段連續的 要。我們發展了濃縮化 (shrink) 與擴張化 (expand) 激活函數轉換成與輸入相關的方程及限制函數。 的交替步驟,能夠在避免任務遺忘的情況下,發 將這個技術運用到非線性神經網路的分析,可以 展出實現多工任務的緊緻模型。達成在學習新任 將神經網路轉成一系列跟輸入有關的方程式及限 務的同時,持續完整地保有舊任務的功能。我們 方法的特色是可完全避免遺忘,並在維持深度模 制函數。雖然輸入的數目是連續不可數的,但是 型緊緻性 (compactness) 的狀況下來進行擴充。 限制函數的數目是有限的,所以可以利用這個技 術可以將神經網路優化的問題轉換成為一般的有 3. 群眾行為分析:群眾行為分析牽涉許多人在時間 軸上的位置變化,要利用深度學習網路原本慣用 限制函數的優化問題。利用這個技術與分析神經 的架構去分析群眾並非易事,另外,在實際的應 網路,可得到以下結論:(1) 一個神經網路將輸入 用情境中往往需要即時偵測資訊,需要以執行邊 空間分割成一堆多邊形的組合,每增加一層神經 緣運算搭配輕量化的神經網路架構。我們打算設 網路會將原多邊形的組合切得更細,因此在輸入 計適當的深層網路去分析動態群眾行為。 空間的切割上,神經網路的行為很像分類樹;(2) 在每一個多邊形的定義域上計算一個仿射變換, 4. 音樂資訊檢索:音樂訊號往往包涵多種樂器、階 神經網路上所有的仿射變換的線性的部分都可以 層式的拍號結構和混雜的曲風,通常皆以多重標 用一個跟輸入有關的矩陣表示;(3) 假如以輸出 記描述。因此,音樂資訊檢索問題可以利用多重 兩點的距離與輸入兩點的距離放大率來度量神經 任務學習 (multi-task learning) 架構之下的深度神 經網路來處理。此方法已經被證實在音樂和弦辨 網路的穩定性的話,一個神經網路若要穩定,那 識的問題上(即同時處理和弦名稱和根音名稱) 麼它的連結的數目便要稀疏或者連結的強度要夠 有其實用性。此外,我們引入和弦斷詞的概念, 首先訓練神經網路辨識和弦單元的位置,繼而在 小。目前研究的重點,是在深化半線性化技術與 和弦的層級上模擬和聲進行,有效提昇神經網路 神經網路上的分析與應用,包含神經網路的學習, 對於和弦辨識與和聲分析的表現力。我們正在進 包含發展可逆的神經網路及有迴圈的神經網路。 行的研究還包括用於視聽語音增強和用戶識別的 多模式深度學習。未來我們將開發於視訊中結合 2. 永續深度學習:監督式深度學習主要利用已蒐集 影像與自然語言,挖掘出人物相關資訊的方法, 完成的資料集來進行學習。這面臨一個困難:資 以利於更高階的語意檢索。 料並非一次性即可蒐集與建構完成,而是分批獲 得;而學習的任務或技能也並非一次可達成,而 是 在 過 去 以 某 些 資 料 集 學 習 完 相 對 應 任 務 後, 接 著 再 以 新 的 資 料 集 學 習 新 的 任 務。 因 此 永 續 性 (continuous) 或終身 (life-long) 學習變得益加重 121

大 八 實驗 室 Multimedia Technologies Laboratory Research Laboratories Research Faculty Multimedia technology is considered one of the three most promising industries of the twenty- first century, along with biotechnology and nanotechnology. We have all witnessed how multimedia Li Su / Chair technology has influenced various aspects of our daily lives over the past two decades. Given its wide spectrum of applications, there is a constant challenge to advance a broad range of multimedia Assistant Research techniques, including those related to music, video, images, text, and 3D animation. Fellow The main research interests of the Multimedia Technology Group members at IIS include multimedia signal Hsin-Min Wang processing, computer vision and machine learning. Apart from focusing on their own research interests, joint research activity by our group is best characterized by two ongoing major projects: I. Integration Research Fellow of Video and Audio Intelligence for Multimedia Applications; and II. Deep Learning for Multimedia Information Processing. Chun-Shien Lu I. Integration of Video and Audio Intelligence for Multimedia Applications Research Fellow This project explores new multimedia techniques and applications requiring fusion of video and audio Jen-Chun Lin intelligence. Through the prevalence of mobile devices, people can now easily lm a live concert and create video clips of speci c parts. Popular websites such as YouTube or Vimeo have further boosted Assistant Research this phenomenon, as data sharing becomes easier. Videos of this kind, recorded by audience members Fellow at different locations, can provide those who could not attend the event the opportunity to enjoy the same performance. However, the viewing experience is usually unpleasant since video-capture is Chu-Song Chen not coordinated and incompleteness or redundancy frequently occurs. To ensure a pleasant viewing/ listening experience, e ective fusing of such videos through a smooth \"decoration\" process to generate Research Fellow a single near-professional audio/visual stream is indispensable. Wen-Liang Hwang Video mashup, an emerging research topic in multimedia, can satisfy the above-mentioned needs. A successful mashup process has to deal with all videos captured at di erent locations and convert them Research Fellow into a complete, non-overlapping, seamless, and high-quality outcome. To make a concert video mashup process successful, we propose to address the following issues: (1) we are exploring the relationship Mark Liao between music and visual storytelling of shots to make the mashup outcome strike a deep chord; (2) we are solving the problem of video clip order from visual and/or auditory aspects; (3) we are studying Distinguished generative models such as GAN in depth to improve video/audio quality; and (4) we are investigating Research Fellow how to improve model learning to be compact yet accurate, thereby making mashup systems easier to deploy. For example, how to automatically generate a near-professional shot-type sequence from music Tyng-Luh Liu is a challenging but desirable task. The hardest element of doing that is the di culty in establishing and modeling the relationship between music and shots. Motivated by the concept of storyboards, we Research Fellow have created a probabilistic-based ensemble model that integrates information at di erent temporal resolutions to encode the relationship between music and shots, as shown in the gure below. To make Multimedia shapes our future the model e cient, we are further proposing a model distillation technique for learning a lightweight classifier through collaborative training with the ensemble model. A video composed in this way is attractive, as the music and visual storytelling blend naturally. We anticipate that the system could also be used in practical applications. II. Deep Learning for Multimedia Information Processing Deep learning-related research has attracted considerable attention in recent years owing to its effectiveness for solving various challenging tasks in computer science. In the field of Multimedia Information Processing, deep learning opens up brand new opportunities for both conventional and modern research topics. In the next few years, we aim to rigorously re-formulate or better solve emerging multimedia-related problems within the context of deep learning. These efforts can be highlighted by the following three collaborative projects: 1. Un-rectified neural networks: The main obstacle to analyzing deep neural networks lies in the composition of non-linear activation functions. We are establishing an \"un-rectifying\" technique that replaces point-wise, piece-wise linear activation functions in neural networks with a nite number 122

Brochure 2020 Figure 1 : Illustration of the proposed probabilistic-based ensemble model for translation of music to visual storytelling using shots. Here, STFT stands for short-time Fourier transform, and VQ stands for vector quantization. of data-dependent equations and constraints. Applying the First, it can avoid forgetting (i.e., it learns new tasks while un-rectifying procedure layer by layer in a non-linear network remembering all previous tasks). Second, it allows model can generate a representation comprising data-dependent expansion but maintains model compactness when handling equations over those constraints. As a result, optimization sequential tasks. problems on networks can be re-framed as data-dependent and constrained optimization problems in which the number 3. Crowd behavior analysis: a crowd can stay still or move of constraints is finite. By applying the technique to L-layer with respect to time. Moreover, when a crowd moves, it networks comprised of ReLU and Max-pooling activation moves in a non-rigid manner or may form sub-groups of functions, we show that: 1) the polytope domains of the people. Therefore, effectively analyzing crowd behavior is affine mappings induced by representations partition the not a trivial task. In addition, in real-world scenarios, real- input space, and that partitioning is re ned with increasing time detection of crowd behavior is often required. In those numbers of layers; 2) the linear parts of the a ne mappings cases, it is necessary to employ edge computing with a allow atomic decompositions as the matrix products WLCW1, lightweight neural network architecture. We aim to propose where data-dependent matrix C is induced from the un- heuristic definitions to characterize a crowd. Based on rectifying process, and WL and W1 are the weight matrices; these criteria, we hope to generate various deep-learning and 3) a Lipschitz bound can be estimated for the a ne linear modules to construct a complete framework for analyzing transforms of a network and used to characterize its stability, crowd behaviors, including sub-group detection, sub-group rendering a network asymptotically stable if the Lipschitz merging, and abnormal behavior detection, amongst other constant is a bounded function of the network depth L. attributes. Accordingly, there are connections between the stability of the network and the sparse or compressible weight 4. Music information retrieval: most descriptions of music distributions among its layers. Currently, our research is apply at multiple granularities, since music signals are focused on developing optimization algorithms for learning constructed with multiple instruments, hierarchical meter representation induced from un-rectifying a network, structures and mixed genres. The problem of music studying invertible DNN networks, and analyzing dynamic information retrieval can therefore be considered in the behaviors of networks containing loops. context of deep neural networks (DNNs) with multi-task learning (MTL). MTL-based DNNs are applicable in musical 2. Lifelong learning of deep models: Continual lifelong chord recognition (i.e., recognizing chord and root note in learning is essential to many applications. We are proposing parallel). In addition, we are introducing the concept of chord a simple but e ective approach to continual deep learning. segmentation. First, we train the neural network to identify Our approach leverages the principles of deep model the position of the chord unit and then infer harmony compression, critical weights selection, and progressive progression at the chord level. This approach effectively networks expansion. By enforcing their integration in an improves the neural network's expressive power for chord iterative manner, we are establishing an incremental learning identi cation and functional harmony analysis. method that is scalable to the number of sequential tasks in a continual learning process. Our approach is easy to implement and exhibits several favorable characteristics. Other ongoing research by our group includes multi-modal deep learning for audio-visual speech enhancement and user identi cation. In the future, we will develop e ective approaches that can discover characteristics of a movie and mine their relationships by combining image and natural language information. 123

大八 實驗 室 語言與知識處理實驗室 Research Laboratories 研究人員 大量資訊以多媒體的形式在網路世界散佈,為了達到智慧型的資訊處理,知識為本的訊息 處理是本計劃的核心焦點,我們正在進行三個主要的長期研究課題:知識庫的建立,自然 許聞廉 / 召集人 語言理解,及知識應用,尤其是中文自然語言的處理。 特聘研究員 一、 自然語言知識庫 王新民 1. 語言知識庫的建構 我們在過去二十多年發展了中文處理基礎建設,包含標記語料庫、句結構樹資料庫、 研究員 詞彙庫、中文語法、知識地圖、廣義知網、中文字構形資料庫、詞彙分析系統及語 句剖析器等。我們將利用「廣義知網」與中文剖析器來自動分析並抽取網路文件中 古倫維 隱含的語言知識及領域訊息,構建概念知識架構並建立領域知識庫及中文詞彙知識 庫。我們將連結不同的知識庫,包含維基百科的詞條與分類,共同形成一個完整的 副研究員 知識網 (ConceptNet) 以提高計算機推理及語言了解能力。同時,目前廣義知網的基 本知識節點 ( 義原與基本概念 ) 已全部人工對譯到英文詞網 (WordNet),廣義知網的 馬偉雲 詞彙則利用自動化技術跟英文詞網 (WordNet) 對譯,形成一個跨語系的知識網。此 外,我們也自動蒐集了一個「共現」知識庫 COPA,將許多有意義的組合:名詞 - 名 副研究員 詞,動詞 - 名詞,形容詞 - 名詞等等放入。COPA 可以幫助我們將句子中詞與詞之間 的修飾關係找出。 蘇克毅 2. 知識庫的運用 研究員 我們特別專注於如何讓模型能夠執行長且複雜的推演。我們已經成功讓知識庫為本 的問答系統能夠回答需要多次推論的複雜問句。在推薦系統及圖像故事生成兩個工 作中,我們除了利用知識庫裡的知識擴大對真實世界實體的了解,更進一步利用知 識庫提高多樣性,有效解決深度神經網路的缺陷。在推薦系統中,我們的模型能夠 提供使用者更有興趣的文章或物件;在故事生成上,則使故事更為有趣,上下文也 更通順。 二、自然語言理解 我們將研究知識架構的基礎理論及細緻語意的表達模式。藉由分析近義詞的細微差別 來找出細緻語意的表達方式及語意的合成機制。我們將改善並整合當下最重要的一些 知識架構如詞網、知網及事件框架網,以達到較佳的整合知識表達系統。我們也將研 究知識邏輯及推理與知識結構整合的完整架構並應用於自動推理。希冀在廣義知網的 本體架構下短語或句子的語意可以由詞彙語意合成。 1. 知識為本的中文語言處理技術 我們將發展強健型的中文結構剖析及語意分析系統,注重以概念為中心的中文處理 技術,將利用所發展的「廣義知網」詞彙知識架構及自動抽取得到的統計、語言語 法及常識訊息作為基礎知識用於分析文件的概念結構,並發展語意自動合成技術以 瞭解文件的意義。由於複雜句都是由簡單句加上各式各樣的修飾語形成,我們可以 利用 COPA 發掘句子中的修飾語,因而反向將複雜句簡化成簡單句。我們將這個方 法稱為「簡化還原法」,可用來進行斷詞,語音辨識與依存語法剖析。 2. 統計準則式模型(SPBA) 適當的語意模版有助於從文句中擷取名詞,及其名詞間的關係,以便於理解。我們 設計了一種模版近似對應的方法,並採取了一種半自動學習的策略,能夠將大量使 用者標註過的語料學習後,進行模版的摘要。使得摘要過的「準則式模版」更為強 健,應用到嶄新的領域也有不錯的對應率。SPBA 與上述「簡化還原法」合併的機器 學習效果極佳,而且學到的知識提綱挈領,易於理解,錯誤分析可準確提供改進方 124

Brochure 2020 圖一:句子剖析範例。 向,增強系統性能,是傳統機器學習不容易做到 1. 情感分析與意見探勘 的。 主觀資訊的分析是自然語言處理中最具應用性的 3. 中文自然語言問答系統 研究領域之一,且相關技術需深入理解文本內容 中文自然語言問答是一項嶄新且充滿挑戰性的研 及特定領域知識。我們在新聞、部落格、網路論 究議題。 我們結合了實驗室各種中文技術,如問 壇、評論、讀者回應、對話及短訊文本中研究意 題理解、文句擷取、專有名詞辨識、候選答案排 見、情感、主觀性、表情、情緒、觀點等資訊, 序、語意模板等,研發出一套問答系統。這套系 並同時為中文及英文文件,研發分析主觀資訊的 統在 2007 年日本東京 NTCIR 舉辦的第二屆中文問 資源與技術。資源方面,我們建立了中文情感分 答系統競賽中,以 55.3% 正確率蟬聯第一名。目 析最完整的套件 CSentiPackage,其中包含辭典、 前,這個系統已經商品化,我們正在往流暢的對 語料、計分工具,與深度學習的社群網路文本分 談系統努力中。 析演算法 UTCNN;其中套件下載次數超過 250 次, 而字典更超過 1,000 次。目前我們基於過去所發展 4. 整合詞彙知識庫來表達詞彙向量 的各項技術,將情感分析帶入人類生活的各種情 詞彙知識庫,如廣義知網,是將每一個詞彙的屬 境中。除了利用最新的遠程監督技術收集推特的 性與語法語意以結構化的方式加以表達,好處在 情感與諷刺文本,有效提供研究社群更大更可靠 於提供清晰的解釋與穩定的應用,缺點在於人為 的語料以加速研究進程外,更開展與以色列及美 定義成本高且語意表達的範圍有所侷限。另一方 國的國際合作,其中正在進行的工作包括目前研 面,深度學習的作法則是利用大量語料訓練出每 究社群關注的熱門新議題謊言偵測與假新聞干預, 個詞彙的一組向量,面對實際的 NLP 問題時可以 也都獲得豐碩的研究成果。 把詞彙向量當作後續可訓練調整的參數,提供一 般化與語意表達擴張的能力。結合這兩者的長處 2. 機器閱讀 具有很高的學理價值與應用突破。我們結合知識 我們將用之前所建立之不同分析模組 ( 如中文分 庫與語料共同從事詞彙向量的生成,將結果發表 詞、句法剖析、語意角色標註、邏輯型式轉換等 ), 在 EACL 2017 及 IALP 2017,後者更得到大會的最 來建立一個中文自然語言理解系統。我們會首先 佳論文獎。此外,在 IALP 2017 的預測情緒詞彙與 為這個長期的研究計劃建立一個中文機器閱讀程 片語的國際競賽中,我們利用語料訓練出的詞彙 向量與廣義知網的同義詞集來預測情緒維度,奪 式,使本計劃可以用閱讀測驗來評估。我們將從 得整體第三名以及片語 arousal 項目的冠軍。 國小課本開始,然後再進一步到中學課本,最後 到專業智慧型問答系統。 三、自然語言應用 3. 口語處理 我們所發展的注音自動轉國字的軟體─自然輸入法, 我們的研究涵蓋了多種口語處理主題。最近的成 正確率接近 96%,已經普遍受到大眾的歡迎與接受。 果包括用於語音 / 說話人辨識的鑑別式自動編碼 最近我們朝著更加國際化的方向前進,以針對常用 器、基於子空間的口語語言辨識模型、基於變分 字串「搜尋」為概念,發展中英文「快打」輸入法。 自動編碼器的語音轉換、自動語音評估模型及視 未來,我們將發展情感分析,機器閱讀,以及口語 聽語音增強。實驗室成員黃文勁獲得 ISCSLP2018 處理技術以支援不同的應用系統。 最佳學生論文獎。進行中的研究包括閩南語語音 辨識、跨語言語音辨識及口語問答系統。 125

大八 實 室 Natural Language and Knowledge Processing 驗 Research Laboratories Laboratory Research Faculty Wen-Lian Hsu / Chair We focus on problems concerning knowledge-based information processing, strongly prompted by the over-abundance of information flooding the Internet. We are currently Distinguished engaging in studies on knowledge acquisition, representation and utilization, with a particular Research Fellow emphasis on Chinese processing. Hsin-Min Wang I. Knowledge Base We are assessing strategies and methodologies to automate knowledge acquisition Research Fellow processes. 1. Construction of linguistic knowledge bases Lun-Wei Ku Over the past twenty years or more, we have developed an infrastructure for Chinese language processing that includes a part-of-speech tagged corpus, tree-banks, Chinese Associate Research lexical databases, Chinese grammar, InfoMap, word identi cation systems, and sentence Fellow parsers, among other attributes. We plan to use these tools to extract linguistic and domain knowledge to form a complete concept net. We have also automatically Wei-Yun Ma constructed a knowledge base, COPA, containing over 220 million collation pairs of NN, NV, AN, etc., which can be used to discover the modifying relationships in a sentence. Associate Research 2. Utilization of knowledge bases Fellow We are studying how to enable long or lengthy inference and multi-hop knowledge-based question answering (KBQA). Furthermore, we are leveraging knowledge inference paths in Keh-Yih Su recommendation systems to improve their precision, and applying this research to visual storytelling to improve text diversity and coherence. Research Fellow II. Natural Language Understanding 126 We are remodeling current ontology structures of WordNet, HowNet, and FrameNet to achieve a more unified representation. We have designed a universal concept representational mechanism called E-HowNet, which is a frame-based entity-relation model. E-HowNet has semantic composition and decomposition capabilities that are intended to derive near-canonical sense representations of sentences through semantic composition of lexical senses. Senses in E-HowNet have been fully and manually connected to the corresponding synsets in WordNet, just as lexicons in E-HowNet are automatically linked to synsets in WordNet. 1. Knowledge-based Chinese language processing We are focusing on conceptual processing of Chinese documents. Our system will utilize the statistical, linguistic, and commonsense knowledge derived by our evolving Knowledge Web and E-HowNet systems to parse the conceptual structures of sentences and to interpret sentence meanings. We have adopted COPA to reduce a sentence to a simpler one by discovering the modifying relationships of words within the sentence, which is called \"reduction\" and can be used to guide word segmentation, speech recognition, and dependency parsing. 2. Statistical Principle-based Approach (SPBA) We have proposed a semi-supervised method that can cluster a large quantity of seed instances of diverse quality. SPBA coupled with reduction can provide flexible frame- based pattern matching and summarization. Our biological NER system won rst place in the 2009 and 2017 BioCreative international contests. 3. Chinese question answering system We have integrated several Chinese NLP techniques to construct a Chinese factoid QA system, which won rst place in the NTCIR-5 and NTCIR-6 workshops. We will extend this system to answer \"how\" and \"what\" types of questions.

Brochure 2020 Figure 1 : Sentence Parsing Example. 4. Distributed Word Representation of applications utilizing this toolkit has exceeded 250, Compared to traditional word meaning representation and more than 1,000 utilize the dictionary alone. We via symbols, distributional word representation are currently working on a new effective supervision provides additional computational ability and the method for sentiment and sarcastic tweets collection advantage of generation, but lacks explanatory ability. to assist the research community and boost research Thus, fully making use of the strength of each kind progress. We have also established international of representation is crucial to resolving practical NLP cooperation with Israel and the US. We are now tasks. We have studied how to infuse information investigating sentiment information model- and into knowledge bases in the form of distributional system-wide for new emerging and vital research word representation, and published our work in topics such as lie detection and fake news intervention, EACL 2017 and IALP 2017, with this latter manuscript in which we anticipate making useful advances. receiving a best paper award. We have also developed 2. Machine Reading: a lexical sentiment analyzer using both E-HowNet Since most knowledge is nowadays expressed in text and distributional word representation to predict form, machine reading has become a very important sentiment for a given word or phrase. Our system won topic. We are building a Conversational Open Domain third place overall in the 2017 IALP contest, and first Document-based Natural Speech Q&A system that can place for prediction of phrasal arousal. read from Wikipages and then answer questions in natural speech. We plan to start this long-term research III. Natural Language Applications project by using a Chinese machine reading program Our Chinese input system, GOING, is used by over one that can be evaluated with reading comprehension million people in Taiwan. Our knowledge representation tests. Initially, we plan to focus on reading elementary kernel, InfoMap, has been applied to a wide variety of school texts, and then gradually shift to high school application systems. In the future, we will design event and real domain-oriented applications (e.g., smart frames as a major building block for our learning system. Q&A). We will also develop basic technologies for processing 3. Spoken language processing spoken languages and music to support various Our recent achievements include discriminative applications. autoencoders for speech/speaker recognition, 1. Sentiment Analysis and Opinion Mining subspace-based models for spoken language We have studied opinions, sentiments, subjectivities, recognition, variational autoencoder-based affects, emotions and views in texts such as news approaches for voice conversion, automatic speech articles, blogs, forums, reviews, comments, dialogs, assessment models, and audio-visual speech and short messages, and then developed sentiment enhancement. Our group member Wen-Chin Huang analysis techniques for both Chinese and English. We won the best student paper award at ISCSLP2018. have built one of the most popular Chinese sentiment Ongoing research includes Minnan speech recognition, analysis toolkits, CSentiPackage, which includes cross-lingual speech recognition, and spoken question sentiment dictionaries, scoring tools, and the deep answering. neural network module UTCNN. Thus far, the number 127

八大 實驗 室 網路系統與服務實驗室 Research Laboratories 研究人員 本實驗室的研究方向包括參與式感測系統、軟體定義網路、備災應變科技、室內定位、導 航及物件跟蹤的應用和服務,以及資料協作專案,以下針對這些項目進行更詳細的介紹。 陳伶志 / 召集人 一、參與式感測系統 研究員 我們研究物聯網與參與式感測系統,並且結合這兩種概念發展一個細懸浮微粒 (PM2.5) 何建明 的大型觀測系統 – 空氣盒子。我們不但號招民眾參與環境感測,也提供詳盡的資源導 引民眾製作低成本的 PM2.5 感測裝置,並且將所有收集到的量測資料免費開放給所有 研究員 人使用,大幅提升 PM2.5 觀測在時間與空間上的精細度,豐富了環境研究的數據資料。 直到 2020 年中為止,我們已經在超過 58 個國家,募集到使用者佈建超過 15,000 個裝 莊庭瑞 置,並且在在許多重要的物聯網與跨領域的環境監測議題中,保持和國內外相關團隊 密切的交流合作。 副研究員 除此之外,我們也著手研究空氣盒子系統的資料品質議題,並且提出一個能夠從即時 陳孟彰 資料流進行感測資料異常偵測的運算框架;我們也提出一個基於叢集分群的短時間 PM2.5 濃度預測方法,並且研究一種結合感測值與預測值以達到最小 PM2.5 曝露量的 研究員 路徑導航規劃演算法;我們也開發一個因地制宜且支援大規模物聯網的感測器校正機 制,並且設計結合多元資料源的資料融合技術。我們接下來的工作除了持續透過更深 楊得年 入的時空資料分析,找出感測資料中更多的細微特性,同時也將進一步擴展我們的研 究成果到其他環境因子(例如噪音、臭味和輻射),支援更多元的參與式環境感測。 研究員 二、軟體定義網路 張韻詩 軟體定義網路(Software-De ned Networking, SDN)為單播和群播的流量工程(Tra c 客座講座 Engineering)提供了靈活的路由,以提高網路可擴展性。然而,由於可能的群播組數 量與網路中的節點數量成指數關係,此問題對於群播更具有挑戰性。為了解決此問題, 我們在真實網路一些重要的限制下,制定一系列新的 SDN 優化問題,並證明其為 NP- hard 並推導不可近似性,同時設計了具有嚴格下界近似比率的近似演算法。我們首先 提出一個新近似演算法,以提高 SDN 中群播組通信的可擴展性,接著引入了一種新的 可靠群播樹以降低在路徑和節點容量限制下之恢復成本(recovery cost)。接著,我們 嵌入由一系列虛擬化網路功能(Virtualized Network Functions, VNFs)組成的服務鏈將 提出的群播樹進行擴展,以支援具有網路功能虛擬化(Network Function Virtualization, NFV)和多個群播組的線上 SDN 流量工程,並提出了低成本的重新路由演算法以支援 分段樹(segment tree)路由之線上 SDN 流量工程。上述所提出之 δ 倍近似演算法和 δ 倍競爭演算法皆可以實現最嚴格的近似比例,其中 δ 是群播組中的最大目的用戶數。 除了理論分析之外,上述演算法並實作於 HP OpenFlow 交換機與 Floodlight 控制器, 在真實 SDN 網路中以 YouTube 進行驗證,以顯示其真實的適用性,且此結果已轉移至 Inventec 和 EstiNet。 三、備災應變科技 這些年來,永續科學研究計畫「DRBoaST, 善用巨量開放資料與互聯網於強化災防應 對與社區復原能力」為我們提供了與幾位地球所研究員和領先大學教師合作研發災害 管理科技的機會。我們早期的目標為開發出一個用於建立開放並可永續維護與發展的 災害管理資訊系統之架構,此架構的組件包括可信賴之資訊授權服務、具高適應力及 穩健度的即插即用異質災害網路、動態和即時之資訊傳送服務、及群眾外包策略與演 算法。我們已用這些組件的雛型證明了它們的概念和可行性,亦與中華電信研究所及 國家災害科技防救中心合作、研發一個異質通訊網路之即時防災救災訊息中介軟體雛 型。 我 們 近 來 比 較 專 注 於 獲 取 與 善 用 防 救 災 資 料, 一 方 面 研 發 獲 取 關 鍵 性 災 害 資 料 之 方 法 與 工 具, 另 一 方 面 研 發 善 用 開 放 資 料 與 互 聯 網 的 備 災 應 變 設 備、 軟 體 與 服 務。 一 個 前 者 的 例 子 是 以 群 眾 外 包 為 基 礎 之 災 情 資 料 蒐 集 與 決 策 平 台、 名 為 128

Brochure 2020 CROSS (CROwdsourcing Support system for disaster 來搜索和管理醫療裝置和設備、監控病人的位置 Surveillance), 其 會 在 傳 統 感 測 網 路 與 環 境監控系 和動作、以及促成設備使用統計數據的收集。 統無法提供足夠災情資料時,於網路社群中廣播勘 災資料蒐集之需求,基於志工之屬性,動態規畫志 我們正在進行現場試驗,以評估這些應用程序和服 工配置與勘災路徑,並整合志工回報之觀測資料與 務的功能和可用性。 測試地點包括台北國立台灣大 感測器資料,以有效掌握災害範圍與災情。我們特 學醫院和雲林分院(急診室和患者病房的 BOT,以 別著重在志工資源分配模型到勘災路徑規劃與災情 及門診病人就診的區域的 WPIN)。其他試驗地點包 資料整合之相關學理基礎研究,使這些問題的解法 括彰化基督教醫院,台北市政府和雲林榮譽國民之 能不僅為 CROSS 實用,而且可以推進基礎理論。我 家。 們已整合我們的志工資源分配模型到知名的開源災 害管理平台 Ushahidi。我們的地球科學同事在使用 五、資料協作專案 CROSS,他們在每次顯著地震發生後,隨即調度志工 在安全的情況下回報地震地表災害的科學參數,協 為了資料的協同生產、取用、與保存,我們開發了 助斷定對應的孕震構造與評估後續的潛在危害。 相關的工具、系統、與服務。資源受限的的研究計 畫通常仰賴第三方服務來分享與典藏資料。當計畫 我們的另一個重點是應用開放大數據中的資訊以及 成員眾多且以流動方式參與時,這不見得是好作法。 智慧物件於備災及應變的科技基礎,目標為大幅提 舉例來說,一項公民科學計畫可能有數千人參與。 昇建築物安全及設施管理基礎建設及應用系統的防 以下列出幾項跟(院內或院外)其他研究單位一起 災及應變功能。我們已研發了能自動啟動避險機制 進行的資料協作專案,其中資訊所擔負領頭的角色。 的主動式智慧型嵌入式裝置、手機應用程式與緊急 研究資料寄存所 ― 對所有人開放的研究資料寄存服 應變系統的雛型,以顯示其概念、可行性及有效性。 務。「 研 究 資 料 寄 存 所 」(https://data.depositar.io/) 我們並研發建築物資訊系統 (BeDIS),此系統可提供 建立在 CKAN 這套用來發布開放資料的自由軟體套件 做基於位置和環境之備災應變決策所需的資訊,所 之上,但我們加了以下功能:一、豐富的後設資料 以是讓主動式裝置、應用程式與緊急應變系統可以 支援;二、時空間資訊的註記和查詢;三、空間資 普遍使用於智能建築與環境中必需的基礎設施。 料集的預覽與疊套;四、使用維基資料管理關鍵詞; 五、並開放給所有人註冊使用。許多計畫已使用該 四、室內定位、導航及物件跟蹤的應用和服務 服務來管理研究資料。研究資料寄存所的原始程式 碼跟使用手冊都可以自由下載。我們也跟科技部一 DRBoaST 計畫於 2018 年 12 月結束後,我們將 BeDIS 同發展研究資料管理的規範與指引,提供給獲科技 的雛型強化為可立即部署的平台。BeDIS 系統中最重 部補助的專題計畫。 要的組件是位置信標 (Lbeacon, location beacons), 它 們是在建築物中普遍安裝的藍牙低能耗設備,在通 台 灣 動 物 路 死 觀 察 網 (TaiRON)。 TaiRON (https:// 常情況下,每個 Lbeacon 都定期地通過定向天線廣 roadkill.tw/) 以協作方式蒐集台灣的動物路死的(手 播自己的 3D 坐標和位置描述,此功能使建築物中的 機)照片紀錄。這項計畫 2011 年在台灣特有生物研 人可以通過手機準確定位及導航。每個 Lbeacon 還 究保育中心開始的時候,使用臉書社團為資料收集 收集在它覆蓋範圍內之藍牙標籤 (tags) 的 MAC 地址, 管道。我們重新設計了資料流程,讓參與者先將觀 此功能使 BeDIS 能夠即時的定位並跟蹤帶有藍牙標 察紀錄上傳到專屬的研究網站,然後由網站發出摘 籤的設備、工具、人員等。BeDIS 可以達到 3-5 米或 要訊息到臉書社團,進行物種辨識與社群互動。新 6-10 米的定位準確度,其響應時間通常為 2-3 秒。 的流程讓資料好管理與分析,也讓觀察紀錄得以個 BeDIS 既 是 能 支 持 室 內 定 位 和 室 內 導 航(IPIN) 應 別管控。TaiRON 的資料已用於狂犬病疫情監測以及 用程序的基礎架構,又是個即時室內位置特定服務 路殺熱點的標示跟改善。於 2019 年底首次舉行國家 (RT-ILBS)的平台,這是它的獨特優勢之一。BeDIS 農業科學獎,此計畫在生態永續類獲得卓越紮根獎。 的卓越處也包括在此平台上構建的導航和物件跟蹤 台灣動物路死觀察網目前有五千多位積極的參與者。 程序能有足夠好、不受人潮影響的定位精準度和響 應時間,而且具有低的部署和維護成本。 318 公民運動文物紀錄典藏庫。 2014 年太陽花運動 要結束之時,中央研究院到被佔領的立法院議場, 除 BeDIS 之外,我們的成果包括: 與學生與運動者協議取得場內的物件。這些物件的 ・ WPIN(基於 Waypoint 的室內導航器): 此應用程 數位化檔案以及另外取得的照片與影音檔案,便是 這 典 藏 庫 (http://public.318.io/) 的 主 要 收 藏。 典 藏 序具有抬頭顯示樣式的 GUI 和語音導航指令,可 庫的目錄介面,讓眾人可以搜尋並指認物件;也提 在各種智能手機上為一般民眾在室內導航。專為 供機制讓物件的創作者,能將物件的高解析度數位 大醫院和診所的門診病人設計的 WPIN 也能規劃 檔案提供給公眾再次使用。太陽花運動修正了台灣 行程,以盡量減少每位患者完成其就診、醫療檢 的走向,並持續影響全國的政治面貌與社會反響。 查、付款並取藥所需的總時間; 和 這整批收藏已移轉給國立台灣歷史博物館,部份並 於「戰後台灣社會運動特展」中展出。 ・ BOT(BeDIS object tracker)和地理圍欄服務 : BOT 可用於醫院、老人護理和其他提供護理的場地中 129

大八 實驗 室 Network System and Service Laboratory Research Laboratories Research Faculty Our research addresses several aspects of network systems and services, including participatory Ling-Jyh Chen / Chair sensing, software-de ned networking, technologies for disaster preparedness and responses, indoor location-based applications and services, and collaborative data projects. These areas are described Research Fellow further below. Meng-Chang Chen I. Participatory Sensing We are studying the Internet of Things (IoT) and participatory sensing systems, and have integrated the two Research Fellow concepts to develop a large-scale system, called AirBox, for monitoring fine particulate matter (PM2.5). This project engages citizens to participate in environmental sensing, enabling them to make their own low- Tyng-Ruey Chuang cost PM2.5-sensing devices. It also facilitates PM2.5 monitoring at ne spatio-temporal scales and facilitates environmental data analyses by making all measurement data freely available. Thus far, we have deployed more Associate Research than 15,000 devices across 58 countries, and we have established collaborations with international and domestic Fellow researchers on several key issues regarding IoT systems and the interdisciplinary topic of environmental monitoring systems. Jan-Ming Ho Specifically, we have investigated data quality issues with the AirBox system and have adopted an anomaly detection framework to detect anomalous events in the real-time data stream. We have developed a cluster- Research Fellow based approach for short-term PM2.5 concentration forecasting and designed a clean-air routing algorithm to allow path navigation with minimal PM2.5 exposure. Our adaptive sensor calibration mechanism and data De-Nian Yang fusion algorithm can integrate multiple PM2.5 data streams with di erent inherent accuracy levels. Currently, we are improving spatio-temporal data analysis techniques and identifying additional intrinsic properties of Research Fellow data generated by AirBox. We hope to extend our methodologies to support other sensing systems for various environmental factors, such as noise, odor, and radiation. Jane W. S. Liu II. Software-de ned Networking (SDN) Distinguished Visiting SDN enables exible routing for tra c engineering of unicast and multicast ows to increase network scalability. Chair However, traffic engineering is much more challenging for multicast flows since the number of possible multicast groups increases exponentially with the number of nodes in the network. To tackle this issue, we 130 formulated a series of new optimization problems with practically important SDN constraints, proved NP- hardness and inapproximability, and designed new algorithms with the tightest possible approximation ratios. Speci cally, we rst proposed an approximation algorithm with the strictest bound to improve scalability for multicast group communications in SDN, and introduced a new and very reliable multicast tree for SDN to reduce recovery costs under link and node capacity constraints. Next, we extended the proposed multicast tree to support multiple multicast groups for online SDN tra c engineering with network function virtualization (NFV) by embedding a service chain consisting of a sequence of virtualized network functions (VNFs). Finally, we proposed low overhead rerouting schemes to support online SDN traffic engineering with and without segment tree routing. Our proposed δ-approximation and δ-competition algorithms can achieve the tightest possible approximation ratios. In addition to sophisticated theoretical analyses, our algorithms have also been implemented in real SDN networks by means of HP OpenFlow switches and Floodlight for YouTube tra c to show their practical applicability. The results of these implementations have been transferred to Inventec and EstiNet. III. Technologies for Disaster Preparedness and Responses For many years, the multi-disciplinary Sustainability Science Research project DRBoaST (Disaster Resilience through Big and open data and Start Things) provided us with opportunities and support to collaborate with research fellows and faculty members in earth science, urban planning and computer science with regard to information and communication technologies for disaster management. Initially, our focus was on technologies for building open and sustainable disaster management information systems. Our results have included prototypes, models and algorithms for a responsive and trustworthy emergency information brokerage service, heterogeneous disaster-resilient and plug-n-play networks and dynamic logical information exchange, as well as models and tools for integrating data from intelligent things and people. Our more recent emphasis has been on the generation and use of data. One research e ort is directed towards developing methods and tools for generating and collecting data needed for disaster risk reduction, including CROSS (CROwdsourcing Support system for disaster Surveillance). A typical disaster surveillance system needs to make critically important decisions within minutes or hours before disasters strike. To crowdsource human sensor data, an emergency manager needs help to recruit and manage trained volunteers, select participants from volunteers, plan tours for them to cover locations where observational data are needed, and integrate in real-time data collected by them with physical sensor data to improve the quality of sensor coverage. CROSS was built to meet these needs. We solved realistic issues underlying participant selection, tour planning, and data integration, resulting in solutions that are not only of practical utility to CROSS but also have theoretical signi cance. We have integrated selected components of CROSS into the well-known Ushahidi platform. CROSS has been used by our earth science colleagues who use trained volunteers after each signi cant earthquake in

Brochure 2020 Taiwan to collect data on new geo-hazards, enabling assessments of during visits to hospitals and health clinics, thereby minimizing the risks of earthquake-triggered compound disasters. the total time required by each patient to complete his/her visits; Another major thrust of our research is directed toward developing and active smart embedded devices, mobile applications and services/ systems for smart homes and buildings, all of which can automatically • BOT (BeDIS object tracking) and geo-fencing services for locating process alerts from authorized senders and building safety systems and take location-specific actions to minimize the likelihood of objects and tracking people indoors in general and, more injuries and reduce property damage when disasters strike. In speci cally, within hospitals and other care-providing institutions. particular, each device (or application) selects its action (or response Several field trials are being conducted to assess the functionality instructions) in response to an alert based not only on the type and and usability of these applications. The test sites include National parameters of the alert but also on attributes of the building, interior Taiwan University Hospital Taipei and Yunlin Branch (BOT in ER and layout and objects surrounding the device (or application). Our work patient wards, and WPIN in areas visited by outpatients). Other test has demonstrated that easily customizable and maintained active sites include Changhua Christian Hospital, Taipei City Hall and Yunlin smart devices/applications for diverse purposes can be built on a Veterans Home. common architectural framework from reusable components, and V. Collaborative Data Projects alert messages can be pushed asynchronously over the Internet We are working on tools, systems, and services for collaborative data to meet end-to-end delay requirements of time-critical alerts. Our production, access, and preservation. Resource-constrained research experimentation with a prototype active emergency response projects often rely on third-party services for on-going data sharing system in an office building during a simulated strong earthquake and archiving. Such practices may not work well when the project demonstrated the effectiveness of such systems. As a step towards involves a large number of collaborators participating in a fluid enabling pervasive use of active devices and applications, a proof- modality. For example, thousands of people may participate in citizen of-concept prototype building/environment data and information science projects. Below, we present a few collaborative data projects system (or BeDIS) was built as a part of the ICT infrastructure needed we are undertaking jointly with other research units (both within and to support location-specific, active emergency preparedness and beyond Academia Sinica), with IIS taking a leading role. response within large public buildings. depositar (Research Data Repositories for All): depositar (https://data. IV. Indoor Location-based Applications and Services depositar.io/) is a research data repository built upon CKAN, a free Since the DRBoaST project ended in December 2018, we have software package for publishing open data. We have added several been focusing on transforming the BeDIS prototype into a ready- extensions to it, including: 1) rich metadata support; 2) spatiotemporal for-deployment platform. BeDIS is a system of Lbeacons (location annotation and query; 3) preview and overlay of spatial datasets; 4) beacons). Lbeacons are Bluetooth Low Energy devices with Wikidata-powered keyword; and 5) open registration for all. These directional antennas that are installed throughout a building. Under services are frequently used for active research data management. normal conditions, each Lbeacon broadcasts its own locally stored 3D The source code and user manual of depositar are freely available. We coordinates and a brief location description, which enables hundreds are also working with the Ministry of Science and Technology (MOST) to thousands of people in the building to locate themselves using to develop research data management policies and guidelines for their mobile phones and navigate amidst a dense crowd and moving research projects supported by MOST grants. objects. Each Lbeacon also collects and timestamps MAC addresses Taiwan Roadkill Observation Network (TaiRON): TaiRON (https:// of Bluetooth tags carried by objects of interest such as valuable roadkill.tw/) is a collaborative project to collect (by smartphone) equipment, essential tools, and people requiring special care, photo observations of roadkill and other animal mortality in Taiwan. amongst other targets. This capability enables BeDIS to locate these It was started by the Taiwan Endemic Species Research Institute in people/objects and track their movements in real-time with nominal 2011, initially by using a Facebook Group for data collection. We response time of 2-3 seconds. Location accuracy of 3-5 m or 6-10 m redesigned the data work ow so that participants rst upload their is achieved by adjusting the ranges and beam widths of directional observation records to a dedicated project website. The website then antennas used on Lbeacons. generates digests of the records and relays them to the Facebook BeDIS is unique among competing platforms by being both an Group for species identi cation and community interaction. The new infrastructure for indoor positioning and indoor navigation (IPIN) work ow facilitates better data management and analysis, as well as applications and a platform for indoor location-based services (ILBS). individual control of observation records. TaiRON datasets have been Furthermore, BeDIS has several distinct advantages compared to IPS instrumental in monitoring unusual animal deaths in Taiwan, and (indoor positioning systems) built on competing technologies. For in designing new measures to reduce roadkill. The project received instance, location accuracy and response time degrade negligibly a 2019 National Agricultural Science Award in the Sustainable in the presence of large crowds, and it is easy to configure and Environment category. TaiRON currently has more than 5,000 active inexpensive to deploy and maintain. Moreover, Lbeacons can participants. be easily configured to act as a micro data server and used in an The Sunflower Movement Archive: This public archive (http:// extended BeDIS that supports active disaster preparedness and public.318.io/) is a collection of artifacts, images, and videos from responses. When triggered by a disaster/emergency alert from the 2014 Sunflower Movement. Most of the material was acquired responsible government agencies or the building safety system, an by Academia Sinica when the students and activists were vacating extended BeDIS functions as a micro data server to deliver location- the occupied chamber of Taiwan’s Legislature. The archive permits and situation-specific emergency response instructions to people anyone to search and identify objects in the collection, and it makes and decision-support data to active devices and applications within it possible for owners to release high-resolution images of their fractions of a second. creations to the public for reuse. The Sun ower Movement signi es a To date, the following applications and services are ready-for-market: turning point in Taiwan’s recent history, and it continues to in uence the political landscape and societal re ections of the populace. The • BeDIS: an infrastructure/platform for IPIN and RT-ILBS applications; entire collection had been transferred to the National Museum of • WPIN (Waypoint-based indoor navigation) applications with head- Taiwan History, and was featured in a special exhibition on Social Movements in Post-War Taiwan. up-display style GUI (graphical user interface) and voice-based navigation directions on diverse smart phones for general public and special-purpose navigators to manage outpatient flows 131

八大 實驗 室 程式語言與形式方法實驗室 Research Laboratories 研究人員 程式語言與形式方法實驗室發展保證程式正確性之技術。我們在程式語言方面的研究,著 重於發展正確程式之語法、語意、及實務上之議題;在形式方法方面的研究,強調分析真 陳郁方 / 召集人 實程式中演算法、計算、以及實際方面之課題。我們應用數學及形式技術以探討我們的研 究問題,亦致力於發展工具及方法論以幫助程式開發者寫作正確程式。 研究員 一、函數程式語言驗證與推導 王柏堯 函數程式語言提供一個單純、多限制的計算模型,但相對地,我們也因此擁有數學的 研究員 好處:程式可當作數學來理解,可用包括歸納法在內的種種數學工具推論關於程式的 性質。可從一個簡單、好理解、但作為程式執行效率不高的程式開始,用數學方法推 柯向上 導出一個和它「相等」,但執行效率高的程式。這套技巧稱作「程式推導」。我們做 過幾個有趣的演算法推導,包括近似演算法、串列上的演算法等等。日後也將繼續尋 助研究員 找更多的程式推導例子,並建立推導程式的常用定理與步驟。 莊庭瑞 一般的印象是函數語言中不允許副作用的存在。事實上,許多實用的函數語言不僅支 援副作用,甚至支援許多種副作用。只是在函數語言中,我們會希望在支援副作用的 副研究員 同時不能失去函數語言原有的好處:副作用也須以嚴謹、能以數學方法管理的方式出 現在程式中。其中一種管理、描述副作用的方式是透過「單子」。我們近年來也研究 穆信成 如何證明單子程式的性質,並嘗試釐清確保多種副作用結合後仍保持其應有性質的充 分條件。 副研究員 二、依值型別編程 學界已有不少型別理論基礎和語言/系統根據「命題即型別」此一原則發展出來。在 這些系統中,證明數學定理就只是編寫具精巧型別(學名為「依值型別」)的程式。 其中較為大膽的一脈工作是依值型別編程,其目標是將完整的依值型別引入一般程式 語言內(特別是函式語言),不僅保證程式正確,同時還保留自然的編程風格。依值 型別編程的一大特徵是用迭構型別族系(最近亦能用迭解型別族系)將程式員關注的 性質編入型別內,程式員操作這些型別的構造與計算時便能不著痕跡地建立或利用型 別內嵌的性質,不需另寫證明,甚至能以型別資訊作為語意上的編程引導。我們已有 一些初步實驗顯示這個新範式能成功融合程式與證明,並將進一步探究其潛力。 三、低階密碼程式之正規驗證 現代密碼學仰賴大有限體上之計算。常見之計算設備卻不支援長整數(如:255 位元) 計算。因此,大有限體上之計算必須在 32 或是 64 計算結構上來實現。而這些實現必 然隨著計算結構不同而有所改變。在 OpenSSL 中,大有限體乘法已經有 x86、ARM、 SPARC 上之實現。這些組合程式是否正確地被實現,在密碼學之應用非常重要。我們 著重於發展實用的群與體運算密碼程式形式化驗證技術。藉由與密碼專家之合作,我 們的技術已經被用於驗證多種在 OpenSSL、Bitcoin、wolfSSL 中橢圖曲線密碼系統之群 與體運算。 132

四、字串限制式求解 Brochure 2020 133 網路程式安全是我們數位基礎建設的關鍵基石。在 所有確保網路安全的方法中,自動化的網路程式安 全分析明顯的有很高的需求。字串限制式求解器可 以說是新一代自動化網路安全分析技術的引擎。在 這樣的高度需求下,近六年來,有大量關於這個主 題的研究發表。我們和一些在歐洲的合作者,在這 個題目上取得了一系列的成果。舉例來說,我們將 Craig 在 1957 年在數理邏輯提出的內插理論擴展到 字串限制式,這是軟體模型驗證的關鍵技術,可以 用來自動產生迴圈不變量。我們所開發的字串限制 式求解器 Trau 目前是全世界最有效率的實作之一。 儘管如此,網路程式分析對於字串限制式求解器效 能有很高的要求,目前求解器的發展仍然和實際的 完整需求有明顯的落差。我們正和全世界此領域的 研究者密切共同合作,以期待能縮小這個差距。 五、敏感資料之群組共享 隱私保護的意涵不僅在於個人資料取得上的限制, 同時也在於資訊使用與揭露的原則與規範。妥善管 理資訊流向、鼓勵資料分享、並維持所分享資訊的 私密性,可真是挑戰。一般作法是透過集中式的資 料控制單位,將來自眾多個人的資料集去識別化後, 再行釋出供二次使用。不過,這三方⸺

八大 實室 Programming Languages and Formal 驗 Research Laboratories Methods Laboratory Research Faculty Yu-Fang Chen / Chair The Programming Languages and Formal Methods research group develops techniques to ensure program correctness. Our research into programming languages focuses on syntactic, Research Fellow semantic, and pragmatic issues in the development of correct programs. Our work on formal methods emphasizes the algorithmic, computational, and practical aspects of realistic program Bow-Yaw Wang analyses. We apply mathematical and formal techniques to our investigations of research problems. We also aim to develop tools and methodologies that can help developers write Research Fellow code correctly. Hsiang-Shang Ko I. Reasoning and Derivation of Functional and Monadic Programs Assistant Research Purely functional languages, in which functions are true mathematical functions, offer Fellow a simple model of computation in which programs can be understood as mathematical equations whose properties can be proved using mathematical tools including induction. Tyng-Ruey Chuang An e cient program can be derived from a program speci cation by stepwise equational reasoning, whereby every step is justified by a theorem. This technique, called \"program Associate Research derivation\", aims to provide formalization and patterns that help programmers to develop Fellow programs and their correctness proofs at the same time. We have published some interesting case studies of derivation, including approximation algorithms and algorithms Shin-Cheng Mu for segment and partition problems. Associate Research The general impression that purely functional languages do not allow side effects is Fellow misleading. Instead, they dictate that effects must be introduced in a mathematically manageable manner. One way to introduce e ects is through monads. Monadic programs actually allow plenty of reasoning. We are studying their derivation, and also trying to establish the conditions under which desired properties of one monadic effect can be preserved in the presence of other monadic e ects. II. Dependently Typed Programming Based on the propositions-as-types principle, there have been type-theoretic foundations and languages/systems in which the activity of proving mathematical theorems is subsumed by programming with sophisticated types (called \"dependent types\"). Among these approaches, there is an ambitious line of work on dependently typed programming that aims to integrate full dependent types into conventional (in particular functional) programming languages, guaranteeing program correctness while retaining natural programming styles. A de ning characteristic of dependently typed programming is the use of inductive families (and, recently, coinductive families too) to encode properties of interest in datatypes such that the dependently typed programmer can implicitly establish and exploit those properties while specifying computations on inhabitants of those datatypes. Consequently, the programmer does not need to write explicit and separate proofs about those properties, and can even get semantic hints in the form of type information during program construction. We have conducted some successful preliminary experiments on this new paradigm of blending programs and proofs, and are setting out to explore its full potential. 134

Brochure 2020 III. Formal Verification on Group and Field Operations in personal data, but represents a set of principles and rules Cryptographic Programs that govern the use of information and its disclosure. How to appropriately manage information flows and Modern cryptography relies on computation over to encourage data sharing yet keep shared information large finite fields. However, commodity computing private remains a challenge. A typical approach is for devices do not natively support long integer (such as a centralized data controller to de-identify a dataset 255 bits) computation. Therefore, computation over collected from individuals before it is released for large finite fields must be implemented on 32- or 64- reuse. That the interests of the three actors involved, bit architectures. Such implementations necessarily i.e., individuals, data controllers, and third-party data vary depending on the architecture. In OpenSSL, users, are not properly aligned is a central dilemma in multiplication over large finite fields has been sharing and reusing personal data. Together with legal implemented for x86, ARM and SPARC. Whether all scholars, we are working toward a communal approach assembly programs implement multiplication correctly to personal data management whereby, without an is of utmost importance in applications of cryptography. external authority, members of a community can pool We are focusing on developing practical and formal sensitive information about themselves for mutual techniques for verifying algebraic properties on group benefits. Principles from programming languages and and field operations in cryptographic programs. formal methods will guide us in the development of In collaboration with practical cryptographers, our good data-sharing schemes with verifiable properties. technique has been applied to verify group and field We will develop methods, tools, and systems to facilitate operations in various elliptic curve cryptosystems in communal management of sensitive data and put them OpenSSL, Bitcoin and wolfSSL. into use. IV. String Constraint Solving VI. Education Web program security is a crucial building block of In addition to our research activities, we also dedicate our digital infrastructure. Among other means of signi cant resources to education. In order to introduce ensuring cybersecurity, there is an apparent demand our research to students, we have organized the for automatic web program analysis. String constraint yearly Formosan Summer School on Logic, Language, solvers are the engine of modern web program analysis and Computation (FLOLAC), in operation since 2007. techniques. Due to high demand, there has been a Through FLOLAC, more and more students in Taiwan considerable increase in the number of publications on have been encouraged to study and conduct research this subject over the past six years. Together with our on programming languages and formal methods. close collaborators in Europe, we have achieved a series of successes in this research direction. For example, we proposed the first string constraint solving procedure that is able to generate Craig interpolant, which is essential in the invariant generation for software model checking. Our string constraint solver, named Trau, is among the most efficient implementations in the world. Nevertheless, current state-of-the-art for the development of string constraint solvers lags way behind the levels required for web program analysis and, together with our colleagues globally, we are working toward bridging that gap. V. Communal Sharing of Sensitive Data Privacy refers not merely to restrictions on acquiring 135

八大 實驗 室 計算理論與演算法實驗室 Research Laboratories 研究人員 廖純中 / 召集人 本組的研究目標在探討計算問題本身的極限和效用,並為資訊科學其他領域建立穩固的理 論基礎。以下簡述幾個我們進行的研究主題。 研究員 一、量子與後量子密碼學 王大為 量子密碼學的目標是了解量子計算在密碼學中扮演的角色。對密碼學來說量子可說 研究員 是一把雙面刃:在壞人手中,大型的量子電腦可以通過秀爾演算法 (Shor’s algorithm) 攻破現行的大多數公鑰密碼系統實作。做為好人的一方,我們則需發展後量子密碼 呂及人 學 (Post-quantum Cryptography; PQC) 來應對。後量子密碼學 是研究在量子電腦攻擊 下仍然安全的密碼學演算法(通常專指公鑰密碼系統)的學科。另一方面,量子也增 研究員 強了好人的能力。量子密碼學廣泛的探索如何利用這樣的能力來達到更高的安全性或 更多的功能性。我們對於這兩個研究方向都有興趣,也同時從理論與實務的角度來進 李德財 行研究。在前者,我們積極的參與美國國家標準局 (NIST) 後量子密碼學標準的公開 徵選,我們提出的一個候選的密碼系統也順利晉級到目前第二輪的徵選階段。在理論 客座講座 方面,我們廣泛的參與多項研究課題,包含設備無關量子密碼學 (Device-indpendent Cryptography)、古典委任量子計算 (Classical Delegation of Quantum Computation)、安 徐讚昇 全多方量子計算 (Secure Multiparty Quantum Computation) 與量子預言機模型下的安全 性 (Security in the Quantum Random Oracle Model) 等。 研究員 二、幾何計算 高明達 幾何計算考慮的是與幾何物件相關的計算問題,以及其演算法之開發。在這個領域中, 研究員 最常用的距離度量是歐式距離;但在許多實際的應用中,也有需要導入其他距離度量 以符合現實情境。舉例來說,曼哈頓距離用來表示在棋盤狀市區中的步行距離,而時 楊柏因 間距離則用來描述在不同速度的載具所形成的交通路網中所需要的移動時間。如果對 傳統的幾何問題(如沃羅諾伊圖形、凸包、路線規劃、設施放置等)導入不同的距離 研究員 度量,我們將會衍生具有理論重要性的幾何問題,在交通網路、無線感測網路、或是 超大型積體電路設計等不同的領域裡也可找到對應的實際應用。我們探討這些問題的 劉進興 性質與所選定的度量有何相依性,並利用新的性質設計高效率的演算法。 副研究員 三、組合最佳化與近似演算法 鐘楷閔 組合最佳化這個領域考慮如何在有限卻龐大的解集合中,找出最佳解的相關問題與方 法。在大部分的情況下,窮舉搜尋法對這領域的最佳化問題都不切實際,因此我們多 研究員 半訴諸於啟發式演算法、隨機演算法或是近似演算法,取得儘可能接近最佳的解。近 數十年來,設施放置問題(包括無容量限制與有容量限制)是這個領域中一個重要的 研究課題。由於此問題在模型上的簡單性與基礎性、以及在抽象的線性規劃表述裡所 表現出的多樣性,它在近代的近似演算法設計中扮演了非常重要的角色。我們基於過 去的相關研究成果,試圖以不同的觀點來解析設施放置問題。在演算法的研究過程中, 我們不僅想為這些問題設計高效率的演算法或最佳的近似解,而且要獲得更根本的分 析技巧與工具,用以解決組合最佳化領域中更多的相關問題。 136

Brochure 2020 四、巨量資料 七、機器人學 • 巨量資料相關之高速演算法設計:近年來大量資 對於無人載具例如服務機器人、自駕車及無人機的 訊很容易在線上取得。我們研究如何利用這些巨 應用,受到性能及操作限制,平滑路徑規劃是基礎 量資料進行快速計算。研究題目包含傳染性疾病 的工作。在這方面的工作,多年來我們研究了不同 散播模型之快速動態模擬、疾病網路之建構和視 基本路徑組成曲線和不同的路徑生成技術。 覺化呈現、電腦對局理論和實作。 • 巨量資料之邏輯與知識表達:巨量資料之中隱藏 許多有用的資訊與知識,我們將以形式邏輯的方 法來探討相關的知識表徵與推理問題。 五、基礎圖論 圖論可以解決許多實際應用問題,而且也是很多理 論研究的工具。我們通常先由基礎圖論性質的研究 著手,然後藉由新性質的發現,設計高效率演算法, 再進一步探討理論之突破,以及可能的應用價值。 我們探討於實際應用產生的圖論演算法問題。目前 的研究重點之一為河流模式 (streaming model) 下之 演算法設計。 六、機器學習理論 在日常生活中,我們時常必須不斷在未知的環境中 作決定,並為此付出代價。這可被抽象化為所謂的 線上決策問題,而我們希望能為此問題設計出好的 線上演算法,可以從過去的歷史中學習,而能在未 來做出好的決定。對此問題,我們刻畫出一些自然 而常見的環境條件,並在這些條件下設計出更有效 率的線上演算法。此外,我們也為此問題在機器學 習、賽局理論、複雜度理論等研究領域中找到新的 應用。 137

八大 實室 Computation Theory and Algorithms 驗 Research Laboratories Laboratory Research Faculty Churn-Jung Liau / Chair The principle research goals of our group are to understand the power and limitations of computation and to help establish solid foundations for other areas of computer science. Research Fellow Below, we brie y describe the research topics that we are currently focusing on. Da-Wei Wang I. Quantum and Post-Quantum Cryptography Research Fellow We aim to understand the role of quantum power in cryptography, representing a double- edged sword. On the one hand, large-scale quantum computers can break most public-key Chi-Jen Lu cryptosystems in use by means of Shor’s algorithm, which is being tackled by post-quantum cryptography (PQC). We are developing post-quantum cryptosystems (usually public-key Research Fellow cryptosystems) that are secure against quantum adversaries. On the other hand, quantum computing also enhances the abilities of honest parties, and quantum cryptography Der-Tsai Lee broadly explores the rich possibilities of stronger functionality or security that can be achieved. We are interested in both research directions, from both theoretical and practical Distinguished Visiting aspects. We have actively participated in the post-quantum cryptography standardization Chair process by the U.S. National Institute of Standards and Technology (NIST), with one of our submissions undergoing round 2 of that process. On the theoretical side, we have broad Tsan-sheng Hsu interests in various topics such as device-independent cryptography, classical delegation of quantum computation, secure multiparty quantum computation, and security in the Research Fellow quantum random oracle model. Ming-Tat Ko II. Geometric Computing Research Fellow Geometric computing considers algorithm development for problems that can be stated in terms of geometrical objects. The most common distance metric used in this field is Bo-Yin Yang Euclidean distance, but for many practical applications other distance metrics are needed to t real scenarios. For example, Manhattan distance represents the walking distance in a Research Fellow grid-like city, and time distance describes travel time in a transportation network consisting of vehicles of various speeds. By introducing these metrics into classical geometry Jing-Sin Liu problems―such as Voronoi diagrams, convex hulls, route planning, facility location, amongst others―we are investigating problems of theoretical importance that also have Associate Research practical applications in diverse areas, including transportation networks, wireless sensor Fellow networks, VLSI design, and many others. We are interested in how the problem properties would change with respect to each metric and how we can make use of these properties to Kai-Min Chung design e cient algorithms. Research Fellow III. Combinatorial Optimization and Approximation Algorithms Combinatorial optimization is concerned with the search for a best possible solution in a nite but large solution space. In most cases, exhaustive searching is not tractable, and 138

Brochure 2020 researchers have resorted to heuristic, randomized, or V. Graph Theory and Algorithms approximation algorithms to nd near-optimal solutions. In this field, uncapacitated and capacitated facility Graphs are used to model many important applications location problems are among the most central research and are the main tool for solving many theoretical topics. These problems have played important roles over problems. We often start by probing fundamental several decades in the development of approximation theoretical problems, such as the structures of graphs, algorithms due to the simplicity and elegance of their with certain properties. With these properties, we can problem models and the diversity revealed in their then design efficient solutions. We are working on abstract linear programming formulations. Our goal is e cient graph algorithms based on streaming models. to tackle these kinds of facility location problems from a di erent perspective based on our previous research. VI. Computational Learning Theory Throughout the process of algorithmic research, we aim to develop not only efficient algorithms and suitable Many situations in daily life require us to make repeated approximation solutions for these problems, but also decisions before knowing the resulting outcomes. Such endeavor to derive fundamental techniques and tools scenarios have motivated study of the so-called online that can be used for a wide range of related problems in decision problem, in which one must iteratively choose combinatorial optimization. an action and then experience some corresponding loss for a number of rounds. To investigate this IV. Massive Data problem, we are identifying natural scenarios in which online algorithms with improved performances can • Efficient Data Intensive Algorithms: Given rapid be designed. Moreover, we have discovered new developments of computer and communication applications of this problem in diverse areas, such as technology, it has become much easier to access or machine learning, game theory, and complexity theory. store massive amounts of data electronically. We are interested in research problems concerning e cient VII. Robotics computation of massive data, which include e cient epidemic simulation, visualization and construction Planning smooth paths is a fundamental task for of disease networks, and classical computer games. applications of unmanned vehicles such as service robots, self-driving vehicles and unmanned aerial • Logics for Massive Data: Considerable amounts of vehicles, necessitating path and operation constraints information and knowledge are implicit in massive for safety. We have studied different methodologies data. We intend to study the problems of knowledge and variants of path primitives for smooth curvature- representation and reasoning in data science by using bounded path generation in scenarios such as lane formal logic. With proper representation frameworks chang and state-to-state transfer with minimized travel and logical formalisms, knowledge discovered from time or path length as the criterion. massive data can be used in data-intensive intelligent systems. 139

特聘講座 / 特聘研究員 Distinguished Visiting Chair / Distinguished Research Fellow

李德財 Der-Tsai Lee 142 張韻詩 Jane Win-Shih Liu 144 許聞廉 Wen-Lian Hsu 146 廖弘源 Mark Liao 147 141

究研 人 Academician & Distinguished Visiting Chair 員 李德財 Der-Tsai Lee Faculty Ph.D., Computer Science, University of Illinois at Urbana-Champaign, United States T +886-2-2788-3799 ext. 2209 F +886-2-2782-4814 E [email protected] W www.iis.sinica.edu.tw/pages/dtlee Research Description In recent years, my research interests lie in the design and analysis of algorithms and information security. In the area of algorithm design, our goal is to study the complexities of computational problems and to design e cient algorithms for them, as it addresses fundamental issues of the theoretic aspects of computer science. Currently, we focus on the elds of geometric computing and combinatorial optimization. In geometric computing, we consider non-traditional distance metrics, such as time distance, inspired from di erent real-world concepts, and introduce them to classical geometric problems for performance evaluation. For example, we study the variations of route planning and facility location problems, which arise in various areas, including transportation networks, communication networks, wireless sensor networks, VLSI design, etc. By introducing new distance metrics, we formulate both theoretically and practically important problems, and nd e cient algorithmic solutions to these problems. In combinatorial optimization, we study capacitated facility location problems. These problems are among the most studied research topics in the eld of approximation algorithms, and much research e orts in the past have developed many basic design and analysis techniques of this eld. We explore the underlying structures and properties of these problems, and obtain improved approximation solutions by developing new fundamental techniques derived from linear programming, which can be applied to a wide range of related problems. In the area of information security, our focus is to study mechanism for secure message delivery. We initiate the SecureKeyIn (SKI) project, which aims to provide a secure, accountable and convenient platform for modern mobile users in need of a trusted intermediary to exchange messages and data in a secure manner anytime and even everywhere. We have developed a secure, end-to-end-encryption messenger, called SKI+, as iOS/Android App and Windows/macOS/Unix application. Because messaging is one of the most popular online activities, along with social networking, the SKI project can di erentiate itself from others for being agnostic and a trusted intermediary. For being agnostic, our rst trial App can be agile and adaptive to user’s favorite messenger Apps. It is App-agnostic and platform-independent. The scope of the secure delivery can be applied to email/audio/video/photo/document/calendar as well. For being a trusted intermediary, we would like to act as user’s duciary safeguard upon the usage of popular on-line o erings (such as Facebook, Gmail, etc.). The SKI+ App employs public key infrastructure. It can manage and exchange public/private keys in a manner transparent to the user. Acting as a trusted duciary intermediary, SKI+’s security standard is more stringent. The private key, once generated when the user rst registers, would not leave user’s device, thus even SKI+’s sta or other prying eyes cannot view user’s messages in plaintext nor decrypt them because the keys for decryption reside exclusively in the user’s device. SKI+ requires minimal personal information, i.e., only the user’s mobile phone number, which is required for registration and identi cation purposes. It o ers strong security (authenticity and integrity) for user’s one-on-one and group chats. SKI+ also has an Account Migration feature for using the same account across multiple devices. When the user reinstalls the SKI+ App or changes a new mobile device, this feature can allow the user to restore to the previous state without risking starting anew with blank contact and empty conversation threads. 142

Associate Research Fellow 研 ・ Senior Advisor, National Security Council, Taiwan (2016-2020) 究 人 Chien-Min Wang 王建民・ President & Computer Science Chair Professor, National Chung Hsing University, Taichung, Taiwan (2011-2015) 員 ・ Distinguished Research Fellow (1998-2019) & Director, Institute of Information Science, Academia Sinica (1998-2008) Faculty ・・ APrsosgisrtaamntDPirroefcePtsoshro,.DrD(.i1,v9E. 7ole8fc)C,toArimscsapolucEitaentrgeaiPnnrdeoefCerosinmsogpr,u(N1ta9at8ti1oio)n,nPRareolsTfeeaasirswcohar,n(N1U9a8tni6oi-vn1ea9rl9sS9ict)iy,eE,nTEcaCeiSwF, oNaunonrtdhawtieosnte(1rn98U9n-1iv9e9r0si)ty ・ Alumni Award for Distinguished Service, College of Engineering, UIUC, (2017), Distinguished Alumni Educator Award, Dept. of ・・TF +CHM+Su88e,m88mU66bnb--oie22vlr--de,22Trts77hRi88teey82sWoe--34ofa78rIrlcll91dihn94AoAeciwsaxadatrt.ed1Um(7r2yb00oa30nf7aS)-c,CAiehmnacbmeaspWEs(asTaigcwcWdnmmoAwr(wwS2wS)0caa(.1ii2nnei40sggn).0st@8iisn)itii,scA.aslie.enxdiacunad..teewdr /vupo.tanwgHeusm/pbaogldets/Foundation, Germany (2010-2016) ・ Academician, Academia Sinica (2004) ・ Fellow, Association for Computing Machinery (1997), Institute of Electrical and Electronic Engineers (1992) ・・ AEEddssiittioostrra,, nALetlgcRoteursirteehaNmrcoihcteaFs,eISlnloet’rwlieJ(.s1oo9fn9C1Co-o1mm9p9pu5ut)a,tAitniosgsn,oaWclioaGrteledoRmSecesieetranyrtci&hcAFPpeupllbolilwciasht(1iion9ng9s6C,-Iopn.rt,e’IlnsJec.n.otf),Information and Computer Security; Series ・ PInhs.tDit.u(1te97o8f )I,nMfo.rSm. (a1t9io76n)S, CciSe,nUcnei,vAecrasidtyemofiaIllSininoiicsaat Urbana-Champaign, B.S., EE, National Taiwan University (1971) Publications 1. M.J. Kao, J.Y. Shiau, C.C. Lin and D.T. Lee, \"Tight 9. J.J. Chen, M.J. Kao, D.T. Lee, I. Rutter, and D. Wagner, Approximation for Partial Vertex Cover with Hard \"Online Dynamic Power Management with Hard Real- Capacities,\" Theoretical Comput. Sci., 778, 61-72, 2019. Time Guarantees,\" Theoretical Comput. Sci., 595: 46-64, 2015. 2. M.J. Kao, H.L. Tu and D.T. Lee, \"O(f) Bi-criteria Approximation for Capacitated Covering with Hard 10. C.H. Liu, C.X. Lin, I.C. Chen, D. T. Lee, and T.C. Wang Capacities,\" Algorithmica, 81(5), 1800-1817, 2019. \"Efficient Multilayer Obstacle-Avoiding Rectilinear Steiner Tree Construction Based on Geometric Reduction,\" 3. H.I Yu, T.C. Lin, and D. T. Lee, \"The (1|1)-Centroid IEEE Trans. Comput-Aided Design of Integrated Circuits Problem in the Plane with Distance Constraints,\" and Systems, 33(12), 1928-41, 2014. International Journal of Computational Geometry and Applications, 28(2), 81-109, 2018. 11. F. Aurenhammer, R. Klein and D.T. Lee, Voronoi Diagrams and Delaunay Triangulations, World Scientific 4. H.I Yu, C.C. Li, and D.T. Lee, \"The multi-service center Publishing, 2013. problem,\" Theoretical Comput. Sci., 705, 58-74, 2018. 12. M.J. Kao, B. Katz, M. Krug, D.T. Lee, I. Rutter, D. 5. Y.H. Su, C.C. Lin, and D.T. Lee, \"Broadcasting in Wagner, \"The Density Maximization Problem in Graphs,\" Weighted Trees under the Postal Model,\" Theoretical J. Combinatorial Optimization, pp. 723-754, Nov. 2013. Comput. Sci., 621: 73-81, 2016. 6. B.S. Dai, M.J. Kao, and D.T. Lee, \"Optimal Time-Convex Hull for a Straight-Line Highway in Lp-Metrics,\" Comput. Geom., 53: 1-20, 2016. 7. M.J. Kao, H.L. Chen, and D.T. Lee, \"Capacitated Domination: Problem Complexity and Approximation Algorithms,\" Algorithmica, 72(1): 1-43, 2015. 8. C.H. Liu, E. Papadopoulou, and D.T. Lee, \"The k-Nearest Neighbor Voronoi Diagram Revisited,\" Algorithmica , 71(2): 429-449, 2015. Brochure 2020 143

究研 人 Distinguished Visiting Research Chair 員 張韻詩 Jane Win-Shih Liu Faculty Sc.D. Electrical Engineering, Massachusetts Institute of Technology, United States T +886-2-2788-3799 ext. 1807 F +886-2-2782-4814 E [email protected] W www.iis.sinica.edu.tw/pages/janeliu Research Description Throughout 80’s and 90’s, my research focus was on theories, algorithms, architectures and tools for building real-time systems. Those decades saw great advances in technologies for building time-critical systems from commodity components and validating rigorously their timing behavior. PERTS (Prototyping Environment for Real-Time Systems) built by my students and me in the mid-90's is a part of our contribution. This system of schedulers and tools puts theorems and algorithms for scheduling, resource management, and validation into a ready-for-use form so that developers may validate, simulate and evaluate design alternatives of time-critical systems. PERTS was subsequently commercialized. We also developed the underlying principle of an open architecture for real-time applications. The architecture makes it possible to tune and validate in an open environment the timing behavior of a real-time component independently and thus enables independently developed real-time and non-real-time applications to run together. Our approach was convincingly demonstrated by Windows and Linux prototypes of kernel-level CPU and disk bandwidth reservations. When I rst joined Institute of Information Science, my research focused on system architecture, components, platforms and tools for building easy to use, customizable and safe user-centric automation and assistive devices and systems devices at low-cost. Some of the devices aim to enhance the quality of life and self-reliance of users, especially elderly individuals. Other devices can serve as point-of-care or automation tools for use at home or within care-providing institutions. Publications on this work and links to open source software prototypes can be found at the SISARL homepage http://sisarl.iis.sinica.edu.tw. My recent work concerns technologies for disaster preparedness and response. This work was supported by the thematic projects OpenISDM(Open Information Systems for Disaster Management) and DRBoaST (Disaster Resilience through Big open Data and Smart Things) in the Academia Sinica Sustainability Science Research Program. The major thrust of the OpenISDM project was toward providing a framework for building open and sustainable disaster management information systems. Building blocks of the framework include trustworthy information brokerage services for responsive information access during emergencies and methods and tools for synergetic use of data from human sensors and physical sensors. Proof of concept prototypes and publications can be found at the OpenISDM homepage http://openisdm.iis. sinica.edu.tw. Project DRBoaST further emphasizes the generation and use of data and smart things. Signi cant contributions of this work include active devices, mobile applications and services/systems that can automatically process alerts from authorized senders and initiate location-speci c preparedness and response actions to reduce risks when disasters strike. Our prototypes have demonstrated the feasibility of active smart devices/applications and the effectiveness of active emergency response systems containing them. To enable their pervasive use in smart buildings and environments, we designed and built a platform called the Building/environment Data and Information system (BeDIS). Structured as a fog, BeDIS is scalable and responsive under overload. When triggered by a disaster/emergency alert from responsible government agencies or the building safety system, BeDIS functions as a system of micro data servers for delivering location- and situation-speci c emergency response instructions to people and decision support data to active devices and applications. During normal times, it functions as an indoor positioning system, enabling people to locate themselves and navigate amidst dense crowd and moving objects indoors via their mobile phones. BeDIS also support applications and services for locating objects and people indoors and tracking their movements with location accuracy of 3-5 m or 6-10 m and nominal response time of 2-3 seconds. BeDIS is being productized along with families of indoor positioning and indoor navigation and real-time object tracking and geo-fence applications. These future products are undergoing eld trials in which their e ectiveness and usability are assessed by targeted end users in hospitals and elderly care facilities. 144

・ University of Illinois, Computer Science Department, Distinguished Educator Award (2011) 研 ・ Taiwan Linux Consortium, Linux Golden Penguin Award (2009) 究 ・ Honor Medal, Taiwan Institute of Information and Computing Machinery (2008) 人 ・ Technical Achievement and Leadership Award, IEEE Computer Society, Technical Committee on Real-Time Systems (2005) 員 ・ Distinguished Visiting Research Fellow, Institute of Information Science, Academia Sinica (2004-present) ・ Shun Hing Group, Honorary Chair Professor, Computer Science, Tsing Hua University (2004-present) Faculty ・ Software Architect, OSCore Technology, Windows, Microsoft (2000-2004) ・ Professor of Computer Science, University of Illinoisat Urbana-Champaign, United States (1981-2000) ・ Sc.D., Massachusetts Institute of Technology (1968) ・ M.S. and EE, Massachusetts Institute of Technology (1966) ・ B.S., Cleveland State University (1959) ・ Life Fellow of IEEE Publications 1. C. C. Li, J. Su, E. T.-H. Chu, and J. W. S. Liu, \"Building/ 9. J.W.S.Liu, E.T.-H.Chu, and C.S.Shih, \"Cyber-Physical environment Data/information Enabled Location Specificity Element of Disaster Prepared Smart Environment,\" IEEE and Indoor Positioning,\" IEEE Internet of Things Journal, Computer, volume 46, number 2, pages 69-75, February 2017, DOI: 10.1109/JIOT.2017.2748141 2013. 2. C.Y.Lin, K.L.Yeh ,E.T.-H.Chu, and J.W.S.Liu, \"Participant 10. T.Y. Chen, Y.C. Huang, T.S. Chou, C.S. Shih, and J.W.S. Liu, Selection Problem-Relative Performance of Five Optimization \"Model-Based Development of User-Centric Automation and Solvers,\" Proceedings of the 8th International Conference on Assistive Devices/Systems,\" IEEE Systems Journal, Vol. 16, Computer Modeling and Simulation, January 2017, DOI: number 3, September 2012. http:// dx.doi.org/10.1145/3036331.3036334. 11. J. W. S. Liu, Real-Time Systems, Prentice Hall, 2000 3. J.W.S.Liu, L.J.Chen, J.Su, C.C.Li and E.T.H.Chu, \"A Building/environment Data Based Indoor Positioning 12. J.W.S. Liu, K.J. Lin, W.K. Shih, R, Bettati and J.Y. Chung, Service,\" Proceedings of Data Science and Data Intensive \"Imprecise Computations,\" IEEE Proceedings, Vol.82, pp.1- Systems, December 2015, DOI: 10.1109/DSDIS.2015.46 12, January 1994. 4. J.W.S.Liu and E.T.H.Chu, \"Dependability of Active Brochure 2020 Emergency Response Systems,\" Proceedings of the 8th International Conference on Dependability (DEPEND2015), pp. 33-37, August 2015 (Best Paper Award). 5. E.T.-H. Chu, C.-Y.Lin ,P.H.Tsai and J.W.S.Liu, \"Design and Implementation of Participant Selection for Crowdsourcing Disaster Information,\" International Journal on Safety and Security Engineering, Vol. 5, No. 1, pp. 48-62, March 2015 6. P.H.Tsai, Y.-C.Lin, Y.Z.Ou, E.T.-H.Chu, and J.W.S.Liu, \"A Framework for Fusion of Human Sensor and Physica lSensor Data,\" IEEE Transactions on Systems, Man and Cybernetics: Systems, Vol. 44, Issue: 9, pp. 1248-1261, September 2014. 7. C.Y.Lin, E.T.-H.Chu, L.-W.Ku, J.W.S.Liu, \"Active Disaster Response Systems for Smart Building,\" Sensors,Vol.14, No. 9, pp. 17451-17470, September 2014. 8. P.H.Tsai and J.W.S.Liu, \"Intelligent Tools for Reducing Medication Dispensing and Administration Errors,\" In: Gibbons J., MacCaull W.(eds) Foundations of Health Information Engineering and Systems. FHIES 2013. Lecture Notes in Computer Science , Vol. 8315, pp.1-21, Springer, Berlin, Heidelberg, 2014. 145

究研 人 Distinguished Research Fellow 員 許聞廉 Wen-Lian Hsu Faculty Ph.D., Operations Research, Cornell University, United States T +886-2-2788-3799 ext. 2211 E [email protected] F +886-2-2782-4814 W iasl.iis.sinica.edu.tw/hsu/en/about/ ・ Associate Professor (with tenure), Department of IE/MS, Northwestern University (1986- 1989) ・ K.T. Lee Research Breakthrough Award(李國鼎穿石獎)(1999) ・ President, Taiwanese Association for Arti cial Intelligence (TAAI) (2001-2002) ・ NSC Appointed Outstanding Re-search Award(國科會傑出特約研究員獎)(2005) ・ IEEE Fellow (2006) ・ President, Association for Computational Linguistics and Chinese Language Processing (ACLCLP) (2011-2012) ・ Distinguished Research Fellow, Institute of Information Science, Academia Sinica, (2008-present) ・ Director, Institute of Information Science, Academia Sinica (2012-2018) Research Description My main research interests include the following: graph algorithms, natural language understanding; and bioinformatics. In graph algorithms, we have done extensive and ground-breaking work on two fundamental classes of special graphs, namely interval graphs and planar graphs, in which a new data structure PC-tree was introduced as a natural representation for planar graph embedding. In natural language understanding, we have developed the following: a) A Chinese input system, GOING ( 自然輸入法 ), which is being used by more than a million people in Taiwan; b) A knowledge annotation and inference kernel, InfoMap, which supports the implementation of our NLU modules and has helped us win numerous awards in international contests, such as Chinese question answering, word segmentation, recognizing Inference in Text Tasks. In biological literature mining, we also won the 1st place in BioCreative gene normalization task in 2009 and 2017; c) A statistical principle-based approach (SPBA) as well as a reduction algorithm (together called RPBA), which is a conceptual pattern-based machine learning (ML) approach that garners the bene ts of both rule-based and traditional statistical ML approaches. RPBA can be simultaneously used for speech recognition, parsing, natural language generation, and machine translation, which is perfect for NLU. We have been awarded by MOST a 4-year AI innovation grant of US$ 400,000 a year on NLU and Dialogue system. In bioinformatics, we have developed the following: a) A suite of software for protein quantitation (Multi-Q, Maxi-Q, and Ideal-Q); b) An ultra-e cient divide- and-conquer algorithm, called Kart, for NGS read mapping, which divides a read into small fragments that can be aligned independently in parallel. Superb results were obtained applying this to RNA-seq, and whole genome comparison; c) An automated method, Auto-CHO, for the programmable one-pot oligosaccharide synthesis, drastically reducing the complexity of intermediate separation and protecting group manipulation. Publications 1. C. Gabor, W. L. Hsu and K. Supowit, \"Recognizing circle graphs Chin Chen, Wen-Lian Hsu, \"Linguistic Template Extraction for in polynomial time,\" J. Assoc. Comput. Machin., 435-473, (1989). Recognizing Reader-Emotion and Emotional Resonance Writing Assistance,\" ACL-IJCNLP 2015, 775-780, (2015). 2. W. K. Shih and W. L. Hsu, \"A new planarity test,\" Theoretical Computer Science 223, 179-191, (1999). 7. Chih-Hung Chou, Nai-Wen Chang, …, Wen-Lian Hsu,* and Hsien-Da Huang,* \"miRTarBase 2016: updates to the 3. K. P. Wu, H. N. Lin, J. M. Chang, T. Y. Sung, and W. L. Hsu, experimentally validated miRNA-target interactions database,\" \"HYPROSP: A Hybrid Protein Secondary Structure Prediction Nucleic Acids Research, doi: 10.1093/nar/gkv1258, November Algorithm - a Knowledge-Based Approach,\" Nucleic Acids 20, (2015). Research 32(17), 5059-5065, (2004). 8. Hsin-Nan Lin and Wen-Lian Hsu, \"Kart: a divide-and-conquer 4. Hong-Jie Dai, Chi-Hsin Huang, Ryan T. K. Lin, Richard Tzong- algorithm for NGS read alignment,\" Bioinformatics 33, 2281- Han Tsai, and Wen-Lian Hsu, \"BIOSMILE web search: a web 2287, (2017). application for annotating biomedical entities and relations,\" Nucleic Acids Research 36, W390-W398, (2008). 9. Cheng-Wei Cheng, Yixuan Zhou, Wen-Harn Pan, Supriya Dey, Chung-Yi Wu, Wen-Lian Hsu, Chi-Huey Wong, \"Hierarchical 5. Chih-Chiang Tsou, Chia-Feng Tasi, Ying-Hao Tsui, Putty-Reddy and programmable one-pot synthesis of oligosaccharides,\" Nature Sudhir, Yi-Ting Wang, Yu-Ju Chen, Jeou-Yuan Chen, Ting- Communications 9, Article number:5202, (2018). Yi Sung, and Wen-Lian Hsu, \"IDEAL-Q: An automated tool for label-free quantitation analysis using an efficient peptide 10. Ming-Siang Huang, Po-Ting Lai, Pei-Yen Lin, Yu-Ting You, alignment approach and spectral data validation,\" Molecular & Richard Tzong-Han Tsai and Wen-Lian Hsu, Biomedical Named Cellular Proteomics 9, 131-144, (2010). Entity Recognition and Linking Datasets: Survey and Our Recent Development, Briefings in Bioinformatics (Oxford), March 6. Yung-Chun Chang, Cen-Chieh Chen, Yu-Lun Hsieh, Chien (2020). 146

Distinguished Research Fellow 廖弘源 研 究 Mark Liao 人 員 Ph.D., Electrical Engineering, Northwestern University, United States Faculty T +86-2-2788-3799 ext. 2203 or 1550 E [email protected] F +886-2-2789-3309 W iwnwdewx._iies.nsi.hnticmal.edu.tw/pages/liao/ ・ Honorary Chair Professor, National Chiao Tung University (2016-present) ・ IEEE Fellow (2013) ・ Academia Sinica Young Investigator Award (1998) ・ Academia Sinica Investigator Award (2010) ・ Distinguished Research Award, National Science Council (2003-2006, 2010ep-2016) ・ TECO Award 東元獎 (2016) ・ Associate Editor, IEEE Trans. on Multimedia (1998-2001), Image Processing (2009- 2013), Information Forensics and Security (2009-2012), ACM Computing Surveys (2018-present) Research Description Beginning in 2018, my research team received a four-year AI-related project from the Ministry of Science and Technology (MOST) (2018/1~2021/12). The MOST grants about $280,000 to this project each year. The original expertise of my team is multimedia signal and information processing, and MOST encourages research teams with di erent training background to invest in arti cial intelligence related theory or applied research from di erent aspects. In fact, the MOST had recognized that AI will be a very important direction in the future, if no resources were invested, Taiwan's competitiveness in science or industry would inevitably decline. Due to the hasty preparation time, we started with multimedia information processing combined with deep learning as the topic. The MOST is concerned that R & D teams will only publish papers as their ultimate goal, and will seriously consider cooperation with Taiwan's industry as one of the evaluation criteria. One of the most important criteria is \"problems given by the industry, and solved by the academia.\" My research team then chose to cooperate with the listed IC company ELAN. We hope to integrate respective software and hardware advantage, and use this to develop \"smart city tra c ow solutions\" that are not only applicable to Taiwan, but also export internationally. We expect to complete the following systems in four years: (1) A sub-system that can directly execute edge computing at intersections to compute various parameters of tra c ow; (2) A sub- system that can use computer vision technology to compute other tra c ow parameters between adjacent intersections; and (3) A sub- system that can make use of the parameters obtained above to trigger a reinforcement learning process, and then dynamically adjust all tra c signs in a certain range. Publications 1. C. Y. Wang, H. Y. Mark Liao, P. Y. Chen, and J. W. Hsieh, 6. Nick C. Tang, Yen-Yu Lin, Ming-Fang Weng, and Hong-Yuan Brochure 2020 \"Enriching Variety of Layer-wise Learning Information by Mark Liao, \"Cross-Camera Knowledge Transfer for Multiview Gradient Combination,\" IEEE International Conference on People Counting,\" IEEE Transactions on Image Processing , Computer Vision Workshop (ICCVW) \"Low Power Computer volume 24, number 1, pages 80-93, January 2015. Vision.'', October 2019. 7. Y. L. Chen, C. T. Hsu, and H. Y. Mark Liao, \"Simultaneous 2. Jen-Chun Lin, Wen-Li Wei, Tyng-Luh Liu, C.-C. Jay Kuo, and H. Tensor Decomposition and Completion Using Factor Priors,\" Y. Mark Liao, \"Tell Me Where is Still Blurry: Adversarial Blurred IEEE Transactions on Pattern Analysis and Machine Intelligence, Region Mining and Refining,\" 27th ACM Multimedia Conference volume 36, number 3, pages 577-591, March 2014. (long paper), Nice, France, October 2019. 8. M. F. Weng, Y. Y. Lin, Nick C Tang, and H. Y. Mark Liao, 3. Jen-Chun Lin, Wen-Li Wei, Tyng-Luh Liu, Yi-Hsuan Yang, \"Visual Knowledge Transfer among Multiple Cameras for People Hsin-Min Wang, Hsiao-Rong Tyan, and Hong-Yuan Mark Counting with Occlusion Handling,\" 20th ACM Multimedia Liao, \"Coherent Deep-Net Fusion To Classify Shots In Concert Conference (long paper), October 2012. Videos,\" IEEE Transactions on Multimedia, volume 20, number 11, pages 3123-3136, November 2018. 9. C. H. Ling, Y. M. Liang, C. W. Lin, Y. S. Chen, and H. Y. Mark Liao, \"Human Object Inpainting Using Manifold Learningbased 4. Jen-Chun Lin, Wen-Li Wei, James Yang, Hsin-Min Wang, and Posture Sequence Estimation,\" IEEE Trans. on Image Processing, Hong-Yuan Mark Liao, \"Automatic Music Video Generation volume 20, number 11, pages 3124-3135, November 2011. Based on Simultaneous Soundtrack Recommendation and Video Editing,\" 25th ACM Multimedia Conference (long paper), San 10. Hsueh-Yi Sean Lin, Hong-Yuan Mark Liao, and Ja-Chen Jose, USA, October 2017. Lin, \"Visual Salience-Guided Mesh Decomposition,\" IEEE Transactions on Multimedia, volume 9, number 1, pages P46-P57, 5. P. C. Wen, W. C. Cheng, Y. S. Wang, H. K. Khu, Nick C. January 2007. Tang, and H. Y. Mark Liao, \"Court Reconstruction for Camera Calibration in Broadcast Basketball Videos,\" IEEE Transactions on Visualization and Computer Graphics, volume 22, number 5, pages 1517-1526, May 2016. 147

研究人員 Research Faculty

王大為 Da-Wei Wang 150 王建民 Chien-Min Wang 151 王柏堯 Bow-Yaw Wang 152 王新民 Hsin-Min Wang 153 古倫維 Lun-Wei Ku 154 何建明 Jan-Ming Ho 155 吳真貞 Jan-Jan Wu 156 呂及人 Chi-Jen Lu 157 呂俊賢 Chun-Shien Lu 158 宋定懿 Ting-Yi Sung 159 李丕榮 PeiZong Lee 160 林仲彥 Chung-Yen Lin 161 林仁俊 Jen-Chun Lin 162 施純傑 Arthur Chun-Chieh Shih 163 柯向上 Hsian-Shang 'Josh' Ko 164 徐讚昇 Tsan-sheng Hsu 165 馬偉雲 Wei-Yun Ma 166 高明達 Ming-Tat Ko 167 張原豪 Yuan-Hao Chang 168 莊庭瑞 Tyng-Ruey Chuang 169 陳伶志 Ling-Jyh Chen 170 陳孟彰 Meng Chang Chen 171 陳郁方 Yu-Fang Chen 172 陳祝嵩 Chu-Song Chen 173 黃文良 Wen-Liang Hwang 174 楊柏因 Bo-Yin Yang 175 楊得年 De-Nian Yang 176 葉彌妍 Mi-Yen Yeh 177 廖純中 Churn-Jung Liau 178 劉進興 Jing-Sin Liu 179 劉庭祿 Tyng-Luh Liu 180 蔡懷寬 Huai-Kuang Tsai 181 穆信成 Shin-Cheng Mu 182 蘇克毅 Keh-Yih Su 183 蘇 黎 Li Su 184 鐘楷閔 Kai-Min Chung 185 林宗慶 Tsung-Ching Lin 186 149

究研 人 Research Fellow 員 王大為 Da-Wei Wang Faculty Ph.D., Computer Science, Yale Univeristy, United States T +886-2-2788-3799 ext. 1729 E [email protected] F +886-2-2782-4814 W www.iis.sinica.edu.tw/pages/wdw ・ Deputy Director, Institute of Information Science, Academia Sinica, Taiwan (2007/7- 2009/9) (2018/9-present) ・ President, Taiwan Association of Medical Informatics Science, Taiwan (2011/11-2013/12) ・ Director, Academia Sinica, Information Technology Services, Taiwan (2010/8-2017/10) ・ Assistant Research Fellow(1992/12-2000/3), Associate Research Fellow (2000/3-2007/5), Research Fellow (2007/5-present), Institute of Information Science, Academia Sinica, Taiwan Research Description My research interests are in the theoretical foundation of computer science and in applying fundamental tools to various areas. I am most interested in understanding the structure of objects to be manipulated computationally, classifying the resources needed for various computational problems, and identifying more e cient algorithms. In the application eld, I am interested in information privacy, simulation and medical informatics. I was previously involved in the design of information privacy protection mechanisms and the development of privacy regulations in medical research. We developed a prototype system, CellSecu, to safeguard personal privacy. The system checks if datasets that are intended to be released contain too much personal information. The system is based on an appropriate logic model, which we discovered. We also proposed a method to quantify privacy leakage with information theory and economic models I have worked closely with Centers for Disease Control in Taiwan to use computer simulations to facilitate decision making processes. We developed an e cient agent-based stochastic simulator to study the spatial transmission patterns of in uenza. We also studied problems related to vaccination strategy, such as searching for appropriate strategies with di erent objectives and di erent social structures in the population. We applied a surrogate assisted genetic algorithm to make the search process more e cient, and we also apply machine learning methods to construct better surrogate functions. Recently, I have been involved with data bank related operations to provide guidance on privacy protection and information security. Publications 1. Jian, Zong-De and Chang, Hung-Jui and Hsu, Tsan-sheng 6. Ting-Yu Liu, Yu Chen, Huai-Hsien Wang, Yen-Lin Huang, Yuan- and Wang, Da-Wei, \"Applying Deep Learning for Surrogate Chun Chao, Kun-Tong Tsai, Wei-Chih Cheng, Chih-Yuan Chuang, Construction of Simulation Systems,\" Springer, volume 873, Yi-Ho Tsai, Chung-Yueh Huang, Da-Wei Wang, Chi-Hung Lin, pages 335-350, September 2019. Juen- Kai Wang, and Yuh-Lin Wang, \"Differentiation of Bacteria Cell Wall Using Raman Scattering Enhanced by Nanoparticle 2. Jian, Zong-De and Hsu, Tsan-sheng and Wang, Da-Wei, \"The Array,\" Journal of Nanoscience and Nanotechnology, volume 12, Power of Surrogate-Assisted Evolutionary Computing in number 6, pages 5004-5008, June 2012. Searching Vaccination Strategy,\" International Conference on Simulation and Modeling Methodologies, Technologies and 7. Ya n g - c h i h F u , D a - We i Wa n g , J e n - H s i a n g C h u a n g , Applications, volume 676, pages 222-240, October 2018. \"Representative Contact Diaries for Modeling the Spread of Infectious Diseases in Taiwan,\" PLoS ONE, volume 7, number 3. Wu, Chien Hua and Chiu, Ruey Kei and Yeh, Hong Mo and 10, pages e45113, October 2012. Wang, Da Wei, \"Implementation of a cloud-based electronic medical record exchange system in compliance with the 8. Te-Kang Jan, Da-Wei Wang, Chi-Hung Lin and Hsuan-Tien Lin, integrating healthcare enterprise's cross-enterprise document \"A Simple Methodology of Soft Cost-sensitive Classification,\" sharing integration profile,\" International Journal of Medical ACM SIGKDD International Conference on Knowledge Informatics, volume 107, pages 30-39, September 2017. Discovery and Data Mining (ACM KDD), August 2012. 4. Chien-Chou Chen, Jen-Hsiang Chuang, Da-Wei Wang, Chien- 9. Meng-Tsung Tsai, Tsurng-Chen Chern, Jen-Hsiang Chuang, Min Wang, Bo-Cheng Lin, Ta-Chien Chan, \"Balancing geo- Chih- Wen Hsueh, Hsu-Sung Kuo, Churn-Jung Liau, Steven privacy and spatial patterns in epidemiological studies,\" Riley, Bing-Jie Shen, Chih-Hao Shen, Da-Wei Wang, Tsan- Sheng Geospatial Health, volume 12, number 2, pages 294-299, Hsu, \"Efficient simulation of the spatial transmission dynamics August 2017. Zong-De Jian, Tsan-Sheng Hsu and Da-Wei of infiuenza,\" PLoS ONE, volume 5, number 11, pages e13292, Wang, \"Searching Vaccination Strategy with Surrogate-assisted November2010. Evolutionary Computing,\" SIMULTECH 2016, July 2016. 10. Kung Chen, Yuan-Chun Chang, Da-Wei Wang, \"Aspect-oriented 5. Chia-Tung Kuo, Da-Wei Wang and Tsan-sheng Hsu, \"Simple design and implementation of adaptable access control for and efficient algorithms to get finer resolution in a stochastic Electronic Medical Records,\" International Journal of Medical discrete time agent-based simulation,\" Simulation and Modeling Informatics, volume 79, number 3, pages 181-203, March 2010. Methodologies, Technologies and Applications Advances in Intelligent Systems and Computing, volume 256, pages 97-109, 2014. 150


Like this book? You can publish your book online for free in a few minutes!
Create your own flipbook