摘要:
A compressed text index is a data structure that stores a text in the compressed form while efficiently supports pattern searching queries. This thesis investigates three compressed text indexes and their applications in bioinformatics. Suffix tree, suffix array, and directed acyclic word graph (DAWG) are the pioneers text indexing structures developed during the 70's and 80's. Recently, the development of compressed data-structure research has created many structures that use surprisingly small space while being able to simulate all operations of the original structures. Many of them are compressed versions of suffix arrays and suffix trees, however, there is still no compressed structure for DAWG with full functionality. Our first work introduces an nH k ( S ) + 2 nH 0 ( T S ) + o ( n )-bit compressed data-structure for simulating DAWG where H k ( S ) and H 0 ( T S ) are the empirical entropy of the reversed input sequence and the suffix tree topology of the reversed sequence, respectively. Besides, we also proposed an application of DAWG that improves the time complexity of local alignment problem. In this application, using DAWG, the problem can be solved in O ( n 0.628 m ) average case time and O ( nm ) worst case time where n and m are the lengths of the database and the query, respectively. In the second work, we focus on text indexes for a set of similar sequences. In the context of genomic, these sequences are DNA of related species which are highly similar, but hard to compress individually. One of the effective compression schemes for this data (called delta compression) is to store the first sequence and the changes in term of insertions and deletions between each pair of sequences. However, using this scheme, many types of queries on the sequences cannot be supported effectively. In the first part of this work, we design a data structure to support the rank and select queries in the delta compressed sequences. The data structure is called multi-version