摘要:
Recent advancements in the field of compressed data structures create inter- esting opportunities for interdisciplinary research and applications. Com- pressed data structures provide essentially a time space tradeofffor solving a problem; while traditional data structures use extra space in addition to the input, compressed data structures replace the input and require space proportional to the compressed size of the input. The amount of available memory is often fixed, thus, the user might be willing to spend more time if it allows the use of larger inputs. However, despite the potential behind compressed data structures, they have not quite reached the audience of other disciplines. We study how to take advantage of compressed data structures in the fields of bioinformatics, data analysis and information retrieval. We present several novel applications for compressed data structures and include an experimental evaluation of the time space tradeoffs achieved. More precisely, we propose (i) a space-efficient string mining algorithm to recognise substrings that admit the given frequency constraints, (ii) both theoretical and practical methods for computing approximate overlaps be- tween all string pairs, (iii) a practical path-based graph kernel for predicting the function of unknown enzymatic reactions, and (iv) a compressed XML index that supports efficient XPath queries on both the tree-structure and textual content of XML documents. Problem (i) is motivated by knowledge discovery in databases, where the goal is to extract emerging substrings that discriminate two (or more) databases. Problem (ii) is one of the first phases in a sequence assembly pipeline and requires efficient algorithms due to the new high-throughput sequencing systems. Problem (iii) is motivated by machine learning, where kernels are used to measure the similarity of complex objects. Problem (iv) has its background in information retrieval. The proposed methods achieve theoretical and practica