1 Introduction
A highly repetitive text collection is a set of texts such that any two texts can be converted into each other with a few modifications. Such collections have become common in various fields of research and industry. Examples include genome sequences for the same species, versioncontrolled documents, and source code repositories. For human genome sequences, it is said that the difference between individual human genomes is around 0.1%, and there is a huge collection of human genomes, such as the 1000 Genome Project [1]. As another example, Wikipedia belongs to the category of versioncontrolled documents. There is clearly a strong, growing need for powerful methods to process huge collections of repetitive texts.
A selfindex is a data representation for a text with support for random access and pattern search operations. Quite a few selfindexes for highly repetitive texts have been proposed thus far. Early methods include the SLPindex [5] and LZend [9]. The block tree index (BTindex) [14] is a recently proposed selfindex that reaches compression close to that of LempelZiv. The runlength FMindex (RLFMindex) [7] is a recent breakthrough built on the notion of a runlengthencoded BurrowsWheeler transform (BWT). The RLFMindex can be constructed in a spaceefficient, online manner while supporting fast operations. Although existing selfindexes for highly repetitive texts are constructible in a compressed space and also support fast operations, there are no proposed methods supporting dynamic updating of selfindexes for editing a set of texts with a highly repetitive structure. Thus, an important open challenge is to develop a selfindex for highly repetitive texts that supports not only fast operations for random access and pattern search but also dynamic updating of the index.
The ESPindex [18] and signature encoding [15] are two selfindexes using the notion of a locally consistent parsing (LCPR) [11] for highly repetitive texts. While they have the advantages of being constructible in an online manner and supporting dynamic updates of data structures, they have a large disadvantage of slow pattern search for short patterns. From a given string, these methods build a parse tree that guarantees upper bounds for parsing discrepancies between different appearances of the same substring. For pattern search, the ESPindex performs topdown search of the parse tree to find candidate pattern appearances, and then it checks whether the pattern does occur around each candidate. In contrast, signature encoding uses a 2D range of reporting queries for pattern search. Traversing the search space, especially for short patterns, takes a long time for both methods, which limits their applicability in practice.
In this paper, we present a novel, compressed dynamic index for highly repetitive text collections, called a truncated suffix tree (TST) based index (TSTindex). The TSTindex improves on existing selfindexes on an LCPR to support faster pattern search by leveraging the idea behind a truncated suffix tree (TST) [13]. A TST is built from an input text, and a selfindex is created for the transformed text by using the TST, which greatly improves the search performance. In addition, the TSTindex supports dynamic updates, which is a useful property when input texts can be edited, as noted above. Experiments with a benchmark dataset of highly repetitive texts show that pattern search with the TSTindex is much faster than that with signature encoding. In addition, the TSTindex has pattern search performance competitive with that of the RLFMindex, yet the size of the TSTindex can be smaller than that of the RLFMindex.
2 Preliminaries
Let be an ordered alphabet and . Let string be an an element in , with its length denoted by . In general, a string is called a pattern. Let and throughout this paper. For string , , , and are called a prefix, substring, and suffix of , respectively. For two strings and , let .
The empty string is a string of length . Let . For any , denotes the th character of . For any , denotes the substring of that begins at position and ends at position . Let and for any . For strings and and indexes and , we define insertion and deletion operations as follows: and . For any strings and , let denote all occurrence positions of in ; that is, . Let . Similarly, for .
For a string and integer , we say that is a gram if . We then say that is a short pattern if , or a long pattern if . Similarly, we say that a suffix of length at most of is a short suffix of . Let be the set of all substrings of length and all suffixes of length at most in , i.e., . For example, if and , then . For a string , we say that is a colored sequence if holds for any .
For a string and an integer , the longest prefix of such that occurs in is called the longest previous factor without selfreference at position of . Furthermore, is called a factorization of if holds, where factor for each is a substring of and is the size of the factorization. Then, the runlength encoding is a factorization of string such that each factor is a maximal run of the same character in . Such a run is denoted as for length and character . For example, for , . is a colored sequence when we treat each factor as a character.
Next, the LempelZiv77 (LZ77) factorization without selfreference [21] for a string is a factorization satisfying the following conditions: (1) ; (2) ; (3) is the longest previous factor without selfreference at position of , if it exists for ; and (4) otherwise, . Hence, is the number of factors in LZ77, i.e., . For example, if , then .
A selfindex is a data structure built on and supporting the following operations:

Count: Given a pattern , count the number of appearances of in .

Locate: Given a pattern , return all positions at which appears in .

Extract: Given a range , return .
Such a selfindex is considered static. In contrast, a dynamic selfindex supports not only the above three operations but also an update operation on the selfindex for insertion and deletion of a (sub)string of length on string of maximal length . A compressed dynamic selfindex is a compressed representation of a dynamic selfindex. We assume that for a static setting and for a dynamic setting.
Our model of computation is a unitcost word RAM with a machine word size of bits. We evaluate the space complexity here in terms of the number of machine words. A bitwise evaluation of space complexity can be obtained with a multiplicative factor.
Finally, let , , and for a real positive number .
3 Literature Review
Selfindexes for highly repetitive text collections constitute an active research area, and many methods have been proposed thus far. In this section, we review the representative methods summarized in the upper part of Table 1. See [7] for a complete survey.
Selfindexes for highly repetitive text collections are broadly classified into three categories. The first category is a grammarbased selfindex. Claude and Navarro
[5] presented the SLPindex, which is built on a straightline program (SLP), a contextfree grammar (CFG) in Chomsky normal form for deriving a single text. For the size of a CFG built from text , the SLPindex takes space in memory while supporting locate queries in time, where is a parameter [5].Two other grammarbased selfindexes, the ESPindex [18] and signature encoding [15], use the notion of LCPR. Each takes space in memory while supporting locate queries in time, where and . Signature encoding also supports dynamic updates in time for highly repetitive texts. Although the ESPindex and signature encoding have an advantage in that they can be built in an online manner, and although signature encoding also supports dynamic updates, these selfindexes have a large disadvantage in that locate queries are slow for short patterns.
The second category includes selfindexes based on LZ77 factorization. Recently, Bille et al. [4] presented a selfindex with a timespace tradeoff on LZ77. Their selfindex takes space while supporting locate queries in time, where is the number of factors in LZ77 with selfreference on and is an arbitrary constant. Navarro [14] presented a selfindex based on a block tree (BT), called the BTindex. The BTindex uses space and locates a pattern in time for any constant .
The third category includes the RLFMindex built on the notion of a runlengthencoded BWT. The RLFMindex was originally proposed in [10] but has recently been extended to support locate queries [7]. It uses space for the number of BWT runs and takes time for locate queries. Although the RLFMindex supports fast locate queries in a compressed space, it has a serious issue with its inability to dynamically update indexes.
Despite the importance of compressed dynamic selfindexes for highly repetitive text collections, no previous method has both supported fast queries and dynamic updates of indexes while achieving a high compression ratio for highly repetitive texts. We thus present a compressed dynamic selfindex, called the TSTindex, that meets both demands and is applicable to highly repetitive text collections in practice. The TSTindex has the following theoretical property, which will be proven over the remaining sections.
Theorem 1.
For a string and an integer , the TSTindex takes space while supporting the following four operations: (i) count queries in time for a pattern of length ; (ii) locate queries in time for a pattern of length ;(iii) extract queries in time; and (iv) update operations in time, where .
Method  Space  Locate Time  Update time 

RLFMindex [7]  Unsupported  
Bille et al. [4]  Unsupported  
BTindex [14]  Unsupported  
SLPindex [5]  Unsupported  
Signature  
encoding [15]  on average  
TSTindexs  ()  Unsupported  
(this study)  
TSTindexd  ()  
(this study) 
4 Fast queries with truncated suffix trees
In this section we present a novel text transformation, called a TST transformation, to improve the search performance of a selfindex. We first introduce the TST in the next subsection and then present the TST transformation in the following subsection.
4.1 Tries, compact tries and truncated suffix trees
A trie for a set of strings is a rooted tree whose nodes represent all prefixes of the strings in (see the left side of Figure 1). Let be the set of nodes in , and let be the set of leaves in . We then define the following five operations for , , , and :

: Returns the string starting at the root and ending at node .

: Returns the node such that .

: Returns the set of all leaves whose prefixes contain .

: Returns the node such that holds if it exists.

: Returns the node such that holds if it exists.
All nodes in are categorized into two types. If node is an internal node and has only one child, then is called an implicit node; otherwise, is called an explicit node. Let and be the respective sets of implicit and explicit nodes in .
For , the function returns the lowest explicit node such that is an ancestor of . The computation times for , , , and are constant, , , and , respectively, where is the computation time for . Here, when we use perfect hashing [6] in space. Moreover, can be computed in constant time if node stores a pointer to in constant space. Hence, the data structure requires space.
A compact trie is a spaceefficient representation of a trie such that all chains of implicit nodes in are collapsed into single edges (see the right side of Figure 1). We use two representations for node labels in a compact trie. The first representation uses a string such that each edge label in is represented as a pair of start and end positions in , resulting in a spaceefficient representation of the edge labels in . This representation is called a compact trie with a reference string and takes space. The second approach represents each node label explicitly and is called a compact trie without a reference string. This representation takes space. The compact trie with a reference string is more space efficient and is used in the static case, while the compact trie without a reference string is used in the dynamic case.
We can insert or delete a string into or from a compact trie without a reference string in time [12], where is the computation time for updating the data structure for when a child is inserted into or removed from (assume ). Here, when we use the predecessor/successor approach of Beame and Fich [3] in space, where is the number of children of .
Example 2.
The trie on the left side of Figure 1 is built on . In this trie, , , , , , , , , and .
The TST [13] of a string is a trie for . We assume that ends with a special character not included in . Since for Example 2, the right side of Figure 1 shows a TST built on . Trie is a TST built on , because .
An important fact is that always exists for every leaf node in TST. We thus explicitly store for every leaf .
Vitale et al. showed that the reference string of a TST can be represented as a string of length , and that a TST for a string can be constructed in time in an online manner while using working space [20].
4.2 TST transformation
We now present a string transformation using a TST, which we call a TST transformation. A string is transformed into a new string by a TST transformation built on . A selfindex with improved search performance can then be built on .
A TST transformation using a TST on is a transformation of into a new string by replacing every gram in by (i.e., by its node id). Formally, , where . Similarly, a pattern is transformed into by a TST transformation using the same TST , i.e., . Given these definitions, the following lemma holds.
Lemma 3.
For two strings and TST of , the following equation holds.
(1) 
For , when cannot be computed, i.e, when contains a new gram not occurring in .
Proof.
Recall that a gram beginning at position in is transformed to the corresponding TST leaf beginning at position in . This means that a substring of length at most and beginning at position in is transformed to a TST leaf containing as a prefix beginning at position in . Let be the set of all leaves containing as a prefix. Then all occurrence positions of all leaves of in are given by . Since , Equation 1 holds for short patterns.
Similarly for a long pattern , a substring beginning at position in is transformed to beginning at position in . clearly holds when cannot be computed. Therefore, Equation 1 also holds for long patterns. ∎
Example 4.
Let and , and let the trie in Figure 1 be a TST of . Then . Let and . Then holds, because , , , and . Similarly, , because .
We can compute in time. also can be computed in time by using and . This is because for , , and , holds if and exist.
Lemma 3 tells us that we can improve search queries on a general selfindex for short patterns on a TST in time if the computation time for pattern search on the general selfindex is greater than , where is the computation time for with the general selfindex. If the length of pattern is at most , i.e., , then we perform search queries on the TST. Otherwise, we perform search queries of on a selfindex for .
We obtain the following theorem by using Lemma 3.
Theorem 5.
Let be an index supporting in time and in time. Then there exists an index of space that supports in time for a short pattern , and that supports in time for a long pattern , where is the size of .
We apply Theorem 5 to signature encoding and present a novel selfindex named TSTindex in the next section.
5 TSTindex
We obtain the TSTindex by combining Theorem 5 with signature encoding. First, we introduce LCPR and signature encoding, and then we develop the TSTindex.
5.1 Locally consistent parsing (LCPR)
LCPR [11] is a factorization of a string by using a bit string computed for . Let be the position of the th in . Then and satisfy the conditions given in Lemma 6.
Lemma 6 ([11]).
For a colored sequence of length at least , there exists a function that returns a bit sequence of length satisfying the following properties: (1) and hold for , where and . (2) If holds for integers and , then holds, where , and and are colored sequences.
An LCPR for a colored string using for is defined as .
Example 7.
Let , and , and assume that . Then , , , , , , and holds for any . Since holds, . Similarly, since holds, .
5.2 Signature encoding
A signature encoding [11] of a string is a contextfree grammar generating the single text and built using and . Here, is a set of positive integers called variables. is a set of deterministic production rules (a.k.a. assignments), with each being either of a sequence of variables in or a single character in . is the start symbol deriving string .
A signature encoding corresponds to a balanced derivation tree of height built on the given string , where each internal node (respectively, leaf node) has a variable in (respectively, a character in ), as illustrated in Figure 2. Since this derivation tree is balanced, the node labels at each level can be considered as a sequence from the leftmost node to the rightmost node at a level. We define as a function returning a variable sequence , where is a single character in or a sequence of variables in , and is a single character in . In addition, holds for . Then, is a sequence of node labels at the th level of the derivation tree built from string and defined using as follows.
Here, holds for the minimum positive integer satisfying . Hence, is the height of the derivation tree of . Let be the size of , whose bound is given by the following lemma.
Lemma 8 ([17]).
The size of is bounded by .
satisfies the following properties by the definition. (1) holds because for a positive even integer . (2) Every variable is at most because must be a  colored sequence to apply an LCPR to . (3) can be stored in space, because every variable in is in one of three cases: (i) ; (ii) , where and ; or (iii) derives a variable sequence of length .
Example 9.
Let be a contextfree grammar, where , , , and . Assume that , , and hold. Then is the signature encoding of .
5.3 Static TSTindexes
We can extend signature encoding as a selfindex for pattern search. A key idea for pattern search is that there exists a common variable sequence of length representing every substring in the parse tree of . This sequence is called the core of and is computable from , as described elsewhere [18, 15] in detail.
Since we can perform topdown searches of the parse tree for the core of instead of itself, we obtain the following lemma. ^{1}^{1}1Takabatake et al. [18] construct the grammar representing an input text by using a technique called alphabet reduction, which is essentially equal to LCPR. Although the definition of their grammar is slightly different from ours, we can use their search algorithm with our signature encoding.
Lemma 10 ([18]).
For a signature encoding, there exists an index of space supporting count and locate queries in time, in time, and extract queries in time, where is the number of candidate occurrences of by this index.
Proof.
In the original paper of Takabatake et al. [18], their index performs count and locate queries in time and in time. We can remove the term, however, by using perfect hashing [6]. Since the term is the computation time for , it can also be reduced to by using the data structure proposed by Alstrup et al. [2]. Both of these data structures use space.
Next, we show that we can compute in time by using space. The directed acyclic graph (DAG) of represents the derivation tree of , and each node in the DAG represents a distinct variable in . For an edge in the DAG, let the relevant offset between the parent and child be the length of the string generated by , where represents and represents the th edge in . Then the occurrence position of a node in is the sum of relevant offsets in the path starting at the root and ending at the node. Hence, we can compute in time by storing all relevant offsets, because the height of is . Furthermore, the offsets can be stored in space, because we can represent the relevant offsets between and for by using space.
Next, for a node representing , we store the lowest ancestor of such that the indegree of is at least 2 or is the root in the DAG, i.e, the node representing . We also store the sum of relevant offsets on the path starting at and ending at . By summing up these values, we can reduce to time, and the data structures use space. ∎
We call the signature encoding built on a signature encoding. Our static selfindex consists of the TST of and the selfindex of Lemma 10 for representing . In turn, we construct from , from , and the self index of Lemma 10 from . The search algorithm is based on Theorem 5, meaning that, for locate queries, we compute by using and output all occurrences of on by using locate queries on if . Otherwise, we compute by using and output all occurrences of on by using locate queries on . Similarly, we can perform count queries. Hence, we obtain the following performance for our selfindex from Theorem 5 and Lemma 10.
Theorem 11.
For a string and an integer , there exists an index using space and supporting (i) locate queries in time for short patterns, (ii) locate and count queries in time for long patterns, and (iii) extract queries in time, where is the size of the signature encoding of .
We give the following upper bound for .
Theorem 12.
For a signature encoding of , holds.
Proof.
See Section 6. ∎
5.4 Dynamic TSTindexes
In this section, we consider maintaining our selfindex for short patterns with a dynamic text . Recall that our index consists of a TST of and an index of . In the dynamic setting, the index is still based on Theorem 5, but we use the following data structure for instead of that given by Lemma 10.
Lemma 13 (Dynamic signature encoding [16]).
With the addition of data structures taking space, can support update operations in time. The modified data structure also supports computing in time for and extract queries in time.
Hence, our dynamic index supports locate queries in time for short patterns. For count queries, we append to as extra information. We also append to .
The remaining problem is how to update these data structures when is changed. The main problem is how changes when is changed. If we can understand that process, then we can update the data structures through each update operation.
Let for a string and two integers and , i.e., . For strings , the following equations hold, where and :
Comments
There are no comments yet.