Suffix Trees


Free PDF

A new novel from appropriation novelist Todd Van Buskirk. Van Buskirk uses the same approach as his novel “On the Exploration of Interrupts” by using a website called “SCIgen—The Automatic CS Paper Generator” to generate text for a whole novel. SCIgen is a program that generates random Computer Science research papers, including graphs, figures, and citations. It uses a hand-written context-free grammar to form all elements of the papers. The aim here is to maximize amusement, rather than coherence.

The plot of Suffix Trees:

In computer science, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a data structure that presents the suffixes of a given string in a way that allows for a particularly fast implementation of many important string operations.

The suffix tree for a string is a tree whose edges are labeled with strings, such that each suffix of S corresponds to exactly one path from the tree’s root to a leaf. It is thus a radix tree (more specifically, a Patricia tree) for the suffixes of .

Constructing such a tree for the string S takes time and space linear in the length of S. Once constructed, several operations can be performed quickly, for instance locating a substring in S, locating a substring if a certain number of mistakes are allowed, locating matches for a regular expression pattern etc. Suffix trees also provided one of the first linear-time solutions for the longest common substring problem. These speedups come at a cost: storing a string’s suffix tree typically requires significantly more space than storing the string itself.

The concept was first introduced as a position tree by Weiner in 1973,[1] which Donald Knuth subsequently characterized as “Algorithm of the Year 1973”. The construction was greatly simplified by McCreight in 1976 ,[2] and also by Ukkonen in 1995.[3][4] Ukkonen provided the first online-construction of suffix trees, now known as Ukkonen’s algorithm, with running time that matched the then fastest algorithms. These algorithms are all linear-time for constant-size alphabet, and have worst-case running time in general.

In 1997, Martin Farach[5] gave the first suffix tree construction algorithm that is optimal for all alphabets. In particular, this is the first linear-time algorithm for strings drawn from an alphabet of integers in a polynomial range. This latter algorithm has become the basis for new algorithms for constructing both suffix trees and suffix arrays, for example, in external memory, compressed, succinct, etc.

Suffix trees can be used to solve a large number of string problems that occur in text-editing, free-text search, computational biology and other application areas.[8] Primary applications include:[8]

String search, in O(m) complexity, where m is the length of the sub-string (but with initial O(n) time required to build the suffix tree for the string)
Finding the longest repeated substring
Finding the longest common substring
Finding the longest palindrome in a string S.

Suffix trees are often used in bioinformatics applications, searching for patterns in DNA or protein sequences (which can be viewed as long strings of characters). The ability to search efficiently with mismatches might be considered their greatest strength. Suffix trees are also used in data compression; they can be used to find repeated data, and can be used for the sorting stage of the Burrows–Wheeler transform. Variants of the LZW compression schemes use suffix trees (LZSS). A suffix tree is also used in suffix tree clustering, a data clustering algorithm used in some search engines (first introduced in [9]).


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s