StatRepeats Documentation – Laboratory of Computational Genomics

StatRepeats is a tool for finding maximal repeats in an input (nucleotide or protein) sequence. It can extract four types of repeats: direct non-complementary, direct complementary, inverse non-complementary and inverse complementary repeats. It can use statistical estimation to extract only statistically significant repeats, or all repeats. It also can filter the set of found repeats based on the specified p-value.

Usage

min length Minimal length specifies the minimal length of the extracted repeats. If the specified length is to low for statistical filtering to work correctly the program will suggest a minimal length. If zero is specified as the minimal length StatRepeats will automatically choose the lowest possible value for which the probabability estimation used for statistical filtering is valid.

General output format:

[<prefix>],<LP start position>, <LP end position>, <RP start position>, <RP end position>, <sequence length>, <LP>, <RP>
where LP and RP are the left and right parts of repeat sequence. Prefix in result is taken from fasta header in input file. If fasta header is not present prefix is null (i.e. not exists). Both start and end position of left and right parts of repeat sequence are included in order to make furher processing more efficient.

Repeat types options:

Direct non-complementary repeats Search for direct non-complementary repeats
Direct Complementary Repeats Search for direct complementary repeats
Inverse non-complementary repeats Search for inverse non-complementary repeats
Inverse complementary repeats Search for inverse complementary repeats

Probability estimation

StatRepeats uses probability estimation as explained in paper to estimate if a result is statistically significant. The following options can be specified:
p-value Specify value of p-value parameter. Value is a number between 0 and 1. The default is 0.05
Without probability estimation extract all repeat sequences without probability estimation