RNATOPS version 1.0 GENERAL INFORMATION ------------------- RNATOPS is a program that searches for RNA secondary structures based on the notion of a structure graph to specify the consensus structure of an RNA family. This program is implemented in C++. It has been compiled and tested on several systems, including Desktop Linux computers, a Linux cluster, and a SUN workstation running SunOS 5.1. Authors: Zhibin Huang, Yong Wu, Joseph Robertson Department of Computer Science University of Georgia Athens, GA 30602 Copyright (C) 2008 University of Georgia. All Rights Reserved. WHAT IS CONTAINED IN THIS PACKAGE --------------------------------- The software package of RNATOPS and one example are contained in this package. INSTALLATION ------------ Compile RNATOPS. Type "make" to build an executable program called "rnatops". USAGE ----- 1. How to run this program Usage: rnatops <-tf training_file> <-gf genome_file> [other options] 2. Command line parameter list: [-tf] training_file [-gf] genome_file [-pcnt] pseudocount_value [DEFAULT=0.001] [-k] k_value [DEFAULT=10] [-th] threshold_value_for_candidate_hit [DEFAULT=0.0] [-no] number of overlapping nts between two stems when overlap is allowed [DEFAULT=2] [-mc] to merge candidates in preprocessing [-ms] when taking the merge strategy, take the candidate with the (m)ax score or the one with the (s)hortest length (m|s) [DEFAULT=s] [-ns] number of shift positions allowed in the merge strategy [DEFAULT=0] [-ni] number of insertions allowed in a null loop [DEFAULT=3] [-pcv] weight for prior base pairing matrix [DEFAULT=0.4] [-pf] prior_file_name (DEFAULT='./base_pair_prior.txt') [-pv] pcoeff value [DEFAULT=2.0] [-st] split threshold [DEFAULT=6] [-js] stepsize in jump strategy [DEFAULT=1] [-jt] score_filtering threshold in jump strategy [DEFAULT=0.0] [-r] to also search reverse complement strand [-ps] to print structure alignment info [-d] to print out the debug info 3. Command line parameter explanation: [-tf] training_file File of RNA training data in the pasta format to produce the RNA structure model. An example (ytRNA_NC_001137_train.pasta) is included in the directory of "example". [-gf] genome_file File to be searched for the structure modeled with the training data. Current version requires that genome_file only contain ACGT nucleotide characters. See Section of "PREPARATION OF THE GENOME FILE". [-pcnt] pseudocount_value [DEFAULT=0.001] The value of pseudocount. Default value is 0.001. [-k] k_value [DEFAULT=10] The number of candidates for each stem in the structure to be searched in genome_file. Default value is 10. [-th] threshold_value_for_candidate_hit [DEFAULT=0.0] the threshold value which is used to determine whether the current structure is hit or not. Default value is 0.0. [-no] number of overlapping nts between two stems when overlap is allowed [DEFAULT=2] The number of overlapping nucleotides allowed between adjacent candidates. Default value is 2. [-mc] to merge candidates in preprocessing If you use this option, the strategy of merging similar candidates will be taken. [-ms] when taking merge strategy, take the candidate with (m)ax score or one with the (s)hortest length (m|s) [DEFAULT=s] When using the -mc option, merging candidates, then you need to choose which merging strategy you will choose. Choosing the representive candidate with max score or shortest length. m: max score; s: shortest length. Default value is s. [-ns] number of shift positions allowed in the merge strategy [DEFAULT=0] When merging candidates, the candidates within the number of shift positions will be merged together. Default value is 0. [-ni] number of insertions allowed in null loop [DEFAULT=3] For the null loop model, the number of inserted nucleotides will be allowed. Default value is 3. [-pcv] prior weight value [DEFAULT=0.4] The weight for prior base pairing frequency matrix. Default value is 0.4. [-pf] prior_file (DEFAULT='./base_pair_prior.txt') The file containing prior pair frequency matrix (5x5): base_pair_prior.txt. (Note: User can place this file any places as long as he needs to specify the absolute path of this file or he can use his own matrix file) [-pv] pcoeff value [DEFAULT=2.0] The weight of distance penalty. Default value is 2.0. [-st] split threshold [DEFAULT=6] The standard deviation value to group the training sequences. Default value is 6. [-js] stepsize in jump strategy [DEFAULT=1] This program supports the skip-and-jump strategy. The scanning window is shifted by the 'stepsize' nucleotides to speed up search. Value 2 or 3 for the stepsize is suggested. [-jt] score_filtering threshold in jump strategy [DEFAULT=0.0] Default value is 0.0. [-r] to search the complementary strand This option can be used to search the complementary genome strand. [-ps] to print out info of the folded structure and structure alignment. This option can give you more info of the candidate hit the program finds. Definitions of folded structure and structure alignment are given in "EXPLANATION OF THE RESULTS AND OUTPUT" in this file. [-d] print out the debug info If you want to debug the program, you can use -d option. We suggest you do not debug this program until you have a good understanding of the code because it will produce a large volume of data especially for the the tree decomposition based dynamic programming. 4. The data directory 'example' contains: a). one training file of searching tmRNA ytRNA_NC_001137_train.pasta b). one tmRNA genome file NC_001137.fasta c). the script for running the program. ./rnatops -tf ./example/ytRNA_NC_001137_train.pasta -gf ./example/NC_001137.fasta -mc -js 2 -r -ps EXPLANATION OF THE RESULTS AND OUTPUT ------------------------------------- The following information of every found structure candidate (of a score above the threshold) is generated as a result: total alignment score, begin and end positions in the genome; folded structure of the candidate; structural alignment with the model; and the score for every stem and loop of the structure. The following is an example of running result from NC_001137.fasta. profilescore: 18.1082[86604](86604~86682) The possible pos range is [86604...86682] begin position = 86604 end position = 86682 score = 18.1082 1........10........20........30........40........50........60........70........ ....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|.... Folded structure GCAACTTGGCCGAGTGGTTAAGGCGAAAGATTAGAAATCTTTTGGGCTTTGCCCGCGCAGGTTCGAGTCCTGCAGTTGT DDDDDDD..AAA........AAA..BBBBB.......BBBBB..............CCCCC.......CCCC*DDDDDD Structure alignment GCAACUUGG-CCGAGUGGUUAAGGCGAAAGAUUAGAAAUCUUUU-GGGCUUUGCCCG-CGCAGGUUCGAGUCCUGCAGUUGU DDDDDDDmmAAAAmmmmmmmmAAAAmBBBBBmmmmmmmBBBBBmdiiiiiiiiiiimdmCCCCCmmmmmmmCCCC*DDDDDD Score for Stem D : -0.335287 Score for Loop (D->A): -9.28159 Score for Stem A : -1.3706 Score for Loop (A->A): 8.70134 Score for Loop (A->B): 1.12378 Score for Stem B : 3.51535 Score for Loop (B->B): 2.44458 Score for Loop (B->C): -0.921154 Score for Stem C : 7.24402 Score for Loop (C->C): 6.98771 Score for Loop (C->D): 0 Note: The folded structure is the predicted secondary structure of the found RNA candidate in the genome, where a base pair is annotated by a pair of letter. (See referenced pasta paper). The same letter is used for every stem. The structure alignment shows the optimal alignment between the model and the found RNA sequence on the genome. It contains additional information of matches, deletions, and insertions computed by the RNATOPS program. SEARCH WITH FILTER PROFILES --------------------------- RNATOPS can be used with filters in a two step process to speed up search. In step 1, given a substructure profile as an input, RNATOPS is used to search for the substructure. In step 2, on the genome segments containing the substructure, the RNATOPS is used again to search for the whole structure. The connection between steps 1 and 2 can only be manually done with RNATOPS v1.0, however. CONTACT INFORMATION ------------------- For suggestions, questions, requests and bug reports, please contact Liming Cai at cai@cs.uga.edu.