RNATOPS version 1.1 GENERAL INFORMATION ------------------- RNATOPS is a program that searches for RNA secondary structures based on the notion of a structure graph to specify the consensus structure of an RNA family. This program is implemented in C++. It has been compiled and tested on several systems, including Desktop Linux computers, a Linux cluster, and a SUN workstation running SunOS 5.1. Authors: Zhibin Huang, Yong Wu, Joseph Robertson, Yingfeng Wang Department of Computer Science University of Georgia Athens, GA 30602 Copyright (C) 2008 University of Georgia. All Rights Reserved. IMPROVEMENTS OVER RNATOPS version V1.0 -------------------------------------- Support two kinds of filter-based searches to improve efficiency. [-hmmfilter] with this option, search produces hits based on an automatically selected profile hmm (as filter); [-userdefinedfilterfile] with this option, search produces hits based on a user defined substructure profile (as filter); the program can be run again using the whole structure profile on these hit sequences to produce final search results. WHAT IS CONTAINED IN THIS PACKAGE --------------------------------- The software package of RNATOPS and one example are contained in this package. INSTALLATION ------------ Compile RNATOPS. Type "make" to build an executable program called "rnatops". USAGE ----- 1. How to run this program Usage: rnatops <-tf training_file> <-gf genome_file> [other options] 2. Command line parameter list: [-tf] training_file [-gf] genome_file [-hmmfilter] search with automatic-hmmfilter [-userdefinedfilterfile]search with user-defined-filter [-pcnt] pseudocount_value [DEFAULT=0.001] [-k] k_value [DEFAULT=10] [-th] threshold_value_for_candidate_hit [DEFAULT=0.0] [-no] number of overlapping nts between two stems when overlap is allowed [DEFAULT=2] [-mc] to merge candidates in preprocessing [-ms] when taking the merge strategy, take the candidate with the (m)ax score or the one with the (s)hortest length (m|s) [DEFAULT=s] [-ns] number of shift positions allowed in the merge strategy [DEFAULT=0] [-ni] number of insertions allowed in a null loop [DEFAULT=3] [-pcv] weight for prior base pairing matrix [DEFAULT=0.4] [-pf] prior_file_name [DEFAULT='./base_pair_prior.txt'] [-pv] pcoeff value [DEFAULT=2.0] [-st] split threshold [DEFAULT=6] [-js] stepsize in jump strategy [DEFAULT=1] [-jt] score_filtering threshold in jump strategy [DEFAULT=0.0] [-r] to also search reverse complement strand [-ps] to print structure alignment info [-d] to print out the debug info [-time] to print out the time info [-pscore] to print score info for stem and loop [-dfilter] to print out the debug info for the filter based search [-ddpinput] to print out the debug info of the input in dp search [-ddptree] to print out the debug info of the tree in dp search [-ddpwin] to print out the debug info of the window in dp search [-ddptd] to print out the debug info of the td in dp search [-ddpsearch] to print out the debug info of the dp search 3. Command line parameter explanation: [-tf] training_file File of RNA training data in the pasta format to produce the RNA structure model. An example (ytRNA_NC_001137_train.pasta) is included in the directory of "example". [-gf] genome_file File to be searched for the structure modeled with the training data. Current version requires that genome_file only contain ACGT nucleotide characters. See Section of "PREPARATION OF THE GENOME FILE". [-hmmfilter] search with automatic-hmmfilter This option is to do the automatically selected hmm filter-based fast search. [-userdefinedfilterfile] search with user-defined-filter This option is to do the user defined filter based fast search. If this option is used, please refer to "4. How to make user-defined-filter file.". [-pcnt] pseudocount_value [DEFAULT=0.001] The value of pseudocount. Default value is 0.001. [-k] k_value [DEFAULT=10] The number of candidates for each stem in the structure to be searched in genome_file. Default value is 10. [-th] threshold_value_for_candidate_hit [DEFAULT=0.0] the threshold value which is used to determine whether the current structure is hit or not. Default value is 0.0. [-no] number of overlapping nts between two stems when overlap is allowed [DEFAULT=2] The number of overlapping nucleotides allowed between adjacent candidates. Default value is 2. [-mc] to merge candidates in preprocessing If you use this option, the strategy of merging similar candidates will be taken. [-ms] when taking merge strategy, take the candidate with (m)ax score or one with the (s)hortest length (m|s) [DEFAULT=s] When using the -mc option, merging candidates, then you need to choose which merging strategy you will choose. Choosing the representive candidate with max score or shortest length. m: max score; s: shortest length. Default value is s. [-ns] number of shift positions allowed in the merge strategy [DEFAULT=0] When merging candidates, the candidates within the number of shift positions will be merged together. Default value is 0. [-ni] number of insertions allowed in null loop [DEFAULT=3] For the null loop model, the number of inserted nucleotides will be allowed. Default value is 3. [-pcv] prior weight value [DEFAULT=0.4] The weight for prior base pairing frequency matrix. Default value is 0.4. [-pf] prior_file [DEFAULT='./base_pair_prior.txt'] The file containing prior pair frequency matrix (5x5). base_pair_prior.txt. (Note: User can put this file any places as long as he needs to specify the absolute path of this file or he can use his own matrix file) [-pv] pcoeff value [DEFAULT=2.0] The weight of distance penalty. Default value is 2.0. [-st] split threshold [DEFAULT=6] The standard deviation value to group the training sequences. Default value is 6. [-js] stepsize in jump strategy [DEFAULT=1] This program supports the skip-and-jump strategy. The scanning window is shifted by the 'stepsize' nucleotides to speed up search. 2 or 3 for the stepsize is suggested. [-jt] score_filtering threshold in jump strategy [DEFAULT=0.0] Default value is 0.0. [-r] to search the complementary strand This option can be used to search the complementary genome strand. [-ps] to print out info of the folded structure and structure alignment. This option can give you more info of the candidate hit the program finds. Definitions of folded structure and structure alignment are given in "EXPLANATION OF THE RESULTS AND OUTPUT" in this file. [-time] to print out the time info This option can give you the time info about your running RNATOPS. [-pscore] to print score info for stem and loop This option can give you the score info for every stem and loop. [-dfilter] to print out the debug info for the filter based search [-ddpinput] to print out the debug info of the input in dp search [-ddptree] to print out the debug info of the tree in dp search [-ddpwin] to print out the debug info of the window in dp search [-ddptd] to print out the debug info of the td in dp search [-ddpsearch] to print out the debug info of the dp search [-d] to print out the debug info If you want to debug the program, you can use -d option. We suggest you do not debug this program until you have a good understanding of the code because it will produce a large volume of data especially for the the tree decomposition based dynamic programming. 4. How to make user-defined-filter file. Two kinds of filter file can be made from the original training file. 4.1) One is user-defined hmm-filter file. It should be part of the training file except the first line should be all '.'. For example, the original training file is wwwww..XXXXX...wwwww....xxxxx CGGCG-CCGGGGUUCGAGGAGAUUCCCGG CUCGU-CGAGGGUUCUUGGAGAUUCCCUC CUCGU-CGAGGGUUCUUGGAGAAUCCCUC CAUGU-CGAGGGUUCUUGGAAAUUCCCUC UUACU-CGAGAGUUCAAGGAGAAUCUCUC UUACU-CGAGAGUUCAAGGAGACUCUCUC UUACU-CGAGAGUUCAAGGAGAGUCUCUC UUACU-CGAGAGUUCAAGGAGAGUCUCUC UUUGUCCGAGAGUUCAAGGAGAGUCUCUC GUGUC-CGCGGGUUCCUGGAAAGUCCCGC GACCU-CGGCGGUUCUUGGAGACUCCGCC UAUUU-CGUGGGUUCUUGGAGACUCCCAC and you want to make your own hmm-filter file from the seventh column to the last column, then the hmm-filter file should be ....................... CCGGGGUUCGAGGAGAUUCCCGG CGAGGGUUCUUGGAGAUUCCCUC CGAGGGUUCUUGGAGAAUCCCUC CGAGGGUUCUUGGAAAUUCCCUC CGAGAGUUCAAGGAGAAUCUCUC CGAGAGUUCAAGGAGACUCUCUC CGAGAGUUCAAGGAGAGUCUCUC CGAGAGUUCAAGGAGAGUCUCUC CGAGAGUUCAAGGAGAGUCUCUC CGCGGGUUCCUGGAAAGUCCCGC CGGCGGUUCUUGGAGACUCCGCC CGUGGGUUCUUGGAGACUCCCAC 4.2) The other is user-defined substructure-filter file. It should also be part of the training file and also complete substructure in the first line. For example, the original training file is wwwww..XXXXX...wwwww....xxxxx CGGCG-CCGGGGUUCGAGGAGAUUCCCGG CUCGU-CGAGGGUUCUUGGAGAUUCCCUC CUCGU-CGAGGGUUCUUGGAGAAUCCCUC CAUGU-CGAGGGUUCUUGGAAAUUCCCUC UUACU-CGAGAGUUCAAGGAGAAUCUCUC UUACU-CGAGAGUUCAAGGAGACUCUCUC UUACU-CGAGAGUUCAAGGAGAGUCUCUC UUACU-CGAGAGUUCAAGGAGAGUCUCUC UUUGUCCGAGAGUUCAAGGAGAGUCUCUC GUGUC-CGCGGGUUCCUGGAAAGUCCCGC GACCU-CGGCGGUUCUUGGAGACUCCGCC UAUUU-CGUGGGUUCUUGGAGACUCCCAC and you want to make your own substructure-filter file from stem X-x because there are conserved good base-pair in stem X-x , then the substructure-filter file should be XXXXX............xxxxx CGGGGUUCGAGGAGAUUCCCGG GAGGGUUCUUGGAGAUUCCCUC GAGGGUUCUUGGAGAAUCCCUC GAGGGUUCUUGGAAAUUCCCUC GAGAGUUCAAGGAGAAUCUCUC GAGAGUUCAAGGAGACUCUCUC GAGAGUUCAAGGAGAGUCUCUC GAGAGUUCAAGGAGAGUCUCUC GAGAGUUCAAGGAGAGUCUCUC GCGGGUUCCUGGAAAGUCCCGC GGCGGUUCUUGGAGACUCCGCC GUGGGUUCUUGGAGACUCCCAC instead of XXXXX...wwwww....xxxxx CGGGGUUCGAGGAGAUUCCCGG GAGGGUUCUUGGAGAUUCCCUC GAGGGUUCUUGGAGAAUCCCUC GAGGGUUCUUGGAAAUUCCCUC GAGAGUUCAAGGAGAAUCUCUC GAGAGUUCAAGGAGACUCUCUC GAGAGUUCAAGGAGAGUCUCUC GAGAGUUCAAGGAGAGUCUCUC GAGAGUUCAAGGAGAGUCUCUC GCGGGUUCCUGGAAAGUCCCGC GGCGGUUCUUGGAGACUCCGCC GUGGGUUCUUGGAGACUCCCAC 5. How to use RNATOPS to do the search. Generally user can do three kinds of search: automatic hmm-filter-based search, user-defined-filter-based search, and whole-structure search. We strongly recommend user do the search in the following way. First, search with the automatic hmm-filter, and save the sequence segments flanking the hits. Second, search with the whole structure profile on the results from the first step. The data directory 'example' contains one sample using the hmm-filter and whole-structure search: the script for running the program. a) -hmmfilter ./rnatops -tf ./example/tmRNA_43_training_24.txt -gf ./example/RC_NC_008787.fasta -hmmfilter -time -ps >./example/hmm_search/RC_NC_008787_hmmfilter_result.txt ./rnatops -tf ./example/tmRNA_43_training_24.txt -gf ./example/hmm_search/RC_NC_008787_hmmfilter_result.txt -k 10 -mc -ps -time >./example/wholestructure_search/RC_NC_008787_wholestructureresult_based_on_hmm_result.txt b) -userdefinedfilterfile ./rnatops -tf ./example/tmRNA_43_training_24.txt -gf ./example/RC_NC_008787.fasta -userdefinedfilterfile ./example/substructure_search/tmRNA_substructure_train_24.txt -mc -ps -time >./example/substructure_search/RC_NC_008787_substructurefilter_result.txt ./rnatops -tf ./example/tmRNA_43_training_24.txt -gf ./example/substructure_search/RC_NC_008787_substructurefilter_result_short.txt -k 10 -mc -ps -time >./example/wholestructure_search/RC_NC_008787_wholestructureresult_based_on_substructure_result_short.txt EXPLANATION OF THE RESULTS AND OUTPUT ------------------------------------- 1. Result of hmm-filter-based search The following information of every found hmm candidate (of a score above the threshold) is generated as a result: genome name, begin and end positions of candidate based hmm search, begin and end positions of the segment flanking the hmm search candidate; genome segment flanking the hmm search candidate, hmm score, sequence alignment for this candidate; The last line info is about the time spent in this hmm-filter-based search; >RC_NC_008787(338253-338294)[337847-338294] #Frequency of ACGT in RC_NC_008787 (0.345628, 0.152393, 0.153811, 0.348168) AAATATTGTCGTTCATATGAAAATGATGATTCTGTTTTAGCGGTTTTTAATCCTTTAATT GTTAATGTTCCAAATATTATTTCTAAATAAGGGAGCGACTTGGCTTCGACAGGAGTAAGT CTGCTTAGATGGCATGTCGCTTTGGGCAAAGCGTAAAAAGCCCAAATAAAATTAAACGCA AACAACGTTAAATTCGCTCCTGCTTACGCTAAAGCTGCGTAAGTTCAGTTGAGCCTGAAA TTTAAGTCATACTATCTAGCTTAATTTTCGGTCATTTTTGATAGTGTAGCCTTGCGTTTG ACAAGCGTTGAGGTGAAATAAAGTCTTAGCCTTGCTTTTGAGTTTTGGAAGATGAGCGAA GTAGGGTGAAGTAGTCATCTTTGCTAAGCATGTAGAGGTCTTTGTGGGATTATTTTTGGA CAGGGGTTCGATTCCCCTCGCTTCCAC #HMMScore=28.8389 #Structure alignment info #GGAUUAUUUUUGGACAGGGGUUCGAUUCCCCUCGCUUCCACC #mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm ... #Time for total filter search 0.00386944 hours 2. Result of the-whole-structure-based search The following information of every found structure candidate (of a score above the threshold) is generated as a result: total alignment score, begin and end positions in the genome; folded structure of the candidate; structural alignment with the model; The following is an example of running result from RC_NC_008787.fasta. training file name=tmRNA_43_training_24.txt HMM [Y/N] = N genome file name=RC_NC_008787_hmmfilter_result_01.txt -------------Parameter settings------------- pseudocount=0.001 k=10 Threshold=0 Number of overlap between stems=2 Take the merge-candidate strategy in preprocessing=Yes About the merge strategy, take the candidate with the (s)hortest length=Yes iShiftNumMergeCand=No iAllowedNullLoopInsNum=3 pcoeff=2 iJumpStrategy=Yes iStepSize=1 iScoreThresholdInJump=0 Print structure alignment option=Yes reverse complement search=No Debug option=No Print the debug info of input-checking in dp search=No Print the debug info of tree in dp search=No Print the debug info of window size in dp search=No Print the debug info of td in dp search=No Print the debug info of the dp search=No -------------------------------------------- --------------- Hit 1 RC_NC_008787 hit positions = (337938-338291) hit score = 17.3102 hit sequence GGAGCGACTTGGCTTCGACAGGAGTAAGTCTGCTTAGATGGCATGTCGCTTTGGGCAAAGCGTAAAAAGCCCAAATAAAATTAAACGCAAACAACGTTAAATTCGCTCCTGCTTACGCTAAAGCTGCGTAAGTTCAGTTGAGCCTGAAATTTAAGTCATACTATCTAGCTTAATTTTCGGTCATTTTTGATAGTGTAGCCTTGCGTTTGACAAGCGTTGAGGTGAAATAAAGTCTTAGCCTTGCTTTTGAGTTTTGGAAGATGAGCGAAGTAGGGTGAAGTAGTCATCTTTGCTAAGCATGTAGAGGTCTTTGTGGGATTATTTTTGGACAGGGGTTCGATTCCCCTCGCTTCC 1........10........20........30........40........50........60........70........80........90........100.......110.......120.......130.......140.......150.......160.......170.......180.......190.......200.......210.......220.......230.......240.......250.......260.......270.......280.......290.......300.......310.......320.......330.......340.......350.. ....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|.... Folded structure GGAGCGACTTGGCTTCGACAGGAGTAAGTCTGCTTAGATGGCATGTCGCTTTGGGCAAAGCGTAAAAAGCCCAAATAAAATTAAACGCAAACAACGTTAAATTCGCTCCTGCTTACGCTAAAGCTGCGTAAGTTCAGTTGAGCCTGAAATTTAAGTCATACTATCTAGCTTAATTTTCGGTCATTTTTGATAGTGTAGCCTTGCGTTTGACAAGCGTTGAGGTGAAATAAAGTCTTAGCCTTGCTTTTGAGTTTTGGAAGATGAGCGAAGTAGGGTGAAGTAGTCATCTTTGCTAAGCATGTAGAGGTCTTTGTGGGATTATTTTTGGACAGGGGTTCGATTCCCCTCGCTTCC YYYYYY..........WWWWWWWWWW.....VVVVVUUUUTTTTTTAAA**GGGGG.AAAAA......GGGGGGG.........................................IIIIIIHHHH........HHH*IIII*MMMMMM.LLLL*JJJ.BBBBBBJJJ*LLLLMMMMMMM........BBBBBB..OOOOOOOO**CCCC...OOOO.OOOOOO...CCCCCC....QQQQQQQQQQ.......P**DDDDD...P**QQQQQQQQ.........DDDiDDDD...TTTTTT...UUUU....VVVVV.WWWWWWWWWW.XXXXX.......XXXXXYYYY.YY Structure alignment GG-AGCGACUUGGCUUCGACAGGAGUAAGUCUGCUUAGAUGGCAUGUCGCUUUGGGCAAAGCGUAAAAAGCCCAAAUAA-AAUUAAACGCAAACAACGUUAAAU--UCGCUCCUGCUUACGCUAAAGCUGCGUAAGU-UCAGUUGAGCCUGAAAUUUAAGUCAUACUAUCUAGCUUAAUUUUCGGUCAUUUUUGAUAGU--GUAGCCUUGCGUUUGACAAGCGUUGAGGUG---AA--AUAAAGUCUUAGCCUUGCUUUUGAGUUUUGGAAGAUGAGCGAAGUAGGGUGAAGU--AGUCAUCUUUGCUAAGCAUGUAGAGGUCUUUGUGGGA----UUAUUUUUGGACAGGGGUUCGAUUCCCCUCGCUUCC YYYYYYYmmmmmmmmmmWWWWWWWWWWmmmmmVVVVVUUUUTTTTTTAAA**GGGGGmAAAAAmmmmmiGGGGGGGmmmdmmmmmmmmmmmmmmmmmmmmmmmmddiiiiiiiiiiiimmIIIIIIHHHHmmmmmmmdmHHH*IIII*MMMMMMiLLLL*JJJmBBBBBBJJJ*LLLLMMMMMMMmmmmmmmmBBBBBBddiiOOOOOOOO**CCCC...OOOOrOOOOOOdddmmddrCCCCCCmmiiQQQQQQQQQQiiiiiimP**DDDDD...P**QQQQQQQQmmmmiddmmmmDDDiDDDDmmmTTTTTTmimUUUUiiiiVVVVVddddmWWWWWWWWWWmXXXXXmmmmmmmXXXXXYYYYYYY -------------------------------------------------- Time for the whole search process 0.240542 hours Note: The folded structure is the predicted secondary structure of the found RNA candidate in the genome, where a base pair is annotated by a pair of letter. (See referenced pasta paper). The structure alignment shows the optimal alignment between the model and the RNA sequence found on the genome. It contains additional information of matches, deletions, and insertions computed by the RNATOPS program. CONTACT INFORMATION ------------------- For suggestions, questions, requests and bug report: please contact Liming Cai at cai@cs.uga.edu.