This RNApasta Help, 8 April 2010
RNApasta is a Java application to calculate a variety of useful statistics related to RNA stem-loop and pseudoknot structures. It will also perform a functions related to alignment editing, primarily the generation of subsets of data where the original data set is hetergeneous with respect to some structural feature.
The input data is may be in a "pasta" formatted file, a variant of the fasta format; alternatively, the program will also accept Stockholm formatted files as downloaded from the Rfam database.
The "pasta" format is defined below, along with more program details and notes.
Setup & Start    Program Interface    Basic Tools    Pasta Edit    Analysis Tools    Alignment Tools    Phylogenetics Tools    Format Details & Notes    Disclaimer: No Warranty
Setup &
Start
Unpacking, Installing
RNApasta is available compiled from Sun Java versions 1.6 and 1.5. The source files are also available in case you wish to compile it yourself.
RNApasta-jar-1.6.zip Version compiled for Sun Java 1.6
RNApasta-jar-1.5.zip Version compiled for Sun Java 1.5
RNApasta-mac-java1.5.zip Version compiled for Sun Java 1.5 and bundled for MacOSX
RNApasta.src.zip Java source files
Unzip the program version of your choice and place in an appropriate directory. If you wish to create a program startup shortcut (Windows or Linux), it is best to do so to the jar-start.bat file (Windows) or jar-start.sh file (Linux), as described below. Users of the MacOS bundle may wish to simply drag this to the Applications folder.
Starting The Program
To start the program from the RNApasta.jar file, use:
jar-start.bat for Windows
jar-start.sh for Linux
The bundle specifically for MacOS can be started by double clicking on it MacOS users can also use the jar-1.5 version by starting a command window, then use jar-start.sh within the directory that contains the file.
Simply double clicking on RNApasta.jar may start the program, but with larger data files you need to reserve more memory, which is accomplished by the jar-start files.
The data directory or folder contains some sample files.
Command Line Options
jar-start Starts the program with a graphical interface.
jar-start infile Starts the program with a graphical interface and begins loading the infile.
jar-start infile outfile No graphical interface. The infile is interpreted and the results saved to outfile.
jar-start infile outfile integer No graphical interface. The infile is interpreted and then processed according to the integer (see below) with results saved to outfile.

integer = 0 All of analyses 1-10 integer = 6 stack doublets
integer = 1 base frequencies integer = 7 pseudofrequencies
integer = 2 basepair frequencies by position integer = 8 frequencies of regions
integer = 3 basepair frequencies by region integer = 9 stem summary
integer = 4 basepair frequencies by stem position integer = 10 loop summary
integer = 5 entropy by position integer = 11 generates arc diagram
Program
Interface &
Data Loading
Interface Overview
Files are loaded and displayed in the upper textbox, with results of analyses displayed in a series of tabbed panes in the lower textbox. One of the lower tabbed panes is "Interpreted" to indicate how RNApasta interpreted the input data. This is particularly useful when the input data is in the Stockholm (Rfam) format, and you wish to see how the program converted this into stem regions.
The information in any of the textboxes may be saved in a file.
The various analytical functions may be started either by pushing a button, or by selecting the same item in one of the menus.
The darker gray status pane at the bottom of the program screen has a larger left side status box that will display the name of the file most recently opened into the upper, input, text box.
The smaller right side box displays the position of the mouse selection. If you mouse click on some text and select (highlight) it, moving left to right, the right side status box will display the length, beginning and ending of the selection.
A right mouse click allows you to copy selected text to the system clipboard.
Top Row Buttons
Load File This loads a Stockholm or Pasta formatted file into the upper text box, and then interprets it, with output in one tab of the lower text box. The Stockholm format specifies individual basepairs, while the pasta format specifies pairing regions or stems. Hence it is useful to examine the results of the conversion in the "Interpreted" tab.
For Stockholm files, the user has a choice of permitting double-sided bulges within a stem, or not. Not allowing double-sided bulges will result in more separate stems, while allowing them will result in fewer separate stems.
 ...AAAA..AA...aa...aaaa..
 ...AAAA..BB...bb...aaaa..
The first line would result from allowing double-sided bulges, while the second would result from not allowing them.
Save Interpreted This saves the interpreted output to a pasta formatted file.
Clear Bottom Analysis The clears all the bottom output panes.
Push Analysis Up Some of the functions create subsets of the original data in one of the output tabs. This button pushes the subset data to the upper (input) textbox, so that analysis of the subset can be done.
Options This controls various program options, such as the number of decimal points to display.
Help Program information and help.
Find & Copy
Find Find First and Find Next commands can initiated from the edit menu to search for text in either of the input or output text boxes
Right Mouse Find/Copy Selecting some text within the interpreted tab, then a right mouse click allows either copying or searching.
Highlighting Pairs
Stem Regions Right mouse click within the "Interpreted" output tab. You can then highlight all stems with different colors to visualize the pairing regions. This can also be done with the highlight button within the Pasta Edit function group.
Individual Basepairs Right mouse click within the "Interpreted" output tab, next to a base that belongs to a stem. This column of bases and the column to which it basepairs will be highlighted.
Analytical
& Editing
Functions
Basic Tools
Interpret Input This interprets whatever is in the upper (input) textbox and puts the results in a lower tabbed textbox labeled "Interpreted". This is useful if you have changed the input information in the input textbox, or if you have pushed up a subset created by an analysis.
Draw Arcs This generates an arc diagram of the secondary structure which can be saved in an image file.
Pasta Edit
Remove Partition Tag This function removes tags placed in the comment line for a sequence as a result of a partition analysis (described below).
Highlight Stems In the "Interpreted" tab, this places a color highlight on both halves of the stems. The same can be done by a right mouse click within this output tab.
Remove All Highlights This removes these highlights.
Analysis Tools
Base Freq This calculates the frequency of each base as a function of position in the alignment and overall.
BasePair Freq Pos This calculates the frequency of each possible basepair as a function of position in the alignment. The pairing used is that indicated by the pairing indicator line. If the pairing line indicates that position 2 pairs with position 670, then this function will calculate the observed base-pair frequencies for these positions, omitting the alignment induced -:- or gap:gap pairs. This function will calculate the frequencies for both ends of the pair, hence for a position 2:670 it will report a certain GC frequency, and for position 670:2 it will report the same frequency as CG.
BasePair Freq Reg This reports the frequency of each possible basepair as a function of regions. First, overall frequencies of base-pairs are reported for the whole alignment, the pseudoknots, and non-pseudoknot stem-loops. A crossing helix is the stem of a pseudoknot which the program detects as actually crossing another stem, and is indicated by an XXX in one of the comment lines. Second, the base-pair frequencies are given for each of the labelled regions of the alignment - A, B, C .. etc.
BasePair Freq Stem Pos For each region, this calculates the frequency of each base-pair as a function of position within a stem, with positions reported as outer, middle, or inner. If there are 4 base-pairs in a stem, the middle 2 would get averaged. The overall base-pair frequency by stem position over all positions is also calculated.
Entropy by Pos This computes relative and absolute entropy by alignment position, with and without gaps. Entropy H(X) = - Sum over i [ P(Xi) log P(Xi) ], and relative entropy is H(P||Q) = Sum over i [ P(Xi) log P(Xi)/Q(Xi) ] see Durbin et al. page 305-308. This uses the natural log base.
Stack Doublets This calculates the frequency with which basepairs follow each other in a stem, such as how often does a GC follow a CG.
PseudoFreqs This will recalculate the base and base-pair frequencies using pseudofrequencies. The user can choose which set of pseudofrequencies to use, or to define their own, and then which method to average the pseudofrequences with the measured frequencies. Zero-offset adds 1 to any 0 count; Fifty weights the pseudofrequencies as if they were from 50 sequences; Square-Root uses the square root of the number of sequences in the alignment as the weight, and Minimal Risk uses a modification of Square-Root developed by Wu et al., 1999. J. Comp. Bio. 6: 219-235. Wu et al. discusses each of these methods.
Region Freq This measures the frequency with which each pairing region appears in the sequence alignment, by not-counting sequences which have no bases in that region of the alignment.
Stem Summary This function reports the length distributions of the subregions of stems. The subregions are the length from the beginning of the sequence, the length of the 5' stem, the central loop, the length of the 3' stem, and the length to the end of the sequence. Summary statistics are followed by the complete length distributions.
Loop Summary This function calculates the length distributions of each non-paired sequence region. These are named by the flanking pairing regions, so that non-paired region "AB" is in between pairing regions 'A' and 'B'.
Alignment Tools
Many of these functions create subsets of the original alignment using various methods.
The Push Analysis Up function can then be used to copy the new subalignment to the upper input textbox, after which Interpret can will generate an analysis of the subalignment, making it the focus of study.
Non-Canonical This function searches for non-cannonical base-pairs, and marks them with a "!" beneath them. Non-cannonical is defined as anything other than A=U, G=C, or G=U. Bulges and gaps are also indicated by "b" and "g" below the positions.
Pseudoknot Removal Resolves pseudoknots by removing the conflicting stem based on the elimination gain method.
Extract Pseudoknot This function will extract one or more of the pseudoknot regions, creating an alignment of just those regions.
Extract Regions This function will extract one or more of the pairing regions, creating an alignment of just those regions.
If there is a pseudoknot crossing region which begins within the area being extracted, and which would pair with a region outside the area being extracted, then that pseudoknot pairing region is erased from the Pasta lines, as it will having nothing with which to pair.
Erase Region Label This will erase a pairing region (stem) from the pairing indicator line. It does not alter any of the sequences themselves.
Delete Gap Cols This function will remove columns from the alignment that contain a gap in all sequences and also in the pairing indicator line.
Del Seq by Reg Size This removes sequences from the alignment based upon the number of bases within a region. The user can choose to remove all sequences in which the length of stem C1 is less than 3, for example, or greater than 12.
Partition By Stem This allows the user to divide the data set in 2 based upon the size distribution of a stem length. The dialog allows one to select the stem of interest, then the "Select By Histogram" generates a histogram of the stem length distribution. A left mouse click selects the partition value.
The user has a choice of outputs. In one case the data is divided into 2 sets of sequences, one above and one below the selected partition length for the selected stem. In the other case, whether a given sequence is above or below the selected partition length for that stem is indicated by text added to the standard label line. ^(D1>4.6)=1 This example indicates that for stem D1, the sequence that follows has a length greater than 4.6.
This may be followed by a phylogenetic parse of the results.
Partition By Loop This allows the user to divide the data set in 2 based upon the size distribution of a loop length. The dialog allows one to select the loop of interest, then the "Select By Histogram" generates a histogram of the loop length distribution. A left mouse click selects the partition value.
The user has a choice of outputs. In one case the data is divided into 2 sets of sequences, one above and one below the selected partition length for the selected loop. In the other case, whether a given sequence is above or below the selected partition length for that loop is indicated by text added to the standard label line. ^(c1D1>3.6)=0 This example indicates that for loop c1D1, the sequence that follows has a length less than 3.5.
This may be followed by a phylogenetic parse of the results.
Phylogenetic Tools
Phylogenetic Stem/Loop Parse This works with the Partition By Stem and Loop functions, and requires input of a phylogenetic tree in Newick format that relates the RNA sequences to each other.
The function reads partition information written in the comment line, then displays the evolutionary pattern of the stem or loop on top of the phylogenetic tree, based upon a simple parsimony reconstruction.
Format
Details
& Notes
Pasta Format
The pasta format is built upon a fasta sequence alignment with the addition of two or one line(s) of pairing indicators to indicate the RNA secondary structure:
; is a comment line
>pairs (the next line contains pairing indicator letters)
....AAAA....BBBB....aaaa....bbbb....AAAA..aaaa
>index (the next line contains index/subscript numbers)
....1111....1111....1111....1111....2222..2222.
>sequence label 1
GCUCAACCCAGUCAUUUGCCGGUUC---AAUGGCUAAACCCCGGUUG
>sequence label 2
UCGCAACCC--UCAUUUCGCGGUUCCAGAAUGGAUCAACCGCGGUUU
The pairing indicators are upper and lower case letters in the ">pairs" and numbers in the ">index" line. Regions that pair with each other are indicated by corresponding upper and lower case letters, while the numbers are used as subscripts to allow more than one pairing region using the same alphabetic letter. The "." is used for space between pairing indicators, while "-" is used to indicate an alignment gap or structural bulge in the sequences.
The base in each sequence in the column beneath the first A1 will pair with the corresponding base in the column beneath the last a1. The base beneath the last B1 will pair with the base beneath the first b1. In the example shown, pairing regions A1 a1 and B1 b1 form a pseudoknot.
Stockholm (Rfam) files get converted to Pasta by changing the <<..>> notation into the AA..aa notation, as well as resolving the interleaved sequence format. Stockholm (Rfam) notation is based upon individual basepairs, whereas pasta labels whole stems. The user may wish to compare the Stockholm pair structure line with the computed Pasta format line to ensure the conversion makes sense.
Interpreted Output
RNApasta will write information into the output textbox in response to the Load File or Interpret functions. This will include comment lines indicating the detected pseudoknots and their pairing region indicators. The line with the XXXX indicates a stem that was detected as the crossing helix of the pseudoknot.
;.....11111.....1111...11111........1111...
;...............XXXX................xxxx...
;.....AAAAA.....BBBB...aaaaa........bbbb...
This display is useful to see how the program interpreted your data, particularly if it was a Stockholm(Rfam) file.
Paired and End Gaps
In the paired sequence below, there are several categories of gaps. Some are artifacts of the process used to create the alignment, and do not reflect actual pairing within a single sequence.
CCGUUG-CACAC--
--CAAUCGUGUG-C
A gap:gap pair, such as the next to last pair, may be a result of alignment, rather than biological structure. End gaps, such as those on both ends, may also be artifacts of the alignment; for example, the CC at the beginning should not be counted as pairing with --, but rather should be scored as part of the non-paired sequence before the paired region. The internal gap opposite a C, however, is likely a biologically real gap, associated with a stem-region that has a bulge. The RNApasta program attempts to screen out gap pairings induced by the alignment process when computing pairing statistics.
Disclaimer NO WARRANTY
This software was created in the course of academic and/or research endeavors and not as a commercial package. Its present version (which may still be in development) is distributed for a nominal fee to cover the cost of distribution and administrative costs, "AS IS, WITH ALL DEFECTS." By using the software, each user agrees to assume all responsibility for any and all such use. The author(s) and University of Georgia are not aware that the software or the use thereof infringe any proprietary right belonging to a third party. However, NO WARRANTY OR REPRESENTATION OF ANY KIND, EXPRESS OR IMPLIED, is made about the software, including without limitation any warranty of title, noninfringement, merchantability, or fitness for a particular purpose, by the author(s) or by University of Georgia.
Copyright 2010, Russell L. Malmberg, University of Georgia