|
This
|
RNApasta Help, 8 April 2010
|
RNApasta is a Java application to calculate a
variety of useful statistics related to RNA stem-loop and
pseudoknot structures. It will also perform a functions related to
alignment editing, primarily the generation of subsets of data
where the original data set is hetergeneous with respect to some
structural feature.
The input data is may be in a "pasta" formatted file, a variant of
the fasta format; alternatively, the program will also accept
Stockholm formatted files as downloaded from the Rfam database.
The "pasta" format is defined below, along
with more program details and notes.
|
|
Setup & Start
Program Interface
Basic Tools
Pasta Edit
Analysis Tools
Alignment Tools
Phylogenetics Tools
Format Details & Notes
Disclaimer: No Warranty
|
Setup & Start
|
Unpacking, Installing
|
RNApasta is available compiled from Sun Java versions 1.6 and 1.5.
The source files are also available in case you wish to compile it yourself.
| RNApasta-jar-1.6.zip |
Version compiled for Sun Java 1.6 |
| RNApasta-jar-1.5.zip |
Version compiled for Sun Java 1.5 |
| RNApasta-mac-java1.5.zip |
Version compiled for Sun Java 1.5 and bundled for MacOSX |
| RNApasta.src.zip |
Java source files |
Unzip the program version of your choice and place in an appropriate directory.
If you wish to create a program startup shortcut (Windows or Linux), it is best to do so
to the jar-start.bat file (Windows) or jar-start.sh file (Linux), as
described below. Users of the MacOS bundle may wish to simply drag this to the
Applications folder.
|
|
Starting The Program
|
To start the program from the RNApasta.jar file, use:
jar-start.bat for Windows
jar-start.sh for Linux
The bundle specifically for MacOS can be started by double clicking on it
MacOS users can also use the jar-1.5 version by starting a command window,
then use jar-start.sh within the directory that contains the file.
Simply double clicking on RNApasta.jar may start the program, but with larger
data files you need to reserve more memory, which is accomplished by the
jar-start files.
The data directory or folder contains some sample files.
|
|
Command Line Options
|
| jar-start |
Starts the program with a graphical interface. |
| jar-start infile |
Starts the program with a graphical interface and
begins loading the infile. |
| jar-start infile outfile |
No graphical interface. The infile is interpreted and the
results saved to outfile. |
| jar-start infile outfile integer |
No graphical interface. The infile is interpreted and then
processed according to the integer (see below) with results
saved to outfile. |
| integer = 0 |
All of analyses 1-10 |
integer = 6 |
stack doublets |
| integer = 1 |
base frequencies |
integer = 7 |
pseudofrequencies |
| integer = 2 |
basepair frequencies by position |
integer = 8 |
frequencies of regions |
| integer = 3 |
basepair frequencies by region |
integer = 9 |
stem summary |
| integer = 4 |
basepair frequencies by stem position |
integer = 10 |
loop summary |
| integer = 5 |
entropy by position |
integer = 11 |
generates arc diagram |
|
Program Interface & Data Loading
|
Interface Overview
|
Files are loaded and displayed in the upper textbox,
with results of analyses displayed in a series of tabbed panes in the lower textbox.
One of the lower tabbed panes is "Interpreted" to indicate how RNApasta
interpreted the input data. This is particularly useful when the input data
is in the Stockholm (Rfam) format, and you wish to see how the program converted
this into stem regions.
The information in any of the textboxes may be saved in a file.
The various analytical functions may be started either by pushing a button, or
by selecting the same item in one of the menus.
The darker gray status pane at the bottom of the program screen has
a larger left side status box that will display the name of the
file most recently opened into the upper, input, text box.
The smaller right side box displays the position of the mouse
selection. If you mouse click on some text and select (highlight)
it, moving left to right, the right side status box will display
the length, beginning and ending of the selection.
A right mouse click allows you to copy selected text to the system
clipboard.
|
|
Top Row Buttons
|
| Load File |
This loads a Stockholm or Pasta formatted file into the upper
text box, and then interprets it, with output in one tab
of the lower text box. The Stockholm format specifies
individual basepairs, while the pasta format specifies
pairing regions or stems. Hence it is useful to examine
the results of the conversion in the "Interpreted"
tab.
For Stockholm files, the user has a choice of permitting
double-sided bulges within a stem, or not. Not allowing
double-sided bulges will result in more separate stems,
while allowing them will result in fewer separate stems.
...AAAA..AA...aa...aaaa..
...AAAA..BB...bb...aaaa..
The first line would result from allowing double-sided bulges,
while the second would result from not allowing them.
|
| Save Interpreted |
This saves the interpreted output to a pasta formatted file. |
| Clear Bottom Analysis |
The clears all the bottom output panes. |
| Push Analysis Up |
Some of the functions create subsets of the original data in one of the
output tabs. This button pushes the subset data to the upper (input)
textbox, so that analysis of the subset can be done. |
| Options |
This controls various program options, such as the number of decimal points
to display. |
| Help |
Program information and help. |
|
|
Find & Copy
|
| Find |
Find First and Find Next commands can initiated from the edit menu
to search for text in either of the input or output text boxes |
| Right Mouse Find/Copy |
Selecting some text within the interpreted tab, then a right mouse click
allows either copying or searching. |
|
|
Highlighting Pairs
|
| Stem Regions |
Right mouse click within the "Interpreted" output tab.
You can then highlight all stems with different colors to
visualize the pairing regions. This can also be done with the
highlight button within the Pasta Edit function group. |
| Individual Basepairs |
Right mouse click within the "Interpreted" output tab,
next to a base that belongs to a stem. This column of bases
and the column to which it basepairs will be highlighted. |
|
Analytical & Editing Functions
|
Basic Tools
|
| Interpret Input |
This interprets whatever is in the upper (input) textbox
and puts the results in a lower tabbed textbox labeled
"Interpreted". This is useful if you have
changed the input information in the input textbox, or if
you have pushed up a subset created by an analysis. |
| Draw Arcs |
This generates an arc diagram of the secondary structure which
can be saved in an image file. |
|
|
Pasta Edit
|
| Remove Partition Tag |
This function removes tags placed in the comment line for a sequence
as a result of a partition analysis (described below). |
| Highlight Stems |
In the "Interpreted" tab, this places a color highlight on
both halves of the stems. The same can be done by a right
mouse click within this output tab. |
| Remove All Highlights |
This removes these highlights. |
|
|
Analysis Tools
|
| Base Freq |
This calculates the frequency of each base as a function of
position in the alignment and overall.
|
| BasePair Freq Pos |
This calculates the frequency of each possible basepair as a
function of position in the alignment. The pairing used is that
indicated by the pairing indicator line. If the pairing line
indicates that position 2 pairs with position 670, then
this function will calculate the observed base-pair frequencies
for these positions, omitting the alignment induced -:- or
gap:gap pairs. This function will calculate the frequencies for
both ends of the pair, hence for a position 2:670 it will report
a certain GC frequency, and for position 670:2 it will report
the same frequency as CG.
|
| BasePair Freq Reg |
This reports the frequency of each possible basepair as a function
of regions. First, overall frequencies of base-pairs are reported
for the whole alignment, the pseudoknots, and non-pseudoknot
stem-loops. A crossing helix is the stem of a pseudoknot which the
program detects as actually crossing another stem, and is indicated
by an XXX in one of the comment lines. Second, the base-pair
frequencies are given for each of the labelled regions of the
alignment - A, B, C .. etc.
|
| BasePair Freq Stem Pos |
For each region, this calculates the frequency of each base-pair as
a function of position within a stem, with positions reported as
outer, middle, or inner. If there are 4 base-pairs in a stem, the
middle 2 would get averaged. The overall base-pair frequency by
stem position over all positions is also calculated.
|
| Entropy by Pos |
This computes relative and absolute entropy by alignment position,
with and without gaps. Entropy H(X) = - Sum over i [ P(Xi) log
P(Xi) ], and relative entropy is H(P||Q) = Sum over i [ P(Xi) log
P(Xi)/Q(Xi) ] see Durbin et al. page 305-308. This uses the natural
log base.
|
| Stack Doublets |
This calculates the frequency with which basepairs follow
each other in a stem, such as how often does a GC follow a CG.
|
| PseudoFreqs |
This will recalculate the base and base-pair frequencies using
pseudofrequencies. The user can choose which set of
pseudofrequencies to use, or to define their own, and then which
method to average the pseudofrequences with the measured
frequencies. Zero-offset adds 1 to any 0 count; Fifty weights the
pseudofrequencies as if they were from 50 sequences; Square-Root
uses the square root of the number of sequences in the alignment as
the weight, and Minimal Risk uses a modification of Square-Root
developed by Wu et al., 1999. J. Comp. Bio. 6: 219-235. Wu et al.
discusses each of these methods.
|
| Region Freq |
This measures the frequency with which each pairing region appears
in the sequence alignment, by not-counting sequences which have no
bases in that region of the alignment.
|
| Stem Summary |
This function reports the length distributions of the subregions of
stems. The subregions are the length from the beginning of the
sequence, the length of the 5' stem, the central loop, the length
of the 3' stem, and the length to the end of the sequence. Summary
statistics are followed by the complete length distributions.
|
| Loop Summary |
This function calculates the length distributions of each
non-paired sequence region. These are named by the flanking pairing
regions, so that non-paired region "AB" is in between pairing
regions 'A' and 'B'.
|
|
|
Alignment Tools
|
Many of these functions create subsets of the original alignment
using various methods.
The Push Analysis Up function can then be used to copy the
new subalignment to the upper input textbox, after which Interpret
can will generate an analysis of the subalignment, making it the focus of
study.
| Non-Canonical |
This function searches for non-cannonical base-pairs, and marks
them with a "!" beneath them. Non-cannonical is defined as anything
other than A=U, G=C, or G=U. Bulges and gaps are also indicated by
"b" and "g" below the positions.
|
| Pseudoknot Removal |
Resolves pseudoknots by removing the conflicting stem based
on the elimination gain method.
|
| Extract Pseudoknot |
This function will extract one or more of the pseudoknot regions,
creating an alignment of just those regions.
|
| Extract Regions |
This function will extract one or more of the pairing regions,
creating an alignment of just those regions.
If there is a pseudoknot crossing region which begins within the
area being extracted, and which would pair with a region outside
the area being extracted, then that pseudoknot pairing region is
erased from the Pasta lines, as it will having nothing with which
to pair.
|
| Erase Region Label |
This will erase a pairing region (stem) from the pairing indicator
line. It does not alter any of the sequences themselves.
|
| Delete Gap Cols |
This function will remove columns from the alignment that contain a
gap in all sequences and also in the pairing indicator line.
|
| Del Seq by Reg Size |
This removes sequences from the alignment based upon the
number of bases within a region. The user can choose to remove
all sequences in which the length of stem C1 is less than 3,
for example, or greater than 12.
|
| Partition By Stem |
This allows the user to divide the data set in 2 based upon the
size distribution of a stem length. The dialog allows one to select
the stem of interest, then the "Select By Histogram" generates a
histogram of the stem length distribution. A left mouse click
selects the partition value.
The user has a choice of outputs. In one case the data is divided
into 2 sets of sequences, one above and one below the selected
partition length for the selected stem. In the other case, whether
a given sequence is above or below the selected partition length
for that stem is indicated by text added to the standard label
line. ^(D1>4.6)=1 This example indicates that for stem
D1, the sequence that follows has a length greater than 4.6.
This may be followed by a phylogenetic parse of the results.
|
| Partition By Loop |
This allows the user to divide the data set in 2 based upon the
size distribution of a loop length. The dialog allows one to select
the loop of interest, then the "Select By Histogram" generates a
histogram of the loop length distribution. A left mouse click
selects the partition value.
The user has a choice of outputs. In one case the data is divided
into 2 sets of sequences, one above and one below the selected
partition length for the selected loop. In the other case, whether
a given sequence is above or below the selected partition length
for that loop is indicated by text added to the standard label
line. ^(c1D1>3.6)=0 This example indicates that for
loop c1D1, the sequence that follows has a length less than
3.5.
This may be followed by a phylogenetic parse of the results.
|
|
|
Phylogenetic Tools
|
| Phylogenetic Stem/Loop Parse |
This works with the Partition By Stem and Loop functions,
and requires input of a phylogenetic tree in Newick format that
relates the RNA sequences to each other.
The function reads partition information written in the
comment line, then displays the evolutionary pattern of the
stem or loop on top of the phylogenetic tree, based upon a simple
parsimony reconstruction.
|
|
Format Details & Notes
|
Pasta Format
|
The pasta format is built upon a fasta sequence
alignment with the addition of two or one line(s) of pairing
indicators to indicate the RNA secondary structure:
; is a comment line
>pairs (the next line contains pairing indicator letters)
....AAAA....BBBB....aaaa....bbbb....AAAA..aaaa
>index (the next line contains index/subscript numbers)
....1111....1111....1111....1111....2222..2222.
>sequence label 1
GCUCAACCCAGUCAUUUGCCGGUUC---AAUGGCUAAACCCCGGUUG
>sequence label 2
UCGCAACCC--UCAUUUCGCGGUUCCAGAAUGGAUCAACCGCGGUUU
The pairing indicators are upper and lower case letters in the
">pairs" and numbers in the ">index" line. Regions that pair
with each other are indicated by corresponding upper and lower case
letters, while the numbers are used as subscripts to allow more
than one pairing region using the same alphabetic letter. The "."
is used for space between pairing indicators, while "-" is used to
indicate an alignment gap or structural bulge in the sequences.
The base in each sequence in the column beneath the first
A1 will pair with the corresponding base in the column
beneath the last a1. The base beneath the last B1
will pair with the base beneath the first b1. In the
example shown, pairing regions A1 a1 and B1 b1
form a pseudoknot.
Stockholm (Rfam) files get converted to Pasta by changing the
<<..>> notation into the AA..aa
notation, as well as resolving the interleaved sequence format.
Stockholm (Rfam) notation is based upon individual basepairs,
whereas pasta labels whole stems. The user may wish to compare
the Stockholm pair structure line with the computed Pasta format
line to ensure the conversion makes sense.
|
|
Interpreted Output
|
RNApasta will write information into the output textbox
in response to the Load File or Interpret functions. This
will include comment lines indicating the detected pseudoknots
and their pairing region indicators. The line with the XXXX
indicates a stem that was detected as the crossing helix of
the pseudoknot.
;.....11111.....1111...11111........1111...
;...............XXXX................xxxx...
;.....AAAAA.....BBBB...aaaaa........bbbb...
This display is useful to see how the program interpreted your
data, particularly if it was a Stockholm(Rfam) file.
|
|
Paired and End Gaps
|
In the paired sequence below, there are several
categories of gaps. Some are artifacts of the process used to
create the alignment, and do not reflect actual pairing within a
single sequence.
CCGUUG-CACAC--
--CAAUCGUGUG-C
A gap:gap pair, such as the next to last pair, may be a result
of alignment, rather than biological structure. End gaps, such
as those on both ends, may also be artifacts of the alignment;
for example, the CC at the beginning should not be counted as
pairing with --, but rather should be scored as part of the
non-paired sequence before the paired region. The internal gap
opposite a C, however, is likely a biologically real gap, associated
with a stem-region that has a bulge. The RNApasta program attempts
to screen out gap pairings induced by the alignment process when
computing pairing statistics.
|
|
Disclaimer
|
NO WARRANTY
|
This software was created in the course of academic and/or
research endeavors and not as a commercial package.
Its present version (which may still be in development) is
distributed for a nominal fee to cover the cost of distribution and
administrative costs, "AS IS, WITH ALL DEFECTS." By using the
software, each user agrees to assume all responsibility for any and
all such use. The author(s) and University of Georgia are not aware
that the software or the use thereof infringe any proprietary right
belonging to a third party. However, NO WARRANTY OR REPRESENTATION
OF ANY KIND, EXPRESS OR IMPLIED, is made about the software,
including without limitation any warranty of title,
noninfringement, merchantability, or fitness for a particular
purpose, by the author(s) or by University of Georgia.
Copyright 2010, Russell L. Malmberg, University of Georgia
|