PRICE Documentation: User Manual
Back to PRICE Documentation main page
Command-line Arguments
The various command-line arguments for PRICE are described below. Note that brief descriptions, including current
default values, can be accessed at the command line by executing PRICE with the -h or --help flags.
- INPUT FILES:
- Accepted formats are fasta (.fa or .fasta), fastq (.fq, .fastq, or _sequence.txt), or priceq (.pq or .priceq)
- READ FILES:
- NOTE: these flags can be used multiple times in the same command to include multiple read datasets.
- -fp a b c
- (a,b)input file pair, (c)amplicon insert size (including read)
- -fpp a b c d
- (a,b)input file pair, (c)amplicon insert size (including read), (d)required % identity for match (25-100 allowed)
- INITIAL CONTIG FILES:
- NOTE: these flags can be used multiple times in the same command to include multiple initial contig datasets.
- -icf a b c d
- (a)initial contig file, (b)number of addition steps, (c)number of cycles per step, (d)const by which to multiply quality scores
- -picf a b c d e
- (a)num of initial contigs from this file, (b)initial contig file, (c)num addition steps, (d)num cycles per step (e)const by which to multiply quality scores
- OUTPUT FILES:
- accepted formats are fasta (.fa or .fasta) or priceq (.pq or .priceq)
- -o a
- (a)output file name (.fasta or .priceq)
- -nco a
- (a)num. cycles that pass in between output files being written
- OTHER PARAMS:
- -nc a
- (a)num. of cycles
- Recommendation: if too few cycles were specified, a job that was previously run can be virtually re-started
with very little loss
of information if the output of the previous cycle was written in .priceq format. The final .priceq
file can be added as the initial contigs file using the -icf command.
- -link a
- (a)max. number of contigs that are allowed to replace a read in a contig-edge assembly
- Edge assembly jobs can become unreasonably complex if the sequence into which the contig is being
extended includes a repetitive element. The ability of even a single repeat-derived read to map to
all contigs in the current assembly that contain a copy of that repeat opens the possibility of a
huge number of contigs being brought into a single assembly job, despite the fact that they do not
truly derive from nearby genomic loci. -link provides an opportunity to avoid that situation by
placing a maximum on the number of contigs that are allowed to replace a single repeat-mapping read.
The read itself will be retained in the assembly job, allowing for the assembly to extend into the
local copy of the repetitive element, but irrelevant contigs with inappropriate sequence flanking
the repetitive element will not be included.
- -mol a
- (a)minimum overlap length for mini-assembly
- NOTE: -mol does not affect the parameters for de-Bruijn-graph-based assembly.
- This is the global minimum alignment length for two sequences to be combined into a contig during
contig-edge assembly jobs. Alignments are performed semi-globally, so this is the minimum extent of overlap
that is allowed to exist between two sequences for them to be combined. This value is logarithmically increased
with the number of sequences in an assembly job, starting when that number exceeds that specified by -tol.
For a gapped alignment, the lesser of the overlapping nucleotide counts for the two sequences is compared to this value.
As noted above, if the de Bruijn graph strategy is applied and has a lower k-mer size (-dbk, below) than -mol, short
sequences with less overlap
can still be combined through that strategy.
- -tol a
- (a)threshold seq num for scaling overlap for contig-edge assemblies
- NOTE: -tol does not affect the parameters for de-Bruijn-graph-based assembly.
- For contig-edge assembly jobs, this is the number of sequences above which the minimum overlap of two sequences for
combining them into a contig
will be logarithmically increased. At and below this number of sequences, the minimum overlap value will be equal to that
specified by -mol.
- -mpi a
- (a)minimum % identity for contig-edge assembly
- In a gapped alignment, the aligned strand with the lower percent of matching nucleotides will give rise to the % identity.
Competing alignments will always be selected based on their calculated semi-global scores, but alignments with less than
the minimum percent identity will be excluded. -mpi imposes a global minimum on contig-edge assemblies, but like the
minimum overlap lenth, this requirement will be scaled as the number of reads increases over that specified by -tpi (see below).
The minimum % ID value will asymptotically approach 100% from the value given by -mpi, decreasing the distance by half for every
log-scale increase to the number of input sequences.
- -tpi a
- (a)threshold seq num for scaling % ID for contig-edge assemblies
- For contig-edge assembly jobs, this is the number of sequences above which the minimum % identity of two sequences for
combining them into a contig
will be drawn closer to 100% (see -mpi above). At and below this number of sequences, the minimum % identity will
be equal to that specified by -mpi.
- -MPI a, -TPI a
- same as -mpi and -tpi above, but for meta-assembly
- NOTE: there is no minimum overlap value for meta-assembly. Meta-assembly only collapses highly-redundant
sequences that overlap entirely (or nearly entirely).
- -trim a b
- after (a) cycles, (b)min. coverage level, (optional min. length after trim)
if (a) == 0, initial contigs will be trimmed whenever they are added
- -target a b [c d]
- limit output contigs to those with matches to input initial contigs at the end of each cycle.
(a) % identity to an input initial contig to count as a match (ungapped); (b)num cycles to skip
before applying this filter. [c and d are optional, but must both be provided if either is]
After target filtering has begin, target-filtered/-unfiltered cycles will alternate with (c)
filtered cycles followed by (d) unfiltered cycles.
- This feature was inspired by target virus assembly jobs using metagenomic datasets generated through random hexamer
priming of RNA templates. The tendency of RT to switch templates generated chimeric paired-ends, which would
allow jobs seeded with viral sequences to tangentially begin assembling contigs from other metagenome components.
Even with more reliable library prep methods, low-frequency chimeric amplicons can spawn such off-target assemblies,
as can repetitive genomic sequence or incorrect read mapping. This feature eliminates contigs at the end of each cycle
that do not retain identity to the seed sequences. Cycles can be skipped both initially and/or periodically through the
PRICE run using the args. This allows short contigs to spawn the assembly of larger contigs that will not be able to
immediately close the gap between them and their parent contig.
- -targetF a b [c d]
- the same as -target, but now matches to all reads in the input set will be specified, not just
the ones that have been introduced up to that point (this is FullFile mode).
- -dbmax a
- (a) the maximum length sequence that will be fed into de Bruijn assembly
(recommended: max paired-end read length)
- The de Bruijn graph approach to genome assembly is highly efficient. However, when k-mers are repeated in
a genome, or when k-mers are palindromic, the graph representation loses information about the overall
structure of the genome. PRICE applies the de Bruijn graph approach only to local contig-edge assembly jobs,
thereby limiting the opportunity for such errors to cases of very local repeats and, by creating strand-specific
assembly jobs using paired-end topology to orient sequences, avoiding the confusion caused by palindromic sequences.
In order for PRICE to not dismantle the structure of the larger contigs that are also included in contig-edge
assembly jobs, an upper limit must be placed on the length of sequence that will be subjected to de Bruijn graph
representation. If -dbmax is set to the length of the input reads, then they will be efficiently collapsed
by this method, while the larger contigs will be retained as they are for pairwise alignment and collapse with one
another and with the contig(s) generated by the de Bruijn graph method.
- -dbk a
- (a) the k-mer size for de Bruijn assembly (recommended: keep less than the read length)
- The de Bruijn assembly strategy is only applied during the contig-edge assemblies, and then it is only applied
to short sequences (see -dbmax). The single-stranded nature of the contig-edge assembly jobs allows for the k-mer size
to be an even number without introducing errors from palindromic k-mers.
- -dbms a
- (a) the minimum number of sequences to which de Bruijn assembly will be applied
- The de Bruijn graph is a graph of nucleotide k-mers. Therefore, k-mers containing non-canonical nucleotide characters
(like N, which is often used in raw sequence files to represent a nucleotide whose identity was ambiguous) cannot be included
in the de Bruijn graph. Therefore, a single N in an input sequence will remove (k-mer size) k-mers from being entered into the
de Bruijn graph. In areas of redundant coverage, other sequences will be able to provide the path through the de Bruijn graph
to span the error. But in places of low coverage, a single error could prevent two sequences capable of forming a very high-quality
alignment from being combined. Places of low coverage can be identified as contig-edge assembly jobs with small numbers of short
(read-sized) sequences. -dbms allows one to bypass de Bruijn-based assembly of short sequences when very few such sequences are
present.
- -lenf a b
- filters out contigs shorter than (a) nt at the end of every cycle, after skipping (b) cycles.
- Recommended use is to set this to the read length or higher. De Bruijn-graph assembly can yield very short sequences (paths that
traverse too few k-mers to cover even a read length), but those are not eliminated unless a minimum contig length is specified
with -lenf.
- -maxHp a
- filters out a pair of reads if either read has a homo-polymer track >(a) nucleotides in length.
- This feature was inspired by mRNA transcript assembly jobs, for which reaching the poly-A tail at the end of the message could
spell disaster without it.
- COMPUTATIONAL EFFICIENCY:
- -a x
- (x)num threads to use
- Many aspects of PRICE are threaded for multi-core CPUs. Thread use will rise and fall as data from parallel operations
is re-synchronized in the main thread. Threading will generally be at its most efficient during mapping (dropping while files are
being read) and during contig-edge assembly. Thread use will generally be the most uneven during meta-assembly (or at the beginning
of contig-edge assembly if there are a small number of very large contig-edge assembly jobs).
- -mtpf a
- (a)max threads per file
- This variable allows the reading from a single file to be threaded. It is not expected to improve performance reading
files stored on disc drives but may enhance performance on solid-state drives (not tested).
- USER INTERFACE:
- -log a
- determines the type of output and can make the output verbose (lots of time stamp tags)
- (a) = c: concise stdout (default)
- (a) = n: no stdout
- (a) = v: verbose stdout
- -logf a
- (a)the name of an output file for verbose log info to be written (doesn't change stdout format)
- Recommendation: use option c (concise/default) output for stdout for viewing and, if desired, use -logf to create
a supplemental verbose record of your run. While you may want to mine the verbose file for details
about the run, it will generally not be a good interface for keeping track of the job status.
- -h, or --help
- user interface info.
- NOTE: no job will be run if this flag is used, regardless of whatever flags are used.