PRICE was designed to address the challenge of assembling viral genomes that
comprise a small minority of the reads within ultra-deep, short-read, shotgun metagenomic
datasets. PRICE has already enabled the discovery of several novel virus genomes from such complex
datasets, and it is also being applied to the de novo assembly of large individual genomes.
||Complete File Set
||Bug fix for PriceSeqFilter: segfault when executed with single-read input corrected;
Minor documentation update.
||A new executable, PriceSeqFilter, for filtering input data prior to executing an assembly run using
the same criteria available through the PriceTI assembler command line;
Bug fix: PriceTI interface previously terminated if negative numbers were provided as arguments, erroneously
interpreting them as invalid flags. This has been corrected.
||Updated documentation, including file format specifications;
New sample job recommended command;
Stdout log no longer reports the addition of zero contigs;
Reduced memory use during mapping;
Special instructions for installation on Mac OS X (see README.txt or "Installation" in documentation);
Improved error messages for incorrect number of arguments used for a flag;
Improved error messages for the specification of input files that don't exist;
Improved error messages for inappropriate command-line input when numbers (or integer numbers) are required;
Improved error messages for invalid score characters in _sequence.txt (Illumina) input files;
Tolerates an expanded set of whitespace characters in input files;
Optimized memory use and efficiency during the 1st mapping step;
Requirement of >half (as opposed to >=half) of linking reads to connect two contigs for them to be combined into a single assembly job;
Prints version number at each call.
||Bug fix: several bugs that contributed to inverted-strand misassemblies,
including the replacement of legitimate linkages between adjacent contigs with
reverse-orientation linkages in the AssemblyJobCreator class, also the flipping of some
sequences prior to their collapse with redundant contigs in the AssemblyJobSubset
Bug fix: small memory leak when using the -badf flag;
Bug fix: the paired-ends of reads filtered by -badf/-repmask should not be blocked
from mapping to contigs, but were in the previous version;
New feature: ability to mask paired-end reads in which one read contains a di-nucleotide
repeat stretch using -maxDi (similar to the homopolymer filter -maxHp);
New feature: quality filter flags -rqf and -rnf, to remove paired-end reads for which
at least one of the reads has an unacceptably high number of low-quality or uncalled (N)
New feature: filters for initial contigs to remove sequences matching those in a file
(-icbf), low-quality sequences (-icqf/-icnf), or sequences with homopolymer/dinucleotide
stretches (-icmHp/-icmDi); these are the initial-contig equivalents of the read filters
-badf, -rqf/-rnf, and -maxHp/-maxDi, respectively;
New feature: run logs now include information about the number of initial contigs and
reads that are removed or retained by filters, as well as explicitly the number of
initial contigs gathered from files at each cycle;
Explicit blocking of the same contig being included in a single contig-edge assembly job
in both orientations (a source of palindromic misassemblies);
Fastq nucleotides with less than 50% probability of being correct are automatically converted
Potentially accellerated processing of fastq quality scores.
||Bug fix: memory leak in the verbose log writer class;
Bug fix: improved thread safety during read mapping that should reduce the frequency
of already-infrequent core-dump crashes.
||Bug fix: corrected an illegal read error during gapped alignment scoring/collapsing;
this error would sometimes occur without raising exceptions, but would also sometimes
result in an exception declaring that "conditions after a gap block are not legit".
||Bug fix: corrected error during parsing of the -repMask command args;
Acceleration of contig targeting (substantial acceleration when using -target and
in the AssemblyJobGraph class);
New design feature: verification that at least one of the best-hit matches
for a pair of paired-end reads is to a contig edge window (previously assumed
but not verified);
Reciprocal matches of a read to both strands of a contig are filtered at
an earlier (less time-consuming) step;
Acceleration of assembly by the AssemblyJobSubset class.
||Speed improvements to the various AssemblyJob interface-implementing classes.
||Speed improvements in the ScoredSeqCollectionBwt class that reduce the time for seeding alignments to long sequences.
||Speed improvement for dynamic programming alignment of long sequences.
||Bug fix: corrected an illegal memory write when using -target;
New feature: more versatile and better-specified -trim commands (including -trimB and -trimI for basal/continuous
and initial trimming of contigs, respectively);
Additional threading to increase the efficiency of the assembly job creation steps (in between read-mapping and job-running);
New feature: control of the match/mismatch/initiate gap/extend gap scores for dynamic programming alignments using the
-r, -q, -G, and -E flags (akin to those flags for NCBI BLAST);
For developers: removal of dependencies on other PRICE classes for use of the Assembler programmatic interface.
||Bug fix: -trim command was broken (non-functional) in previous version(s), function now restored;
Bug fix: corrected an illegal memory write when using -spf/-spfp;
New feature: aborts if paired-end files both point to the same file (-fp/-fpp/-mp/-mpp);
New feature: aborts if an output file cannot be written;
New feature: length filter (-lenf) can be applied variably through a run
Updated documentation, including preliminary developer documentation.
||New feature: false paired-end reads. Use single-direction reads (like 454 or IonTorrent data) as if they
were paired-ends using the -spf or -spfp input flags (see descruption using --help for more info).
||New feature: repeat detection based on significantly high levels of coverage using the -repmask flag;
Changed default value for -link flag from 5 to 2 (see -link description using --help for more info);
Added support for .fna, .ffn, and .frn as valid fasta file appends.
||Bug fix: Segfaults occuring due to incorrect interpretation of nucleotide scores when reading .fastq/_sequence.txt
||Efficiency improvements for the extraction of read information from files,
as well as gapped alignments and assembly of redundant sequences.
||Bug fix: another cause of infrequent segfault during the second read mapping step only due
to a now-corrected race condition when threaded;
Substantial acceleration of ungapped alignment, most notable during the read-mapping steps.
||Bug fix: infrequent segfault during the read mapping steps due to a now-corrected race condition when threaded;
More even balancing of computational load between threads during meta-assembly and in later assembly cycles;
Re-design and simplification of ScoredSeqCollectionBwt class (for the purpose of load balancing);
Some optimization of read-mapping implementation;
Optimization of the AssemblyJobSubset class (more optimally prevents the exploration of alignments that will ultimately
be of insufficient quality).
||Added a second meta-assembly step that occurs post-contig filtering (allows scaled parameters to be adjusted
to reflect the size of the filtered contig set, allowing more dissimilar but nonetheless likely redundant
contigs to be collapsed;
Bug fix: corrected a problem with filtering out N-containing substring seeds for searches to a BWT dataset
(also speed-optimized that operation);
Implemented a new method for obtaining full-sequence alignments from a collection of sequences and applied
it to -target mode (improves both efficiency and completeness of results);
Additional speed optimizations for seeding alignments with substring matches to a BWT dataset.
||Bug fix: -nco previously threw an exception when called, now it is functional;
Multiple output files now guarantee the return equal-length sequences in the same order as one another;
Corrected an error in the execution of read mapping that allowed some sub-optimal alignments to persist
when only best-scoring matches were being sought;
Increased the efficiency of several aspects of PRICE, especially methods for getting information about
ScoredSeq objects or Alignments, or copying sequences.
||Bug fix: -reset flag function restored (was broken such that it had no effect);
Support for the input of read files with pairs of reads facing away from one another (typical
of mate-pair libraries) using the new -mp, -mpp, -ms, and -msp flags;
New args for paired-end and mate-paired read files allow them to be used cyclically;
Accelerated retrieval of data from disc during the second mapping and assembly steps of each cycle;
Speed increase for the first mapping step;
Reduced verbosity of the verbose log file: statistics are no longer printed for individual assembly jobs with only one sequence;
Efficiency improvements for gathering sequence information from the ScoredSeqFlip class.
||New -icfNt and -picfNt flags allow initial contigs to be introduced without assembled
contigs being targeted to them in -target mode;
Support for single-file paired-end input (paired ends found as alternating file entries);
Modified behavior from the -badf flag: reads that map to the sequences in the bad file are prevented
from mapping to existing contigs but can still be included in the assembly;
Small efficiency improvements for sequence mapping/alignment.
||Multiple output files can be written in parallel;
Filtering of reads with matches to sequences in a provided file (-badf command-line arg);
Corrected a bug in AssemblyJobGraph class that caused infrequent segmentation faults;
Corrected a bug in read mapping that previously allowed sub-par mappings to be retained;
Some simplification to the source code module structure.
||Large reduction to the runtime memory footprint, including correction of memory leaks in
the AssemblyJobGraph class and just-in-time loading of sequence/score information from files
Improvements to thread safety for the reading of sequence information from files into RAM;
Implementation of a literal interpretation of fastq (and _sequence.txt) quality scores into the
the internal representation of sequence support scores;
Small updates to the user manual and --help message.
||Correction to targeting during linked contig edge assembly jobs: targeting alignments now allow gaps;
Additional threading of de Bruijn graph assembly, contig edge assembly jobs, and reading paired-end files;
New -reset option for rejuvenation of dead contigs;
Performance optimizations for gapped alignments;
Stronger support for file format variants;
Both reads from each edge-mapping pair are now re-mapped to the entire contig set;
User specification allowed for transient use of paired-read files across only specified extend cycles;
Default -dbms value changed from 5 to 3;
Changed location of internal storage of some default parameters.
||Minor change to avoid spurious palindromic contigs;
Gapped alignments used in -target mode;
Initial user-level documentation provided.