Challenges
of data management and analysis from 2nd
generation sequencing platforms
Due to the advent of 2nd generation sequencing technologies,
exhaustive multi-genome comparisons of 100s or 1000s
of human genomes will soon become a reality. This will
have profound benefits for personalized disease diagnostics
and therapeutics.
However, the informatics challenges presented by these
technologies are already creating sequence mapping,
assembly and analysis bottlenecks. As reads generated
from these low cost ultra-high-throughput sequencing
solutions are typically 25bp to 200bp, with error rates
as high as 5%, genome assembly and analysis become exponentially
more complex.
Synamatix has addressed many of these issues by using
SynaBASE™, which is a scalable, high-throughput
database solution that leverages sequence complexity
and exhaustive word-based searching to yield optimum
results. SynaBASE exhaustively identifies all k-mers
within biological sequences, storing the data as k-mers
structured on the basis of their inter-relationships.
By using a search application built on top of SynaBASE,
1.68 million 120mer reads were mapped back to the human
genome in 5 hours.
In a similar experiment to handle 25mer reads, a non-heuristic
search strategy employing a scoring matrix for sequence
quality was used. Mapping was achieved at an average
rate of over 1,000 reads/sec back to a SynaBASE of the
human genome. The sensitivity and performance improvements
of several magnitudes of this approach over conventional
tools such as MegaBLAST validate the potential technology
fit between 2nd generation sequencers and SynaBASE.
|