A NGS C++ Generic Programming Library

The libseq software is a C++11 and C++14 programming library with facilities designed for Next Generation Sequencing (NGS) analysis. It is under current development and it should not be considered production-ready.

The software makes use of heavy templating in order to achieve a runtime boost by using static polymorphism and class traits. It provides classes for the following NGS concepts:

  • DNA sequence container traits (e.g., strings)
  • Generic DNA sequences with variadic properties
  • Sequence specialization for FASTQ and FASTA properties
  • Reverse-Forward matching sequences
  • Kmer list containers (serial and concurrent)
  • Kmer generators
  • De Bruijn graph generator between diverse sequence classes
  • Specialization for de Bruijn graph (e.g., kmers)
  • Parsers for FASTA and FASTQ

The library is wholly parametrized via templates. For instance, you can create a graph that matches any given sequence list against another, with custom node and edge weights. As an example, given the list of all reads, and the list of all kmers, you can create a graph that links reads with kmers, with nodes as kmers and reads, and weighted edges indicating the position of a kmer in any given read.

We are also implementing cache-oblivious data structures along with out-of-core algorithms to exploit modern architectures and to process data with limited amounts of RAM.

Note: This software library incorporates the SeqDB library by Mark Howison, published in “High-throughput compression of FASTQ data with SeqDB”, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 10(1): 213–218, 2013. See the BitBucket page for SeqDB for more information.

Download

The source code is available under the BSD license. Requirements:

An A* NGS error correction algorithm

This work proposes an error correction method based on the de Bruijn graph that permits its execution on Gigabyte-sized data sets using normal desktop computers. The implementation makes extensive use of hashing techniques, and implements an A* algorithm for optimal error correction, minimizing the distance between an erroneous read and its possible replacement with the Needleman-Wunsch score. Our approach outperforms other popular methods both in terms of random access memory usage and computing times.

Sequence

In order to compile and contribute to this project you need A C++11 compiler, the Boost library, and Intel's Threading Building Blocks library.

The source code is available under the BSD license.

High-throughput Error Correction by Oligomers

This work presents a novel error correction algorithm based on k-mer strings with their associated overlap graph, along with an open-source, multi-threaded, implementation. The algorithm, named HErCoOl (High-throughput Error Correction by Oligomers), needs minimal tuning, only an overall error rate and –optionally– information about the genome sizes. HErCoOl was compared against other state-of-the art methods, using empirical NGS data obtained with Roche 454 technology, focusing the benchmarks on mixtures of related species. Results show that HErCoOl improves significantly over the current methods, and the parallelisation scales well with the size of input NGS genome producing long sequence reads, such as Roche 454 or Ion Torrent. HErCoOl provides a fast and efficient error correction of NGS data, especially for mixed samples. Its platform-independent, open-source, multi-threaded implementation assures flexibility for being employed and integrated in any NGS data analysis software.

HErCoOl

This is a past project, and it is no longer developed. The reason is performance, as NGS technologies progress, we need to achieve always increasing performances.

The source code is available under the GPLv2 license.