A NGS C++ Generic Programming Library

The libseq software is a C++ programming library with facilities designed for Next Generation Sequencing (NGS) analysis. It is under current development and it should not be considered production-ready. The software makes use of heavy templating in order to achieve a runtime boost by using static polymorphism and class traits. It provides classes for the following NGS concepts:

  • DNA sequence container traits (e.g., strings)
  • Generic DNA sequences with variadic properties
  • Sequence specialization for FASTQ and FASTA properties
  • Reverse-Forward matching sequences
  • Kmer list containers (serial and concurrent)
  • Kmer generators
  • De Bruijn graph generator between diverse sequence classes
  • Specialization for de Bruijn graph (e.g., kmers)
  • Parsers for FASTA and FASTQ

The library is wholly parametrized via templates. For instance, you can create a graph that matches any given sequence list against another, with custom node and edge weights. As an example, given the list of all reads, and the list of all kmers, you can create a graph that links reads with kmers, with nodes as kmers and reads, and weighted edges indicating the position of a kmer in any given read. We are also implementing cache-oblivious data structures along with out-of-core algorithms to exploit modern architectures and to process data with limited amounts of RAM.

Note: This software library incorporates the SeqDB library by Mark Howison, published in “High-throughput compression of FASTQ data with SeqDB”, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 10(1): 213–218, 2013. See the BitBucket page for SeqDB for more information.