A Best-match Correction Algorithm

Next-generation sequencing (NGS) technologies are marking the foundations for a new paradigm in genomics and transcriptomics. Nowadays is possible to sequence any microbial organism or meta-genomic sample within hours, and to obtain a whole human genome in less than a month. The sequencing prices are decreasing dramatically, opening to actual personalised medicine. NGS technologies however are error- prone, and correcting errors is a challenge due to multiple factors, including the data sizes (gigabyte scale) and the machine- specific, non-at-random, characteristics of errors and error distributions. Several approaches have been proposed, but yet the problem is a challenge, especially when analysing mixtures of (closely related) species, e.g., highly variable viruses infecting in a host as a swarm, like hepatitis C or human immunodeficiency virus.

This work presents a novel error correction algorithm based on k-mer strings with their associated overlap graph, along with an open-source, multi-threaded, implementation. The algorithm, named HErCoOl (High-throughput Error Correction by Oligomers), needs minimal tuning, only an overall error rate and –optionally– information about the genome sizes. HErCoOl was compared against other state-of-the art methods, using empirical NGS data obtained with Roche 454 technology, focusing the benchmarks on mixtures of related species. Results show that HErCoOl improves significantly over the current methods, and the parallelisation scales well with the size of input NGS genome producing long sequence reads, such as Roche 454 or Ion Torrent. HErCoOl provides a fast and efficient error correction of NGS data, especially for mixed samples. Its platform-independent, open-source, multi-threaded implementation assures flexibility for being employed and integrated in any NGS data analysis software.


The source code is available under the GPLv2 license, and runs under the Java Virtual Machine.