
However, in many cases an independent validation tool is required. Although some bioinformatics approaches to be used for the identification of assembly errors have been suggested 4 they are normally based on comparison with previously available data. If misassembled genomes are used as references for assembly of other (similar) genomes, the errors could be carried over to the sequences being assembled. Misassembly can affect various downstream applications including comparative genomic analysis. However, both with separate contigs and final genome sequence there remains a problem with assembly quality validation. A combination of several methods used for genome sequencing and assembly, which is named a hybrid approach, can potentially lead to a high quality genome assembly 3. Optical mapping is an additional tool which allows arrangement of contigs on the chromosome and an estimation of gap sizes and their positions. On the other hand, long read technology tends to produce low quality reads with low sequence coverage, although the reads can span across repetitive elements. Given that the low quality reads are removed, the high sequence coverage obtained from short reads can be useful in identifying nucleotide-level variants such as indels or SNPs. Automatic de novo assembly of a large number of short reads is relatively cheap and often provides good genome coverage but usually results in a number of disconnected contigs due to the presence of repetitive sequences. The draft genome sequences are produced by an automatic de novo assembly of short reads generated by using different whole genome sequencing technologies, such as e.g. There are limited tools for validation of the quality of the contigs, and undetected errors may also contribute to the problems. The problem stems from the lack of a universal and reliable tool that would allow automatic contig assembly, particularly with sequences containing long repeats. Derivation of a complete genome sequence of a distantly related species represents a challenging task. One example is a large number of complete sequences of different versions of the genome of Campylobacter jejuni strain NCTC 11168 2. Even the genomes of relatively small organisms such as bacteria (up to 10 million bases) are usually submitted as draft assemblies 1, and those present as complete genomes are often derived from the previously sequenced genomes of very closely related strains of the same species. Despite the exponential accumulation of sequencing data, the vast majority of genomes deposited in GenBank represent only ‘draft’ or incomplete versions. Determination of complete sequences of genomes is paramount for understanding an organism’s biology and function.
