Genome Completeness Evaluation

BUSCO is a computational tool used to assess the completeness and quality of genome assemblies(23). ASFV is not included in the BUSCO’s database (OrthoDB). Therefore, we developed a completeness evaluation system to generate a BUSCO-like notation.

completeness.py

Description

We generated consensus sequences from genotype I and II isolates. The CDS predicted from input ASFV genome using Prodigal were compared to consensus sequences using BLASTN (e-value 1e-5). This comparison yielded a BUSCO like genome completeness evaluation.

Input

An ASFV genome file.

Arguments

Argument name Required Description
-c No consensus sequences to be use: I or II (default='II')

Example

completeness.py single_fasta/OM966717.1.fasta -c II > OM966717.1_completeness.tsv

Output

A 5-column tsv table, the 5 fields are file name, genome size, gene number by prodigal, completeness evaluation with MGF genes and completeness evaluation without MGF genes. The following table is a partial example of the output.

file_name size prodigal_gene_num with_MGF without_MGF duplicate_genes fragmented_genes missing_genes
OM966717.1.fasta 189125 168 C:99.32%[D:0.0%],F:0.68%,M:0.0%,n:148 C:99.13%[D:0.0%],F:0.87%,M:0.0%,n:115 C122R

The completeness evaluation is a BUSCO like notation, with C:complete [D:duplicated], F:fragmented, M:missing, n:number of genes used. If a consensus CDS term can find a mapping from the predicted gene sequences with an identity larger than 90%, together with a unique mapping length longer than 90%, it can be considered "complete." A consensus CDS term that cannot find a valid mapping (with an identity greater than 30% and a mapping length greater than 30%) in the predicted genes is considered "missing". The other consensus CDS with partial hit is termed "fragmented". The "duplicated" term means that there is more than one 'complete' hit in the predicted gene sequences.