Mapping Assembly
We provide the mapping_assembly.py script to perform the entire process from reads to assembled genomes. In addition, we provide polish_asfv.py to polish the homopolymers, download_asfv_genome.py to download all latest ASFV genomes from NCBI, and find_near_ref.py to select the ASFV genome with the most mapped reads as the reference genome.
mapping_assembly.py
Description
The task requires the ONT reads and a reference sequence. If a reference sequence is specified directly through the -r parameter, it will be used directly. Alternatively, a directory containing multiple reference sequences can also be specified through the -r parameter. We have provided a directory "single_fasta" containing 406 genomes on github. You can also download the latest ASFV genome by executing download_asfv_genome.py. If a directory is specified through the -r parameter, the genome exhibiting the highest read mapping coverage among all available ASFV genomes will be selected as the reference genome. This reference genome will then be utilized for mapping assembly of the sequenced data. The alignment is performed using the minimap2 aligner with the -a option, and ONT reads as input. A consensus sequence is generated using samtools. The consensus sequence is further polished with medaka.
Arguments
| Argument name | Required | Description |
|---|---|---|
| -p, --processes | No | number of processes (default = 4) |
| -i, --input | Yes | input ASFV reads (fasta or fastq file) |
| -r, --ref | Yes | a fasta file of ASFV genome or a folder containing multiple ASFV genomes (if the it is a folder, the program will automatically select the nearest one as the reference) |
| -o, --output | Yes | file name of the output of assembled ASFV genome |
| --medaka | No | medaka model |
Example
mapping_assembly.py -p 4 -r single_fasta -i test_data.fasta -o asfv_genome.fasta --medaka r941_min_high_g303
Output
A fasta file of the assembled ASFV genome.
polish_asfv.py
Description
Polish the homopolymers by homopolish. It is recommended to choose the closest non-ONT sequenced ASFV genome as the reference genome in NCBI by blastn.
Arguments
| Argument name | Required | Description |
|---|---|---|
| -i, --input | Yes | fasta file of input ASFV genome |
| -r, --ref | Yes | fasta file of reference ASFV genome |
| -m, --model | Yes | model used in homopolish |
Example
polish_asfv.py -i single_fasta/MN194591.1.fasta -r single_fasta/OR180113.1.fasta -m R9.4.pkl
Output
Polished ASFV genome in the current working directory.
download_asfv_genome.py
Description
Download all ASFV sequences with a length ranging from 160,000 to 250,000 from NCBI.
Example
download_asfv_genome.py
Output
A directory "single_fasta" containing all ASFV genomes from NCBI in the current working directory.
find_near_ref.py
Description
From multiple references, get the nearest reference (The genome exhibiting the highest read mapping coverage among all available ASFV genomes). This task has been integrated into mapping_assembly.py. If you want to use this task separately, you can use this script.
Arguments
| Argument name | Required | Description |
|---|---|---|
| -f, --file | Yes | input ASFV reads (fasta or fastq file) |
| -r, --reference | Yes | dir of the reference files |
| -c, --core | No | number of processes (default = 32) |
| -q, --qscore | No | the qscore used to filter the bam file (default= 0) |
| -m, --mapper | No | the mapper used to map, can be minimap2 or bwa (default="minimap2") |
Example
find_near_ref.py -r ./single_fasta -f read.fastq > near.fasta
Output
A genome file exhibiting the highest read mapping coverage among all available ASFV genomes.