Tblastn alignments needs to be refined as they do not properly align around exon intron boundaries, group hsps wide apart, also might get aligned to the repeat regions. The approach would be …
- Align against a repeat masked database.
- Repeat filtering with dust
- Run tblastn and specify
- Further post process the alignments and convert to gff3.
Mask genome sequence
An example with pallidum supercontig
1 2 3
Make blast database with masking information
It’s a mapping protocol, so the search is expected to be on the insensitive side. The parameters are going to be selected to make the search as stringent as possible without loosing the top alignments.
1 2 3 4 5
- e-value : 1e-10, we have tried this before and seems a good starting point.
- threshold : 999, Neighborhood threshold value, a high value makes the seeding phase extremely stringent.
- db_soft_mask : Uses soft masking in database generated before by dustmasker.
- max_target_seqs : Maximum no of hits per discoideum protein that will be reported in the output. In this case it will only have top five alignments.
Refine TBLASTN alignments and convert it to GFF3
Run the blast2gbrowsegff3 perl script from Modware Loader distribution.
1 2 3 4 5 6
Because of the nature of the blast statistics, blast aligns wherever it can. As a result, it has multiple alignments in the same region, sometimes the aligned regions are out of order and for translated blast it also extends through stop codons. So, bunch of filtering options are used during the above conversion …
max_intron_length: 3000 bp. It is the maximum distance(in bp) allowed between two consecutive hsps(High scoring pair) of a tblastn hit. Since, blast statistics do not account for splice junctions or exon intron models, it tends to group hsps that are far apart in the genome. This option split hsps(into separate hit groups) that are more than max_intron_length apart. The maximum intron length (
3000 bp) is calculated from the multi exonic gene models(curated and predicted) of various dictyostelids(discoideum,purpureum,fasciculatum and pallidum).
merge_contained: This activates merging of overlapping hsps(
high scoring pair). In this case, the hsps whose start and end falls within another are merged with the larger one. The alignment attributes of larger ones are kept.
remove_stop_codon: In the first step the hsps are regrouped based on their strand and frame of their alignment which means a single hit could be split into maximum possible of
6 groups. In the next step, the genome(
subject) sequence in hsp alignment is checked for extension through stop codon. If found, that hit along with the hsps are discarded.
Overall, the filtering options helps to provide a best possible contiguous alignment with one or more segments(
hsp). The options are tuned for hsps aligned in the exon regions where two or more grouped hsps will flank the intervening introns. Few examples,
Alignment with pallidum
Alignment with purpureum