dictyBase Developers

Solving one problem at a time

Exporting D.discoideum Annotations in GFF3 Format

Prerequisites

Install modware loader from github. With a latest cpanm (>1.6), it can be also directly installed from github

1
$_> cpanm git://github.com/dictyBase/Modware-Loader.git@release

Then follow the basic introduction about using Modware-Loader.

Export genome annotations

As mentioned before, annotations are exported in pieces. First gene models(canonical, non-coding, curated and predicated), then alignments and promoters. Exports are done by the export subcommand of Modware-Loader.

‘modware-export subcommands’ (modware-export-commands.txt) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
$_> modware-export help

Available commands:

                       commands: list the application's commands
                           help: display a command's help screen

            chado2alignmentgff3:  Export alignment from chado database in GFF3 format
            chado2canonicalgff3:  Export canonical gene models from chado database in GFF3 format
       chado2dictycanonicalgff3: Export GFF3 with canonical gene models of Dictyostelium discoideum
         chado2dictycuratedgff3: Export GFF3 with curated gene models of Dictyostelium discoideum
    chado2dictynoncanonicalgff3: Export GFF3 with sequencing center gene models of Dictyostelium discoideum
  chado2dictynoncanonicalv2gff3:  Export GFF3 with repredicted gene models of Dictyostelium discoideum
       chado2dictynoncodinggff3: Export GFF3 with non coding gene models of Dictyostelium discoideum
                    chado2fasta: Export fasta sequence file from chado database

Common config file

A basic yaml config file to be used for all the exports.

‘gff3.yaml’
1
2
3
4
dsn: dbi:Oracle:database
user: username
password: password
feature_name: 1

All exports are done with —feature_name options that exports the name of reference feature in GFF3 column 1.

Canonical

‘subcommand to export canonical gff3’ (modware-export-canonical.txt) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
$_> modware-export chado2dictycanonicalgff3 [-?chlopu] [long options...]
  -h -? --usage --help   Prints this usage information.
  --reference_id         reference feature name/ID/accession number. In
                         this case,  only all of its associated
                         features will be dumped
  -o --output            Name of the output file,  if absent writes to
                         STDOUT
  --write_sequence       To write the fasta sequence(s) of reference
                         feature(s),  default is true
  --attr --attribute     Additional database attribute
  --pass -p --password   database password
  --feature_name         Output feature name instead of sequence id in
                         the seq_id field,  default is off.
  --dsn                  database DSN
  --schema_debug         Output SQL statements that are executed,
                         default to false
  -u --user              database user
  --log_level            Log level of the logger,  default is error
  -l --logfile           Name of logfile,  default goes to STDERR
  -c --configfile        yaml config file to specify all command line
                         options

It exports complete coding gene models along with contig and reference features. It could be either of curated or predicted(sequencing center) gene models where curated models take precedence.

1
$_> modware-export  chado2dictycanonicalgff3 -c gff3.yaml  -o canonical.gff3

To dump only a particular chromosome(reference feature) pass either a name or id in the —reference_id option.

1
$_> modware-export  chado2dictycanonicalgff3 -c gff3.yaml  --reference_id 6 -o canonical6.gff3

Non-coding

1
$_> modware-export  chado2dictynoncodinggff3 -c gff3.yaml -o data/noncoding.gff3

Non-canonical

There will be three exports, one for curated, one for sequencing center and one for reprediction pipeline.

‘subcommand to export sequencing center gene models’ (modware-export-noncanonical.txt) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
$_> modware-export chado2dictynoncanonicalgff3 [-?chlopu] [long options...]
  -h -? --usage --help        Prints this usage information.
  --reference_id              reference feature name/ID/accession
                              number. In this case,  only all of its
                              associated features will be dumped
  -o --output                 Name of the output file,  if absent
                              writes to STDOUT
  --attr --attribute          Additional database attribute
  --feature_name              Output feature name instead of sequence
                              id in the seq_id field,  default is off.
  --pass -p --password        database password
  --write_sequence_region     write sequence region header in GFF3
                              output,  default if off
  --source                    Name of database/piece of
                              software/algorithm that generated the
                              gene models. By default it is *Sequencing
                              Center*.
  --dsn                       database DSN
  --schema_debug              Output SQL statements that are executed,
                              default to false
  -u --user                   database user
  --log_level                 Log level of the logger,  default is error
  -l --logfile                Name of logfile,  default goes to STDERR
  -c --configfile             yaml config file to specify all command
                             line options

Though, we use different subcommands theirs options are identical.

1
2
3
4
$_> modware-export chado2dictynoncanonicalgff3  -c gff3.yaml -o data/noncanonical_seq_center.gff3
$_> modware-export chado2dictynoncanonicalv2gff3  -c gff3.yaml \ 
             -o data/noncanonical_norepred.gff3
$_> modware-export chado2dictycuratedgff3 -c config/dicty_gff3.yaml -o data/curated.gff3

Alignment

EST and couple of alignments from GenBank datasets.

‘subcommand to export alignment’ (modware-export-alignment.txt) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
$_> modware-export chado2alignmentgff3 [-?chlopu] [long options...]
  --write_sequence_region     write sequence region header in GFF3
                              output,  default if off
  -h -? --usage --help        Prints this usage information.
  --feature_name              Output feature name instead of sequence
                              id in the seq_id field,  default is off.
  --rt --reference_type       The SO type of reference feature,
                              default is supercontig
  -o --output                 Name of the output file,  if absent
                              writes to STDOUT
  --feature_type              SO type of alignment features to be
                              exported
  --attr --attribute          Additional database attribute
  --match_type                SO type of alignment feature that will be
                              exported in GFF3, *_match* is appended to
                              the feature_type by default.
  --pass -p --password        database password
  --force_name                Adds the value of GFF3 *ID* attribute to
                              *Name* attribute(if absent),  off by
                              default
  --add_description           If present,  add the GFF3 *Note*
                              attribute. It looks for a feature
                              property with *description* cvterm. Off
                              by default
  --dsn                       database DSN
  --property                  List of additional cvterms which will be
                              used to extract additional feature
                              properties
  --schema_debug              Output SQL statements that are executed,
                              default to false
  -u --user                   database user
  --log_level                 Log level of the logger,  default is error
  -l --logfile                Name of logfile,  default goes to STDERR
  --species                   Name of species
  --genus                     Name of the genus
  -c --configfile             yaml config file to specify all command
                             line options
  --org --organism            Common name of the organism whose genomic
                              features will be exported
1
2
$_> modware-export chado2alignmentgff3 -c gff3.yaml --org dicty \
      --reference_type chromosome  --feature_type EST -o data/EST.gff3
1
2
$_> modware-export chado2alignmentgff3 -c gff3.yaml --org dicty \ 
     --reference_type chromosome  --feature_type cDNA_clone -o data/cDNA_clone.gff3
1
2
3
$_> modware-export chado2alignmentgff3 -c gff3.yaml --org dicty -o data/genomic_fragment.gff3\ 
     --reference_type chromosome  --feature_type databank_entry \
     --match_type nucleotide_match

Misc

And ultimately some promoter features ..

1
2
3
$_> modware-export chado2alignmentgff3 -c gff3.yaml --org dicty -o data/promoter.gff3 \ 
    --reference_type chromosome --feature_type promoter --match_type promoter
    --org dicty   --force_name 1 --add_description 1 --property 'details_url'