dictyBase Developers

Solving one problem at a time

Design Pattern of Chado Database Loaders

Preamble

This is more or less my thoughts about how to structure a bulk loader for chado. Majority of the ideas come from writing obo2chado loader. obo2chado still lack the design that i am aiming now, but most of the upcoming one will follow that. And the future idea is to refactor the obo loader to that mold.

Design

Scope and expectation

  • The input would be some sort of flat file.
  • The data will be loaded to a relational backend. It could definitely be generalized, but at this moment it is not considered.

Reading data

There should be an object oriented interface for reading data from flat files. That object is expected to be passed along to other classes. For example, for obo2chado loader i have used the ONTO-Perl module.

Database interaction

Probably one of the import one. It’s better to have an ORM that supports mutiple backends and bulk loading support. For Perl code, i have used BCS a DBIx::Class class layer for chado database.

Loading in the staging area

This part is supposed to get data from flat file to temp tables of RDBMS. To start with, lets assign a class which will manage everything related to this task. First lets figure out what kind of information the class needs in order to perform those tasks. For the sake of understaing we name it as StagingManager ….

Staging manager

Attributes

  • schema: Should have an instance of Bio::Chado::Schema. A ORM/Database object for all database centric tasks. If its an ORM, then it should better provide access to some bulk mode operation or at least low level objects for bulk support.
  • chunk_threshold: I kind of thrown this in, it will be used for bulk loading in chunk.
  • sqlmanager: Should have an instance of SQL::Lib. A class that manages handling of sql statements. I found it easy to manage instead of inlining it in the class itself. With growing sql statments, it could become cumbersome to navigate through code. Provides better separation between code and non-code content. For obo2chado, i have used SQL::Library module, seems to be a very good choice.
  • logger: An instance of an logger.

Exporting D.discoideum Annotations in GFF3 Format

Prerequisites

Install modware loader from github. With a latest cpanm (>1.6), it can be also directly installed from github

1
$_> cpanm git://github.com/dictyBase/Modware-Loader.git@release

Then follow the basic introduction about using Modware-Loader.

Export genome annotations

As mentioned before, annotations are exported in pieces. First gene models(canonical, non-coding, curated and predicated), then alignments and promoters. Exports are done by the export subcommand of Modware-Loader.

‘modware-export subcommands’ (modware-export-commands.txt) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
$_> modware-export help

Available commands:

                       commands: list the application's commands
                           help: display a command's help screen

            chado2alignmentgff3:  Export alignment from chado database in GFF3 format
            chado2canonicalgff3:  Export canonical gene models from chado database in GFF3 format
       chado2dictycanonicalgff3: Export GFF3 with canonical gene models of Dictyostelium discoideum
         chado2dictycuratedgff3: Export GFF3 with curated gene models of Dictyostelium discoideum
    chado2dictynoncanonicalgff3: Export GFF3 with sequencing center gene models of Dictyostelium discoideum
  chado2dictynoncanonicalv2gff3:  Export GFF3 with repredicted gene models of Dictyostelium discoideum
       chado2dictynoncodinggff3: Export GFF3 with non coding gene models of Dictyostelium discoideum
                    chado2fasta: Export fasta sequence file from chado database

Common config file

A basic yaml config file to be used for all the exports.

‘gff3.yaml’
1
2
3
4
dsn: dbi:Oracle:database
user: username
password: password
feature_name: 1

Taming the Dictybase GFF3

dictyBase GFF3 has developed bunch of rough edges over the years and so do not plays well with third party tools. Here are the issues that we are aware of…

Known Issues

  • Genes with multiple gene models could easily be confused as splice isoforms. It is particularly hard to separate in case of known isoforms. There is also no easy way to identify the primary gene models.