Read the post here to understand the ontology data model.
Overall design
We create a Modware-Loader application module which instantiate
a Loader class and then call various loading specific methods on it. The Loader
itself consumes a database specific engine(Moose role) to run backend specific code as
needed.
The application module uses
BioPortal API download the latest version of ontology in
obo format and requires and API KEY to run.
packageModware::Load::Command::bioportalobo2chado;usestrict;usenamespace::autoclean;useMoose;useBioPortal::WebService;useModware::Loader::Ontology;useOBO::Parser::OBOParser;extendsqw/Modware::Load::Chado/;has'apikey'=>(is=>'rw',isa=>'Str',required=>1,documentation=>'An API key for bioportal');has'ontology'=>(is=>'rw',isa=>'Str',required=>1,documentation=>'Name of the ontology for loading in Chado');sub execute{my($self)=@_;my$logger=$self->logger;#download from bioportalmy$bioportal=BioPortal::WebService->new(apikey=>$self->apikey);my$downloader=$bioportal->downlaod($self->ontology);if(!$downloader->is_obo_format){$logger->logcroak($self->ontology,' is not available in OBO format');}#parse obo and give it to loadermy$loader=Modware::Loader::Ontology->new;my$ontology=OBO::Parser::OBOParser->new->work($downloader->filename);$loader->set_ontology($ontology);$loader->set_schema($self->schema);}
Here is how the command line looks like
1234567891011121314151617
$_> perl -Ilib bin/modware-load bioportalobo2chado --usage
modware-load bioportalobo2chado [-?chilpu][long options...] --apikey An API key for bioportal
--ontology Name of the ontology for loading in Chado
-i --input Name of the input file, if absent reads from
STDIN
-h -? --usage --help Prints this usage information.
--attr --attribute Additional database attribute
--pass -p --password database password
--dsn database DSN
--schema_debug Output SQL statements that are executed,
default to false -u --user database user
--log_level Log level of the logger, default is error
-c --configfile yaml config file to specify all command line
options
-l --logfile Name of logfile, default goes to STDERR
The Modware::Load::Chado provides the schema and logger options.
packageModware::Load::Chado;usestrict;# Other modules:usenamespace::autoclean;useMoose;useYAMLqw/LoadFile/;extendsqw/MooseX::App::Cmd::Command/;with'MooseX::ConfigFromFile';with'Modware::Role::Command::WithIO';with'Modware::Role::Command::WithBCS';with'Modware::Role::Command::WithLogger';# Module implementation#has'+output'=>(traits=>[qw/NoGetopt/]);has'+output_handler'=>(traits=>[qw/NoGetopt/]);has'+configfile'=>(cmd_aliases=>'c',documentation=>'yaml config file to specify all command line options',traits=>[qw/Getopt/]);sub get_config_from_file{my($self,$file)=@_;returnLoadFile($file);}__PACKAGE__->meta->make_immutable;1;# Magic true value required at end of module
The loader class integrates database(storage engine) specific methods in forms of a
Moose role. The role is choosen dynamically based on dsn value provided. The backend
specific role is consumed as soon as the schema attribute is set.
packageModware::Loader::Ontology;usenamespace::autoclean;useMoose;has'schema'=>(is=>'rw',isa=>'Bio::Chado::Schema',writer=>'set_schema',trigger=>sub {my($self,$schema)=@_;$self->_load_engine($schema);$self->_around_connection;$self->_check_cvprop_or_die;});has'connect_info'=>(is=>'rw',isa=>'Modware::Storage::Connection',set=>'set_connect_info');sub _check_cvprop_or_die{my($self)=@_;my$row=$self->schema->resultset('Cv::Cv')->find({name=>'cv_property'});croak"cv_property ontology is not loaded\n"if!$row;$self->set_cvrow('cv_property',$row);}sub _around_connection{my($self)=@_;my$connect_info=$self->connect_info;my$extra_attr=$connect_info->extra_attribute;my$create_statements=$self->create_temp_statements;my$drop_statements=$self->drop_temp_statements;push@$create_statements,$extra_attr->{on_connect_do}ifdefined$exta_attr->{on_connect_do};push@$drop_statements,$extra_attr->{on_disconnect_do}ifdefined$exta_attr->{on_disconnect_do};$self->schema->connection($connect_info->dsn,$connect_info->user,$connect_info->password,$connect_info->attribute,{on_connect_do=>$create_statements,on_disconnect_do=>$drop_statements});}sub _load_engine{my($self,$schema)=@_;$self->meta->make_mutable;my$engine='Modware::Loader::Role::Ontology::With'.ucfirstlc($schema->storage->sqlt_type);ensure_all_roles($self,$engine);$self->meta->make_immutable;$self->transform_schema;}
The loader also validates the presence of cv_property(_check_cvprop_or_die) ontology, it is needed to be
preloaded for any other ontology to be loaded later on.
The _around_connection method reset the storage connection and add statements for creating and dropping staging tables.
Methods/attributes the backend role needs to implement
transform_schema
Meant for altering any column name/attributes through the schema object. For example,
for oracle backend the synonym column need to be renamed and the column type for all
the chado prop tables need to be altered.
It creates temporary staging tables for holding the data. It expects to return an
ArrayRef where each element could be a sql statement or a code reference. The create
statments should at least create the following temporary tables ….
temp_cvterm
123456789101112131415
packageModware::Loader::Role::Ontology::WithSqlite;sub create_temp_statements{return[qq{ CREATE TEMPORARY TABLE temp_cvterm ( name varchar(1024) NOT NULL, accession varchar(1024) NOT NULL, is_obsolete integer NOT NULL DEFAULT 0, is_relationshiptype integer NOT NULL DEFAULT 0, definition text, ) }];}
drop_temp_statements
Similar to create_temp_statments but only to remove the staging tables.
12345
packageModware::Loader::Role::Ontology::WithSqlite;sub drop_temp_statements{return[qq{ DROP TABLE temp_cvterm }];}
cache_threshold
Maximum entries that will be held in memory before it is persisted in the staging tables.
merge_dbxrefs
merge_cvterms
merge_comments
merge_relations
DBIC result classes for staging tables
These result classes maps to the staging temporary tables and gets registered with the
main schema.
By default, it will store the ontology namespace, date and version(if available). The
loader proceeds only if there is a new version available. However, the version value in
data-version field is non-uniform which makes it very hard to parse and compare
uniformly. So, instead the creation date is used for comparison.
123456789101112131415161718192021222324252627
packageModware::Load::Command::bioportalobo2chado;sub execute{my($self)=@_;my$logger=$self->logger;#download from bioportalmy$bioportal=BioPortal::WebService->new(apikey=>$self->apikey);my$downloader=$bioportal->downlaod($self->ontology);if(!$downloader->is_obo_format){$logger->logcroak($self->ontology,' is not available in OBO format');}#parse obo and give it to loadermy$loader=Modware::Loader::Ontology->new;my$ontology=OBO::Parser::OBOParser->new->work($downloader->filename);$loader->set_ontology($ontology);$loader->set_schema($self->schema);# check if it is a new versionif($loader->is_ontology_in_db()){if(!$loader->is_ontology_new_version()){$logger->logcroak("This version of ontology already exist in database");}}$loader->store_metadata;}
Check line 34 for date comparsion using DateTime object.
Here, more or less some of the namespaces will be shared and might have been created
during other ontology loading, so this step will go on with find_or_create mode.
Overall, the loader prepares perl data hashes suitable for insertion and loads them in
various staging temp tables. The entire process is wrapped around prepare_data_for
loading method.
And finally you call it from main application and load them in a separte transaction
1234567891011
packageModware::Load::Command::bioportalobo2chado;sub execute{....#transaction for loading in staging temp tables$guard=$self->schema->txn_scope_guard;$loader->prepare_data_for_loading;$guard->commit;$logger->info(loader->cvterms_in_staging," terms in temp table");}