Stock Data Export
Why?
Stock at dictyBase.org consists of strains & plasmids. Currently, the data resides in Oracle under a custom schema. Objectives behind this export (and eventual import)
- to bring the data under a standard Chado schema.
- to clean/merge/format data.
- each entry has multiple kinds of references (pubmed, internal reference, other references)
- data exists in 2 tables and is neither linked nor in sync (strain-plasmid & plasmid)
- abbreviations not linked to full forms (mutagenesis method)
- improper spacing & linebreaks (some Windows-style)
- to correct data model for inventories, phenotype, strain-feature links etc.
Data
Different kinds of data that are exported as a part of stock are represented in JSON below.
1 2 3 4 5 6 7 8 9 |
|
The data is exported in TAB delimited files. Structure of exported data looks like
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
|
Rationale & SQL
Inventory
Inventory is a property of the stock (strain/plasmid). In the legacy schema, the inventory information sits in a table of its own. With this export, the inventory will be controlled by ontologies (strain_inventory.obo, plasmid_inventory.obo & storage_condition.obo).
1 2 3 4 5 6 7 8 9 10 |
|
Phenotype
Phenotype is something that is observed. Each strain has a genotype which on expression under certain environment shows the phenotype. Thus, the data model for phenotype involves genotype, environment and the pubmed reference. It also optionally involves the assay information. Following SQL retrieves the phenotype information,
1 2 3 4 5 6 7 8 9 10 11 |
|
From the above SQL, phen.name
, env.name
& assay.name
are terms from ontologies viz. Dicty Phenotypes
, Dicty Environment
& Dictyostelium Assay
respectively. Read about ontology loading here.
Columns in the exported TAB delimited file are: DBS-ID, Phenotype term, Environment, Assay, PMID, Phenotype note
Props, Publications & Characteristics
Props are additional information for the stock (in this case). For strains, we have props like ‘mutagenesis method’, ‘mutant type’ & ‘synonym’. And for plasmids, thr props are ‘keyword’, ‘depositor’ & ‘synonym’. The exported file has columns; DBS-ID, Prop, Value.
Stocks have associated publications. Mainly the publications are PubMed IDs. However, stocks have some unresolvable internal references. With this export, these internal references are cleaned up and brought down a standard, PubMed. While exporting publications redundant/duplicate entries were thrown out and the data is exported as a TAB delimited file with 2 columns; DBS_ID & PMID
The characteristics are strain characteristics. It is maintained as an ontology. Strain characteristics are exported as a 2 column TAB delimited file; DBS_ID & Characteristic Term
Parental strain & Strain-Plasmid
Parental strains, as the name suggests, are parents of the strain records. There are only a few parents for Dictyostelium discoideum. However, depending on when these parents were submitted to the Dicty Stock Center and by whom, they can have multiple records in the database. So the issue is that a strain can be linked to multiple entries of the same parent. So now, we will be added generic strains to resolve this issue. All strains with parents with multiple IDs will point to only one generic strain. Currently data is exported in 2 columns; DBS_ID & DBS_ID of parent
In case of strain-plasmids, there are strain-plasmid entries that are not real plasmids in the Dicty Stock Center. When the plasmid entri exists, the 2nd column exported is the DBP_ID, otherwise it is the plasmid_name
. This issue is resolved with the stock data import
Plasmid sequence
The plasmid sequences are served from static files, currently. These sequences have been cleaned by running it through Bio::SeqIO
and some manually. The files names are the database IDs for plasmids. The raw, pre-processed data can be found here. This is the input for --seq_data_dir
parameter of modware-dump dictyplasmid
. The data is exported in sequence
subdirectory of output directory. The file names are DBP-IDs and extension is the data format (genbank/fasta).
Command
The data is being exported using the modware-dump
command. All the modules used by this command can be found under Modware::Dump
or Modware::Role::Stock::Export
. The command looks like this;
1 2 3 4 5 6 7 8 |
|
The options common for both commands are
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
1 2 3 4 |
|
1 2 3 4 5 |
|
Running the commands
1 2 3 4 5 6 7 8 |
|