Stock at dictyBase.org consists of strains & plasmids. Currently, the data resides in Oracle under a custom schema. Objectives behind this export (and eventual import)
- to bring the data under a standard Chado schema.
- to clean/merge/format data.
- each entry has multiple kinds of references (pubmed, internal reference, other references)
- data exists in 2 tables and is neither linked nor in sync (strain-plasmid & plasmid)
- abbreviations not linked to full forms (mutagenesis method)
- improper spacing & linebreaks (some Windows-style)
- to correct data model for inventories, phenotype, strain-feature links etc.
Different kinds of data that are exported as a part of stock are represented in JSON below.
1 2 3 4 5 6 7 8 9
The data is exported in TAB delimited files. Structure of exported data looks like
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
Rationale & SQL
Inventory is a property of the stock (strain/plasmid). In the legacy schema, the inventory information sits in a table of its own. With this export, the inventory will be controlled by ontologies (strain_inventory.obo, plasmid_inventory.obo & storage_condition.obo).
1 2 3 4 5 6 7 8 9 10
Phenotype is something that is observed. Each strain has a genotype which on expression under certain environment shows the phenotype. Thus, the data model for phenotype involves genotype, environment and the pubmed reference. It also optionally involves the assay information. Following SQL retrieves the phenotype information,
1 2 3 4 5 6 7 8 9 10 11
From the above SQL,
assay.name are terms from ontologies viz.
Dicty Environment &
Dictyostelium Assay respectively. Read about ontology loading here.
Columns in the exported TAB delimited file are: DBS-ID, Phenotype term, Environment, Assay, PMID, Phenotype note
Props, Publications & Characteristics
Props are additional information for the stock (in this case). For strains, we have props like ‘mutagenesis method’, ‘mutant type’ & ‘synonym’. And for plasmids, thr props are ‘keyword’, ‘depositor’ & ‘synonym’. The exported file has columns; DBS-ID, Prop, Value.
Stocks have associated publications. Mainly the publications are PubMed IDs. However, stocks have some unresolvable internal references. With this export, these internal references are cleaned up and brought down a standard, PubMed. While exporting publications redundant/duplicate entries were thrown out and the data is exported as a TAB delimited file with 2 columns; DBS_ID & PMID
The characteristics are strain characteristics. It is maintained as an ontology. Strain characteristics are exported as a 2 column TAB delimited file; DBS_ID & Characteristic Term
Parental strain & Strain-Plasmid
Parental strains, as the name suggests, are parents of the strain records. There are only a few parents for Dictyostelium discoideum. However, depending on when these parents were submitted to the Dicty Stock Center and by whom, they can have multiple records in the database. So the issue is that a strain can be linked to multiple entries of the same parent. So now, we will be added generic strains to resolve this issue. All strains with parents with multiple IDs will point to only one generic strain. Currently data is exported in 2 columns; DBS_ID & DBS_ID of parent
In case of strain-plasmids, there are strain-plasmid entries that are not real plasmids in the Dicty Stock Center. When the plasmid entri exists, the 2nd column exported is the DBP_ID, otherwise it is the
plasmid_name. This issue is resolved with the stock data import
The plasmid sequences are served from static files, currently. These sequences have been cleaned by running it through
Bio::SeqIO and some manually. The files names are the database IDs for plasmids. The raw, pre-processed data can be found here. This is the input for
--seq_data_dir parameter of
modware-dump dictyplasmid. The data is exported in
sequence subdirectory of output directory. The file names are DBP-IDs and extension is the data format (genbank/fasta).
1 2 3 4 5 6 7 8
The options common for both commands are
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 2 3 4
1 2 3 4 5
Running the commands
1 2 3 4 5 6 7 8