-
Notifications
You must be signed in to change notification settings - Fork 3
Log files generated by the nightly load
https://curation.pombase.org/dumps/latest_build/logs/ or https://www.pombase.org/nightly_update/logs/
The gene structures are loaded from EMBL format contig files in the Subversion repository
all_warnings.txt
- the main log file containing all load warnings
Each type of warning is also written to one of these files:
- the CV name at the start of a term field of an annotation doesn't match the
cv
qualifier. eg.term=sequence feature, transmembrane helices; cv=seq_feat
- a db_xref is missing
- a qualifier is duplicated
- missing or unknown evidence code
Most errors originate from contig files
- there is no feature corresponding to the
/systematic_id
in a UTR or intron- Check all associated features, the rogue identifier might be on an intron or UTR
- or a CDS has no
/systematic_id
- a term is missing from a mapping file
- problems with IDs in
/GO
annotation
- cases where the cv name in the term doesn't match the cv qualifier eg.
/controlled_curation="term=sequence feature, transmembrane helices; cv=seq_feat; date=19700101"
- a human or cerevisiae gene from an ortholog annotation isn't in Chado
- the ortholog couldn't be store in Chado, perhaps because of a duplicate
- feature has
/colour=13
but isn't a pseudogene
- general qualifier problems
-
Often 'term' is missing from the beginning of the qualifier
-
- E.g. intron SPAC20G4.09.1:intron:1: qualifier not in the form "key=value": "misc, confirmed"
-
The modification terms in the contig files are converted to MOD terms- the mapping file is: pombe-embl/chado_load_mappings/modification_map.txt Check there is an entry in this file
-
- E.g. mRNA SPAC1834.03c.1: can't find new term for methylated lysine in mapping for PSI-MOD: failed to load qualifier 'term=modification, methylated lysine; residue=K20; db_xref=PB_REF:0000001; evidence=ISS; cv=pt_mod; date=20100311' from SPAC1834.03c.1
- more than one term is found by synonym in Chado for a term name in an annotation
- unused?
-
cv=
has a name that isn't in Chado
-
term=
has a term name that isn't in Chado
All these log files have a date and time prefix so the file names look like: log.2021-12-11-23-30-05.biogrid-load-output
Output from the process that adds missing allele names, where possible, using the gene name and allele description. This process only considers amino_acid_mutation and nucleotide_mutation alleles.
See: https://github.com/pombase/pombase-chado/issues/881
Warnings from reading allele comments from pombe-embl/supporting_files/allele_comments.txt
.
Warnings from reading allele comments from pombe-embl/supporting_files/allele_synonyms-from-supporting-data.txt
.
- This usually means that the primary name and synonyms have been switched. Either remove from the file, or swap the name /synonym combo
A table of alleles with type "other".
- a table of the counts of annotations per CV
- three tables annotation counts by evidence code and cv type sorted by
- CV name
- annotation count
- evidence code
- a count of all ontology annotationa
- counts of annotation from Canto, by type
- total annotations from Canto
Warning generating by reading the pombe interactions from BioGRID.
We read BIOGRID-ORGANISM-LATEST.tab2.zip
from: https://downloads.thebiogrid.org/Download/BioGRID/Latest-Release/
Warnings from the Compara ortholog predictions from: pombe-embl/orthologs/compara_orths.tsv
in SVN.
Warnings from the Canto curation sessions:
To add new allowed extensions see the extension_restrictions
configuration setting.
Warnings from pombe-embl/external_data/disease/pombase_disease_associations_mondo_ids.txt
.
Annotations that use FYPO terms from: pombe-embl/mini-ontologies/FYPO_qc_do_not_annotate_subsets.obo
Annotations that use FYPO terms from: pombe-embl/supporting_files/FYPO_terms_excluded_from_pombase.txt
Annotations that use GO terms from: pombe-embl/supporting_files/GO_terms_excluded_from_pombase.txt
warnings generated while exporting from Chado to https://curation.pombase.org/dumps/latest_build/:
- the GO pombe GAF file
exports/pombase-go-physical-interactions.tsv.gz
exports/pombase-go-substrates.tsv.gz
- interactions in BioGRID format
- ortholog file:
pombase-latest.human-orthologs.txt.gz
exports/pombe-human-orthologs-with-systematic-ids.txt.gz
exports/pombe-cerevisiae-orthologs-with-systematic-ids.txt.gz
- phenotypes in PHAF format:
pombase-latest.phaf.gz
-
pombase-latest.eco.phaf.gz
(See https://github.com/pombase/pombase-chado/issues/869) - modifications, with a file name like:
pombase-build-2021-12-10.modifications.gz
The number of annotations using extensions grouped by CV.
Message from the process that fixes incomplete allele names.
If name and description are the same and look like a residue change (eg. "A123K" or "K21A,T23A"), add the gene name as a prefix to the allele name: "abc1-A123K" "abc1-K21A,T23A"
Warnings from loading GAF files from GOA, Panther and others.
Messages from the process that removes GOA annotations if there is another identical annotation.
Output from applying pombe-embl/chado_load_mappings/GO_mapping_to_specific_terms.txt
.
Messages from loading the KEGG pathways for pombe.
Warnings from loading: pombe-embl/supporting_files/legacy_go_annotations_from_contigs.txt
Warnings from loading: pombe-embl/external_data/disease/malacards_data_for_chado_mondo_ids.tsv
Warnings from loading: pombe-embl/orthologs/conserved_one_to_one.txt
Warnings from: pombe-embl/orthologs/conserved_multi.txt
Warnings from loading the modification data files from: pombe-embl/external_data/modification_files/
One file for each PHAF file we load from pombe-embl/external_data/phaf_files/chado_load/htp_phafs/
and pombe-embl/external_data/phaf_files/chado_load/ltp_phafs/
A table of which gene is annotated with which protein family.
Counts of qualifiers grouped by CV name.
Warnings from loading qualitative gene expression files from: pombe-embl/external_data/qualitative_gene_expression_data/
Warnings from loading quantitative gene expression files from: pombe-embl/external_data/Quantitative_gene_expression_data/
Warnings and messages from the process (using the pombase-chado-json
executable) that creates the data files for the website. See the Nightly update page for more detail.
Warns about:
- missing introns (either no gap or an overlap between two exons)
- genes that are viable and inviable simultaneously
These are queries that are run for information or quality control. We expect to get output from these queries, in contrast to the Chado checks.
Current queries:
The output appears in log files like: log.2021-12-11-23-30-05.qc_queries
At the end of the load we run "chado_checks" configured in the check_chado
section of main Chado load config file.
These are queries that hope will one day produce zero results (in contrast to the QC queries). If there are no results the corresponding log file will be empty.
Some warnings will appear in the main Chado check log file: log.2021-12-11-23-30-05.chado_checks
but mostly warnings are split into one file per check:
Check that all alleles are an instance_of
gene.
Internal loading check to make sure the database has the right structure for alleles and genes.
Check that we have enough annotations in Chado. This is a coarse that the count hasn't dropped a lot.
Report annotations with no evidence.
Checks column 17 / gene_product_form_id fields for badly formatted PRO IDs.
- Usually occurs when a community curator has used their own laborotory name destination. Fix by using standard name format and moving the previously curated name to be a synonym. Multiple synonyms are "|" separated.
Deletion alleles that don't end in "delta".
Warns if the number of annotation from Canto drops unexpectedly.
Two alleles with the same description and gene have different names.
Occurs either:
- An unknown allele, that has subsequently been sequenced now has a standard description, and this has been used as the name in subsequent publications.
-
- E.g. SPBC119.11c snm1-1 3f99f4c4c350c480,78374d69fcccdd6e A342T pac1-A342T f5403fffd3ecaf81,c242b94e8f530220 - in these cases standardize on the most informative description/ and or current gene name pac1-A342T
- An allele has been entered without the gene name prefix (e.g. just A123E not cdc2-A123E)
- An allele has been assigned different names by different laboratories (use first or most prevalent, discuss with community if in doubt).
- A generic description has been used e.g. 'analog sensitive' ' fusion', descrip[tion needs to be explicit if it can apply to multiple allese.
** E.g. move 'analogue sensitive' to allele comment and add the actual amino acid change, i.e cdk9as would become cdk9-T120G OR we need to know if there are multiple as mutants and label them clearly cdk9-as1,cdk9-as2,cdk9-as3 (the first solution is preferred)
NOTE: FOR ALL, REMEMBER TO FIX IN MULTI-ALLELE GENOTYPES TOO (THERE ARE OFTEN MANY, IDEALLY IN THE FUTURE THERE WILL BE A WAY TO FIX ALL ALLELES IN A SESSION SIMULTANEOUSLY)
Check that all allele names start with a gene name or gene synonym.
Check for duplicate interactions.
Check that there are enough genes (more than 10000). This includes all organisms. If we have less than 10000 something has gone very wrong.
Check the genotype backgrounds and report anything that shouldn't be there.
Check that all phenotype terms have a parent in one of the configured categories. See split_by_parents
in the website configuration for fission_yeast_phenotype. Annotation for terms that aren't a descendant of on of the split_by_parents
won't be displayed.
Check for two or more alleles with the same name. The columns are:
- allele name
- allele systematic ID
- allele type
- allele description
- canto session(s) - if empty, the allele comes from a PHAF file
All duplicate GO annotation. Columns are:
count, systematic ID, gene name, term name, evidence, PubMed ID, 'with', session ID
Report PRO IDs used for more than one gene.
Check that we have a sensible number of genetic interactions.
A coarse check that the CVs loaded correctly: warns if there aren't a
sensible number of is_a
cvterm_relationship
rows.
Check that we have a sensible number of physical interactions.
Check that gene identifiers used in extensions are valid.
Confirm we have a reasonable number of GO annotation. This check exisits mostly to confirm that GO filtering is working.
Warn if an identifier in a with
field has an unknown prefix.
Check that the extensions on protein binding annotations have two
assayed_using
relations.
Check for modifications where the residue doesn't make sense for the term. See: https://github.com/pombase/pombase-chado/issues/1097
Report duplicated orthologs.
Check for cases where two genes have the same name.
No population term should have a has_penetrance
relation.
Report non-ascii characters in feature names.
Warn if there not a sensible number of pombe genes.
Check for relation ranges / annotation properties where the term is missing from the database, eg. from(...)
or column_17(...)
.
Report all quantitative expression annotations don't have a count qualifier (like quant_gene_ex_avg_copies_per_cell
).
Check that we have the correct number of species distribution terms; too many means there are typos somewhere.
- All errors are in *.config files
Cases where unknown alleles have a name or synonym in common, which should be merged. GitHub issue
Check for annotations that have more than one of low, high or standard temperature as a condition. GitHub issue