Log files generated by the nightly load

Location

https://curation.pombase.org/dumps/latest_build/logs/ or https://www.pombase.org/nightly_update/logs/

Contig files

The gene structures are loaded from EMBL format contig files in the Subversion repository

Logs files from loading the contig files:

all_warnings.txt - the main log file containing all load warnings

Each type of warning is also written to one of these files:

`cv_name_mismatches.txt`

the CV name at the start of a term field of an annotation doesn't match the cv qualifier. eg. term=sequence feature, transmembrane helices; cv=seq_feat

`db_xref_problems.txt`

a db_xref is missing

`duplicated_sub_qual_problems.txt`

a qualifier is duplicated

`evidence_problems.txt`

missing or unknown evidence code

`feature_warnings.txt`

Most errors originate from contig files

there is no feature corresponding to the /systematic_id in a UTR or intron
- Check all associated features, the rogue identifier might be on an intron or UTR
or a CDS has no /systematic_id

`mapping_problems.txt`

a term is missing from a mapping file

`misc_term_warnings.txt`

problems with IDs in /GO annotation

`mismatches.txt`

cases where the cv name in the term doesn't match the cv qualifier eg. /controlled_curation="term=sequence feature, transmembrane helices; cv=seq_feat; date=19700101"

`ortholog_problems.txt`

a human or cerevisiae gene from an ortholog annotation isn't in Chado
the ortholog couldn't be store in Chado, perhaps because of a duplicate

`pseudogene_mismatches.txt`

feature has /colour=13 but isn't a pseudogene

`qualifier_problems.txt`

general qualifier problems

Often 'term' is missing from the beginning of the qualifier
- E.g. intron SPAC20G4.09.1:intron:1: qualifier not in the form "key=value": "misc, confirmed"
The modification terms in the contig files are converted to MOD terms- the mapping file is: pombe-embl/chado_load_mappings/modification_map.txt Check there is an entry in this file
- E.g. mRNA SPAC1834.03c.1: can't find new term for methylated lysine in mapping for PSI-MOD: failed to load qualifier 'term=modification, methylated lysine; residue=K20; db_xref=PB_REF:0000001; evidence=ISS; cv=pt_mod; date=20100311' from SPAC1834.03c.1

`synonym_match_problems.txt`

more than one term is found by synonym in Chado for a term name in an annotation

`target_problems.txt`

unused?

`unknown_cv_names.txt`

cv= has a name that isn't in Chado

`unknown_term_names.txt`

term= has a term name that isn't in Chado

Log files generated after loading the contig files:

All these log files have a date and time prefix so the file names look like: log.2021-12-11-23-30-05.biogrid-load-output

`add-missing-allele-names`

Output from the process that adds missing allele names, where possible, using the gene name and allele description. This process only considers amino_acid_mutation and nucleotide_mutation alleles.

See: https://github.com/pombase/pombase-chado/issues/881

`allele-comments`

Warnings from reading allele comments from pombe-embl/supporting_files/allele_comments.txt.

`allele-synonyms-from-supporting-data`

Warnings from reading allele comments from pombe-embl/supporting_files/allele_synonyms-from-supporting-data.txt.

This usually means that the primary name and synonyms have been switched. Either remove from the file, or swap the name /synonym combo

`alleles_of_type_other`

A table of alleles with type "other".

`annotation_counts_by_cv`

a table of the counts of annotations per CV
three tables annotation counts by evidence code and cv type sorted by
- CV name
- annotation count
- evidence code
a count of all ontology annotationa
counts of annotation from Canto, by type
total annotations from Canto

`biogrid-load-output`

Warning generating by reading the pombe interactions from BioGRID.

We read BIOGRID-ORGANISM-LATEST.tab2.zip from: https://downloads.thebiogrid.org/Download/BioGRID/Latest-Release/

`compara-orth-load-output`

Warnings from the Compara ortholog predictions from: pombe-embl/orthologs/compara_orths.tsv in SVN.

`curation-tool-data-load-output`

Warnings from the Canto curation sessions:

Extension not allowed for a CV

To add new allowed extensions see the extension_restrictions configuration setting.

`disease_associations`

Warnings from pombe-embl/external_data/disease/pombase_disease_associations_mondo_ids.txt.

`excluded_fypo_terms`

Annotations that use FYPO terms from: pombe-embl/mini-ontologies/FYPO_qc_do_not_annotate_subsets.obo

`excluded_fypo_terms_softcheck`

Annotations that use FYPO terms from: pombe-embl/supporting_files/FYPO_terms_excluded_from_pombase.txt

`excluded_go_terms_softcheck`

Annotations that use GO terms from: pombe-embl/supporting_files/GO_terms_excluded_from_pombase.txt

`export_warnings`

warnings generated while exporting from Chado to https://curation.pombase.org/dumps/latest_build/:

the GO pombe GAF file
exports/pombase-go-physical-interactions.tsv.gz
exports/pombase-go-substrates.tsv.gz
interactions in BioGRID format
ortholog file: pombase-latest.human-orthologs.txt.gz
exports/pombe-human-orthologs-with-systematic-ids.txt.gz
exports/pombe-cerevisiae-orthologs-with-systematic-ids.txt.gz
phenotypes in PHAF format: pombase-latest.phaf.gz
pombase-latest.eco.phaf.gz (See https://github.com/pombase/pombase-chado/issues/869)
modifications, with a file name like: pombase-build-2021-12-10.modifications.gz

`extension_relation_counts`

The number of annotations using extensions grouped by CV.

`fix-allele-names`

Message from the process that fixes incomplete allele names.

If name and description are the same and look like a residue change (eg. "A123K" or "K21A,T23A"), add the gene name as a prefix to the allele name: "abc1-A123K" "abc1-K21A,T23A"

`gaf-load-output`

Warnings from loading GAF files from GOA, Panther and others.

`go-filter-uniprot-duplicates`

Messages from the process that removes GOA annotations if there is another identical annotation.

`go-term-mapping`

Output from applying pombe-embl/chado_load_mappings/GO_mapping_to_specific_terms.txt.

`kegg-pathway`

Messages from loading the KEGG pathways for pombe.

`legacy_go_from_contigs`

Warnings from loading: pombe-embl/supporting_files/legacy_go_annotations_from_contigs.txt

`malacards_data`

Warnings from loading: pombe-embl/external_data/disease/malacards_data_for_chado_mondo_ids.tsv

`manual-1-1-orths-output`

Warnings from loading: pombe-embl/orthologs/conserved_one_to_one.txt

`manual-multi-orths-output`

Warnings from: pombe-embl/orthologs/conserved_multi.txt

`modification`

Warnings from loading the modification data files from: pombe-embl/external_data/modification_files/

`phenotypes_from_PMID_..._phaf`

One file for each PHAF file we load from pombe-embl/external_data/phaf_files/chado_load/htp_phafs/ and pombe-embl/external_data/phaf_files/chado_load/ltp_phafs/

`protein_family_term_annotation`

A table of which gene is annotated with which protein family.

`qualifier_counts_by_cv`

Counts of qualifiers grouped by CV name.

`qualitative`

Warnings from loading qualitative gene expression files from: pombe-embl/external_data/qualitative_gene_expression_data/

`quantitative`

Warnings from loading quantitative gene expression files from: pombe-embl/external_data/Quantitative_gene_expression_data/

`web-json-write`

Warnings and messages from the process (using the pombase-chado-json executable) that creates the data files for the website. See the Nightly update page for more detail.

Warns about:

missing introns (either no gap or an overlap between two exons)
genes that are viable and inviable simultaneously

QC queries

These are queries that are run for information or quality control. We expect to get output from these queries, in contrast to the Chado checks.

Current queries:

genotype comments with session IDs

The output appears in log files like: log.2021-12-11-23-30-05.qc_queries

Chado checks

At the end of the load we run "chado_checks" configured in the check_chado section of main Chado load config file.

These are queries that hope will one day produce zero results (in contrast to the QC queries). If there are no results the corresponding log file will be empty.

Some warnings will appear in the main Chado check log file: log.2021-12-11-23-30-05.chado_checks but mostly warnings are split into one file per check:

`chado_checks.alleles_instance_of_gene`

Check that all alleles are an instance_of gene. Internal loading check to make sure the database has the right structure for alleles and genes.

`chado_checks.annotation_count`

Check that we have enough annotations in Chado. This is a coarse that the count hasn't dropped a lot.

`chado_checks.annotation_with_no_evidence`

Report annotations with no evidence.

`chado_checks.badly_formatted_pro_ids`

Checks column 17 / gene_product_form_id fields for badly formatted PRO IDs.

`chado_checks.badly_named_deletion_alleles`

Usually occurs when a community curator has used their own laborotory name destination. Fix by using standard name format and moving the previously curated name to be a synonym. Multiple synonyms are "|" separated.

Deletion alleles that don't end in "delta".

`chado_checks.canto_annotations`

Warns if the number of annotation from Canto drops unexpectedly.

`chado_checks.duplicate_allele_descriptions`

Two alleles with the same description and gene have different names.

Occurs either:

An unknown allele, that has subsequently been sequenced now has a standard description, and this has been used as the name in subsequent publications.
- E.g. SPBC119.11c snm1-1 3f99f4c4c350c480,78374d69fcccdd6e A342T pac1-A342T f5403fffd3ecaf81,c242b94e8f530220 - in these cases standardize on the most informative description/ and or current gene name pac1-A342T
An allele has been entered without the gene name prefix (e.g. just A123E not cdc2-A123E)
An allele has been assigned different names by different laboratories (use first or most prevalent, discuss with community if in doubt).
A generic description has been used e.g. 'analog sensitive' ' fusion', descrip[tion needs to be explicit if it can apply to multiple allese.
** E.g. move 'analogue sensitive' to allele comment and add the actual amino acid change, i.e cdk9as would become cdk9-T120G OR we need to know if there are multiple as mutants and label them clearly cdk9-as1,cdk9-as2,cdk9-as3 (the first solution is preferred)

NOTE: FOR ALL, REMEMBER TO FIX IN MULTI-ALLELE GENOTYPES TOO (THERE ARE OFTEN MANY, IDEALLY IN THE FUTURE THERE WILL BE A WAY TO FIX ALL ALLELES IN A SESSION SIMULTANEOUSLY)

`chado_checks.AlleleNotStartingWithGeneName`

Check that all allele names start with a gene name or gene synonym.

`chado_checks.DuplicateInteractions`

Check for duplicate interactions.

`chado_checks.FeatureCount`

Check that there are enough genes (more than 10000). This includes all organisms. If we have less than 10000 something has gone very wrong.

`chado_checks.GenotypeBackgrounds`

Check the genotype backgrounds and report anything that shouldn't be there.

`chado_checks.PhenotypesNotInCategory`

Check that all phenotype terms have a parent in one of the configured categories. See split_by_parents in the website configuration for fission_yeast_phenotype. Annotation for terms that aren't a descendant of on of the split_by_parents won't be displayed.

`chado_checks.duplicated_allele_names`

Check for two or more alleles with the same name. The columns are:

allele name
allele systematic ID
allele type
allele description
canto session(s) - if empty, the allele comes from a PHAF file

`chado_checks.duplicate_go_annotation`

All duplicate GO annotation. Columns are:

count, systematic ID, gene name, term name, evidence, PubMed ID, 'with', session ID

`chado_checks.duplicate_pro_ids`

Report PRO IDs used for more than one gene.

`chado_checks.enough_genetic_interactions`

Check that we have a sensible number of genetic interactions.

`chado_checks.enough_isa_cvterm_rels`

A coarse check that the CVs loaded correctly: warns if there aren't a sensible number of is_a cvterm_relationship rows.

`chado_checks.enough_physical_interactions`

Check that we have a sensible number of physical interactions.

`chado_checks.extension_relation_genes_exist`

Check that gene identifiers used in extensions are valid.

`chado_checks.go_annotation_count`

Confirm we have a reasonable number of GO annotation. This check exisits mostly to confirm that GO filtering is working.

`chado_checks.illegal_with_prefix`

Warn if an identifier in a with field has an unknown prefix.

`chado_checks.missing_assayed_using`

Check that the extensions on protein binding annotations have two assayed_using relations.

`chado_checks.modification_on_wrong_residue`

Check for modifications where the residue doesn't make sense for the term. See: https://github.com/pombase/pombase-chado/issues/1097

`chado_checks.no_duplicate_orthologs`

Report duplicated orthologs.

`chado_checks.no_duplicate_pombe_gene_names`

Check for cases where two genes have the same name.

`chado_checks.no_population_terms_with_penetrance`

No population term should have a has_penetrance relation.

`chado_checks.only_ascii_characters_in_feature_name`

Report non-ascii characters in feature names.

`chado_checks.pombe_genes`

Warn if there not a sensible number of pombe genes.

`chado_checks.prop_values_with_missing_terms`

Check for relation ranges / annotation properties where the term is missing from the database, eg. from(...) or column_17(...).

`chado_checks.quantitative_annotation_qualifiers`

Report all quantitative expression annotations don't have a count qualifier (like quant_gene_ex_avg_copies_per_cell).

`chado_checks.species_dist_term_name_typos`

Check that we have the correct number of species distribution terms; too many means there are typos somewhere.

All errors are in *.config files

`chado_checks.unknown_allele_name_synonym_matches`

Cases where unknown alleles have a name or synonym in common, which should be merged. GitHub issue

`chado_checks.inconsistent_temperature_conditions`

Check for annotations that have more than one of low, high or standard temperature as a condition. GitHub issue