Skip to content

Log files generated by the nightly load

Kim Rutherford edited this page Jan 16, 2025 · 50 revisions

Location

https://curation.pombase.org/dumps/latest_build/logs/ or https://www.pombase.org/nightly_update/logs/

Contig files

The gene structures are loaded from EMBL format contig files in the Subversion repository

Logs files from loading the contig files:

all_warnings.txt - the main log file containing all load warnings

Each type of warning is also written to one of these files:

cv_name_mismatches.txt

  • the CV name at the start of a term field of an annotation doesn't match the cv qualifier. eg. term=sequence feature, transmembrane helices; cv=seq_feat

db_xref_problems.txt

  • a db_xref is missing

duplicated_sub_qual_problems.txt

  • a qualifier is duplicated

evidence_problems.txt

  • missing or unknown evidence code

feature_warnings.txt

Most errors originate from contig files

  • there is no feature corresponding to the /systematic_id in a UTR or intron
    • Check all associated features, the rogue identifier might be on an intron or UTR
  • or a CDS has no /systematic_id

mapping_problems.txt

  • a term is missing from a mapping file

misc_term_warnings.txt

  • problems with IDs in /GO annotation

mismatches.txt

  • cases where the cv name in the term doesn't match the cv qualifier eg. /controlled_curation="term=sequence feature, transmembrane helices; cv=seq_feat; date=19700101"

ortholog_problems.txt

  • a human or cerevisiae gene from an ortholog annotation isn't in Chado
  • the ortholog couldn't be store in Chado, perhaps because of a duplicate

pseudogene_mismatches.txt

  • feature has /colour=13 but isn't a pseudogene

qualifier_problems.txt

  • general qualifier problems
  • Often 'term' is missing from the beginning of the qualifier

    • E.g. intron SPAC20G4.09.1:intron:1: qualifier not in the form "key=value": "misc, confirmed"
  • The modification terms in the contig files are converted to MOD terms- the mapping file is: pombe-embl/chado_load_mappings/modification_map.txt Check there is an entry in this file

    • E.g. mRNA SPAC1834.03c.1: can't find new term for methylated lysine in mapping for PSI-MOD: failed to load qualifier 'term=modification, methylated lysine; residue=K20; db_xref=PB_REF:0000001; evidence=ISS; cv=pt_mod; date=20100311' from SPAC1834.03c.1

synonym_match_problems.txt

  • more than one term is found by synonym in Chado for a term name in an annotation

target_problems.txt

  • unused?

unknown_cv_names.txt

  • cv= has a name that isn't in Chado

unknown_term_names.txt

  • term= has a term name that isn't in Chado

Log files generated after loading the contig files:

All these log files have a date and time prefix so the file names look like: log.2021-12-11-23-30-05.biogrid-load-output

add-missing-allele-names

Output from the process that adds missing allele names, where possible, using the gene name and allele description. This process only considers amino_acid_mutation and nucleotide_mutation alleles.

See: https://github.com/pombase/pombase-chado/issues/881

allele-comments

Warnings from reading allele comments from pombe-embl/supporting_files/allele_comments.txt.

allele-synonyms-from-supporting-data

Warnings from reading allele comments from pombe-embl/supporting_files/allele_synonyms-from-supporting-data.txt.

  • This usually means that the primary name and synonyms have been switched. Either remove from the file, or swap the name /synonym combo

alleles_of_type_other

A table of alleles with type "other".

annotation_counts_by_cv

  • a table of the counts of annotations per CV
  • three tables annotation counts by evidence code and cv type sorted by
    • CV name
    • annotation count
    • evidence code
  • a count of all ontology annotationa
  • counts of annotation from Canto, by type
  • total annotations from Canto

biogrid-load-output

Warning generating by reading the pombe interactions from BioGRID.

We read BIOGRID-ORGANISM-LATEST.tab2.zip from: https://downloads.thebiogrid.org/Download/BioGRID/Latest-Release/

compara-orth-load-output

Warnings from the Compara ortholog predictions from: pombe-embl/orthologs/compara_orths.tsv in SVN.

curation-tool-data-load-output

Warnings from the Canto curation sessions:

Extension not allowed for a CV

To add new allowed extensions see the extension_restrictions configuration setting.

disease_associations

Warnings from pombe-embl/external_data/disease/pombase_disease_associations_mondo_ids.txt.

excluded_fypo_terms

Annotations that use FYPO terms from: pombe-embl/mini-ontologies/FYPO_qc_do_not_annotate_subsets.obo

excluded_fypo_terms_softcheck

Annotations that use FYPO terms from: pombe-embl/supporting_files/FYPO_terms_excluded_from_pombase.txt

excluded_go_terms_softcheck

Annotations that use GO terms from: pombe-embl/supporting_files/GO_terms_excluded_from_pombase.txt

export_warnings

warnings generated while exporting from Chado to https://curation.pombase.org/dumps/latest_build/:

  • the GO pombe GAF file
  • exports/pombase-go-physical-interactions.tsv.gz
  • exports/pombase-go-substrates.tsv.gz
  • interactions in BioGRID format
  • ortholog file: pombase-latest.human-orthologs.txt.gz
  • exports/pombe-human-orthologs-with-systematic-ids.txt.gz
  • exports/pombe-cerevisiae-orthologs-with-systematic-ids.txt.gz
  • phenotypes in PHAF format: pombase-latest.phaf.gz
  • pombase-latest.eco.phaf.gz (See https://github.com/pombase/pombase-chado/issues/869)
  • modifications, with a file name like: pombase-build-2021-12-10.modifications.gz

extension_relation_counts

The number of annotations using extensions grouped by CV.

fix-allele-names

Message from the process that fixes incomplete allele names.

If name and description are the same and look like a residue change (eg. "A123K" or "K21A,T23A"), add the gene name as a prefix to the allele name: "abc1-A123K" "abc1-K21A,T23A"

gaf-load-output

Warnings from loading GAF files from GOA, Panther and others.

go-filter-uniprot-duplicates

Messages from the process that removes GOA annotations if there is another identical annotation.

go-term-mapping

Output from applying pombe-embl/chado_load_mappings/GO_mapping_to_specific_terms.txt.

kegg-pathway

Messages from loading the KEGG pathways for pombe.

legacy_go_from_contigs

Warnings from loading: pombe-embl/supporting_files/legacy_go_annotations_from_contigs.txt

malacards_data

Warnings from loading: pombe-embl/external_data/disease/malacards_data_for_chado_mondo_ids.tsv

manual-1-1-orths-output

Warnings from loading: pombe-embl/orthologs/conserved_one_to_one.txt

manual-multi-orths-output

Warnings from: pombe-embl/orthologs/conserved_multi.txt

modification

Warnings from loading the modification data files from: pombe-embl/external_data/modification_files/

phenotypes_from_PMID_..._phaf

One file for each PHAF file we load from pombe-embl/external_data/phaf_files/chado_load/htp_phafs/ and pombe-embl/external_data/phaf_files/chado_load/ltp_phafs/

protein_family_term_annotation

A table of which gene is annotated with which protein family.

qualifier_counts_by_cv

Counts of qualifiers grouped by CV name.

qualitative

Warnings from loading qualitative gene expression files from: pombe-embl/external_data/qualitative_gene_expression_data/

quantitative

Warnings from loading quantitative gene expression files from: pombe-embl/external_data/Quantitative_gene_expression_data/

web-json-write

Warnings and messages from the process (using the pombase-chado-json executable) that creates the data files for the website. See the Nightly update page for more detail.

Warns about:

  • missing introns (either no gap or an overlap between two exons)
  • genes that are viable and inviable simultaneously

QC queries

These are queries that are run for information or quality control. We expect to get output from these queries, in contrast to the Chado checks.

Current queries:

The output appears in log files like: log.2021-12-11-23-30-05.qc_queries

Chado checks

At the end of the load we run "chado_checks" configured in the check_chado section of main Chado load config file.

These are queries that hope will one day produce zero results (in contrast to the QC queries). If there are no results the corresponding log file will be empty.

Some warnings will appear in the main Chado check log file: log.2021-12-11-23-30-05.chado_checks but mostly warnings are split into one file per check:

chado_checks.alleles_instance_of_gene

Check that all alleles are an instance_of gene. Internal loading check to make sure the database has the right structure for alleles and genes.

chado_checks.annotation_count

Check that we have enough annotations in Chado. This is a coarse that the count hasn't dropped a lot.

chado_checks.annotation_with_no_evidence

Report annotations with no evidence.

chado_checks.badly_formatted_pro_ids

Checks column 17 / gene_product_form_id fields for badly formatted PRO IDs.

chado_checks.badly_named_deletion_alleles

  • Usually occurs when a community curator has used their own laborotory name destination. Fix by using standard name format and moving the previously curated name to be a synonym. Multiple synonyms are "|" separated.

Deletion alleles that don't end in "delta".

chado_checks.canto_annotations

Warns if the number of annotation from Canto drops unexpectedly.

chado_checks.duplicate_allele_descriptions

Two alleles with the same description and gene have different names.

Occurs either:

  • An unknown allele, that has subsequently been sequenced now has a standard description, and this has been used as the name in subsequent publications.
    • E.g. SPBC119.11c snm1-1 3f99f4c4c350c480,78374d69fcccdd6e A342T pac1-A342T f5403fffd3ecaf81,c242b94e8f530220 - in these cases standardize on the most informative description/ and or current gene name pac1-A342T
  • An allele has been entered without the gene name prefix (e.g. just A123E not cdc2-A123E)
  • An allele has been assigned different names by different laboratories (use first or most prevalent, discuss with community if in doubt).
  • A generic description has been used e.g. 'analog sensitive' ' fusion', descrip[tion needs to be explicit if it can apply to multiple allese.
    ** E.g. move 'analogue sensitive' to allele comment and add the actual amino acid change, i.e cdk9as would become cdk9-T120G OR we need to know if there are multiple as mutants and label them clearly cdk9-as1,cdk9-as2,cdk9-as3 (the first solution is preferred)

NOTE: FOR ALL, REMEMBER TO FIX IN MULTI-ALLELE GENOTYPES TOO (THERE ARE OFTEN MANY, IDEALLY IN THE FUTURE THERE WILL BE A WAY TO FIX ALL ALLELES IN A SESSION SIMULTANEOUSLY)

chado_checks.AlleleNotStartingWithGeneName

Check that all allele names start with a gene name or gene synonym.

chado_checks.DuplicateInteractions

Check for duplicate interactions.

chado_checks.FeatureCount

Check that there are enough genes (more than 10000). This includes all organisms. If we have less than 10000 something has gone very wrong.

chado_checks.GenotypeBackgrounds

Check the genotype backgrounds and report anything that shouldn't be there.

chado_checks.PhenotypesNotInCategory

Check that all phenotype terms have a parent in one of the configured categories. See split_by_parents in the website configuration for fission_yeast_phenotype. Annotation for terms that aren't a descendant of on of the split_by_parents won't be displayed.

chado_checks.duplicated_allele_names

Check for two or more alleles with the same name. The columns are:

  • allele name
  • allele systematic ID
  • allele type
  • allele description
  • canto session(s) - if empty, the allele comes from a PHAF file

chado_checks.duplicate_go_annotation

All duplicate GO annotation. Columns are:

count, systematic ID, gene name, term name, evidence, PubMed ID, 'with', session ID

chado_checks.duplicate_pro_ids

Report PRO IDs used for more than one gene.

chado_checks.enough_genetic_interactions

Check that we have a sensible number of genetic interactions.

chado_checks.enough_isa_cvterm_rels

A coarse check that the CVs loaded correctly: warns if there aren't a sensible number of is_a cvterm_relationship rows.

chado_checks.enough_physical_interactions

Check that we have a sensible number of physical interactions.

chado_checks.extension_relation_genes_exist

Check that gene identifiers used in extensions are valid.

chado_checks.go_annotation_count

Confirm we have a reasonable number of GO annotation. This check exisits mostly to confirm that GO filtering is working.

chado_checks.illegal_with_prefix

Warn if an identifier in a with field has an unknown prefix.

chado_checks.missing_assayed_using

Check that the extensions on protein binding annotations have two assayed_using relations.

chado_checks.modification_on_wrong_residue

Check for modifications where the residue doesn't make sense for the term. See: https://github.com/pombase/pombase-chado/issues/1097

chado_checks.no_duplicate_orthologs

Report duplicated orthologs.

chado_checks.no_duplicate_pombe_gene_names

Check for cases where two genes have the same name.

chado_checks.no_population_terms_with_penetrance

No population term should have a has_penetrance relation.

chado_checks.only_ascii_characters_in_feature_name

Report non-ascii characters in feature names.

chado_checks.pombe_genes

Warn if there not a sensible number of pombe genes.

chado_checks.prop_values_with_missing_terms

Check for relation ranges / annotation properties where the term is missing from the database, eg. from(...) or column_17(...).

chado_checks.quantitative_annotation_qualifiers

Report all quantitative expression annotations don't have a count qualifier (like quant_gene_ex_avg_copies_per_cell).

chado_checks.species_dist_term_name_typos

Check that we have the correct number of species distribution terms; too many means there are typos somewhere.

  • All errors are in *.config files

chado_checks.unknown_allele_name_synonym_matches

Cases where unknown alleles have a name or synonym in common, which should be merged. GitHub issue

chado_checks.inconsistent_temperature_conditions

Check for annotations that have more than one of low, high or standard temperature as a condition. GitHub issue

Clone this wiki locally