Frequently asked questions

General FAQs
How to cite discoveries made using SpadaHC data?

Please cite the SpadaHC paper:

José M Moreno-Cabrera, Lidia Feliubadaló, Marta Pineda, Patricia Prada-Dacasa, Mireia Ramos-Muntada, Jesús Del Valle, Joan Brunet, Bernat Gel, María Currás-Freixes, Bruna Calsina, Milton E Salazar-Hidalgo, Marta Rodríguez-Balada, Bàrbara Roig, Sara Fernández-Castillejo, Mercedes Durán Domínguez, Mónica Arranz Ledo, Mar Infante Sanz, Adela Castillejo, Estela Dámaso, José L Soto, Montserrat de Miguel, Beatriz Hidalgo Calero, José M Sánchez-Zapardiel, Teresa Ramon Y Cajal, Adriana Lasa, Alexandra Gisbert-Beamud, Anael López-Novo, Clara Ruiz-Ponte, Miriam Potrony, María I Álvarez-Mora, Ana Osorio, Isabel Lorda-Sánchez, Mercedes Robledo, Alberto Cascón, Anna Ruiz, Nino Spataro, Imma Hernan, Emma Borràs, Alejandro Moles-Fernández, Julie Earl, Juan Cadiñanos, Ana B Sánchez-Heras, Anna Bigas, Gabriel Capellá, Conxi Lázaro, The SpadaHC Consortium. SpadaHC: a database to improve the classification of variants in hereditary cancer genes in the Spanish population, Database, Volume 2024, 2024, baae055, https://doi.org/10.1093/database/baae055

Which quality filter do the variant files of individuals (VCFs) undergo?

First, the bioinformatics pipeline implements hard filters for each VCF. Specifically, it excludes variants with FILTER field different from PASS, genotype equal to 0/0, allele balance lower than 0.2, or depth coverage lower than the custom threshold defined by the submitter (minimum is 10).

Second, the pipeline performs quality checks to maximize the quality of the data entered into the database. Noisy samples - those with a rare high number of variants - are detected applying Tukey’s fences upper threshold (k=5) to the distribution of the number of variants in each sample. Similarly, empty samples - those with a rare low number of variants - are detected applying Tukey’s fences lower threshold (k=4) to the same distribution. Moreover, two samples are identified as duplicated when the mean of the percentages of variants that each sample shares with the other is greater than 0.9 . Finally, kinship relationships are detected by calculating the relatedness statistic described at Manichaikul et al., Bioinformatics 2010 and implemented in VCFtools

Noisy, empty, or duplicated samples are not entered into the database. Kinship samples are entered into the database but, for each group of related samples, only one is not excluded from allele frequency calculation.

Are variants normalized?

Deletions and insertions can be represented at multiple locations when they appear within repeated regions. In SpadaHC, genomic positions are left-normalized, that is, the most 5' representation is used when referring to DNA. However, HGVS cDNA notation employs 3' normalization, a format commonly used by the clinical community. Hence, HGVSc names are right-normalized, so predicted consequences and in-silico predictors are computed after this normalization.

How is the SpadaHC allele frequency (Spada AF) computed?

The allele frequency (AF) is the result of dividing the allele count (AC) by the allele number (AN). AN is calculated considereing only the individuals whose genomic position is covered (according to the gene panel ROIs file provided by the laboratory that submitted the individual).

Also, when two or more individuals have a kinship relationship, only one individual is considered for AF calculation. The kinship relationship is detected either by comparing the pedigree provided by the laboratory submitter or by calculating the relatedness statistic (cutoff at 0.25) described at Manichaikul et al., Bioinformatics 2010 implemented in VCFtools.

Why do some variants have an allele frequency of 0?

SpadaHC does not only store variants found in individuals, it also stores variants classified by laboratories or expert groups. If the variant was not found in any individual, its frequency is 0.

Are the variants listed on SpadaHC germline?

Laboratories are requested to provide individuals' variants (VCF format) detected from non-tumor sample DNA sequencing. Also, laboratories and expert groups can provide variant classifications according to guidelines developed for germline variants. Hence, the vast majority of SpadaHC variants should be germline, although mosaic variants could also be present if they were called with a high enough allele balance (>= 20%).

Which set of transcripts is used for variant annotation?

SpadaHC performs variant annotation on RefSeq transcripts using VEP 104 (GRCh37.p13 assembly): fields like HGVSc, HGVSp or Consequence are predicted using this set of transcripts and the HGVSc annotation is checked using Mutalyzer. Additionally, related LRG transcripts are shown in the transcripts view.

However, VEP produces a wrong HGVSc annotation when there are sequence discrepancies between RefSeq transcripts (annotated on mRNA sequences) and Ensembl transcripts (annotated on the reference genome) (Example: 11 67258391 A/G). In these cases, a symbol is shown to the left of the HGVSc field to note that HGVSc annotation produced by VEP is wrong.

How is the main transcript selected in the Variants section?

In the Variants section, only one transcript is shown per line (the main transcript), although other transcripts can be shown using the icon . When a variant is annotated on a single gene, the main transcript shown is the one that matches the first Locus Reference Genomic (LRG) transcript in that gene. If no LRG transcript is available for a certain gene, then the transcript shown is the one labeled as Canonical by the Ensembl Variant Effect Predictor (VEP) tool.

If a variant is annotated on multiple genes, the above criterion is used for selecting the preferred transcripts of each gene. The transcript selected to be shown will be the one with the most severe consequence of the preferred transcripts.

How does spadaHC deal with GRCh37 and GRCh38 genome assemblies?

SpadaHC supports exploring variants in both GRCh37 and GRCh38 genome assemblies. To date, all submitted variants in SpadaHC were called using the GRCh37 assembly. SpadaHC obtains the GRCh38 coordinates by, first, calling CrossMap to lift over genome coordinates and, second, checking that the genomic reference base remains the same at the lifted over position. If any of the two steps fails, the variant will not have representation on GRCh38 assembly (Example: 12 25368462 C/T).

Which parameter values were used to run SpliceAI?

SpliceAI values were obtained by running the SpliceAI model. The parameter values used were:

  • Score type: masked. Splicing changes corresponding to strengthening annotated splice sites and weakening unannotated splice sites are typically much less pathogenic than weakening annotated splice sites and strengthening unannotated splice sites. The masked score option hides the score for such splicing changes and shows 0 instead, which is recommended for variant interpretation (Source).
  • Max distance: 4999. It sets the maximum distance between the variant and gained/lost splice site.
How was MaxEntScan used to predict low/moderate/high splicing effect?

The splicing effect was computed using the MaxEntScan plugin for VEP. This tool encapsulates the MaxEntScan model and extends its functionality. The algorithm used to set low, moderate or high effect is explained at (Shamsani, 2019) (Fig 1B). Note that donor loss and acceptor loss are only calculated for those variants spanning the native splice regions: [-20, 3] interval for the acceptor region and [-3,6] interval for the donor region.

Do clinical suspicion terms belong to any known ontology?

Clinical suspicions terms in SpadaHC were obtained from the MedGen ontology, which is explicitly allowed in ClinVar submissions. Below is listed the terms, MedGen UIDs and synonyms provided by MedGen.

Clinical suspicion term Synonyms MedGen UID
Hereditary breast ovarian cancer syndrome Breast and ovarian cancer; Hereditary breast and ovarian cancer; Hereditary breast and ovarian cancer syndrome; Hereditary breast and ovarian cancer syndrome (HBOC) 151793
Hereditary gastric cancer Hereditary cancer of stomach; hereditary cancer of stomach; hereditary gastric cancer 944393
Hereditary cancer Familial Cancer; Familial Malignant Neoplasm; Hereditary Cancer; Hereditary Malignant Neoplasm 232504
Hereditary pheochromocytoma-paraganglioma Hereditary Paraganglioma-Pheochromocytoma Syndromes; Hereditary Paragangliomas and Pheochromocytomas 313270
Familial medullary thyroid carcinoma MTC; MTC, familial; NTRK1-Related Familial Medullary Thyroid Carcinoma; Thyroid cancer, familial medullary 322311
Gorlin syndrome Basal cell nevus syndrome; Nevoid Basal Cell Carcinoma Syndrome 2554
Hereditary nonpolyposis colon cancer Familial nonpolyposis colon cancer; Hereditary nonpolyposis colorectal cancer; Hereditary Nonpolyposis Colorectal Cancer Syndrome; HNPCC 232602
Familial multiple polyposis syndrome Classic familial adenomatous polyposis; Familial adenomatous polyposis; Familial adenomatous polyposis of the colon; Familial intestinal polyposis; Familial multiple polyposis; Familial polyposis; Familial polyposis of the colon; FAP; Hereditary polyposis coli 46010
Familial cancer of breast Breast cancer, familial 87542
Familial melanoma Hereditary cutaneous melanoma; Hereditary melanoma 268851
Familial ovarian cancer Familial cancer of ovary; Familial malignant neoplasm of ovary; familial ovarian cancer; Familial ovarian malignant tumor; familial ovarian malignant tumor; hereditary ovarian cancer 1803368
Familial prostate carcinoma Familial prostate cancer; Hereditary prostate cancer 419810
Familial pancreatic carcinoma Familial Pancreatic Cancer 419700
Hereditary renal cell carcinoma Familial Renal Cancer; familial renal carcinoma; hereditary renal carcinoma; hereditary renal cell cancer; Hereditary Renal Cell Carcinoma; hereditary renal cell carcinoma; hereditary renal cell carcinoma (disease) 392857
Neurofibromatosis Multiple Neurofibroma; Multiple Neurofibromas; Neurofibroma, Multiple; Neurofibromas, Multiple; Neurofibromatoses; Neurofibromatosis Syndrome; Neurofibromatosis Syndromes; Syndrome, Neurofibromatosis; Syndromes, Neurofibromatosis 58149
Multiple endocrine neoplasia, type 1 Endocrine adenomatosis multiple; MEA I; MEN 1; MEN I; MEN1; Wermer syndrome 9957
Li-Fraumeni syndrome LFS; Sarcoma family syndrome of Li and Fraumeni 88399
Hereditary hyperparathyroidism Genetic hyperparathyroidism 1843372