SpadaHC - Spanish variant database for hereditary cancer

Frequently asked questions

General FAQs:

How to cite discoveries made using SpadaHC data?
Which quality filter do the variant files of individuals (VCFs) undergo?
Are variants normalized?
How is the SpadaHC allele frequency (Spada AF) computed?
Why do some variants have an allele frequency of 0?
Are the variants listed on SpadaHC germline?
Which set of transcripts is used for variant annotation?
How is the main transcript selected in the Variants section?
How does spadaHC deal with GRCh37 and GRCh38 genome assemblies?
Which parameter values were used to run SpliceAI?
How was MaxEntScan used to predict low/moderate/high splicing effect?
Do clinical suspicion terms belong to any known ontology?

General FAQs

How to cite discoveries made using SpadaHC data?

Please cite the SpadaHC paper:

José M Moreno-Cabrera, Lidia Feliubadaló, Marta Pineda, Patricia Prada-Dacasa, Mireia Ramos-Muntada, Jesús Del Valle, Joan Brunet, Bernat Gel, María Currás-Freixes, Bruna Calsina, Milton E Salazar-Hidalgo, Marta Rodríguez-Balada, Bàrbara Roig, Sara Fernández-Castillejo, Mercedes Durán Domínguez, Mónica Arranz Ledo, Mar Infante Sanz, Adela Castillejo, Estela Dámaso, José L Soto, Montserrat de Miguel, Beatriz Hidalgo Calero, José M Sánchez-Zapardiel, Teresa Ramon Y Cajal, Adriana Lasa, Alexandra Gisbert-Beamud, Anael López-Novo, Clara Ruiz-Ponte, Miriam Potrony, María I Álvarez-Mora, Ana Osorio, Isabel Lorda-Sánchez, Mercedes Robledo, Alberto Cascón, Anna Ruiz, Nino Spataro, Imma Hernan, Emma Borràs, Alejandro Moles-Fernández, Julie Earl, Juan Cadiñanos, Ana B Sánchez-Heras, Anna Bigas, Gabriel Capellá, Conxi Lázaro, The SpadaHC Consortium. SpadaHC: a database to improve the classification of variants in hereditary cancer genes in the Spanish population, Database, Volume 2024, 2024, baae055, https://doi.org/10.1093/database/baae055

Which quality filter do the variant files of individuals (VCFs) undergo?

First, the bioinformatics pipeline implements hard filters for each VCF. Specifically, it excludes variants with FILTER field different from PASS, genotype equal to 0/0, allele balance lower than 0.2, or depth coverage lower than the custom threshold defined by the submitter (minimum is 10).

Second, the pipeline performs quality checks to maximize the quality of the data entered into the database. Noisy samples - those with a rare high number of variants - are detected applying Tukey’s fences upper threshold (k=5) to the distribution of the number of variants in each sample. Similarly, empty samples - those with a rare low number of variants - are detected applying Tukey’s fences lower threshold (k=4) to the same distribution. Moreover, two samples are identified as duplicated when the mean of the percentages of variants that each sample shares with the other is greater than 0.9 . Finally, kinship relationships are detected by calculating the relatedness statistic described at Manichaikul et al., Bioinformatics 2010 and implemented in VCFtools

Noisy, empty, or duplicated samples are not entered into the database. Kinship samples are entered into the database but, for each group of related samples, only one is not excluded from allele frequency calculation.

Are variants normalized?

Deletions and insertions can be represented at multiple locations when they appear within repeated regions. In SpadaHC, genomic positions are left-normalized, that is, the most 5' representation is used when referring to DNA. However, HGVS cDNA notation employs 3' normalization, a format commonly used by the clinical community. Hence, HGVSc names are right-normalized, so predicted consequences and in-silico predictors are computed after this normalization.

How is the SpadaHC allele frequency (Spada AF) computed?

The allele frequency (AF) is the result of dividing the allele count (AC) by the allele number (AN). AN is calculated considereing only the individuals whose genomic position is covered (according to the gene panel ROIs file provided by the laboratory that submitted the individual).

Also, when two or more individuals have a kinship relationship, only one individual is considered for AF calculation. The kinship relationship is detected either by comparing the pedigree provided by the laboratory submitter or by calculating the relatedness statistic (cutoff at 0.25) described at Manichaikul et al., Bioinformatics 2010 implemented in VCFtools.

Why do some variants have an allele frequency of 0?

SpadaHC does not only store variants found in individuals, it also stores variants classified by laboratories or expert groups. If the variant was not found in any individual, its frequency is 0.

Are the variants listed on SpadaHC germline?

Laboratories are requested to provide individuals' variants (VCF format) detected from non-tumor sample DNA sequencing. Also, laboratories and expert groups can provide variant classifications according to guidelines developed for germline variants. Hence, the vast majority of SpadaHC variants should be germline, although mosaic variants could also be present if they were called with a high enough allele balance (>= 20%).

Which set of transcripts is used for variant annotation?

SpadaHC performs variant annotation on RefSeq transcripts using VEP 104 (GRCh37.p13 assembly): fields like HGVSc, HGVSp or Consequence are predicted using this set of transcripts and the HGVSc annotation is checked using Mutalyzer. Additionally, related LRG transcripts are shown in the transcripts view.

However, VEP produces a wrong HGVSc annotation when there are sequence discrepancies between RefSeq transcripts (annotated on mRNA sequences) and Ensembl transcripts (annotated on the reference genome) (Example: 11 67258391 A/G). In these cases, a symbol is shown to the left of the HGVSc field to note that HGVSc annotation produced by VEP is wrong.

How is the main transcript selected in the Variants section?

In the Variants section, only one transcript is shown per line (the main transcript), although other transcripts can be shown using the icon . When a variant is annotated on a single gene, the main transcript shown is the one that matches the first Locus Reference Genomic (LRG) transcript in that gene. If no LRG transcript is available for a certain gene, then the transcript shown is the one labeled as Canonical by the Ensembl Variant Effect Predictor (VEP) tool.

If a variant is annotated on multiple genes, the above criterion is used for selecting the preferred transcripts of each gene. The transcript selected to be shown will be the one with the most severe consequence of the preferred transcripts.

How does spadaHC deal with GRCh37 and GRCh38 genome assemblies?

SpadaHC supports exploring variants in both GRCh37 and GRCh38 genome assemblies. To date, all submitted variants in SpadaHC were called using the GRCh37 assembly. SpadaHC obtains the GRCh38 coordinates by, first, calling CrossMap to lift over genome coordinates and, second, checking that the genomic reference base remains the same at the lifted over position. If any of the two steps fails, the variant will not have representation on GRCh38 assembly (Example: 12 25368462 C/T).

Which parameter values were used to run SpliceAI?

SpliceAI values were obtained by running the SpliceAI model. The parameter values used were:

Score type: masked. Splicing changes corresponding to strengthening annotated splice sites and weakening unannotated splice sites are typically much less pathogenic than weakening annotated splice sites and strengthening unannotated splice sites. The masked score option hides the score for such splicing changes and shows 0 instead, which is recommended for variant interpretation (Source).
Max distance: 4999. It sets the maximum distance between the variant and gained/lost splice site.

How was MaxEntScan used to predict low/moderate/high splicing effect?

The splicing effect was computed using the MaxEntScan plugin for VEP. This tool encapsulates the MaxEntScan model and extends its functionality. The algorithm used to set low, moderate or high effect is explained at (Shamsani, 2019) (Fig 1B). Note that donor loss and acceptor loss are only calculated for those variants spanning the native splice regions: [-20, 3] interval for the acceptor region and [-3,6] interval for the donor region.

Do clinical suspicion terms belong to any known ontology?

Clinical suspicions terms in SpadaHC were obtained from the MedGen ontology, which is explicitly allowed in ClinVar submissions. Below is listed the terms, MedGen UIDs and synonyms provided by MedGen.

Clinical suspicion term	Synonyms	MedGen UID
Hereditary breast ovarian cancer syndrome	Breast and ovarian cancer; Hereditary breast and ovarian cancer; Hereditary breast and ovarian cancer syndrome; Hereditary breast and ovarian cancer syndrome (HBOC)	151793
Familial multiple polyposis syndrome	Classic familial adenomatous polyposis; Familial adenomatous polyposis; Familial adenomatous polyposis of the colon; Familial intestinal polyposis; Familial multiple polyposis; Familial polyposis; Familial polyposis of the colon; FAP; Hereditary polyposis coli	46010
Hereditary gastric cancer	Hereditary cancer of stomach; hereditary cancer of stomach; hereditary gastric cancer	1843054
Hereditary pheochromocytoma-paraganglioma	Hereditary Paraganglioma-Pheochromocytoma Syndromes; Hereditary Paragangliomas and Pheochromocytomas	313270
Neurofibromatosis	Multiple Neurofibroma; Multiple Neurofibromas; Neurofibroma, Multiple; Neurofibromas, Multiple; Neurofibromatoses; Neurofibromatosis Syndrome; Neurofibromatosis Syndromes; Syndrome, Neurofibromatosis; Syndromes, Neurofibromatosis	58149
Familial medullary thyroid carcinoma	MTC; MTC, familial; NTRK1-Related Familial Medullary Thyroid Carcinoma; Thyroid cancer, familial medullary	322311
Gorlin syndrome	Basal cell nevus syndrome; Nevoid Basal Cell Carcinoma Syndrome	2554
Hereditary nonpolyposis colon cancer	Familial nonpolyposis colon cancer; Hereditary nonpolyposis colorectal cancer; Hereditary Nonpolyposis Colorectal Cancer Syndrome; HNPCC	232602
Hereditary cancer-predisposing syndrome	Cancer predisposition; Hereditary Cancer Syndrome; Hereditary neoplastic syndrome; Neoplastic Syndromes, Hereditary; PALB2-Related Cancer Susceptibility; Tumor predisposition	14326
Familial cancer of breast	Breast cancer, familial	87542
Familial melanoma	Hereditary cutaneous melanoma; Hereditary melanoma	268851
Familial ovarian cancer	Familial cancer of ovary; Familial malignant neoplasm of ovary; familial ovarian cancer; Familial ovarian malignant tumor; familial ovarian malignant tumor; hereditary ovarian cancer	1803368
Familial prostate cancer	Familial prostate cancer; Hereditary prostate cancer	419810
Familial pancreatic carcinoma	Familial Pancreatic Cancer	419700
Hereditary renal cell carcinoma	Familial Renal Cancer; familial renal carcinoma; hereditary renal carcinoma; hereditary renal cell cancer; Hereditary Renal Cell Carcinoma; hereditary renal cell carcinoma; hereditary renal cell carcinoma (disease)	392857
Multiple endocrine neoplasia, type 1	Endocrine adenomatosis multiple; MEA I; MEN 1; MEN I; MEN1; Wermer syndrome	9957
Li-Fraumeni syndrome	LFS; Sarcoma family syndrome of Li and Fraumeni	88399
Hereditary hyperparathyroidism	Genetic hyperparathyroidism	1843372