Please cite the
SpadaHC paper:
José M Moreno-Cabrera, Lidia Feliubadaló, Marta Pineda, Patricia Prada-Dacasa, Mireia Ramos-Muntada,
Jesús Del Valle, Joan Brunet, Bernat Gel, María Currás-Freixes, Bruna Calsina, Milton E Salazar-Hidalgo,
Marta Rodríguez-Balada, Bàrbara Roig, Sara Fernández-Castillejo, Mercedes Durán Domínguez, Mónica Arranz Ledo,
Mar Infante Sanz, Adela Castillejo, Estela Dámaso, José L Soto, Montserrat de Miguel, Beatriz Hidalgo Calero,
José M Sánchez-Zapardiel, Teresa Ramon Y Cajal, Adriana Lasa, Alexandra Gisbert-Beamud, Anael López-Novo,
Clara Ruiz-Ponte, Miriam Potrony, María I Álvarez-Mora, Ana Osorio, Isabel Lorda-Sánchez, Mercedes Robledo,
Alberto Cascón, Anna Ruiz, Nino Spataro, Imma Hernan, Emma Borràs, Alejandro Moles-Fernández, Julie Earl,
Juan Cadiñanos, Ana B Sánchez-Heras, Anna Bigas, Gabriel Capellá, Conxi Lázaro,
The SpadaHC Consortium. SpadaHC: a database to improve the classification of variants in hereditary
cancer genes in the Spanish population, Database, Volume 2024, 2024, baae055,
https://doi.org/10.1093/database/baae055
First, the bioinformatics pipeline implements hard filters for each VCF. Specifically, it excludes variants with FILTER field different from PASS, genotype equal to 0/0, allele balance lower than 0.2, or depth coverage lower than the custom threshold defined by the submitter (minimum is 10).
Second, the pipeline performs quality checks to maximize the quality of the data entered into the database. Noisy samples - those with a rare high number of variants - are detected applying Tukey’s fences upper threshold (k=5) to the distribution of the number of variants in each sample. Similarly, empty samples - those with a rare low number of variants - are detected applying Tukey’s fences lower threshold (k=4) to the same distribution. Moreover, two samples are identified as duplicated when the mean of the percentages of variants that each sample shares with the other is greater than 0.9 . Finally, kinship relationships are detected by calculating the relatedness statistic described at Manichaikul et al., Bioinformatics 2010 and implemented in VCFtools
Noisy, empty, or duplicated samples are not entered into the database. Kinship samples are entered into the database but, for each group of related samples, only one is not excluded from allele frequency calculation.
Deletions and insertions can be represented at multiple locations when they appear within repeated regions. In SpadaHC, genomic positions are left-normalized, that is, the most 5' representation is used when referring to DNA. However, HGVS cDNA notation employs 3' normalization, a format commonly used by the clinical community. Hence, HGVSc names are right-normalized, so predicted consequences and in-silico predictors are computed after this normalization.
The allele frequency (AF) is the result of dividing the allele count (AC) by the allele number (AN). AN is calculated considereing only the individuals whose genomic position is covered (according to the gene panel ROIs file provided by the laboratory that submitted the individual).
Also, when two or more individuals have a kinship relationship, only one individual is considered for AF calculation. The kinship relationship is detected either by comparing the pedigree provided by the laboratory submitter or by calculating the relatedness statistic (cutoff at 0.25) described at Manichaikul et al., Bioinformatics 2010 implemented in VCFtools.
SpadaHC does not only store variants found in individuals, it also stores variants classified by laboratories or expert groups. If the variant was not found in any individual, its frequency is 0.
Laboratories are requested to provide individuals' variants (VCF format) detected from non-tumor sample DNA sequencing. Also, laboratories and expert groups can provide variant classifications according to guidelines developed for germline variants. Hence, the vast majority of SpadaHC variants should be germline, although mosaic variants could also be present if they were called with a high enough allele balance (>= 20%).
SpadaHC performs variant annotation on RefSeq transcripts using VEP 104 (GRCh37.p13 assembly): fields like HGVSc, HGVSp or Consequence are predicted using this set of transcripts and the HGVSc annotation is checked using Mutalyzer. Additionally, related LRG transcripts are shown in the transcripts view.
However, VEP produces a wrong HGVSc annotation when there are sequence discrepancies between RefSeq transcripts (annotated on mRNA sequences) and Ensembl transcripts (annotated on the reference genome) (Example: 11 67258391 A/G). In these cases, a symbol is shown to the left of the HGVSc field to note that HGVSc annotation produced by VEP is wrong.
In the Variants section, only one transcript is shown per line (the main transcript), although other transcripts can be shown using the icon . When a variant is annotated on a single gene, the main transcript shown is the one that matches the first Locus Reference Genomic (LRG) transcript in that gene. If no LRG transcript is available for a certain gene, then the transcript shown is the one labeled as Canonical by the Ensembl Variant Effect Predictor (VEP) tool.
If a variant is annotated on multiple genes, the above criterion is used for selecting the preferred transcripts of each gene. The transcript selected to be shown will be the one with the most severe consequence of the preferred transcripts.
SpadaHC supports exploring variants in both GRCh37 and GRCh38 genome assemblies. To date, all submitted variants in SpadaHC were called using the GRCh37 assembly. SpadaHC obtains the GRCh38 coordinates by, first, calling CrossMap to lift over genome coordinates and, second, checking that the genomic reference base remains the same at the lifted over position. If any of the two steps fails, the variant will not have representation on GRCh38 assembly (Example: 12 25368462 C/T).
SpliceAI values were obtained by running the SpliceAI model. The parameter values used were:
The splicing effect was computed using the MaxEntScan plugin for VEP. This tool encapsulates the MaxEntScan model and extends its functionality. The algorithm used to set low, moderate or high effect is explained at (Shamsani, 2019) (Fig 1B). Note that donor loss and acceptor loss are only calculated for those variants spanning the native splice regions: [-20, 3] interval for the acceptor region and [-3,6] interval for the donor region.
Clinical suspicions terms in SpadaHC were obtained from the MedGen ontology, which is explicitly allowed in ClinVar submissions. Below is listed the terms, MedGen UIDs and synonyms provided by MedGen.
Clinical suspicion term | Synonyms | MedGen UID |
---|---|---|
Hereditary breast ovarian cancer syndrome | Breast and ovarian cancer; Hereditary breast and ovarian cancer; Hereditary breast and ovarian cancer syndrome; Hereditary breast and ovarian cancer syndrome (HBOC) | 151793 |
Hereditary gastric cancer | Hereditary cancer of stomach; hereditary cancer of stomach; hereditary gastric cancer | 944393 |
Hereditary cancer | Familial Cancer; Familial Malignant Neoplasm; Hereditary Cancer; Hereditary Malignant Neoplasm | 232504 |
Hereditary pheochromocytoma-paraganglioma | Hereditary Paraganglioma-Pheochromocytoma Syndromes; Hereditary Paragangliomas and Pheochromocytomas | 313270 |
Familial medullary thyroid carcinoma | MTC; MTC, familial; NTRK1-Related Familial Medullary Thyroid Carcinoma; Thyroid cancer, familial medullary | 322311 |
Gorlin syndrome | Basal cell nevus syndrome; Nevoid Basal Cell Carcinoma Syndrome | 2554 |
Hereditary nonpolyposis colon cancer | Familial nonpolyposis colon cancer; Hereditary nonpolyposis colorectal cancer; Hereditary Nonpolyposis Colorectal Cancer Syndrome; HNPCC | 232602 |
Familial multiple polyposis syndrome | Classic familial adenomatous polyposis; Familial adenomatous polyposis; Familial adenomatous polyposis of the colon; Familial intestinal polyposis; Familial multiple polyposis; Familial polyposis; Familial polyposis of the colon; FAP; Hereditary polyposis coli | 46010 |
Familial cancer of breast | Breast cancer, familial | 87542 |
Familial melanoma | Hereditary cutaneous melanoma; Hereditary melanoma | 268851 |
Familial ovarian cancer | Familial cancer of ovary; Familial malignant neoplasm of ovary; familial ovarian cancer; Familial ovarian malignant tumor; familial ovarian malignant tumor; hereditary ovarian cancer | 1803368 |
Familial prostate carcinoma | Familial prostate cancer; Hereditary prostate cancer | 419810 |
Familial pancreatic carcinoma | Familial Pancreatic Cancer | 419700 |
Hereditary renal cell carcinoma | Familial Renal Cancer; familial renal carcinoma; hereditary renal carcinoma; hereditary renal cell cancer; Hereditary Renal Cell Carcinoma; hereditary renal cell carcinoma; hereditary renal cell carcinoma (disease) | 392857 |
Neurofibromatosis | Multiple Neurofibroma; Multiple Neurofibromas; Neurofibroma, Multiple; Neurofibromas, Multiple; Neurofibromatoses; Neurofibromatosis Syndrome; Neurofibromatosis Syndromes; Syndrome, Neurofibromatosis; Syndromes, Neurofibromatosis | 58149 |
Multiple endocrine neoplasia, type 1 | Endocrine adenomatosis multiple; MEA I; MEN 1; MEN I; MEN1; Wermer syndrome | 9957 |
Li-Fraumeni syndrome | LFS; Sarcoma family syndrome of Li and Fraumeni | 88399 |
Hereditary hyperparathyroidism | Genetic hyperparathyroidism | 1843372 |