@genebe/revel:0.0.1
REVEL - Rare Exome Variant Ensemble Learner
Description
REVEL
REVEL is an ensemble method for predicting the pathogenicity of missense variants based on a combination of scores from 13 individual tools: MutPred, FATHMM v2.3, VEST 3.0, PolyPhen-2, SIFT, PROVEAN, MutationAssessor, MutationTaster, LRT, GERP++, SiPhy, phyloP, and phastCons. REVEL was trained using recently discovered pathogenic and rare neutral missense variants, excluding those previously used to train its constituent tools. The REVEL score for an individual missense variant can range from 0 to 1, with higher scores reflecting greater likelihood that the variant is disease-causing.
REVEL scores are freely available for non-commercial use. For other uses, please contact Weiva Sieh.
Find out more at https://sites.google.com/site/revelgenomics/
Citation
Ioannidis NM,* Rothstein JH,* Pejaver V, Middha S, McDonnell SK, Baheti S, Musolf A, Li Q, Holzinger E, Karyadi D, Cannon-Albright LA, Teerlink CC, Stanford JL, Isaacs WB, Xu J, Cooney KA, Lange EM, Schleutker J, Carpten JD, Powell IJ, Cussenot O, Cancel-Tassin G, Giles GG, MacInnis RJ, Maier C, Hsieh CL, Wiklund F, Catalona WJ, Foulkes WD, Mandal D, Eeles RA, Kote-Jarai Z, Bustamante CD, Schaid DJ, Hastie T, Ostrander EA, Bailey-Wilson JE, Radivojac P, Thibodeau SN, Whittemore AS, and Sieh W. “REVEL: An ensemble method for predicting the pathogenicity of rare missense variants.” American Journal of Human Genetics 2016; 99(4):877-885. http://dx.doi.org/10.1016/j.ajhg.2016.08.016
Instructions
Download data
wget https://rothsj06.dmz.hpc.mssm.edu/revel-v1.3_all_chromosomes.zip
Process using Apache Spark
val df = spark.read.format("csv").option("header", "true").option("inferSchema", true).option("delimiter", ",").option("nullValue",".").option("comment", "#").load("/tmp/revel_with_transcript_ids")
import org.apache.spark.sql.functions._
// Step 1: Rename hg38_pos to pos
val renamedDf = df.withColumnRenamed("grch38_pos", "pos")
// Step 2: Drop unnecessary columns
val reducedDf = renamedDf.drop("hg19_pos", "aaref", "aaalt", "Ensembl_transcriptid")
// Step 3: Drop rows with null values
val nonNullDf = reducedDf.na.drop()
// Step 4: Deduplicate based on chr, pos, ref, alt, keeping the row with the maximum REVEL
val deduplicatedDf = nonNullDf .groupBy("chr", "pos", "ref", "alt") .agg(max("REVEL").as("REVEL"))
// Step 5: Convert REVEL to float
val finalDf = deduplicatedDf.withColumn("REVEL", col("REVEL").cast("float")).withColumnRenamed("REVEL","score")
val df = finalDf
// -- convert CPRA to SPDI
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, Row}
def toSpdi(chrInput: String, posInput: Int, refInput: String, altInput: String): (String, Int, Int, String) = {
if (chrInput == null || chrInput.isBlank || posInput < 0) {
return (null, -1, -1, null)
}
var seq = chrInput.stripPrefix("chr")
if (seq == "MT") seq = "M" // Normalize for human genomes
var ref = refInput
var alt = altInput
val vcfPos = posInput
// Remove leading common bases
var leadingCommon = 0
while (leadingCommon < ref.length && leadingCommon < alt.length &&
ref.charAt(leadingCommon) == alt.charAt(leadingCommon)) {
leadingCommon += 1
}
ref = ref.substring(leadingCommon)
alt = alt.substring(leadingCommon)
val pos = vcfPos - 1 + leadingCommon // Convert to 0-based
// Remove trailing common bases
var trailingCommon = 0
while (ref.length > trailingCommon && alt.length > trailingCommon &&
ref.charAt(ref.length - 1 - trailingCommon) == alt.charAt(alt.length - 1 - trailingCommon)) {
trailingCommon += 1
}
ref = ref.substring(0, ref.length - trailingCommon)
alt = alt.substring(0, alt.length - trailingCommon)
// Compute del as the number of bases in the remaining ref
val del = ref.length
val ins = alt
(seq, pos, del, ins)
}
val toSpdiUdf = udf((chrInput: String, posInput: Int, refInput: String, altInput: String) => {
val result = toSpdi(chrInput, posInput, refInput, altInput)
result // Return a tuple
})
// register UDF
val spdiUdf = toSpdiUdf(col("chr"), col("pos"), col("ref"), col("alt"))
// I assume existing dataframe name is df
val dfWithSpdi = { df
.withColumn("spdi", spdiUdf)
.withColumn("_seq", col("spdi").getField("_1"))
.withColumn("_pos", col("spdi").getField("_2").cast("int"))
.withColumn("_del", col("spdi").getField("_3").cast("int"))
.withColumn("_ins", col("spdi").getField("_4"))
.drop("spdi") }
// drop not needed columns, almost same data is in _seq, _pos, _ref, _alt
val finalDf = dfWithSpdi.drop("chr", "pos", "ref", "alt").dropDuplicates("_seq","_pos","_del","_ins")
// partitionBy _seq, order by _seq, pos, del, ins, write to parquet
finalDf.repartition($"_seq").sortWithinPartitions($"_seq", $"_pos", $"_del", $"_ins").write.option("compression", "zstd").partitionBy("_seq").parquet("/tmp/revel-38")
Import using GeneBe, create from parquet
annotation create-from-parquet --input /tmp/revel-38 --owner @genebe --name revel --version 0.0.1 --species homo_sapiens --genome GRCh38 --title "REVEL hg38"
annotation push --id @genebe/revel:0.0.1 --public true
Meta Information
Access:
PUBLIC
Author:
@genebeURL:
Created:
19 Jan 2025, 17:06:58 UTC
Type:
VARIANT
Genome:
GRCh38
Status:
ACTIVE