@genebe/revel:0.0.1

REVEL - Rare Exome Variant Ensemble Learner

Description

REVEL

REVEL is an ensemble method for predicting the pathogenicity of missense variants based on a combination of scores from 13 individual tools: MutPred, FATHMM v2.3, VEST 3.0, PolyPhen-2, SIFT, PROVEAN, MutationAssessor, MutationTaster, LRT, GERP++, SiPhy, phyloP, and phastCons. REVEL was trained using recently discovered pathogenic and rare neutral missense variants, excluding those previously used to train its constituent tools. The REVEL score for an individual missense variant can range from 0 to 1, with higher scores reflecting greater likelihood that the variant is disease-causing.

REVEL scores are freely available for non-commercial use. For other uses, please contact Weiva Sieh.

Find out more at https://sites.google.com/site/revelgenomics/

Citation

Ioannidis NM,* Rothstein JH,* Pejaver V, Middha S, McDonnell SK, Baheti S, Musolf A, Li Q, Holzinger E, Karyadi D, Cannon-Albright LA, Teerlink CC, Stanford JL, Isaacs WB, Xu J, Cooney KA, Lange EM, Schleutker J, Carpten JD, Powell IJ, Cussenot O, Cancel-Tassin G, Giles GG, MacInnis RJ, Maier C, Hsieh CL, Wiklund F, Catalona WJ, Foulkes WD, Mandal D, Eeles RA, Kote-Jarai Z, Bustamante CD, Schaid DJ, Hastie T, Ostrander EA, Bailey-Wilson JE, Radivojac P, Thibodeau SN, Whittemore AS, and Sieh W. “REVEL: An ensemble method for predicting the pathogenicity of rare missense variants.” American Journal of Human Genetics 2016; 99(4):877-885. http://dx.doi.org/10.1016/j.ajhg.2016.08.016

Instructions

Download data

wget https://rothsj06.dmz.hpc.mssm.edu/revel-v1.3_all_chromosomes.zip

Process using Apache Spark


val df = spark.read.format("csv").option("header", "true").option("inferSchema", true).option("delimiter", ",").option("nullValue",".").option("comment", "#").load("/tmp/revel_with_transcript_ids")

import org.apache.spark.sql.functions._

// Step 1: Rename hg38_pos to pos
val renamedDf = df.withColumnRenamed("grch38_pos", "pos")

// Step 2: Drop unnecessary columns
val reducedDf = renamedDf.drop("hg19_pos", "aaref", "aaalt", "Ensembl_transcriptid")

// Step 3: Drop rows with null values
val nonNullDf = reducedDf.na.drop()

// Step 4: Deduplicate based on chr, pos, ref, alt, keeping the row with the maximum REVEL
val deduplicatedDf = nonNullDf .groupBy("chr", "pos", "ref", "alt") .agg(max("REVEL").as("REVEL"))

// Step 5: Convert REVEL to float
val finalDf = deduplicatedDf.withColumn("REVEL", col("REVEL").cast("float")).withColumnRenamed("REVEL","score")

val df = finalDf

// -- convert CPRA to SPDI
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, Row}

def toSpdi(chrInput: String, posInput: Int, refInput: String, altInput: String): (String, Int, Int, String) = {
    if (chrInput == null || chrInput.isBlank || posInput < 0) {
        return (null, -1, -1, null)
    }

    var seq = chrInput.stripPrefix("chr")
    if (seq == "MT") seq = "M" // Normalize for human genomes

    var ref = refInput
    var alt = altInput
    val vcfPos = posInput

    // Remove leading common bases
    var leadingCommon = 0
    while (leadingCommon < ref.length && leadingCommon < alt.length &&
           ref.charAt(leadingCommon) == alt.charAt(leadingCommon)) {
        leadingCommon += 1
    }
    ref = ref.substring(leadingCommon)
    alt = alt.substring(leadingCommon)
    val pos = vcfPos - 1 + leadingCommon // Convert to 0-based

    // Remove trailing common bases
    var trailingCommon = 0
    while (ref.length > trailingCommon && alt.length > trailingCommon &&
           ref.charAt(ref.length - 1 - trailingCommon) == alt.charAt(alt.length - 1 - trailingCommon)) {
        trailingCommon += 1
    }
    ref = ref.substring(0, ref.length - trailingCommon)
    alt = alt.substring(0, alt.length - trailingCommon)

    // Compute del as the number of bases in the remaining ref
    val del = ref.length
    val ins = alt

    (seq, pos, del, ins)
}

val toSpdiUdf = udf((chrInput: String, posInput: Int, refInput: String, altInput: String) => {
    val result = toSpdi(chrInput, posInput, refInput, altInput)
    result // Return a tuple
})

// register UDF
val spdiUdf = toSpdiUdf(col("chr"), col("pos"), col("ref"), col("alt"))

// I assume existing dataframe name is df
val dfWithSpdi = { df
  .withColumn("spdi", spdiUdf)
  .withColumn("_seq", col("spdi").getField("_1"))
  .withColumn("_pos", col("spdi").getField("_2").cast("int"))
  .withColumn("_del", col("spdi").getField("_3").cast("int"))
  .withColumn("_ins", col("spdi").getField("_4"))
  .drop("spdi") }

// drop not needed columns, almost same data is in _seq, _pos, _ref, _alt
val finalDf = dfWithSpdi.drop("chr", "pos", "ref", "alt").dropDuplicates("_seq","_pos","_del","_ins")

// partitionBy _seq, order by _seq, pos, del, ins, write to parquet
finalDf.repartition($"_seq").sortWithinPartitions($"_seq", $"_pos", $"_del", $"_ins").write.option("compression", "zstd").partitionBy("_seq").parquet("/tmp/revel-38")

Import using GeneBe, create from parquet

annotation create-from-parquet --input /tmp/revel-38 --owner @genebe --name revel --version 0.0.1 --species homo_sapiens --genome GRCh38 --title "REVEL hg38"
annotation push --id @genebe/revel:0.0.1 --public true

Meta Information

Access:

PUBLIC

Author:

@genebe

Pull Command:

java -jar genebe.jar annotation pull --id @genebe/revel:0.0.1more examples

Created:

19 Jan 2025, 17:06:58 UTC

Type:

VARIANT

Genome:

GRCh38

Status:

ACTIVE

License:

OTHER

Version: