dann_hg19 - DANN: Deep Learning-Based Variant Annotation Database

dann_hg19 0.0.1

DANN scores for hg19 / GRCh37. These are scores for whole human genome.

Find out more in:

https://academic.oup.com/bioinformatics/article/31/5/761/2748191

Create instructions

Data was downloaded from https://cbcl.ics.uci.edu/public_data/DANN/data/DANN_whole_genome_SNVs.tsv.bgz . Due to big size of the file it was preprocessed using Apache Spark to the parquet format:

val df = spark.read.option("delimiter", "\t").option("header", "false").option("inferSchema", "true").csv("DANN_whole_genome_SNVs.tsv.gz")
val dfRenamed = df  .withColumnRenamed("_c0", "chr")  .withColumnRenamed("_c1", "pos")  .withColumnRenamed("_c2", "ref")  .withColumnRenamed("_c3", "alt")  .withColumnRenamed("_c4", "score")
val df = dfRenamed

import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, Row}

def toSpdi(chrInput: String, posInput: Int, refInput: String, altInput: String): (String, Int, Int, String) = {
    if (chrInput == null || chrInput.isBlank || posInput < 0) {
        return (null, -1, -1, null)
    }

    var seq = chrInput.stripPrefix("chr")
    if (seq == "MT") seq = "M" // Normalize for human genomes

    var ref = refInput
    var alt = altInput
    val vcfPos = posInput

    // Remove leading common bases
    var leadingCommon = 0
    while (leadingCommon < ref.length && leadingCommon < alt.length &&
           ref.charAt(leadingCommon) == alt.charAt(leadingCommon)) {
        leadingCommon += 1
    }
    ref = ref.substring(leadingCommon)
    alt = alt.substring(leadingCommon)
    val pos = vcfPos - 1 + leadingCommon // Convert to 0-based

    // Remove trailing common bases
    var trailingCommon = 0
    while (ref.length > trailingCommon && alt.length > trailingCommon &&
           ref.charAt(ref.length - 1 - trailingCommon) == alt.charAt(alt.length - 1 - trailingCommon)) {
        trailingCommon += 1
    }
    ref = ref.substring(0, ref.length - trailingCommon)
    alt = alt.substring(0, alt.length - trailingCommon)

    // Compute del as the number of bases in the remaining ref
    val del = ref.length
    val ins = alt

    (seq, pos, del, ins)
}

val toSpdiUdf = udf((chrInput: String, posInput: Int, refInput: String, altInput: String) => {
    val result = toSpdi(chrInput, posInput, refInput, altInput)
    result // Return a tuple
})

// register UDF
val spdiUdf = toSpdiUdf(col("chr"), col("pos"), col("ref"), col("alt"))

// I assume existing dataframe name is df
val dfWithSpdi = { df
  .withColumn("spdi", spdiUdf)
  .withColumn("_seq", col("spdi").getField("_1"))
  .withColumn("_pos", col("spdi").getField("_2").cast("int"))
  .withColumn("_del", col("spdi").getField("_3").cast("int"))
  .withColumn("_ins", col("spdi").getField("_4"))
  .drop("spdi") }

// drop not needed columns, almost same data is in _seq, _pos, _ref, _alt
val finalDf = dfWithSpdi.drop("chr", "pos", "ref", "alt").dropDuplicates("_seq","_pos","_del","_ins")

// partitionBy _seq, order by _seq, pos, del, ins, write to parquet
finalDf.repartition($"_seq").sortWithinPartitions($"_seq", $"_pos", $"_del", $"_ins").write.option("compression", "zstd").partitionBy("_seq").parquet("/tmp/parquet")

In result there is hive partitioned DANN database in parquet format in /tmp/parquet.

It was then imported using:

annotation create-from-parquet --input /tmp/parquet --owner @genebe --name dann_hg19 --version 0.0.1 --species homo_sapiens --genome GRCh37 --title "DANN: Deep Learning-Based Variant Annotation Database"

@genebe/dann_hg19:0.0.1

Description

dann_hg19 0.0.1

Create instructions

Meta Information

Access:

Author:

Pull Command:

Created:

Type:

Genome:

Status:

License:

Version: