@genebe/dann_hg38:0.0.1
DANN: Deep Learning-Based Variant Annotation Database, GRCh38 version
Description
dann_hg38 0.0.1
DANN scores for hg38 / GRCh38, lifter over from the oryginal hg19 version. These are scores for whole human genome.
Find out more in:
https://academic.oup.com/bioinformatics/article/31/5/761/2748191
Create instructions
Data was downloaded from https://cbcl.ics.uci.edu/public_data/DANN/data/DANN_whole_genome_SNVs.tsv.bgz and lifted over . Due to big size of the file it was preprocessed using Apache Spark to the parquet format:
val df = spark.read.option("delimiter", "\t").option("header", "false").option("inferSchema", "true").csv("DANN_whole_genome_SNVs_lifted.tsv.gz")
val dfRenamed = df .withColumnRenamed("_c0", "chr") .withColumnRenamed("_c1", "pos") .withColumnRenamed("_c2", "ref") .withColumnRenamed("_c3", "alt") .withColumnRenamed("_c4", "score")
val df = dfRenamed
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, Row}
def toSpdi(chrInput: String, posInput: Int, refInput: String, altInput: String): (String, Int, Int, String) = {
if (chrInput == null || chrInput.isBlank || posInput < 0) {
return (null, -1, -1, null)
}
var seq = chrInput.stripPrefix("chr")
if (seq == "MT") seq = "M" // Normalize for human genomes
var ref = refInput
var alt = altInput
val vcfPos = posInput
// Remove leading common bases
var leadingCommon = 0
while (leadingCommon < ref.length && leadingCommon < alt.length &&
ref.charAt(leadingCommon) == alt.charAt(leadingCommon)) {
leadingCommon += 1
}
ref = ref.substring(leadingCommon)
alt = alt.substring(leadingCommon)
val pos = vcfPos - 1 + leadingCommon // Convert to 0-based
// Remove trailing common bases
var trailingCommon = 0
while (ref.length > trailingCommon && alt.length > trailingCommon &&
ref.charAt(ref.length - 1 - trailingCommon) == alt.charAt(alt.length - 1 - trailingCommon)) {
trailingCommon += 1
}
ref = ref.substring(0, ref.length - trailingCommon)
alt = alt.substring(0, alt.length - trailingCommon)
// Compute del as the number of bases in the remaining ref
val del = ref.length
val ins = alt
(seq, pos, del, ins)
}
val toSpdiUdf = udf((chrInput: String, posInput: Int, refInput: String, altInput: String) => {
val result = toSpdi(chrInput, posInput, refInput, altInput)
result // Return a tuple
})
// register UDF
val spdiUdf = toSpdiUdf(col("chr"), col("pos"), col("ref"), col("alt"))
// I assume existing dataframe name is df
val dfWithSpdi = { df
.withColumn("spdi", spdiUdf)
.withColumn("_seq", col("spdi").getField("_1"))
.withColumn("_pos", col("spdi").getField("_2").cast("int"))
.withColumn("_del", col("spdi").getField("_3").cast("int"))
.withColumn("_ins", col("spdi").getField("_4"))
.drop("spdi") }
// drop not needed columns, almost same data is in _seq, _pos, _ref, _alt
val finalDf = dfWithSpdi.drop("chr", "pos", "ref", "alt").dropDuplicates("_seq","_pos","_del","_ins")
// partitionBy _seq, order by _seq, pos, del, ins, write to parquet
finalDf.repartition($"_seq").sortWithinPartitions($"_seq", $"_pos", $"_del", $"_ins").write.option("compression", "zstd").partitionBy("_seq").parquet("/tmp/parquet")
In result there is hive partitioned DANN database in parquet format in /tmp/parquet
.
It was then imported using:
annotation create-from-parquet --input /tmp/parquet --owner @genebe --name dann_hg38 --version 0.0.1 --species homo_sapiens --genome GRCh38 --title "DANN: Deep Learning-Based Variant Annotation Database, GRCh38 version"
Meta Information
Access:
PUBLIC
Author:
@genebeCreated:
16 Jan 2025, 11:03:02 UTC
Type:
VARIANT
Genome:
GRCh38
Status:
ACTIVE