How to create a database in GeneBe Hub

Creating a Variant Database

GeneBe Hub is a repository for variant databases. If you have data you'd like to share, such as population frequency data, variant scores, pathogenicity assignments, or any other type of annotation, you can use GeneBe Hub to share it. This makes it easy for other bioinformaticians to discover and apply your data.

To create a database, your annotation data must be in one of the following formats: csv (or tsv), vcf, or parquet. If your database is very large (over 1GB), it is recommended to use the parquet format for efficiency. For smaller databases, the vcf format is often easier to use. However, since many databases are already in csv or tsv format, we support those as well.

Below you will find a short instructions how to create your databases. You may also be interested in looking into a Annotation builder scripts on GitHub -- a repository with scripts that were used for creating some of the databases in the @genebe namespace.

Importing a VCF File

Importing a VCF file is easy. Here is an example of importing the ClinVar dump.

java -jar GeneBeClient.jar annotation create-from-vcf \
  --input-vcf /tmp/clinvar.vcf.gz \
  --name clinvar --owner pstawinski --version 0.0.8 \
  --columns CLNDN:TEXT CLNREVSTAT:TEXT CLNSIG ONCDN ONC ONCREVSTAT SCIDN SCI SCIREVSTAT

Importing a TSV/CSV File

If your database is in a comma-separated-file (.csv) or tab-separated-file (.tsv) it is easy to convert it to GeneBe format. Also if your database is in Excel format (.xls or .xlsx) it can be easily exported to .tsv and then imported to GeneBe format.

It's easiest to import tsv it if has a single header line. Converter will look for:

chr, pos, ref, alt columns for databases of type VARIANT, where pos is 1-based, like in VCF.
chr, pos columns for databases of type POSITION, where pos is 1 based, like in VCF
chr, start, end for catabases of type RANGE, where start and end are 0-based (!) like in BED file. start is inclusive, end is exclusive, like in BED file https://en.wikipedia.org/wiki/BED_(file_format)

If your file is tsv just run:

java -jar GeneBeClient.jar annotation create-from-tsv \
  --input /tmp/blah.tsv --owner pstawinski --name blah --version 0.0.2 \
  --has-header true

If your file is csv, you have to specify the separator with --separator ,:

java -jar GeneBeClient.jar annotation create-from-tsv \
  --input /tmp/blah.tsv --owner pstawinski --name blah --version 0.0.2 \
  --separator , \
  --has-header true

Most of the tsv databases you find in the wild are not that nice. They may lack of the header or may have multiple header lines. Or instead of chr the contig column may be named chrom or #chrom or chr_hg38 etc. It's why you can:

provide your header using --header, like --header chr pos ref alt data1 data2
change the name of columns in data using --columns: specify column like old_name/new_name:column_type, whre column_type is one of BOOL,INT8,INT16,INT32,INT64,FLOAT32,FLOAT64,TEXT. So if in your input data chr column is named contig, write --columns contig/chr:TEXT
if you need to skip some rows from the top use --skip number_of_rows

If your CSV/TSV file is huge (like in CADD scores for Whole Genome), convert this file to hive-partitioned parquet file using your favorite programming language and environment, and import the database from parquet file. You will find more below, where I write about creating database from parquet file.

Creating from a Parquet File

If you are working with a huge database, for example scores for whole genome, it may be easier to prepare the database using your favourite programming language and editor than to import tsv or vcf file. What you have to create in this step is a paruqet database, hive-partitioned on the _seq column. The format is described in the "Format" part of the help. In summary, for the VARIANT database you need to provide a parquet database with _seq (chromosome without chr prefix), _pos (0-based position), _del (number of deleted bases) and _ins (inserted bases). This is a kind of well known SPDI variant representation format.

As I like to use Apache Spark for processing big files, I usually convert my database to the dataframe containing chr, pos, ref, alt columns (like in VCF), and then I use this function:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, SparkSession}

def processDataFrame(df: DataFrame, outputDir: String): Unit = {
  // Define the converting function
  def toSpdi(chrInput: String, posInput: Int, refInput: String, altInput: String): (String, Int, Int, String) = {
    if (chrInput == null || chrInput.isBlank || posInput < 0) {
      return (null, -1, -1, null)
    }

    var seq = chrInput.stripPrefix("chr")
    if (seq == "MT") seq = "M" // Normalize for human genomes

    var ref = if (refInput == null) "" else refInput
    var alt = if (altInput == null) "" else altInput
    val vcfPos = posInput

    // Remove leading common bases
    var leadingCommon = 0
    while (leadingCommon < ref.length && leadingCommon < alt.length &&
      ref.charAt(leadingCommon) == alt.charAt(leadingCommon)) {
      leadingCommon += 1
    }
    ref = ref.substring(leadingCommon)
    alt = alt.substring(leadingCommon)
    val pos = vcfPos - 1 + leadingCommon // Convert to 0-based

    // Remove trailing common bases
    var trailingCommon = 0
    while (ref.length > trailingCommon && alt.length > trailingCommon &&
      ref.charAt(ref.length - 1 - trailingCommon) == alt.charAt(alt.length - 1 - trailingCommon)) {
      trailingCommon += 1
    }
    ref = ref.substring(0, ref.length - trailingCommon)
    alt = alt.substring(0, alt.length - trailingCommon)

    // Compute del as the number of bases in the remaining ref
    val del = ref.length
    val ins = alt

    (seq, pos, del, ins)
  }

  // Create a UDF from the function
  val toSpdiUdf = udf((chrInput: String, posInput: Int, refInput: String, altInput: String) => {
    val result = toSpdi(chrInput, posInput, refInput, altInput)
    result
  })

  // Apply transformations
  val dfWithSpdi = df
    .withColumn("spdi", toSpdiUdf(col("chr"), col("pos"), col("ref"), col("alt")))
    .withColumn("_seq", col("spdi").getField("_1"))
    .withColumn("_pos", col("spdi").getField("_2").cast("int"))
    .withColumn("_del", col("spdi").getField("_3").cast("int"))
    .withColumn("_ins", col("spdi").getField("_4"))
    .drop("spdi")

  val finalDf = dfWithSpdi
    .drop("chr", "pos", "ref", "alt")
    .dropDuplicates("_seq", "_pos", "_del", "_ins")

  // Write the output
  finalDf
    .repartition($"_seq")
    .sortWithinPartitions($"_seq", $"_pos", $"_del", $"_ins")
    .write
    .option("compression", "zstd")
    .partitionBy("_seq")
    .parquet(outputDir)
}

// apply function to prepared dataframe df with chr pos ref alt columns:
processDataFrame(df, "/tmp/output_ready_dataframe.parquet")

An than we can import this dataframe as

Example:

java -jar GeneBeClient.jar annotation create-from-parquet \
  --input /tmp/output_ready_dataframe.parquet --owner @genebe --name phylop100way --version 0.0.1 \
  --species homo_sapiens --genome GRCh38 \
  --title "Conservation scoring by phyloP" --database-type VARIANT

Publishing a Database

You can publish databases under your personal namespace by using your login as the prefix. For example, if your login is pstawinski, your databases will be named as pstawinski/name:version.

You can also publish databases as an organization. Organizations allow multiple people to collaborate on building and editing databases. They can also share private databases that are not publicly accessible. You can create a new organization on your profile page. Once created, you can submit a database to the organization's namespace by prefixing the organization name with @. For example, if your organization is named genebe, you can publish a database as follows:

java -jar GeneBeClient.jar annotation push \
  --id @genebe/phylop100way:0.0.1 --public true

After you publish the database you can edit it's description page.