Creating a Variant Database
GeneBe Hub is a repository for variant databases. If you have data you'd like to share, such as population frequency data, variant scores, pathogenicity assignments, or any other type of annotation, you can use GeneBe Hub to share it. This makes it easy for other bioinformaticians to discover and apply your data.
To create a database, your annotation data must be in one of the following formats: csv
(or tsv
), vcf
, or parquet
. If your database is very large (over 1GB), it is recommended to use the parquet
format for efficiency. For smaller databases, the vcf
format is often easier to use. However, since many databases are already in csv
or tsv
format, we support those as well.
Below you will find a short instructions how to create your databases. You may also be interested in looking into a Annotation builder scripts on GitHub -- a repository with scripts that were used for creating some of the databases in the @genebe namespace.
Importing a VCF File
Importing a VCF file is easy. Here is an example of importing the ClinVar dump.
java -jar GeneBeClient.jar annotation create-from-vcf \
--input-vcf /tmp/clinvar.vcf.gz \
--name clinvar --owner pstawinski --version 0.0.8 \
--columns CLNDN:TEXT CLNREVSTAT:TEXT CLNSIG ONCDN ONC ONCREVSTAT SCIDN SCI SCIREVSTAT
Importing a TSV/CSV File
If your database is in a comma-separated-file (.csv
) or tab-separated-file (.tsv
) it is easy to convert it to GeneBe format.
Also if your database is in Excel format (.xls
or .xlsx
) it can be easily exported to .tsv
and then imported to GeneBe format.
It's easiest to import tsv it if has a single header line. Converter will look for:
chr
,pos
,ref
,alt
columns for databases of typeVARIANT
, wherepos
is 1-based, like in VCF.chr
,pos
columns for databases of typePOSITION
, wherepos
is 1 based, like in VCFchr
,start
,end
for catabases of typeRANGE
, wherestart
andend
are 0-based (!) like in BED file.start
is inclusive,end
is exclusive, like in BED file https://en.wikipedia.org/wiki/BED_(file_format)
If your file is tsv
just run:
java -jar GeneBeClient.jar annotation create-from-tsv \
--input /tmp/blah.tsv --owner pstawinski --name blah --version 0.0.2 \
--has-header true
If your file is csv
, you have to specify the separator with --separator ,
:
java -jar GeneBeClient.jar annotation create-from-tsv \
--input /tmp/blah.tsv --owner pstawinski --name blah --version 0.0.2 \
--separator , \
--has-header true
Most of the tsv databases you find in the wild are not that nice. They may lack of the header or may have multiple header lines. Or instead of chr
the contig column may be named chrom
or #chrom
or chr_hg38
etc. It's why you can:
- provide your header using
--header
, like--header chr pos ref alt data1 data2
- change the name of columns in data using
--columns
: specify column likeold_name/new_name:column_type
, whrecolumn_type
is one ofBOOL,INT8,INT16,INT32,INT64,FLOAT32,FLOAT64,TEXT
. So if in your input datachr
column is namedcontig
, write--columns contig/chr:TEXT
- if you need to skip some rows from the top use
--skip number_of_rows
If your CSV/TSV file is huge (like in CADD scores for Whole Genome), convert this file to hive-partitioned parquet file using your favorite programming language and environment, and import the database from parquet file. You will find more below, where I write about creating database from parquet
file.
Creating from a Parquet File
If you are working with a huge database, for example scores for whole genome, it may be easier to prepare the database using your favourite programming language and editor than to import tsv
or vcf
file. What you have to create in this step is a paruqet database, hive-partitioned on the _seq
column. The format is described in the "Format" part of the help. In summary, for the VARIANT database you need to provide a parquet database with _seq
(chromosome without chr
prefix), _pos
(0-based position), _del
(number of deleted bases) and _ins
(inserted bases). This is a kind of well known SPDI variant representation format.
As I like to use Apache Spark for processing big files, I usually convert my database to the dataframe containing chr
, pos
, ref
, alt
columns (like in VCF), and then I use this function:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, SparkSession}
def processDataFrame(df: DataFrame, outputDir: String): Unit = {
// Define the converting function
def toSpdi(chrInput: String, posInput: Int, refInput: String, altInput: String): (String, Int, Int, String) = {
if (chrInput == null || chrInput.isBlank || posInput < 0) {
return (null, -1, -1, null)
}
var seq = chrInput.stripPrefix("chr")
if (seq == "MT") seq = "M" // Normalize for human genomes
var ref = if (refInput == null) "" else refInput
var alt = if (altInput == null) "" else altInput
val vcfPos = posInput
// Remove leading common bases
var leadingCommon = 0
while (leadingCommon < ref.length && leadingCommon < alt.length &&
ref.charAt(leadingCommon) == alt.charAt(leadingCommon)) {
leadingCommon += 1
}
ref = ref.substring(leadingCommon)
alt = alt.substring(leadingCommon)
val pos = vcfPos - 1 + leadingCommon // Convert to 0-based
// Remove trailing common bases
var trailingCommon = 0
while (ref.length > trailingCommon && alt.length > trailingCommon &&
ref.charAt(ref.length - 1 - trailingCommon) == alt.charAt(alt.length - 1 - trailingCommon)) {
trailingCommon += 1
}
ref = ref.substring(0, ref.length - trailingCommon)
alt = alt.substring(0, alt.length - trailingCommon)
// Compute del as the number of bases in the remaining ref
val del = ref.length
val ins = alt
(seq, pos, del, ins)
}
// Create a UDF from the function
val toSpdiUdf = udf((chrInput: String, posInput: Int, refInput: String, altInput: String) => {
val result = toSpdi(chrInput, posInput, refInput, altInput)
result
})
// Apply transformations
val dfWithSpdi = df
.withColumn("spdi", toSpdiUdf(col("chr"), col("pos"), col("ref"), col("alt")))
.withColumn("_seq", col("spdi").getField("_1"))
.withColumn("_pos", col("spdi").getField("_2").cast("int"))
.withColumn("_del", col("spdi").getField("_3").cast("int"))
.withColumn("_ins", col("spdi").getField("_4"))
.drop("spdi")
val finalDf = dfWithSpdi
.drop("chr", "pos", "ref", "alt")
.dropDuplicates("_seq", "_pos", "_del", "_ins")
// Write the output
finalDf
.repartition($"_seq")
.sortWithinPartitions($"_seq", $"_pos", $"_del", $"_ins")
.write
.option("compression", "zstd")
.partitionBy("_seq")
.parquet(outputDir)
}
// apply function to prepared dataframe df with chr pos ref alt columns:
processDataFrame(df, "/tmp/output_ready_dataframe.parquet")
An than we can import this dataframe as
Example:
java -jar GeneBeClient.jar annotation create-from-parquet \
--input /tmp/output_ready_dataframe.parquet --owner @genebe --name phylop100way --version 0.0.1 \
--species homo_sapiens --genome GRCh38 \
--title "Conservation scoring by phyloP" --database-type VARIANT
Publishing a Database
You can publish databases under your personal namespace by using your login as the prefix. For example, if your login is pstawinski
, your databases will be named as pstawinski/name:version
.
You can also publish databases as an organization. Organizations allow multiple people to collaborate on building and editing databases. They can also share private databases that are not publicly accessible. You can create a new organization on your profile page. Once created, you can submit a database to the organization's namespace by prefixing the organization name with @
. For example, if your organization is named genebe
, you can publish a database as follows:
java -jar GeneBeClient.jar annotation push \
--id @genebe/phylop100way:0.0.1 --public true
After you publish the database you can edit it's description page.