GeneBe Database Format

Databases are stored in the Parquet format, which is widely supported in many programming languages.

There are two types of annotations:

  • VARIANT - Describes a single variant, defined by its chromosomal position and the change.
  • POSITION - Describes a single position, defined by its chromosomal position.
  • RANGE - Annotates ranges, for example chr1:5000-10000. Variant is anotated if it has some range in common with the annotation range.
  • (not yet ready) annotations by genes
  • (not yet ready) annotations by transcripts

While a simple variant in VCF is described using Chromosome, Position (1-based), Reference sequence, and Alternate sequence (CPRA), in GeneBe we use Sequence, Position (0-based), Deletion (number of deleted bases), and Insertion sequence (SPDI). This SPDI format is similar to the one described in this publication, but there are differences. In GeneBe:

  • Sequence is a chromosome name encoded as 1, 2, ..., X, without the chr prefix. The NC_ notation is not supported.
  • For the mitochondrion, the letter M is used.

The columns for SPDI are named _seq, _pos, _del, and _ins.

Parquet databases for VARIANT and POSITION are HIVE-partitioned using the _seq column. Databases are sorted by _seq, _pos, _del, and _ins for VARIANT databases, or by _seq and _pos for POSITION databases.

Databases are mainly for human species, but the format is flexible and can be used for other species.

Requirements and limitations

Database of type VARIANT:

  • Must contain the columns _seq (String), _pos (Integer, 32-bit), _del (Integer, 32-bit), _ins (String), and must not include other columns starting with _.
  • _seq must be a chromosome ID without the chr prefix; you should use M for mitochondrion (not MT).
  • Avoid using chromosomes (_seq) other than 1–22, X, Y, and M unless absolutely necessary.
  • Column names must be short strings of [a-z0-9_]+ characters, up to 24 characters long, and must start with a letter.
  • _pos is 0-based (not 1-based, as in VCF), following the SPDI format.
  • _del and _ins correspond to D and I from SPDI. _del is a number, and _ins consists of uppercased bases. _ins must not contain any characters other than ACTG.
  • The _seq, _pos, _del, _ins tuple must be unique within a database of type VARIANT.
  • The _seq, _pos tuple must be unique within a database of type POSITION.
  • Data within a partition in Parquet files should be ordered by _seq, _pos, _del, _ins.
  • Data must be Hive-partitioned on _seq. The file format is Parquet, and there may be one or more Parquet files per partition. You should split data into multiple files if single files are significantly larger than 200 MB.
  • Parquet files must have the .parquet extension.
  • Parquet files must be compressed using widely supported compression algorithms, e.g., snappy or zstd. If in doubt, use zstd.
  • Files must not contain symlinks.
  • Use widely supported versions of Parquet files. Both version 1 and version 2 are acceptable as long as they are widely supported; if in doubt, use version 1.
  • File names must only use characters [a-zA-Z0-9.@-] and must not include the substring ...
  • Carefully select column types in the description.toml file. Prefer int16 over int32 if possible, and int32 over int64. Similarly, float32 is usually sufficient, with no need for float64.
  • Prefer numbers over text. For example, store dbSNP as a number rather than as text rsNUMBER.
  • Aim to keep data small by removing duplicated columns and common prefixes before creating the dataset.
  • Avoid including unnecessary columns in the database. Specifically, do not include chr, pos, ref, or alt columns, as their information is already encoded in SPDI.

Naming Convention

  • Database names must consist of [a-z] letters, [0-9] digits, and a hyphen (-). The name must not start with a digit, must not start or end with a hyphen, and must not contain consecutive hyphens.
  • Database names should be short and expressive. They must follow the pattern [a-z0-9-_], start with a letter, and not include consecutive _ or - characters.
  • Databases are versioned using the Semantic Versioning system. If the structure of the database changes (e.g., important fields are deleted), you should increment the major version number. If new columns are added, increment the minor version number. Changes to database content should increment the minor version number.
  • If the database data naturally follows another versioning system (e.g., ClinVar is versioned by date), use the pre-release part of Semantic Versioning. For example: 1.0.0-20241201.

Example of a tree structure

If you are not familiar with Hive partitioning, this is an example of the file structure for VARIANT or POSITION database:

parquet/
├── _seq=1
│   └── part-00014-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=10
│   └── part-00016-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=11
│   └── part-00002-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=12
│   └── part-00018-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=13
│   └── part-00019-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=14
│   └── part-00020-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=15
│   └── part-00001-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=16
│   └── part-00006-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=17
│   └── part-00009-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=18
│   └── part-00008-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=19
│   └── part-00011-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=2
│   └── part-00021-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=20
│   └── part-00015-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=21
│   └── part-00020-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=22
│   └── part-00005-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=3
│   └── part-00003-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=4
│   └── part-00017-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=5
│   └── part-00007-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=6
│   └── part-00010-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=7
│   └── part-00000-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=8
│   └── part-00004-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=9
│   └── part-00013-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=X
│   └── part-00012-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
└── _seq=Y
    └── part-00008-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet

How can I open database?

Databases used in GeneBe are stored in the widely-supported Parquet format. You don't have to use GeneBeClient to work with these databases. If you’re using a Linux system, the databases are stored locally at:

~/.genebe/annotations/{owner}/{database-name}/

in Windows:

C:\Users\USERNAME\AppData\Roaming\genebe\annotations\{owner}/{database-name}/

The data itself is stored in the following directory:

~/.genebe/annotations/{owner}/{database-name}/parquet

Below are several tools and methods you can use to view the contents of a Parquet database, which is hive-partitioned:

1. Using DuckDB

DuckDB is a lightweight SQL database that supports Parquet files natively. Here’s how to use it:

Steps:

  1. Open a terminal and start DuckDB:

    duckdb
    
  2. Attach the parquet directory to DuckDB and query the data:

    SELECT * FROM read_parquet('~/.genebe/annotations/{owner}/{database-name}/parquet/**/*.parquet');
    

2. Using Apache Spark

Apache Spark is a powerful data processing engine that supports Hive-partitioned Parquet files.

Steps:

  1. Start the Spark Shell:
    /path/to/spark/bin/spark-shell
    
  2. Load the Parquet data:
    val df = spark.read.parquet("~/.genebe/annotations/{owner}/{database-name}/parquet")
    df.show()
    

3. Using Python with pandas and pyarrow

Python’s pandas library, with pyarrow or fastparquet as a backend, makes it easy to explore Parquet files.

Steps:

  1. Install pandas and pyarrow:
    pip install pandas pyarrow
    
  2. Load and display the data:
    import pandas as pd
    
    parquet_dir = "~/.genebe/annotations/{owner}/{database-name}/parquet"
    df = pd.read_parquet(parquet_dir, engine="pyarrow")
    print(df.head())
    

4. Using Command-Line Tools

Command-line tools like parquet-tools can inspect Parquet files directly.

Steps:

  1. Install parquet-tools:
    sudo apt install parquet-tools
    
  2. View the contents of a Parquet file:
    parquet-tools show ~/.genebe/annotations/{owner}/{database-name}/parquet/file.parquet
    

5. Using SQL Database Tools

Many database tools like DBeaver or SQL Workbench support Parquet files through plugins or JDBC drivers.

Steps:

  1. Open your database tool and set up a connection for Parquet files.
  2. Browse to the parquet directory and load the data for visualization.