GeneBe Database Format
Databases are stored in the Parquet format, which is widely supported in many programming languages.
There are two types of annotations:
VARIANT
- Describes a single variant, defined by its chromosomal position and the change.POSITION
- Describes a single position, defined by its chromosomal position.RANGE
- Annotates ranges, for example chr1:5000-10000. Variant is anotated if it has some range in common with the annotation range.- (not yet ready) annotations by genes
- (not yet ready) annotations by transcripts
While a simple variant in VCF is described using Chromosome, Position (1-based), Reference sequence, and Alternate sequence (CPRA), in GeneBe we use Sequence, Position (0-based), Deletion (number of deleted bases), and Insertion sequence (SPDI). This SPDI format is similar to the one described in this publication, but there are differences. In GeneBe:
- Sequence is a chromosome name encoded as
1
,2
, ...,X
, without thechr
prefix. TheNC_
notation is not supported. - For the mitochondrion, the letter
M
is used.
The columns for SPDI are named _seq
, _pos
, _del
, and _ins
.
Parquet
databases for VARIANT
and POSITION
are HIVE-partitioned using the _seq
column. Databases are sorted by _seq
, _pos
, _del
, and _ins
for VARIANT
databases, or by _seq
and _pos
for POSITION
databases.
Databases are mainly for human species, but the format is flexible and can be used for other species.
Requirements and limitations
Database of type VARIANT
:
- Must contain the columns
_seq
(String),_pos
(Integer, 32-bit),_del
(Integer, 32-bit),_ins
(String), and must not include other columns starting with_
. _seq
must be a chromosome ID without thechr
prefix; you should useM
for mitochondrion (notMT
).- Avoid using chromosomes (
_seq
) other than 1–22, X, Y, and M unless absolutely necessary. - Column names must be short strings of
[a-z0-9_]+
characters, up to 24 characters long, and must start with a letter. _pos
is 0-based (not 1-based, as in VCF), following the SPDI format._del
and_ins
correspond to D and I from SPDI._del
is a number, and_ins
consists of uppercased bases._ins
must not contain any characters other thanACTG
.- The
_seq
,_pos
,_del
,_ins
tuple must be unique within a database of type VARIANT. - The
_seq
,_pos
tuple must be unique within a database of type POSITION. - Data within a partition in Parquet files should be ordered by
_seq
,_pos
,_del
,_ins
. - Data must be Hive-partitioned on
_seq
. The file format is Parquet, and there may be one or more Parquet files per partition. You should split data into multiple files if single files are significantly larger than 200 MB. - Parquet files must have the
.parquet
extension. - Parquet files must be compressed using widely supported compression algorithms, e.g.,
snappy
orzstd
. If in doubt, usezstd
. - Files must not contain symlinks.
- Use widely supported versions of Parquet files. Both version 1 and version 2 are acceptable as long as they are widely supported; if in doubt, use version 1.
- File names must only use characters
[a-zA-Z0-9.@-]
and must not include the substring..
. - Carefully select column types in the
description.toml
file. Preferint16
overint32
if possible, andint32
overint64
. Similarly,float32
is usually sufficient, with no need forfloat64
. - Prefer numbers over text. For example, store dbSNP as a number rather than as text
rsNUMBER
. - Aim to keep data small by removing duplicated columns and common prefixes before creating the dataset.
- Avoid including unnecessary columns in the database. Specifically, do not include
chr
,pos
,ref
, oralt
columns, as their information is already encoded in SPDI.
Naming Convention
- Database names must consist of
[a-z]
letters,[0-9]
digits, and a hyphen (-
). The name must not start with a digit, must not start or end with a hyphen, and must not contain consecutive hyphens. - Database names should be short and expressive. They must follow the pattern
[a-z0-9-_]
, start with a letter, and not include consecutive_
or-
characters. - Databases are versioned using the Semantic Versioning system. If the structure of the database changes (e.g., important fields are deleted), you should increment the major version number. If new columns are added, increment the minor version number. Changes to database content should increment the minor version number.
- If the database data naturally follows another versioning system (e.g., ClinVar is versioned by date), use the pre-release part of Semantic Versioning. For example:
1.0.0-20241201
.
Example of a tree structure
If you are not familiar with Hive partitioning, this is an example of the file structure for VARIANT
or POSITION
database:
parquet/
├── _seq=1
│ └── part-00014-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=10
│ └── part-00016-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=11
│ └── part-00002-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=12
│ └── part-00018-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=13
│ └── part-00019-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=14
│ └── part-00020-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=15
│ └── part-00001-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=16
│ └── part-00006-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=17
│ └── part-00009-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=18
│ └── part-00008-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=19
│ └── part-00011-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=2
│ └── part-00021-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=20
│ └── part-00015-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=21
│ └── part-00020-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=22
│ └── part-00005-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=3
│ └── part-00003-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=4
│ └── part-00017-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=5
│ └── part-00007-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=6
│ └── part-00010-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=7
│ └── part-00000-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=8
│ └── part-00004-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=9
│ └── part-00013-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
├── _seq=X
│ └── part-00012-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
└── _seq=Y
└── part-00008-d0cc7179-686c-4d73-9d81-d24272ed0bda.c000.zstd.parquet
How can I open database?
Databases used in GeneBe are stored in the widely-supported Parquet format. You don't have to use GeneBeClient to work with these databases. If you’re using a Linux system, the databases are stored locally at:
~/.genebe/annotations/{owner}/{database-name}/
in Windows:
C:\Users\USERNAME\AppData\Roaming\genebe\annotations\{owner}/{database-name}/
The data itself is stored in the following directory:
~/.genebe/annotations/{owner}/{database-name}/parquet
Below are several tools and methods you can use to view the contents of a Parquet database, which is hive-partitioned:
1. Using DuckDB
DuckDB is a lightweight SQL database that supports Parquet files natively. Here’s how to use it:
Steps:
Open a terminal and start DuckDB:
duckdb
Attach the parquet directory to DuckDB and query the data:
SELECT * FROM read_parquet('~/.genebe/annotations/{owner}/{database-name}/parquet/**/*.parquet');
2. Using Apache Spark
Apache Spark is a powerful data processing engine that supports Hive-partitioned Parquet files.
Steps:
- Start the Spark Shell:
/path/to/spark/bin/spark-shell
- Load the Parquet data:
val df = spark.read.parquet("~/.genebe/annotations/{owner}/{database-name}/parquet") df.show()
3. Using Python with pandas and pyarrow
Python’s pandas library, with pyarrow or fastparquet as a backend, makes it easy to explore Parquet files.
Steps:
- Install pandas and pyarrow:
pip install pandas pyarrow
- Load and display the data:
import pandas as pd parquet_dir = "~/.genebe/annotations/{owner}/{database-name}/parquet" df = pd.read_parquet(parquet_dir, engine="pyarrow") print(df.head())
4. Using Command-Line Tools
Command-line tools like parquet-tools
can inspect Parquet files directly.
Steps:
- Install parquet-tools:
sudo apt install parquet-tools
- View the contents of a Parquet file:
parquet-tools show ~/.genebe/annotations/{owner}/{database-name}/parquet/file.parquet
5. Using SQL Database Tools
Many database tools like DBeaver or SQL Workbench support Parquet files through plugins or JDBC drivers.
Steps:
- Open your database tool and set up a connection for Parquet files.
- Browse to the
parquet
directory and load the data for visualization.