Skip to content

oncodash/oncodashkb

Repository files navigation

Oncodash Knowledge Base

Overview

OncodashKB is a set of modules to create a reproducible Semantic Knowledge Graph from existing iterable databases, with the aim of helping finding actionable drugs against cancers.

As of now, our main use case is a pre-clinical study about finding drugs that are actionable on some high-grade serous ovarian cancers.

Under the hood, it uses OntoWeaver for describing how to import data. OntoWeaver uses Biocypher as a tool for doing the ontology composition, and for exporting the SKG to a Neo4j database.

Installation

Source Code

The project uses UV. You can install OncodashKB using the commands below:

git clone https://github.com/oncodash/oncodashkb.git
cd oncodashkb
uv sync

UV will create a virtual environment according to your configuration (either centrally or in the project folder). You can activate it by running uv run bash inside the project directory.

If you have a problem with the uv sync command, it may be that the uv lock command has not been ran after changing dependencies modification in $ONCODASHKB_HOME/pyproject.toml. Try running uv lock to fix the issue.

Requirements for preventing publishing patient ids

Please, install pre-commit hooks before committing or pushing anything new:

pre-commit install
pre-commit install --hook-type pre-push
pre-commit install --hook-type commit-msg

Database

Theoretically, OncodashKB can be exported to any graph database supported by BioCypher's backends.

As of now, OncodashKB targets using Neo4j, which have some restrictions (most notably not supporting type hierarchies on edges).

So far, it has been extensively tested with Neo4j 5+ but it should also works with Neo4j 4+.

Set up

Neo4j "Graph Database Self-Managed" version can be downloaded from their website. When using with this version, be sure to add the bin/ directory to PATH and PYTHONPATH, as well as the correct version of Java to JAVA_HOME.

Note that the community edition of Neo4j do not support multiple databases, hence the need to configure the default database in $NEO4J_HOME/conf/neo4j.conf to be: initial.dbms.default_database=oncodash (which is commented out by default, hence the default database will be called neo4j).

Note that the default database does not always need to be named oncodash, but should match the name of the database in $ONCODASHKB_HOME/config/biocypher_config.yaml.

Usage

Quick start guide

The quickest possible build of OncodashKB is calling:

./prepare.sh <DECIDER_data_dir> <DECIDER_snapshot_dir> # Checks all the needed data, download them if necessary.
./make.sh <DECIDER_snapshot_dir> [config] [debug] # Runs what's needed to build the SKG, and run a test Cypher query.

Note the optional "debug" option for the make.sh script, which enable a more verbose log, python non-optimized run, and will stop on any error.

Detailled build

If you need to handle some of the steps yourself, the following sections tries to explain some subtleties.

Dependency on DECIDER data

As of now, OncodashKB depends on data coming from the DECIDER project. If you have an archive of the Eduuni data, unzip it somwhere and pass its path to the prepare.sh script.

The second argument of the prepare script is where you want to put a snapshot of those data (usually, you would use DECIDER_$(date -I)).

This second argument is the directory you will pass to the make.sh script.

So far, OncodashKB scripts are not generalized to avoid dependencies on DECIDER data. However, it should be technically feasible to make OntoWeaver mappings for your own data, and integrate them with the mappings that are referenced here. See the prepare.sh script to see what mappings it downloads.

Config the output format

You can pass a different BioCypher config file using the [config] argument of the make.sh script. For instance:

./make.sh data/DECIDER_test config/biopathnet.yaml  # This will export the SKG in the BPN format.

Weave database

The make.sh script calls the weave.py script internally.

The weave.py command will include the data files that you indicate into a part of the SKG. It follows the general form of:

uv run weave.py --database-A <data_file> --database-B <data_file> […]

You can get a list of supported options by running:

uv run weave.py --help

Import the database

Once executed, weave.py prepares a shell script named neo4j-admin-import-call.sh in a timestamped sub-directory in '$ONCODASHKB_HOME/biocypher-out'. The complete path of this file is printed at the end of execution, make.sh captures it with a subshell:

import_script=$(uv run weave.py […])

In case your Neo4j installation needs the environment variable 'NEO4J_HOME', you will have to delete the 'bin/' prefix in the import script:

version=$(~~bin/~~neo4j-admin --version | cut -d '.' -f 1)
...

Before importing the data by calling the import script, make.sh ensures that the Neo4j server is stopped. Executing the import script will connect directly to the Neo4j server data files, and feed it with the extracted graph:

sh $import_script # If you captured the path as shown above.
# OR
sh <YOUR_PATH_TO>/neo4j-admin-import-call.sh # To call it directly.

Start the server

You can start the Neo4j server by using either of the commands below.

Neo4j 5+:

neo4j-admin server start

Neo4j 4:

neo4j start

This will give you an HTTP link to the "Neo4j browser" where you can explore your graph from your own Web browser. By default, the link to Neo4j browser is: http://localhost:7474.

Stop the server

You can stop the server by using either of the commands below.

Neo4j 5+:

neo4j-admin server stop

Neo4j 4:

neo4j stop

OncodashKB Adapters

CGI adapter

Cancer Genome Interpreter is the cancer database that contains information about various genetic alterations that can be associated with the patient, gene details, samples, disease type, and transcript information.

To launch CGI adapter, use --cgi option and path to the CSV file with the data that you want to integrate.

Example of use:

./weave.py –cgi /path_to_file/test_genomics_cgimutation.csv

OncoKB adapter

OncoKB is the cancer database that contains information about various genetic alterations that can be associated with the patient, gene details, samples, and disease type, as well as treatment options with FDA, OncoKB evidence levels, and related publications.

To launch OncoKB adapter, use --oncokb option and path to the CSV file with the data that you want to integrate.

Example of use:

./weave.py –oncokb /path_to_file/test_genomics_oncokbannotation.csv

Open Targets adapter

Open Targets is a public database that aims to systematically identify and prioritize drug targets for disease treatment. The described adapter helps to integrate the data about the targets, disease/phenotypes, drugs and evidences.

Current adapter works with the data in Parquet format.

To download the necessary data, check what the prepare.sh script is downloading, you can visit Open Target's download page and separately download needed datasets.

As Open Targets database contains millions of the rows of the data, in order to integrate only necessary information, you need to precise the genes (Hugo Symbols and Ensembl IDs) in the configuration files:

  • Hugo symbols in the fileoncodashkb/adapters/Hugo_Symbol_genes.conf
  • Ensembl ID in the file oncodashkb/adapters/Ensembl_genes.conf

Example of use for targets, diseases, drugs and evidences (only from Chembl) integration:

 ./weave.py  --open_targets path_to_OpenTargets/OpenTargets/targets   --open_targets_drugs path_to_OpenTargets/OpenTargets/molecule  --open_targets_diseases path_to_OpenTargets/OpenTargets/diseases  --open_targets_evidences path_to_OpenTargets/OpenTargets/evidence/sourceId\=chembl

Development

When modifying any dependencies in $ONCODASHKB_HOME/pyproject.toml, be sure to run uv lock.

Hints and tips about designing the ontology alignements:

  • Ontologies may be browsed with Protégé.
  • The biolink model has (a lot of) classes attached at the root Thing. These are actually decomissioned stuff, the actual classes are under entity.

If you operate OncodashKB over sensitive data, you may want to enable Git hooks that checks if there is a potential data leak before committing anything. See the "installation" section above.

Side steps

To check whether there is some data in your graph database, you can use the command-line client of Neo4j:

cypher-shell -d oncodash -u neo4j "MATCH (n) RETURN n LIMIT 5;"

and you should see 5 nodes.

To visualize [a part of] the graph, you can use neo4j-browser with a similar Cypher query.

Notes:

  • Neo4j-browser may need a specific node version, you can install it with:
    pip install nodeenv
    nodeenv --node=16.10.0 env
    . env/bin/activate
    npm install yarn
    yarn install
    yarn start
  • Neo4j server disable connection across the network by default. To connect the browser to a server on another machine, be sure to edit the server's neo4j.conf with the 0.0.0.0 address: server.bolt.listen_address=0.0.0.0:7687

About

A reproducible semantic knowledge graph for helping finding actionable drugs against (ovarian) cancers.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors