OncodashKB is a set of modules to create a reproducible Semantic Knowledge Graph from existing iterable databases, with the aim of helping finding actionable drugs against cancers.
As of now, our main use case is a pre-clinical study about finding drugs that are actionable on some high-grade serous ovarian cancers.
Under the hood, it uses OntoWeaver for describing how to import data. OntoWeaver uses Biocypher as a tool for doing the ontology composition, and for exporting the SKG to a Neo4j database.
The project uses UV. You can install OncodashKB using the commands below:
git clone https://github.com/oncodash/oncodashkb.git
cd oncodashkb
uv syncUV will create a virtual environment according to your configuration (either
centrally or in the project folder). You can activate it by running uv run bash
inside the project directory.
If you have a problem with the uv sync command, it may be that the
uv lock command has not been ran after changing dependencies modification in
$ONCODASHKB_HOME/pyproject.toml. Try running uv lock to fix the issue.
Please, install pre-commit hooks before committing or pushing anything new:
pre-commit install
pre-commit install --hook-type pre-push
pre-commit install --hook-type commit-msgTheoretically, OncodashKB can be exported to any graph database supported by BioCypher's backends.
As of now, OncodashKB targets using Neo4j, which have some restrictions (most notably not supporting type hierarchies on edges).
So far, it has been extensively tested with Neo4j 5+ but it should also works with Neo4j 4+.
Neo4j "Graph Database Self-Managed" version can be downloaded
from their website.
When using with this version, be sure to add the bin/ directory to PATH and
PYTHONPATH, as well as the correct version of Java to JAVA_HOME.
Note that the community edition of Neo4j do not support multiple databases,
hence the need to configure the default database in $NEO4J_HOME/conf/neo4j.conf
to be: initial.dbms.default_database=oncodash (which is commented out by
default, hence the default database will be called neo4j).
Note that the default database does not always need to be named oncodash,
but should match the name of the database in
$ONCODASHKB_HOME/config/biocypher_config.yaml.
The quickest possible build of OncodashKB is calling:
./prepare.sh <DECIDER_data_dir> <DECIDER_snapshot_dir> # Checks all the needed data, download them if necessary.
./make.sh <DECIDER_snapshot_dir> [config] [debug] # Runs what's needed to build the SKG, and run a test Cypher query.Note the optional "debug" option for the make.sh script, which enable a more
verbose log, python non-optimized run, and will stop on any error.
If you need to handle some of the steps yourself, the following sections tries to explain some subtleties.
As of now, OncodashKB depends on data coming from the DECIDER project.
If you have an archive of the Eduuni data, unzip it somwhere and pass its path
to the prepare.sh script.
The second argument of the prepare script is where you want to put a snapshot
of those data (usually, you would use DECIDER_$(date -I)).
This second argument is the directory you will pass to the make.sh script.
So far, OncodashKB scripts are not generalized to avoid dependencies on
DECIDER data. However, it should be technically feasible to make OntoWeaver
mappings for your own data, and integrate them with the mappings that are
referenced here. See the prepare.sh script to see what mappings it downloads.
You can pass a different BioCypher config file using the [config] argument
of the make.sh script. For instance:
./make.sh data/DECIDER_test config/biopathnet.yaml # This will export the SKG in the BPN format.The make.sh script calls the weave.py script internally.
The weave.py command will include the data files that you indicate into a part
of the SKG. It follows the general form of:
uv run weave.py --database-A <data_file> --database-B <data_file> […]You can get a list of supported options by running:
uv run weave.py --helpOnce executed, weave.py prepares a shell script named
neo4j-admin-import-call.sh in a timestamped sub-directory in
'$ONCODASHKB_HOME/biocypher-out'. The complete path of this file is printed at
the end of execution, make.sh captures it with a subshell:
import_script=$(uv run weave.py […])In case your Neo4j installation needs the environment variable 'NEO4J_HOME', you will have to delete the 'bin/' prefix in the import script:
version=$(~~bin/~~neo4j-admin --version | cut -d '.' -f 1)
...Before importing the data by calling the import script, make.sh ensures that the Neo4j
server is stopped. Executing the import script will connect directly to the
Neo4j server data files, and feed it with the extracted graph:
sh $import_script # If you captured the path as shown above.
# OR
sh <YOUR_PATH_TO>/neo4j-admin-import-call.sh # To call it directly.You can start the Neo4j server by using either of the commands below.
Neo4j 5+:
neo4j-admin server startNeo4j 4:
neo4j startThis will give you an HTTP link to the "Neo4j browser" where you can explore
your graph from your own Web browser. By default, the link to Neo4j browser is:
http://localhost:7474.
You can stop the server by using either of the commands below.
Neo4j 5+:
neo4j-admin server stopNeo4j 4:
neo4j stopCancer Genome Interpreter is the cancer database that contains information about various genetic alterations that can be associated with the patient, gene details, samples, disease type, and transcript information.
To launch CGI adapter, use --cgi option and path to the CSV file with the data
that you want to integrate.
Example of use:
./weave.py –cgi /path_to_file/test_genomics_cgimutation.csvOncoKB is the cancer database that contains information about various genetic alterations that can be associated with the patient, gene details, samples, and disease type, as well as treatment options with FDA, OncoKB evidence levels, and related publications.
To launch OncoKB adapter, use --oncokb option and path to the CSV file with
the data that you want to integrate.
Example of use:
./weave.py –oncokb /path_to_file/test_genomics_oncokbannotation.csvOpen Targets is a public database that aims to systematically identify and prioritize drug targets for disease treatment. The described adapter helps to integrate the data about the targets, disease/phenotypes, drugs and evidences.
Current adapter works with the data in Parquet format.
To download the necessary data, check what the prepare.sh script is downloading,
you can visit
Open Target's download page
and separately download needed datasets.
As Open Targets database contains millions of the rows of the data, in order to integrate only necessary information, you need to precise the genes (Hugo Symbols and Ensembl IDs) in the configuration files:
- Hugo symbols in the file
oncodashkb/adapters/Hugo_Symbol_genes.conf - Ensembl ID in the file
oncodashkb/adapters/Ensembl_genes.conf
Example of use for targets, diseases, drugs and evidences (only from Chembl) integration:
./weave.py --open_targets path_to_OpenTargets/OpenTargets/targets --open_targets_drugs path_to_OpenTargets/OpenTargets/molecule --open_targets_diseases path_to_OpenTargets/OpenTargets/diseases --open_targets_evidences path_to_OpenTargets/OpenTargets/evidence/sourceId\=chemblWhen modifying any dependencies in $ONCODASHKB_HOME/pyproject.toml,
be sure to run uv lock.
Hints and tips about designing the ontology alignements:
- Ontologies may be browsed with Protégé.
- The biolink model
has (a lot of) classes attached at the root
Thing. These are actually decomissioned stuff, the actual classes are underentity.
If you operate OncodashKB over sensitive data, you may want to enable Git hooks that checks if there is a potential data leak before committing anything. See the "installation" section above.
To check whether there is some data in your graph database, you can use the command-line client of Neo4j:
cypher-shell -d oncodash -u neo4j "MATCH (n) RETURN n LIMIT 5;"and you should see 5 nodes.
To visualize [a part of] the graph, you can use neo4j-browser with a similar Cypher query.
Notes:
- Neo4j-browser may need a specific node version,
you can install it with:
pip install nodeenv nodeenv --node=16.10.0 env . env/bin/activate npm install yarn yarn install yarn start - Neo4j server disable connection across the network by default.
To connect the browser to a server on another machine,
be sure to edit the server's
neo4j.confwith the0.0.0.0address:server.bolt.listen_address=0.0.0.0:7687