Neo4j data sync

This document describes how data are synchronised from the postgres relational database (our source of truth) into the neo4j graph database, and how this affects the linkurious instance, running on a different server.

Architecture

Data are stored in a RDBMS engine (postgresql), so that we can leverage our experience in web development with django and django rest framework.

Graph queries and visualizations are delegated to a linkurious instance, that connects to a neo4j server.

A docker-compose stack is created on a staging, or production server with:

  • postgres (rdbms)

  • uwsgi (django app and spooled processes)

  • nginx (static content)

  • solr (text search)

  • redis (cache)

  • neo4j (graphdb)

The linkurious instance runs on a different machine, alltogether, and connects to a neo4j data source through the remote bolt connector.

Synchronisation strategy

The first, simple strategy we put in practice, to keep the graphdb synchronised with the rdbms, is that all the graph is destroyed and re-created.

Carefully providing alternative ids for both nodes and edges, and fixing the source key manually in linkurious configuration, allows us to do this, with reasonable downtime for linkurious.

These are the steps:

  • fetch of csv data out of postgres, following guidelines for neo4j-admin import, into neo4j import path

  • launch of neo4j-admin import on graph_new.db (no service disruption)

  • mv graph_new.db => graph.db

  • restart neo4j (15 secs)

  • force linkurious to re-connect to the neo4j source (10 secs)

  • force linkurious to re-index the data (3 mins)

These steps are implemented in compose/update_neo4j_data.sh, and the following scripts in compose/postgres/ are used: - extract_neo4j_csv.sh - extract_neo4j_csv.sql

Use this in the dev machine, to test.

cat compose/postgres/extract_neo4j_csv.sql | \
  sed -e 's/\/var\/lib\/postgresql/\/usr\/local\/var\/postgres/' | \
  psql -Uopdm opdm

Deploy on staging or production

These operations need only to be executed once, to deploy the scripts on the host and its containers, or if the scripts are changed (during development).

Install jq on the host (to parse json content in command line):

apt-get upgrade
apt-get install jq

Copy the main shell script, to be launched by cron, on the host. We do this since we choose to execute cronjobs on the host, for simplicitys sake.

scp compose/update_neo4j_data.sh root@OP_HOST:/root/
scp compose/postgres/extract_neo4j_csv.sh root@OP_HOST:/root/
scp compose/postgres/extract_neo4j_csv.sql root@OP_HOST:/root/

Log into the host and copy the files in the postgres container of the stack, making the shell scripts executables:

ssh root@OP_HOST
chmod a+x update_neo4j_data.sh
docker cp extract_neo4j_csv.sql opdm-service_postgres:/var/lib/postgresql/neo4j_import/
docker cp extract_neo4j_csv.sh opdm-service_postgres:/var/lib/postgresql/neo4j_import/
docker exec -it opdm-service_postgres \
  bash -c "cd /var/lib/postgresql/neo4j_import && chmod a+x extract_neo4j_csv.sh"

Modify .linkurious_env on the host, in order to enter the LINKURIOUS related env variables, needed to run the script. This must be done, in order for the cronjob script to work without stopping and asking for the values. Remove read permissions to group and others for .linkurious_env.

ssh root@OP_HOST

echo "
export LINKURIOUS_USER=pippo@topolinia.top
export LINKURIOUS_PASSWORD=password-segretissima
export LINKURIOUS_SOURCE_KEY=12345
" > .linkurious_env
chmod go-r .linkurious_env

Executing the script

ssh root@OP_HOST
./update_neo4j_data.sh

Setting up the cronjob

To execute the update task twice: once at 5 o’clock in the morning and once at lunch time:

ssh root@OP_HOST
cronjob -e

15 5,13  * * * curl https://cronitor.link/[CODE]/run -m 10 ; cd /root && ./update_neo4j_data.sh && curl https://cronitor.link/[CODE]/complete -m 10

To execute the update task hourly (in reasonable working hours):

ssh root@OP_HOST
cronjob -e

15 7-23 * * * curl https://cronitor.link/[CODE]/run -m 10 ; cd /root && ./update_neo4j_data.sh && curl https://cronitor.link/[CODE]/complete -m 10

cronitor is used to setup alerts when something goes wrong