Why one database isn't enough for drug safety

When a clinician prescribes medication they need to simultaneously answer three different questions: Is this drug appropriate for this patient's history? (structured EHR data), Does it interact with their other medications? (graph relationships), and What did real patients experience? (adverse event narratives). No single database paradigm handles all three well.

Traditional systems force clinicians to query multiple disconnected tools and mentally synthesise the results - a cognitive burden that contributes to prescribing errors. MediDB collapses this into one query, one report.

Data sources: Synthea synthetic EHR (PostgreSQL) · DrugBank + SIDER drug interactions (Neo4j) · openFDA FAERS adverse events (MongoDB + Qdrant vectors) · BioLORD-2023 clinical embeddings for semantic search.

Four databases, one orchestration layer

Each database is chosen for its paradigm strength. The orchestration layer sends parallel queries, collects results, and merges them into a structured safety report with a risk level, interaction insights, and alternative drug suggestions.

# Multi-step retrieval pipeline
def evaluate_drug_safety(patient_id, drug_name):
    # 1. Patient context from PostgreSQL
    patient = pg_client.get_patient_history(patient_id)
    current_meds = patient["medications"]

    # 2. Interaction graph from Neo4j
    interactions = neo4j_client.find_interactions(drug_name, current_meds)

    # 3. Similar adverse event cases from Qdrant
    embedding = bioLORD.encode(f"{drug_name} {patient['conditions']}")
    similar_cases = qdrant.search(embedding, top_k=10,
                                  filters={"severity": "serious"})

    # 4. FAERS raw reports from MongoDB for audit trail
    faers_reports = mongo.get_reports(drug_name)

    return build_safety_report(patient, interactions, similar_cases, faers_reports)

Multi-DB Architecture Design

Designed the paradigm selection: PostgreSQL for relational patient queries, Neo4j for drug interaction traversal, Qdrant for semantic case retrieval, MongoDB for FAERS raw storage + audit logs.

Graph Modeling in Neo4j

Modelled DrugBank interaction data as a property graph - drugs as nodes, interactions as typed edges with severity weights. Cypher traversal detects polypharmacy clusters and direct drug-drug conflicts.

BioLORD Vector Search

Used BioLORD-2023 (a biomedical sentence transformer) to embed clinical narratives from FAERS. Qdrant's multi-filter search retrieves top-k real-world adverse event cases matching drug, severity, and patient demographics.

ETL Pipeline

Three separate ETL jobs: Synthea EHR → PostgreSQL normalised schema; FAERS CSV → MongoDB raw + aggregated; SIDER side-effect data → Neo4j drug-symptom edges. All containerised with Docker Compose.

Streamlit Clinical UI

Four-tab interface: Safety Check (main report), Patient Data, Drug Knowledge Graph, Evidence Audit. The safety report shows risk level (Low/Medium/High), interaction details, and alternative drug suggestions.

Traceability & Audit

Every safety report is logged to MongoDB with timestamps, source citations, and the specific evidence used - making every recommendation fully traceable and auditable for clinical accountability.

Hard problems we solved

Challenge
FAERS data quality

The FDA adverse event dataset has severe inconsistencies - drug name variations, missing fields, duplicate reports. We built a normalisation layer that standardises drug names via RxNorm lookups before storage.

Challenge
Cross-DB result merging

Merging results from four different data models (relational, graph, vector, document) into one coherent report required careful schema design for the orchestration layer's output format.

Challenge
Embedding quality

Generic sentence embeddings performed poorly on clinical text. BioLORD-2023 (fine-tuned on biomedical literature) gave substantially better semantic similarity results for adverse event retrieval.

Challenge
Docker orchestration

Getting four databases + the Python app to initialise in the correct order with health checks and retry logic required careful Docker Compose dependency management.

Technologies Used

ComponentTool & Purpose
Relational DBPostgreSQL - normalised EHR data with Synthea synthetic patient records
Graph DBNeo4j + Cypher - drug interaction network, polypharmacy detection
Vector DBQdrant - semantic similarity search over BioLORD clinical embeddings
Document DBMongoDB - FAERS adverse event reports + audit log storage
EmbeddingsBioLORD-2023 (Sentence Transformers) - biomedical clinical text encoding
UIStreamlit - interactive clinical decision-support interface
InfrastructureDocker Compose - containerised multi-database deployment
Data SourcesSynthea · DrugBank · SIDER · openFDA FAERS · RxNorm

Results

4
database paradigms integrated into one query
Full
audit traceability for every safety report
1
unified report replacing 4 separate tool queries
  • Eliminated the need for clinicians to manually cross-reference multiple systems
  • BioLORD embeddings substantially outperformed generic embeddings for adverse event retrieval
  • Full Docker Compose deployment - any institution can spin up the system in minutes
  • Audit log provides complete provenance for every recommendation made by the system

What I took away

  • Choosing the right database paradigm for each data type is more impactful than optimising within a single paradigm - graph traversal for interactions is orders of magnitude more natural than SQL self-joins.
  • Domain-specific embeddings (BioLORD) dramatically outperform general-purpose ones for clinical text retrieval - generic models miss clinical synonyms and abbreviations.
  • Multi-database systems need an explicit orchestration contract defining exactly how results from each source are merged and weighted.
  • Docker Compose health checks and dependency ordering are non-trivial to get right when databases have complex initialisation sequences.