Why one database isn't enough for drug safety
When a clinician prescribes medication they need to simultaneously answer three different questions: Is this drug appropriate for this patient's history? (structured EHR data), Does it interact with their other medications? (graph relationships), and What did real patients experience? (adverse event narratives). No single database paradigm handles all three well.
Traditional systems force clinicians to query multiple disconnected tools and mentally synthesise the results - a cognitive burden that contributes to prescribing errors. MediDB collapses this into one query, one report.
Data sources: Synthea synthetic EHR (PostgreSQL) · DrugBank + SIDER drug interactions (Neo4j) · openFDA FAERS adverse events (MongoDB + Qdrant vectors) · BioLORD-2023 clinical embeddings for semantic search.
Four databases, one orchestration layer
Each database is chosen for its paradigm strength. The orchestration layer sends parallel queries, collects results, and merges them into a structured safety report with a risk level, interaction insights, and alternative drug suggestions.
# Multi-step retrieval pipeline def evaluate_drug_safety(patient_id, drug_name): # 1. Patient context from PostgreSQL patient = pg_client.get_patient_history(patient_id) current_meds = patient["medications"] # 2. Interaction graph from Neo4j interactions = neo4j_client.find_interactions(drug_name, current_meds) # 3. Similar adverse event cases from Qdrant embedding = bioLORD.encode(f"{drug_name} {patient['conditions']}") similar_cases = qdrant.search(embedding, top_k=10, filters={"severity": "serious"}) # 4. FAERS raw reports from MongoDB for audit trail faers_reports = mongo.get_reports(drug_name) return build_safety_report(patient, interactions, similar_cases, faers_reports)
Multi-DB Architecture Design
Designed the paradigm selection: PostgreSQL for relational patient queries, Neo4j for drug interaction traversal, Qdrant for semantic case retrieval, MongoDB for FAERS raw storage + audit logs.
Graph Modeling in Neo4j
Modelled DrugBank interaction data as a property graph - drugs as nodes, interactions as typed edges with severity weights. Cypher traversal detects polypharmacy clusters and direct drug-drug conflicts.
BioLORD Vector Search
Used BioLORD-2023 (a biomedical sentence transformer) to embed clinical narratives from FAERS. Qdrant's multi-filter search retrieves top-k real-world adverse event cases matching drug, severity, and patient demographics.
ETL Pipeline
Three separate ETL jobs: Synthea EHR → PostgreSQL normalised schema; FAERS CSV → MongoDB raw + aggregated; SIDER side-effect data → Neo4j drug-symptom edges. All containerised with Docker Compose.
Streamlit Clinical UI
Four-tab interface: Safety Check (main report), Patient Data, Drug Knowledge Graph, Evidence Audit. The safety report shows risk level (Low/Medium/High), interaction details, and alternative drug suggestions.
Traceability & Audit
Every safety report is logged to MongoDB with timestamps, source citations, and the specific evidence used - making every recommendation fully traceable and auditable for clinical accountability.
Hard problems we solved
FAERS data quality
The FDA adverse event dataset has severe inconsistencies - drug name variations, missing fields, duplicate reports. We built a normalisation layer that standardises drug names via RxNorm lookups before storage.
Cross-DB result merging
Merging results from four different data models (relational, graph, vector, document) into one coherent report required careful schema design for the orchestration layer's output format.
Embedding quality
Generic sentence embeddings performed poorly on clinical text. BioLORD-2023 (fine-tuned on biomedical literature) gave substantially better semantic similarity results for adverse event retrieval.
Docker orchestration
Getting four databases + the Python app to initialise in the correct order with health checks and retry logic required careful Docker Compose dependency management.
Technologies Used
| Component | Tool & Purpose |
|---|---|
| Relational DB | PostgreSQL - normalised EHR data with Synthea synthetic patient records |
| Graph DB | Neo4j + Cypher - drug interaction network, polypharmacy detection |
| Vector DB | Qdrant - semantic similarity search over BioLORD clinical embeddings |
| Document DB | MongoDB - FAERS adverse event reports + audit log storage |
| Embeddings | BioLORD-2023 (Sentence Transformers) - biomedical clinical text encoding |
| UI | Streamlit - interactive clinical decision-support interface |
| Infrastructure | Docker Compose - containerised multi-database deployment |
| Data Sources | Synthea · DrugBank · SIDER · openFDA FAERS · RxNorm |
Results
- Eliminated the need for clinicians to manually cross-reference multiple systems
- BioLORD embeddings substantially outperformed generic embeddings for adverse event retrieval
- Full Docker Compose deployment - any institution can spin up the system in minutes
- Audit log provides complete provenance for every recommendation made by the system
What I took away
- Choosing the right database paradigm for each data type is more impactful than optimising within a single paradigm - graph traversal for interactions is orders of magnitude more natural than SQL self-joins.
- Domain-specific embeddings (BioLORD) dramatically outperform general-purpose ones for clinical text retrieval - generic models miss clinical synonyms and abbreviations.
- Multi-database systems need an explicit orchestration contract defining exactly how results from each source are merged and weighted.
- Docker Compose health checks and dependency ordering are non-trivial to get right when databases have complex initialisation sequences.