Docs·4ff474d·Updated Mar 14, 2026·43 ADRs

ADR-023accepted

ADR-023: Infrastructure Standardization and Environment Management

Date: 2026-01-09 Status: Accepted Deciders: Development Team Related: ADR-004 (Microservices), ADR-013 (Monorepo), DEVELOPMENT_PROCESS.md

Context

During deployment of the simulation service to production (January 2026), we encountered several critical issues stemming from inconsistent infrastructure configuration across environments:

Current Situation

Database Naming Inconsistencies:
- Development: karmyq_db (database) with karmyq_user (owner)
- Production: karmyq_prod (database) with karmyq_prod (owner)
- Staging: Unknown/inconsistent
- Different naming patterns make scripts, documentation, and automation brittle
Path Differences:
- Production: /home/ubuntu/karmyq
- Staging: /home/karmyq/karmyq
- No consistent pattern across environments
Configuration Management Issues:
- Environment-specific files (.env.production.users) deleted by git clean -fd
- No clear separation between version-controlled config and environment secrets
- Deployment process required manual recreation of credentials
Docker Compose Variations:
- Different compose files for each environment
- No standardized approach to environment-specific overrides
- Configuration drift between environments

Problems Encountered

During simulation service deployment:

Lost user credentials file after running git clean -fd (untracked files deleted)
Database connection strings hardcoded with environment-specific names
Confusion about which paths to use for different servers
Manual intervention required to recreate simulated users
No automated validation of environment parity

Requirements

Predictability: Deployments should be repeatable and consistent
Safety: Environment-specific secrets should survive git operations
Clarity: Team should immediately understand environment differences
Automation: Reduce manual steps in deployment process
Documentation: Environment setup should be self-documenting

Decision

We will standardize infrastructure configuration across all environments using these principles:

1. Standardized Database Naming

Pattern: karmyq_{env} for database, karmyq_user for owner (consistent across all environments)

Development: karmyq_dev / karmyq_user
Staging: karmyq_staging / karmyq_user
Production: karmyq_production / karmyq_user

Rationale:

Single consistent user simplifies permission management
Environment suffix makes purpose explicit
Follows industry standard pattern (Heroku, AWS RDS, etc.)

2. Standardized Directory Structure

Pattern: /opt/karmyq/{env} for all deployments

Development: Local development uses project root
Staging: /opt/karmyq/staging
Production: /opt/karmyq/production

Rationale:

/opt is standard location for optional application software
Clear separation of environments on same machine (if needed)
Predictable paths for scripts, logs, backups

3. Environment Configuration Management

Strategy: Three-tier configuration system

Version-Controlled Base (.env.example, config.template.ts):
- Structure and documentation of required variables
- Safe to commit to git
- Used for validation and initialization
Environment-Specific Secrets (.env.{env}, .env.{env}.users):
- Stored in /opt/karmyq/secrets/{env}/
- Symlinked into application directory
- NEVER in git working directory (prevents accidental deletion)
- Backed up separately
Runtime Configuration (loaded at startup):
- Applications load from symlinked files
- Fail fast with clear errors if configuration missing
- Log configuration source (but never values)

Directory Structure:

/opt/karmyq/
├── production/
│   └── karmyq/           # Git repository
│       └── .env.production -> /opt/karmyq/secrets/production/.env
├── staging/
│   └── karmyq/           # Git repository
│       └── .env.staging -> /opt/karmyq/secrets/staging/.env
└── secrets/
    ├── production/
    │   ├── .env
    │   ├── .env.production.users
    │   └── backup/       # Automated backups
    └── staging/
        ├── .env
        ├── .env.staging.users
        └── backup/

4. Deployment Process Standardization

Standardized Deployment Script (scripts/deploy.sh):

#!/bin/bash
# Usage: ./scripts/deploy.sh [staging|production]

set -e

ENV=$1
BASE_DIR="/opt/karmyq/${ENV}"
SECRETS_DIR="/opt/karmyq/secrets/${ENV}"

# Validate environment
if [[ ! "$ENV" =~ ^(staging|production)$ ]]; then
    echo "Usage: $0 [staging|production]"
    exit 1
fi

# Pull latest code
cd "${BASE_DIR}/karmyq"
git pull origin master

# Install dependencies (preserves node_modules)
npm ci

# Build services
npm run build

# Symlink secrets (idempotent)
ln -sf "${SECRETS_DIR}/.env" .env.${ENV}
ln -sf "${SECRETS_DIR}/.env.${ENV}.users" .env.${ENV}.users

# Restart services
pm2 restart ecosystem.config.js --only simulation-service

# Health check
sleep 5
pm2 logs simulation-service --lines 50 --nostream | grep -q "Simulation service started" || {
    echo "ERROR: Service failed to start"
    exit 1
}

echo "Deployment to ${ENV} completed successfully"

5. Environment Validation

Pre-deployment Checklist (automated in scripts/validate-env.sh):

Database name matches pattern karmyq_{env}
Database user is karmyq_user
Required secrets exist in /opt/karmyq/secrets/{env}/
Secrets are symlinked into application directory
Docker compose file matches environment
PM2 ecosystem config matches environment
All required environment variables present
Database connection successful
Redis connection successful

6. Documentation Requirements

Each environment must have:

Environment README (docs/environments/{env}/README.md):
- Server details (hostname, IP, SSH key)
- Database configuration
- File locations
- Access procedures
- Backup procedures
Runbook (docs/environments/{env}/RUNBOOK.md):
- Deployment procedure
- Rollback procedure
- Common issues and resolutions
- Monitoring and alerts
- Emergency contacts

Consequences

Positive Consequences

Predictable Deployments: Same commands work across all environments
Reduced Errors: Standardization eliminates entire classes of environment-specific bugs
Faster Onboarding: New team members can understand environment structure immediately
Safer Operations: Secrets protected from git operations
Better Automation: Consistent structure enables reliable CI/CD
Clearer Debugging: Logs, paths, and configurations follow predictable patterns
Easier Testing: Can validate environment parity programmatically

Negative Consequences

Migration Effort: Need to migrate existing production/staging to new structure
Breaking Changes: Scripts and documentation referencing old paths must be updated
Downtime Required: Database rename and path migration require brief outage
Learning Curve: Team must learn new conventions
Tooling Updates: CI/CD pipelines, backup scripts, monitoring configs need updates

Neutral Consequences

More Explicit Configuration: What was implicit is now explicit (clearer but more verbose)
Shifted Responsibility: Operations become more formal (good for production, potentially overkill for dev)
Different Mental Model: From "servers have different configs" to "environments are standardized instances"

Alternatives Considered

Alternative 1: Environment-Specific Branches

Description: Maintain separate git branches for each environment (dev, staging, production) with environment-specific configuration committed.

Pros:

Simple to understand
Configuration always in sync with code
No external dependencies

Cons:

Secrets in version control (security risk)
Merge conflicts on every deployment
Difficult to keep branches in sync
Can't easily promote exact code between environments

Why Rejected: Secrets in git is a non-starter for production systems. Merge overhead increases operational burden significantly.

Alternative 2: Configuration Service (Vault, Consul, etc.)

Description: Use external configuration management service like HashiCorp Vault or AWS Parameter Store to manage all environment configuration.

Pros:

Industry standard solution
Excellent secret management
Audit trails and access control
Dynamic secret rotation
Multi-environment support built-in

Cons:

Additional infrastructure to maintain
Increased complexity for small team
Network dependency for application startup
Learning curve for new tool
Cost for hosted solutions

Why Rejected: While technically superior, introduces operational overhead disproportionate to team size and current scale. Good future migration path as team/scale grows.

Alternative 3: Docker Secrets + Swarm

Description: Use Docker Swarm with built-in secrets management.

Pros:

Docker-native solution
Good secrets management
Integrated with orchestration
No additional tools

Cons:

Locks us into Docker Swarm (vs. other orchestrators)
Requires Swarm mode (we're using docker-compose)
More complex than current deployment
Overkill for single-node deployments

Why Rejected: We're not using container orchestration yet. When we scale to multiple nodes, we'll likely choose Kubernetes over Swarm, making this investment wasted.

Alternative 4: Status Quo + Documentation

Description: Keep current environment-specific configurations but document them thoroughly.

Pros:

No migration required
No breaking changes
Team already familiar with current approach

Cons:

Doesn't solve the root problems
Documentation becomes outdated
Requires constant vigilance
Error-prone manual processes remain

Why Rejected: Documentation alone doesn't prevent the errors we've experienced. The problems are structural and require structural solutions.

Implementation Notes

Migration Path

Phase 1: Staging Environment (Week 1)

Create /opt/karmyq/secrets/staging/ directory structure
Backup current .env files
Create new database karmyq_staging alongside karmyq_prod
Migrate data using pg_dump / pg_restore
Update connection strings
Test thoroughly
Switch traffic to new database
Decommission old database after 7 days

Phase 2: Production Environment (Week 2, after staging validation)

Same steps as staging
Require change approval
Schedule during low-traffic window
Have rollback plan ready
Extended monitoring period

Phase 3: Development Environment (Week 3)

Update docker-compose.yml with new database name
Create migration script for developers
Update documentation
Team meeting to walk through changes

Phase 4: Automation & Validation (Week 4)

Implement deployment scripts
Implement validation scripts
Update CI/CD pipelines
Create environment runbooks
Conduct deployment drill

Files Affected

Infrastructure:

infrastructure/docker/docker-compose.yml
infrastructure/docker/docker-compose.staging.yml
infrastructure/docker/docker-compose.production.yml
infrastructure/postgres/init.sql

Configuration:

All service .env.example files
All service database connection strings
PM2 ecosystem configs

Scripts:

scripts/deploy.sh (new)
scripts/validate-env.sh (new)
scripts/setup-secrets.sh (new)
Any existing deployment scripts

Documentation:

docs/environments/ (new directory)
docs/DEPLOYMENT.md
README.md (environment setup section)
Service READMEs (database connection info)

Rollback Strategy

Each migration step has a rollback procedure:

Database Migration: Keep old database running alongside new for 7 days, can switch connection string back
Path Migration: Old paths preserved until new paths validated (30 days)
Secrets Migration: Backups automated before any changes
Service Updates: Blue-green deployment pattern (old version running until new validated)

Testing Strategy

Before declaring migration complete:

Functional Testing: All API endpoints return expected results
Integration Testing: Run full integration test suite
Load Testing: Confirm performance matches pre-migration baseline
Failure Testing: Simulate failures (database down, missing secrets) and verify error handling
Deployment Testing: Perform full deployment from scratch on test environment

References

Incident: Simulation service deployment January 2026 (lost credentials, database naming confusion)
Related Discussions: User feedback: "We probably shouldn't call this db Karmyq_prod and the owner karmyq_prod. This feels like a bad pattern."
Industry Standards:
- Twelve-Factor App - Config
- Twelve-Factor App - Dev/Prod Parity
Tools:
- PM2 Ecosystem Config
- PostgreSQL Naming Conventions
Process Documents:
- DEVELOPMENT_PROCESS.md
- DEVELOPMENT_ROADMAP.md

Connection to Development Process

This ADR emphasizes our tangent management process by documenting lessons learned:

Recognition: During deployment, we recognized infrastructure inconsistency as a tangent from the primary simulation service work
Documentation: Instead of immediately fixing all environments, we documented the issue as an ADR (this document)
Prioritization: Allows deliberate decision about when to address standardization vs. continuing feature work
Communication: Makes the technical debt visible and provides implementation plan when we prioritize it

This ADR serves as an example of how we handle tangents: recognize them, document them thoroughly, and make conscious decisions about when to address them rather than getting pulled off course during feature development.