ADR-023: Infrastructure Standardization and Environment Management
ADR-023: Infrastructure Standardization and Environment Management
Date: 2026-01-09 Status: Accepted Deciders: Development Team Related: ADR-004 (Microservices), ADR-013 (Monorepo), DEVELOPMENT_PROCESS.md
Context
During deployment of the simulation service to production (January 2026), we encountered several critical issues stemming from inconsistent infrastructure configuration across environments:
Current Situation
-
Database Naming Inconsistencies:
- Development:
karmyq_db(database) withkarmyq_user(owner) - Production:
karmyq_prod(database) withkarmyq_prod(owner) - Staging: Unknown/inconsistent
- Different naming patterns make scripts, documentation, and automation brittle
- Development:
-
Path Differences:
- Production:
/home/ubuntu/karmyq - Staging:
/home/karmyq/karmyq - No consistent pattern across environments
- Production:
-
Configuration Management Issues:
- Environment-specific files (
.env.production.users) deleted bygit clean -fd - No clear separation between version-controlled config and environment secrets
- Deployment process required manual recreation of credentials
- Environment-specific files (
-
Docker Compose Variations:
- Different compose files for each environment
- No standardized approach to environment-specific overrides
- Configuration drift between environments
Problems Encountered
During simulation service deployment:
- Lost user credentials file after running
git clean -fd(untracked files deleted) - Database connection strings hardcoded with environment-specific names
- Confusion about which paths to use for different servers
- Manual intervention required to recreate simulated users
- No automated validation of environment parity
Requirements
- Predictability: Deployments should be repeatable and consistent
- Safety: Environment-specific secrets should survive git operations
- Clarity: Team should immediately understand environment differences
- Automation: Reduce manual steps in deployment process
- Documentation: Environment setup should be self-documenting
Decision
We will standardize infrastructure configuration across all environments using these principles:
1. Standardized Database Naming
Pattern: karmyq_{env} for database, karmyq_user for owner (consistent across all environments)
- Development:
karmyq_dev/karmyq_user - Staging:
karmyq_staging/karmyq_user - Production:
karmyq_production/karmyq_user
Rationale:
- Single consistent user simplifies permission management
- Environment suffix makes purpose explicit
- Follows industry standard pattern (Heroku, AWS RDS, etc.)
2. Standardized Directory Structure
Pattern: /opt/karmyq/{env} for all deployments
- Development: Local development uses project root
- Staging:
/opt/karmyq/staging - Production:
/opt/karmyq/production
Rationale:
/optis standard location for optional application software- Clear separation of environments on same machine (if needed)
- Predictable paths for scripts, logs, backups
3. Environment Configuration Management
Strategy: Three-tier configuration system
-
Version-Controlled Base (
.env.example,config.template.ts):- Structure and documentation of required variables
- Safe to commit to git
- Used for validation and initialization
-
Environment-Specific Secrets (
.env.{env},.env.{env}.users):- Stored in
/opt/karmyq/secrets/{env}/ - Symlinked into application directory
- NEVER in git working directory (prevents accidental deletion)
- Backed up separately
- Stored in
-
Runtime Configuration (loaded at startup):
- Applications load from symlinked files
- Fail fast with clear errors if configuration missing
- Log configuration source (but never values)
Directory Structure:
/opt/karmyq/
├── production/
│ └── karmyq/ # Git repository
│ └── .env.production -> /opt/karmyq/secrets/production/.env
├── staging/
│ └── karmyq/ # Git repository
│ └── .env.staging -> /opt/karmyq/secrets/staging/.env
└── secrets/
├── production/
│ ├── .env
│ ├── .env.production.users
│ └── backup/ # Automated backups
└── staging/
├── .env
├── .env.staging.users
└── backup/
4. Deployment Process Standardization
Standardized Deployment Script (scripts/deploy.sh):
#!/bin/bash
# Usage: ./scripts/deploy.sh [staging|production]
set -e
ENV=$1
BASE_DIR="/opt/karmyq/${ENV}"
SECRETS_DIR="/opt/karmyq/secrets/${ENV}"
# Validate environment
if [[ ! "$ENV" =~ ^(staging|production)$ ]]; then
echo "Usage: $0 [staging|production]"
exit 1
fi
# Pull latest code
cd "${BASE_DIR}/karmyq"
git pull origin master
# Install dependencies (preserves node_modules)
npm ci
# Build services
npm run build
# Symlink secrets (idempotent)
ln -sf "${SECRETS_DIR}/.env" .env.${ENV}
ln -sf "${SECRETS_DIR}/.env.${ENV}.users" .env.${ENV}.users
# Restart services
pm2 restart ecosystem.config.js --only simulation-service
# Health check
sleep 5
pm2 logs simulation-service --lines 50 --nostream | grep -q "Simulation service started" || {
echo "ERROR: Service failed to start"
exit 1
}
echo "Deployment to ${ENV} completed successfully"
5. Environment Validation
Pre-deployment Checklist (automated in scripts/validate-env.sh):
- Database name matches pattern
karmyq_{env} - Database user is
karmyq_user - Required secrets exist in
/opt/karmyq/secrets/{env}/ - Secrets are symlinked into application directory
- Docker compose file matches environment
- PM2 ecosystem config matches environment
- All required environment variables present
- Database connection successful
- Redis connection successful
6. Documentation Requirements
Each environment must have:
-
Environment README (
docs/environments/{env}/README.md):- Server details (hostname, IP, SSH key)
- Database configuration
- File locations
- Access procedures
- Backup procedures
-
Runbook (
docs/environments/{env}/RUNBOOK.md):- Deployment procedure
- Rollback procedure
- Common issues and resolutions
- Monitoring and alerts
- Emergency contacts
Consequences
Positive Consequences
- Predictable Deployments: Same commands work across all environments
- Reduced Errors: Standardization eliminates entire classes of environment-specific bugs
- Faster Onboarding: New team members can understand environment structure immediately
- Safer Operations: Secrets protected from git operations
- Better Automation: Consistent structure enables reliable CI/CD
- Clearer Debugging: Logs, paths, and configurations follow predictable patterns
- Easier Testing: Can validate environment parity programmatically
Negative Consequences
- Migration Effort: Need to migrate existing production/staging to new structure
- Breaking Changes: Scripts and documentation referencing old paths must be updated
- Downtime Required: Database rename and path migration require brief outage
- Learning Curve: Team must learn new conventions
- Tooling Updates: CI/CD pipelines, backup scripts, monitoring configs need updates
Neutral Consequences
- More Explicit Configuration: What was implicit is now explicit (clearer but more verbose)
- Shifted Responsibility: Operations become more formal (good for production, potentially overkill for dev)
- Different Mental Model: From "servers have different configs" to "environments are standardized instances"
Alternatives Considered
Alternative 1: Environment-Specific Branches
Description: Maintain separate git branches for each environment (dev, staging, production) with environment-specific configuration committed.
Pros:
- Simple to understand
- Configuration always in sync with code
- No external dependencies
Cons:
- Secrets in version control (security risk)
- Merge conflicts on every deployment
- Difficult to keep branches in sync
- Can't easily promote exact code between environments
Why Rejected: Secrets in git is a non-starter for production systems. Merge overhead increases operational burden significantly.
Alternative 2: Configuration Service (Vault, Consul, etc.)
Description: Use external configuration management service like HashiCorp Vault or AWS Parameter Store to manage all environment configuration.
Pros:
- Industry standard solution
- Excellent secret management
- Audit trails and access control
- Dynamic secret rotation
- Multi-environment support built-in
Cons:
- Additional infrastructure to maintain
- Increased complexity for small team
- Network dependency for application startup
- Learning curve for new tool
- Cost for hosted solutions
Why Rejected: While technically superior, introduces operational overhead disproportionate to team size and current scale. Good future migration path as team/scale grows.
Alternative 3: Docker Secrets + Swarm
Description: Use Docker Swarm with built-in secrets management.
Pros:
- Docker-native solution
- Good secrets management
- Integrated with orchestration
- No additional tools
Cons:
- Locks us into Docker Swarm (vs. other orchestrators)
- Requires Swarm mode (we're using docker-compose)
- More complex than current deployment
- Overkill for single-node deployments
Why Rejected: We're not using container orchestration yet. When we scale to multiple nodes, we'll likely choose Kubernetes over Swarm, making this investment wasted.
Alternative 4: Status Quo + Documentation
Description: Keep current environment-specific configurations but document them thoroughly.
Pros:
- No migration required
- No breaking changes
- Team already familiar with current approach
Cons:
- Doesn't solve the root problems
- Documentation becomes outdated
- Requires constant vigilance
- Error-prone manual processes remain
Why Rejected: Documentation alone doesn't prevent the errors we've experienced. The problems are structural and require structural solutions.
Implementation Notes
Migration Path
Phase 1: Staging Environment (Week 1)
- Create
/opt/karmyq/secrets/staging/directory structure - Backup current
.envfiles - Create new database
karmyq_stagingalongsidekarmyq_prod - Migrate data using
pg_dump/pg_restore - Update connection strings
- Test thoroughly
- Switch traffic to new database
- Decommission old database after 7 days
Phase 2: Production Environment (Week 2, after staging validation)
- Same steps as staging
- Require change approval
- Schedule during low-traffic window
- Have rollback plan ready
- Extended monitoring period
Phase 3: Development Environment (Week 3)
- Update docker-compose.yml with new database name
- Create migration script for developers
- Update documentation
- Team meeting to walk through changes
Phase 4: Automation & Validation (Week 4)
- Implement deployment scripts
- Implement validation scripts
- Update CI/CD pipelines
- Create environment runbooks
- Conduct deployment drill
Files Affected
Infrastructure:
infrastructure/docker/docker-compose.ymlinfrastructure/docker/docker-compose.staging.ymlinfrastructure/docker/docker-compose.production.ymlinfrastructure/postgres/init.sql
Configuration:
- All service
.env.examplefiles - All service database connection strings
- PM2 ecosystem configs
Scripts:
scripts/deploy.sh(new)scripts/validate-env.sh(new)scripts/setup-secrets.sh(new)- Any existing deployment scripts
Documentation:
docs/environments/(new directory)docs/DEPLOYMENT.mdREADME.md(environment setup section)- Service READMEs (database connection info)
Rollback Strategy
Each migration step has a rollback procedure:
- Database Migration: Keep old database running alongside new for 7 days, can switch connection string back
- Path Migration: Old paths preserved until new paths validated (30 days)
- Secrets Migration: Backups automated before any changes
- Service Updates: Blue-green deployment pattern (old version running until new validated)
Testing Strategy
Before declaring migration complete:
- Functional Testing: All API endpoints return expected results
- Integration Testing: Run full integration test suite
- Load Testing: Confirm performance matches pre-migration baseline
- Failure Testing: Simulate failures (database down, missing secrets) and verify error handling
- Deployment Testing: Perform full deployment from scratch on test environment
References
- Incident: Simulation service deployment January 2026 (lost credentials, database naming confusion)
- Related Discussions: User feedback: "We probably shouldn't call this db Karmyq_prod and the owner karmyq_prod. This feels like a bad pattern."
- Industry Standards:
- Tools:
- Process Documents:
Connection to Development Process
This ADR emphasizes our tangent management process by documenting lessons learned:
- Recognition: During deployment, we recognized infrastructure inconsistency as a tangent from the primary simulation service work
- Documentation: Instead of immediately fixing all environments, we documented the issue as an ADR (this document)
- Prioritization: Allows deliberate decision about when to address standardization vs. continuing feature work
- Communication: Makes the technical debt visible and provides implementation plan when we prioritize it
This ADR serves as an example of how we handle tangents: recognize them, document them thoroughly, and make conscious decisions about when to address them rather than getting pulled off course during feature development.