ADR-027: Docker Image Size Optimization (Deferred Technical Debt)
ADR-027: Docker Image Size Optimization (Deferred Technical Debt)
Status: Accepted Date: 2026-01-12 Deciders: Development Team Related: ADR-026 (Self-Hosted Docker Registry)
Context
During implementation of the self-hosted Docker registry (ADR-026), we discovered that our Docker images are significantly larger than optimal:
Current Image Sizes
| Service | Size | Status |
|---|---|---|
| Frontend | 213MB | ✅ Optimized (multi-stage build) |
| Auth Service | 762MB | ❌ Unoptimized |
| Community Service | 867MB | ❌ Unoptimized (largest) |
| Request Service | 480MB | ❌ Unoptimized |
| Reputation Service | 480MB | ❌ Unoptimized |
| Notification Service | 480MB | ❌ Unoptimized |
| Messaging Service | 324MB | ❌ Unoptimized |
| Feed Service | 465MB | ❌ Unoptimized |
| Cleanup Service | 348MB | ❌ Unoptimized |
| Geocoding Service | 209MB | ❌ Unoptimized |
| Social Graph Service | 497MB | ❌ Unoptimized |
| Total | ~5GB | ❌ Could be ~1.7GB (66% reduction) |
Root Causes of Large Images
-
Build Tools Retained (281MB per service)
- Python3, make, g++ installed for bcrypt compilation
- Not removed in final image (no multi-stage build)
-
Full node_modules (155MB per service)
- Includes devDependencies (TypeScript, testing tools, etc.)
- Not pruned to production-only dependencies
-
Source Files Included (~10MB per service)
- TypeScript source files (not needed after compilation)
- Test files and documentation
- Build artifacts
-
No Multi-Stage Builds
- All intermediate layers remain in final image
- Dockerfile pattern: Install → Copy Everything → Run Dev
- Missing production optimization
Optimization Potential
With proper multi-stage builds:
# Current (auth-service): 762MB
FROM node:18-alpine
RUN apk add --no-cache python3 make g++ # 281MB kept!
RUN npm install # 155MB with devDeps
COPY . . # Everything copied
CMD ["npm", "run", "dev"]
# Optimized: ~150MB (80% smaller)
FROM node:18-alpine AS builder
RUN apk add --no-cache python3 make g++
RUN npm ci --only=production
RUN npm run build
FROM node:18-alpine AS runtime
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
CMD ["node", "dist/index.js"]
Potential savings:
- Per service: 480MB → 150MB (~330MB saved)
- Total: 5GB → 1.7GB (~3.3GB / 66% reduction)
Decision
We will defer Docker image optimization and document it as technical debt for future work.
Why Defer?
-
Self-Hosted Registry Solves Immediate Problem
- No storage limits (vs 500MB GHCR free tier)
- 5GB images fit comfortably on production server
- Cost is $0 regardless of image size
-
Limited Impact on Workflow
- First push: 5GB takes ~15 mins (1.7GB would take ~5 mins)
- Subsequent pushes: Layer caching makes both fast
- Pull times on production: Local network is fast either way
- Build times: Same (optimization doesn't speed up builds)
-
Optimization Requires Significant Work
- 10 backend Dockerfiles need rewriting
- Each service needs testing after changes
- Estimated effort: 1-2 hours work + testing
- Risk of breaking production builds
-
Current System Works
- Images build successfully
- Services run correctly
- Deployment pipeline functional
When to Optimize?
Trigger optimization when any of these conditions occur:
-
Storage becomes a concern
- Registry disk usage exceeds 70%
- Multiple versions accumulate (5GB × 10 versions = 50GB)
-
Transfer speed becomes pain point
- Slow deployments due to image size
- Bandwidth costs increase
- Team feedback about slow pushes/pulls
-
Build times need improvement
- When combined with other build optimizations
- As part of CI/CD pipeline improvements
-
Professional polish needed
- Before open-sourcing
- For investor demos
- Production readiness audit
Consequences
Positive
- ✅ Focus on feature development, not infrastructure optimization
- ✅ Self-hosted registry deployed faster (no optimization delays)
- ✅ No risk of breaking builds with Dockerfile changes
- ✅ Technical debt documented for future planning
Negative
- ❌ Larger storage footprint (5GB vs 1.7GB per version)
- ❌ Slower first-time pushes/pulls (~15 mins vs ~5 mins)
- ❌ Less professional image sizes
- ❌ Technical debt accumulates if not addressed
Neutral
- 📦 Storage cost: $0 either way (self-hosted)
- 📦 Subsequent operations: Fast with layer caching regardless
- 📦 Runtime performance: Unaffected by image size
Implementation Plan (Future)
When optimization is triggered, follow this approach:
Phase 1: Create Optimized Dockerfile Template
# Multi-stage production Dockerfile template
FROM node:18-alpine AS builder
RUN apk add --no-cache python3 make g++
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY tsconfig.json ./
COPY src ./src
RUN npm run build
FROM node:18-alpine AS runtime
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./
EXPOSE ${PORT}
CMD ["node", "dist/index.js"]
Phase 2: Optimize Services in Batches
- Start with smallest service (geocoding-service, 209MB)
- Test thoroughly before proceeding
- Roll out to remaining services
- Update documentation
Phase 3: Measure Impact
- Before/after size comparison
- Push/pull speed comparison
- Storage savings calculation
- Document in ADR amendment
Alternatives Considered
Alternative 1: Optimize Immediately
Rejected: Delays registry deployment, adds risk, limited immediate benefit.
Alternative 2: Optimize Only Problem Services
Rejected: Inconsistent Dockerfiles, technical debt remains for most services.
Alternative 3: Never Optimize
Rejected: Technical debt would accumulate indefinitely, professional polish lacking.
References
- Self-Hosted Registry Documentation
- ADR-026: Self-Hosted Docker Registry
- Docker Multi-Stage Builds
- Frontend Dockerfile - Example of optimized build
- Current backend Dockerfiles:
services/*/Dockerfile(all need optimization)
Metrics to Track
When optimization is implemented, measure:
- Image size reduction (target: 66%)
- Push time improvement (target: 50%)
- Pull time improvement (target: 50%)
- Storage savings over time (GB saved × # of versions)
Notes
- Frontend already uses multi-stage builds (213MB, well optimized)
- Backend services all use same unoptimized pattern
- This is intentional technical debt, not oversight
- Decision may be revisited if conditions change