Optimizing Docker on ARM64: From Emulation Hell to Native Performance

The Discovery

After migrating from DigitalOcean to Oracle Cloud Infrastructure (OCI), I was excited about the powerful Ampere ARM64 processors. Everything seemed to work perfectly—my portfolio loaded fast, containers were healthy, and monitoring showed no issues. But then I decided to dig deeper into performance metrics, and what I found surprised me.

My Docker images were built for AMD64, running on ARM64 hardware through QEMU emulation.

While QEMU is impressive technology that makes cross-architecture execution possible, it's essentially a translator—converting x86 instructions to ARM instructions in real-time. This works, but at a cost. This post documents my journey to native ARM64 builds and the dramatic performance improvements I achieved.

If you're interested in my OCI migration journey, check out:

How Did This Happen?

During my migration from DigitalOcean (AMD64) to Oracle Cloud (ARM64), I focused on getting everything working quickly. My docker-compose.yml had this line:

services:
  portfolio:
    platform: linux/amd64  # ← Forcing AMD64 on ARM64 = QEMU emulation
    image: jayakrishnakonda/portfolio:latest

This configuration was intentional on DigitalOcean—the GitHub Actions runner builds AMD64 images by default. When I moved to OCI's ARM64 Ampere instances, I kept this config because "it worked." Docker pulled the AMD64 image and transparently used QEMU to make it run.

The problem? Performance was being left on the table.

Understanding the Infrastructure

OCI Migration Context

When I migrated to Oracle Cloud, I faced several challenges:

Architecture mismatch: DigitalOcean used AMD64, OCI offered ARM64 Ampere
Image compatibility: My GitHub Actions workflow only built AMD64 images
Time pressure: I wanted everything up quickly during migration
"It just works" syndrome: QEMU made everything appear fine

I documented the full migration in this post, but one thing I didn't address immediately was architecture optimization.

Disk Performance Baseline

Before tackling the CPU architecture, I benchmarked my storage:

Boot Drive (/dev/sda1):

Write: 54.3 MB/s
Read: 26.8 MB/s

Attached Block Storage (/dev/sdb - Docker location):

Write: 74.3 MB/s (37% faster)
Read: 37.4 MB/s (40% faster)

Good news: Docker runs on faster block storage, so I/O wasn't the bottleneck. The real issue was CPU architecture.

The Problem: QEMU Emulation Overhead

Why QEMU emulation matters:

Running AMD64 containers on ARM64 requires real-time instruction translation:

CPU overhead from x86 → ARM translation
Slower container startup times
Increased memory usage
Potential compatibility issues

Verifying the Problem

# System architecture
$ uname -m
aarch64  # ← ARM64 system

# Docker image architecture
$ docker inspect jayakrishnakonda/portfolio:latest | grep Architecture
"Architecture": "amd64"  # ← AMD64 image!

# Container runtime
$ docker exec portfolio uname -m
aarch64  # ← Running through QEMU translation

Everything worked, but I was paying a performance tax on every request.

Benchmarking: The Baseline

Before making changes, I collected baseline metrics with AMD64+QEMU:

Container Performance (AMD64 + QEMU)

| Metric | Value | |--------|-------| | Container startup | 0.236s | | Cold start (first request) | 278ms | | 🔥 Warm response time | ~10ms (9.97ms avg) | | Memory usage (idle) | 82.5 MiB | | CPU usage (idle) | 0.00% | | ⏱️ CI/CD build time | ~4 minutes |

Not terrible numbers! But I suspected native ARM64 could do better.

Solution Options

I had three paths forward:

Option 1: Native ARM64 Build

Pros:

Simplest implementation
Fastest builds (no multi-platform overhead)
Best performance for ARM64
Smaller registry footprint

Cons:

Only works on ARM64
Can't deploy to AMD64 without rebuilding
Less flexible for multi-cloud

Best for: Single-architecture deployments

Option 2: Multi-Architecture Build (My Choice)

Pros:

Supports both AMD64 and ARM64
Docker auto-selects correct architecture
Future-proof and portable
Industry standard approach
Works across cloud providers

Cons:

Longer build times (5-10x)
More complex CI/CD setup
Requires Docker Buildx

Best for: Production apps that may deploy anywhere

Option 3: Self-Hosted ARM64 Runner

Pros:

Zero emulation during build
Uses local VPS resources
Complete control
Fastest native builds

Cons:

Must manage runner infrastructure
VPS must be online during builds
Security considerations
More maintenance overhead

Best for: Dedicated build infrastructure

I chose Option 2 for maximum flexibility—build once, deploy anywhere.

Implementation: Multi-Architecture Builds

Step 1: Verify Dockerfile Compatibility

First, I verified all base images and dependencies support ARM64:

# Base image check
FROM node:18-alpine  # ✅ Supports both amd64 and arm64

# System dependencies
RUN apk add --no-cache \
    vips-dev \       # ✅ ARM64 support
    build-base \     # ✅ ARM64 support
    python3 \        # ✅ ARM64 support
    git              # ✅ ARM64 support

Compatibility verification:

Node.js 18: Full ARM64 support
Next.js 14: Full ARM64 support
sharp (native module): Builds for ARM64 automatically
All Alpine packages: Multi-arch by default

Result: No Dockerfile changes needed!

Step 2: Update GitHub Actions Workflow

Modified .github/workflows/build-and-deploy.yml:

- name: 🚀 Build and push Docker image (Multi-Architecture)
  uses: docker/build-push-action@v5
  with:
    context: .
    file: ./Dockerfile
    push: true
    platforms: linux/amd64,linux/arm64  # ← Multi-arch magic
    tags: ${{ steps.meta.outputs.tags }}
    cache-from: type=gha  # ← Speed up subsequent builds
    cache-to: type=gha,mode=max
    build-args: |
      NODE_OPTIONS=--max-old-space-size=4096
      NEXT_PUBLIC_GA_MEASUREMENT_ID=${{ vars.NEXT_PUBLIC_GA_MEASUREMENT_ID }}
    provenance: false
  timeout-minutes: 40  # ← Increased for multi-arch

Key changes:

Added platforms: linux/amd64,linux/arm64
Added GitHub Actions cache for faster builds
Increased timeout from 25 to 40 minutes

Step 3: Remove Platform Override

Modified docker-compose.yml:

services:
  portfolio:
    # Removed: platform: linux/amd64
    # Docker now auto-selects native ARM64 image
    image: jayakrishnakonda/portfolio:latest

That's it! Just remove the platform override.

Step 4: The Build Challenge

First build attempt? Failed. The build timed out after 25 minutes—it was still running (TypeScript type checking) when the timeout hit.

The fix: Increased timeouts:

Job timeout: 30 → 45 minutes
Build step: 25 → 40 minutes

Second attempt? Success! Build completed in ~38 minutes.

Results: The Performance Transformation

Native ARM64 Performance

After deploying the multi-arch image, I re-ran all benchmarks:

| Metric | AMD64 + QEMU | Native ARM64 | Improvement | |--------|--------------|--------------|-------------| | Container Startup | 0.236s | 0.183s | 22% faster ⚡ | | Cold Start | 278ms | 5.6ms | 98% faster | | 🔥 Warm Response | 9.97ms | 6.6ms | 34% faster ⚡ | | Memory (Idle) | 82.5 MiB | 40.45 MiB | 51% less | | CPU (Idle) | 0.00% | 0.00% | Same | | Image Size | 527 MB | 527 MB | Same | | ⏱️ Build Time | ~4 min | ~38 min | 9.5x longer |

The Dramatic Improvements

98% Faster Cold Starts

The most shocking improvement: cold start dropped from 278ms to 5.6ms. This is a game-changer for user experience—the first page load is now nearly instantaneous.

This massive improvement comes from eliminating QEMU's instruction translation overhead on the first request.

34% Faster Response Times

Even warm requests improved significantly: 9.97ms → 6.6ms. While both are fast, this compounds across thousands of requests daily.

51% Memory Reduction

Memory usage dropped from 82.5 MiB to 40.45 MiB—over 50% reduction. This means better resource utilization and the ability to run more containers on the same hardware.

22% Faster Container Startup

Container startup improved from 0.236s to 0.183s. This matters for deployments, scaling events, and recovery from failures.

⏱️ Trade-off: Longer Build Times

The only downside: CI/CD builds increased from 4 minutes to 38 minutes. However, this is a one-time cost per deployment, while runtime performance benefits every single request.

Deployments happen occasionally. The app runs 24/7. The math clearly favors native builds.

Lessons Learned

1. QEMU Works, But Native is Better

QEMU is impressive—my site worked fine with it. Most users probably wouldn't notice. But "works fine" isn't the same as "works optimally."

The benchmarks don't lie:

98% faster cold starts = better user experience
51% less memory = better resource utilization
34% faster responses = compounds over millions of requests

2. The Cold Start Improvement is Surprising

I expected improvements, but 278ms → 5.6ms (98% reduction) was stunning. This suggests QEMU has significant overhead on first instruction translation. Once cached (warm requests), the gap narrows to 34%, but that first impression matters.

3. Multi-Arch is Worth the Complexity

While single-arch builds are simpler, multi-arch provides:

Testing on local AMD64 development machines
Deploying to different cloud providers
Future infrastructure flexibility

The 9.5x longer build time is a small price to pay.

4. Build Time Trade-off is Acceptable

Yes, builds take 9.5x longer (38 min vs 4 min). But consider:

Deployments: Few times per week
Runtime: 24/7, millions of requests
Performance: Every request is 34% faster

If you deploy 10 times per week, that's 340 extra minutes of build time. But your app serves millions of faster requests. The math favors native builds.

5. GitHub Actions Cache is Essential

Multi-platform builds take longer. Using cache-from: type=gha reduces subsequent build times significantly. Without caching, builds would be even slower.

6. Always Verify Base Image Support

Check Docker Hub for multi-architecture support before choosing base images. Official images (like node:18-alpine) usually support both, but always verify.

7. Don't Skip the Benchmarks

I almost didn't benchmark because "everything worked." The 98% cold start improvement would have remained hidden. Always measure!

Best Practices for ARM64 Deployments

1. Use Multi-Architecture Builds by Default

platforms: linux/amd64,linux/arm64

2. Never Hardcode Platform in Docker Compose

# ❌ Don't do this (unless you have a specific reason)
platform: linux/amd64

# ✅ Let Docker choose the right architecture
image: myapp:latest

3. Increase Your Timeouts

Multi-arch builds take 5-10x longer:

timeout-minutes: 40  # Not 25!

4. Use Build Caching Aggressively

cache-from: type=gha
cache-to: type=gha,mode=max

5. Test on Target Architecture

Even with multi-arch builds, test on actual deployment architecture before production.

6. Document Architecture Decisions

Add comments explaining why certain architectures are chosen:

# Multi-arch image: Docker auto-selects ARM64 on this VPS
image: portfolio:latest

Cost Implications

Oracle Cloud Infrastructure (Ampere A1)

Free Tier: 4 ARM cores, 24GB RAM
Performance: Excellent for ARM64 workloads
Cost: $0/month (within free tier)

GitHub Actions

Multi-arch builds: ~2x build time vs single-arch
Free Tier: 2000 minutes/month
Cost: $0/month for public repos

Performance vs Cost Trade-off

Native ARM64 builds provide dramatically better performance at zero additional cost when using OCI's free tier and GitHub Actions for public repos.

Conclusion

Migrating to native ARM64 Docker builds was straightforward and delivered game-changing performance improvements:

98% faster cold starts (278ms → 5.6ms)
34% faster warm responses (9.97ms → 6.6ms)
51% less memory (82.5 MiB → 40.45 MiB)
22% faster container startup (0.236s → 0.183s)
Eliminated QEMU emulation overhead
Future-proof multi-architecture support

Key Takeaways

Check your architecture - Don't assume your images match your hardware. Run docker inspect to verify.
Multi-arch is the way - Build once, deploy anywhere. The flexibility is worth the longer build times.
Benchmarking matters - I measured 98% improvement in cold starts—you might see similar gains.
It's easier than you think - Docker Buildx makes multi-arch simple. The changes fit in a single commit.
Performance compounds - A 34% improvement per request adds up to massive savings at scale.

Recommendations

If you're running containers on ARM64 infrastructure (AWS Graviton, Oracle Ampere, Apple Silicon, Raspberry Pi, etc.), verify your images are built for the correct architecture. The performance gains are worth the minimal effort:

For new projects: Start with multi-arch from day one
For existing projects: Add platforms: linux/amd64,linux/arm64 to your build
Increase your timeouts: Multi-arch builds take 5-10x longer
Use build caching: GitHub Actions cache dramatically speeds up subsequent builds

The cold start improvement alone (278ms → 5.6ms) justifies the migration. When you factor in the memory savings and response time improvements, it's a no-brainer for production ARM64 deployments.

Resources

Running ARM64 infrastructure? Have questions about multi-arch builds? Feel free to reach out via email or check out my other infrastructure articles.

Loading article…

Optimizing Docker on ARM64: From Emulation Hell to Native Performance

The Discovery

How Did This Happen?

Understanding the Infrastructure

OCI Migration Context

Disk Performance Baseline

The Problem: QEMU Emulation Overhead

Verifying the Problem

Benchmarking: The Baseline

Container Performance (AMD64 + QEMU)

Solution Options

Option 1: Native ARM64 Build

Option 2: Multi-Architecture Build (My Choice)

Option 3: Self-Hosted ARM64 Runner

Implementation: Multi-Architecture Builds

Step 1: Verify Dockerfile Compatibility

Step 2: Update GitHub Actions Workflow

Step 3: Remove Platform Override

Step 4: The Build Challenge

Results: The Performance Transformation

Native ARM64 Performance

The Dramatic Improvements

98% Faster Cold Starts

34% Faster Response Times

51% Memory Reduction

22% Faster Container Startup

⏱️ Trade-off: Longer Build Times

Lessons Learned

1. QEMU Works, But Native is Better

2. The Cold Start Improvement is Surprising

3. Multi-Arch is Worth the Complexity

4. Build Time Trade-off is Acceptable

5. GitHub Actions Cache is Essential

6. Always Verify Base Image Support

7. Don't Skip the Benchmarks

Best Practices for ARM64 Deployments

1. Use Multi-Architecture Builds by Default

2. Never Hardcode Platform in Docker Compose

3. Increase Your Timeouts

4. Use Build Caching Aggressively

5. Test on Target Architecture

6. Document Architecture Decisions

Cost Implications

Oracle Cloud Infrastructure (Ampere A1)

GitHub Actions

Performance vs Cost Trade-off

Conclusion

Key Takeaways

Recommendations

Resources

Follow This Topic

Related Articles

Benchmarking Local LLMs Across an RTX 3060 Ti and an M4 Mac Mini (With a Kernel Panic Along the Way)

The Freeze Wasn't Memory: Tracing a Homelab Server's Root Cause Through Three Wrong Turns

Continue Reading

Continue Exploring

The Discovery

How Did This Happen?

Understanding the Infrastructure

OCI Migration Context

Disk Performance Baseline

The Problem: QEMU Emulation Overhead

Verifying the Problem

Benchmarking: The Baseline

Container Performance (AMD64 + QEMU)

Solution Options

Option 1: Native ARM64 Build

Option 2: Multi-Architecture Build (My Choice)

Option 3: Self-Hosted ARM64 Runner

Implementation: Multi-Architecture Builds

Step 1: Verify Dockerfile Compatibility

Step 2: Update GitHub Actions Workflow

Step 3: Remove Platform Override

Step 4: The Build Challenge

Results: The Performance Transformation

Native ARM64 Performance

The Dramatic Improvements

98% Faster Cold Starts

34% Faster Response Times

51% Memory Reduction