LLM gotchas with Docker and AWS

AI is better at coding than DevOps

Jul 25, 2025

When using LLMs to create and deploy Docker-backed apps on AWS, I’ve run into various problems that gobbled up hours of my life. It turns out that LLMs and Agents aren’t yet great at DevOps, and some AWS services just don’t log enough information for targeted troubleshooting. Here are three examples.

Wrong chip architecture

DevOps pros know to be cognizant of the chipsets they’re building for. They also know that running a naked Docker build command will build a container for the host computer’s chipset by default. But this isn’t obvious to coders and even if you knew, it’s easy to forget when an LLM writes your code.

I got bit by this one recently when deploying GOOGLISH to AWS App Runner. The system log would just hang on the health check.

07-20-2025 02:56:37 PM [AppRunner] Performing health check on protocol `HTTP` [Path: '/health'], [Port: '3000'].

I spent hours troubleshooting my health check endpoint only to throw up my hands and try deploying on a different AWS service, ECS. As soon as I did that, I saw this error:

"StopReason": "CannotPullContainerError: pull image manifest has been retried 7 time(s): image Manifest does not contain descriptor matching platform 'linux/amd64'"

ARRRGGGGGG!

To fix this, change the build command. Instead of this:

docker build -t your-image-name .

Use this:

docker buildx build --platform linux/amd64 -t ${APP_NAME} .

Why ECS logs showed the exact error but App Runner logs didn’t is beyond me.

Missing cURL prevents health checks from running

Total nightmare. I worked with Claude for hours trying to figure it out. It was only by dumb luck that I happened upon the right phrasing of the problem for Claude to wake up and say, hmmmm….maybe you need to install cURL manually on the container.

This happened to me when working on The Boston Wrongs, a full-stack AI app with containers running on ECS via Fargate.

The root cause of the problem was in the first line of the Dockerfile that Claude wrote for my app:

FROM python:3.12-slim

I never gave it a second thought but it turns out that Python “slim” doesn’t mean a slim version of Python. It’s a slim version of the OS, Debian, which strips out a bunch of stuff, including cURL.

The fix was to add a line to the Dockerfile to install cURL:

# Install curl for ECS health checks RUN apt-get update && apt-get install -y curl

Missing permissions between AWS services

A typical webapp deployed on AWS will include several distributed components: a backend service to run the container, a frontend service to host the app, Secrets Manager to store API keys, a database, etc.

When LLMs write CDK code, they sometimes forget to include all necessary permissions for the services to communicate with each other. This can be tricky because there are container-level permissions and service-level permissions. Thankfully, these errors usually appear in CloudWatch, but spelunking CloudWatch can be a sinkhole for time.

It turns out that AWS’ Log Analyzer MCP is a fantastic tool to query CloudWatch logs from your favorite AI IDE (I use Cursor). You don’t have to know anything about your log structure, you just need to give it access to your AWS account.

I do find it’s helpful to add a Cursor Rule that gives the agent a bit more info about how to use the MCP in the given project:

You have access to AWS Log Analyzer MCP. This allows you to search AWS CloudWatch logs for this project.

If the user asks for help searching for information in the logs, you need to know the date and time so your search results are relevant. Before invoking the MCP, run `date` in your shell so that you know the date and time.

Onward

These examples don’t cover all the DevOps problems LLMs can invent, but I hope this helps save you some time.

Don't over(look|state) the obvious

Discussion about this post