Back to lessons

Hosting Operations

Count App Errors by Minute

You need to see when severe application log lines clustered during an incident.

Command

awk 'tolower($0) ~ /(error|fatal|timeout|exception)/ {minute=substr($1,1,16); count[minute]++} END {for (m in count) print count[m], m}' fixtures/incidents/app.log | sort -nr

What changed

Nothing changes. The command counts severe application lines by minute.

Danger

safe

When to use it

Use when you need to compare an error spike with deploys, restarts, or external alerts.

When not to use it

Do not use it as a metrics replacement; it is a quick log-derived approximation.

Undo or recovery

No undo needed because the command is read-only.

Expected output

Counts followed by minute timestamps.

demo script

Disposable terminal steps

  1. cat fixtures/incidents/app.log
  2. awk 'tolower($0) ~ /(error|fatal|timeout|exception)/ {minute=substr($1,1,16); count[minute]++} END {for (m in count) print count[m], m}' fixtures/incidents/app.log | sort -nr

simulated output

What it looks like

disposable vessel
::fixture-ready::
$ cat fixtures/incidents/app.log
2026-06-25T14:00:01Z level=INFO service=api request_id=req-100 msg=started release=2026.06.25.1
2026-06-25T14:01:14Z level=INFO service=worker request_id=req-101 msg=queue_depth value=18
2026-06-25T14:02:06Z level=WARN service=api request_id=req-102 msg=upstream_slow upstream=db latency_ms=2200
2026-06-25T14:03:08Z level=ERROR service=api request_id=req-103 msg=database_timeout timeout_ms=30000
2026-06-25T14:03:12Z level=ERROR service=api request_id=req-103 msg=retry_failed upstream=db
2026-06-25T14:04:44Z level=INFO service=deploy request_id=req-104 msg=release_switch release=2026.06.25.2
2026-06-25T14:05:10Z level=FATAL service=worker request_id=req-105 msg=job_runner_exit code=137
2026-06-25T14:05:12Z level=INFO service=system request_id=req-106 msg=worker_restarted
2026-06-25T14:06:33Z level=ERROR service=api request_id=req-107 msg=payment_provider_500 provider=demo-pay
2026-06-25T14:07:01Z level=WARN service=api request_id=req-108 msg=token=demoTOKEN123 should_be_redacted
::exit-code::0
$ awk 'tolower($0) ~ /(error|fatal|timeout|exception)/ {minute=substr($1,1,16); count[minute]++} END {for (m in count) print count[m], m}' fixtures/incidents/app.log | sort -nr
2 2026-06-25T14:03
1 2026-06-25T14:06
1 2026-06-25T14:05
::exit-code::0

YouTube Short

Bucket errors by minute.

Count severe app lines by minute. It tells you whether the incident was one sharp spike or a continuing failure.

LinkedIn hook

A minute-by-minute count shows whether an incident is a spike or a drip.

Question: When logs show errors, do you bucket them by time before reading each line?

experiments

A/B tests to run

Metric: shorts_3_second_hold_rate

A: Spike or drip?

B: Bucket errors by minute.