Back to commands

Hosting Operations

Read-only, can be slow

Find Feed Links Missing from the Sitemap

You need to compare feed item links against sitemap entries.

Command

grep -o '<link>https://example.com/[^<]*</link>' public/feed.xml | sed 's#<link>##;s#</link>##' | while read -r url; do grep -q "$url" public/sitemap.xml || echo "$url"; done

Before you run this

System impact: Read-only. Can create load on large logs, directories, filesystems, or process tables.

When not to use it: Do not use it for feeds with multiple domains without adapting the URL filter.

Expected output

Feed item URLs that do not appear in sitemap.xml.

System impact

Read-only, can be slow. Nothing changes. The command extracts feed links and prints any URL absent from the sitemap.

Scope this to the smallest useful path or service on busy systems.

Recovery / rollback: no state is changed.

When to use it

Use after changing feed generation, permalink formats, or sitemap filters.

When not to use it

Do not use it for feeds with multiple domains without adapting the URL filter.

Watch this command run

Command transcript

This sanitized transcript shows the commands and output shape without exposing host details.

demo@lab:~$

$ grep -o '<link>https://example.com/[^<]*</link>' public/feed.xml | sed 's#<link>##;s#</link>##'

https://example.com/blog/post.html
https://example.com/news/missing.html

$ grep -o '<link>https://example.com/[^<]*</link>' public/feed.xml | sed 's#<link>##;s#</link>##' | while read -r url; do grep -q "$url" public/sitemap.xml || echo "$url"; done

https://example.com/news/missing.html
View commands shown

These are the commands shown in the sanitized transcript.

Commands shown

  1. grep -o '<link>https://example.com/[^<]*</link>' public/feed.xml | sed 's#<link>##;s#</link>##'
  2. grep -o '<link>https://example.com/[^<]*</link>' public/feed.xml | sed 's#<link>##;s#</link>##' | while read -r url; do grep -q "$url" public/sitemap.xml || echo "$url"; done

next steps

Related commands

Hosting Operations Can be slow

Find HTML Pages Missing from the Sitemap

A page can exist in the build but never make it into the sitemap.

find public -name '*.html' -print | sed 's#^public#https://example.com#' | while read -r url; do grep -q "$url" public/sitemap.xml || echo "$url"; done
Hosting Operations Can be slow

Find Pages Missing Canonical Links

Canonical tags are easy to drop when templates branch.

find public -name '*.html' -print | while read -r f; do grep -qi 'rel="canonical"' "$f" || echo "$f"; done
Hosting Operations Can be slow

Find Pages Missing Meta Descriptions

Missing descriptions are usually a content template problem, not a mystery.

find public -name '*.html' -print | while read -r f; do grep -qi 'name="description"' "$f" || echo "$f"; done
Hosting Operations Can be slow

Find Pages Missing og:title

Social previews often fail because one template missed Open Graph tags.

find public -name '*.html' -print | while read -r f; do grep -qi 'property="og:title"' "$f" || echo "$f"; done
Web Server Rescue Can be slow

Find Broken Internal Links in Built HTML

A broken internal link is easiest to catch before it becomes a 404.

grep -Rho --include='*.html' 'href="/[^"]*"' public | sed 's#href="##;s#"##' | while read -r path; do test -e "public${path}" || echo "$path"; done | sort -u
Study mapping

Use this as independent command practice: read the notes, predict the output, then compare it with the example before using a real shell.

  • lpic1:103-gnu-unix-commands
  • lfcs:essential-commands
  • lfcs:operations-deployment
  • lfcs:services-logs
  • linuxplus:automation-scripting
  • linuxplus:provisional
  • risk:read-only

Useful for

  • LPIC-1 style command-line practice
  • LFCS style performance tasks
  • Linux+ style troubleshooting review

Independent study support only. No affiliation, endorsement, exam dumps, or real exam questions.