Sitemap generator in Raku
  • Raku 99.3%
  • HTML 0.7%
Find a file
2026-05-12 23:48:30 +03:00
bin sitemap tree: support YAML input (.yml/.yaml), update README and help text 2026-05-12 16:43:28 +03:00
lib refactor: use Syndicate feed builder and utils for RSS/Atom/MRSS formats 2026-05-12 04:09:48 +03:00
t feat: add YAML serialization to SiteTree and Item via YAMLish 2026-05-12 04:09:16 +03:00
.gitignore
Changes
dist.ini
LICENSE
META6.json feat: add YAML serialization to SiteTree and Item via YAMLish 2026-05-12 04:09:16 +03:00
README.md lol 2026-05-12 23:48:30 +03:00

NAME

Sitemap - Sitemap generator for Raku

DESCRIPTION

Sitemap is a Raku module for generating XML sitemaps, crawling websites, and fetching sitemaps from servers.

Features

  • Generate standard XML sitemaps (sitemap.org protocol)
  • Image & hreflang link extraction during crawl (configurable via --no-images / --no-hreflang)
  • Automatic video & news extraction during crawl (HTML tags + JSON-LD, --videos / --news flags)
  • Separate news sitemap output with 48-hour stale detection
  • Lastmod extraction from HTTP headers (configurable via --no-lastmod)
  • Priority calculation from crawl depth (configurable via --no-priority)
  • Manual API: Video & News sitemaps via Sitemap::Item::Video / Sitemap::Item::News
  • Output formats: XML, HTML, TXT, RSS 2.0, Atom 1.0, Media RSS (mRSS)
  • Crawl websites with respect for robots.txt
  • Parse existing sitemaps (including recursive parsing of sitemap indexes)
  • Fetch sitemaps from remote servers (with recursive support for sitemap indexes)
  • Automatic gzip decompression of .gz sitemaps
  • Automatic sitemap discovery (robots.txt + fallback to /sitemap.xml)
  • Recursive fetch creates descriptive directories (e.g., example.com-sitemap/)
  • Child sitemaps named with path context (e.g., sitemap-blog.xml, sitemap-news-2024.xml)
  • Gzip compression for output files
  • XSL stylesheet support for human-readable XML
  • Ping search engines (Google, Bing) after generation
  • Build hierarchical site trees with depth-inferred priorities
  • Serialize/deserialize site trees to/from YAML
  • Sitemap index for large sites (>50k URLs)

Note: Image, hreflang, lastmod, and priority data only appears in XML format output.

Dependencies

Installation

zef install Sitemap

Installing from source

cd /path/to/Sitemap
zef install .

Requires Raku 6.d or later.

SYNOPSIS

Command Line

# Crawl a website and generate XML sitemap
sitemap https://example.com

# Crawl with options
sitemap https://example.com -o sitemap.xml -f xml -v

# Generate HTML sitemap
sitemap https://example.com -f html -o sitemap.html

# Generate RSS feed
sitemap https://example.com -f rss -o feed.rss

# Build from URL list file
sitemap build urls.txt -o sitemap.xml

# Parse existing sitemap
sitemap parse existing-sitemap.xml

# Fetch sitemap from server (auto-discover from robots.txt)
sitemap fetch example.com

# Fetch with recursive support (downloads all child sitemaps in index)
sitemap fetch example.com --recursive

# Recursive fetch with descriptive directory naming
# Creates example.com-sitemap/ directory with context-aware filenames
sitemap fetch example.com --recursive -v

# Crawl local directory and generate sitemap
sitemap dir ./my-site

# Crawl directory with custom base URL
sitemap dir ./my-site --base-url https://example.com -v

# Build from JSON or YAML site tree definition (priorities inferred from depth)
sitemap tree site.json -o sitemap.xml
sitemap tree site.yaml -o sitemap.xml

# View the tree structure
sitemap tree site.json --verbose

Library Usage

Sitemap::Builder (Generate Sitemaps)

use Sitemap::Builder;

# Create sitemap from URLs
my $builder = Sitemap::Builder.new;
$builder.add-item('https://example.com/page1', :priority(0.8));
$builder.add-item('https://example.com/page2', :lastmod(DateTime.now));

# Write to file (auto-splits at 50k URLs)
$builder.write: 'sitemap.xml';

# With compression and XSL
my $builder2 = Sitemap::Builder.new(:compress, :xsl-url('/sitemap.xsl'));
$builder2.add-item('https://example.com/page');
$builder2.write: 'sitemap.xml';

# Discover sitemap from robots.txt (with /sitemap.xml fallback)
my $discovered = $builder.discover-sitemap('https://example.com');

Sitemap::Crawler (Crawl Websites)

use Sitemap::Crawler;

# Crawl a website
my $crawler = Sitemap::Crawler.new('https://example.com');

$crawler.on-add: -> $url {
    say "Found: $url";
};

my $builder = $crawler.crawl;
$builder.write: 'sitemap.xml';

# Crawl with automatic video and news extraction
my $crawler2 = Sitemap::Crawler.new(
    'https://example.com',
    :extract-videos,
    :extract-news,
);
my $builder2 = $crawler2.crawl;
$builder2.write: 'sitemap.xml';

# News builder contains extracted NewsArticle objects
# (only populated when extract-news is True)
if $crawler2.news-builder -> $nb {
    $nb.write: 'sitemap-news.xml';
}

Sitemap::DirScanner (Crawl Local Directories)

use Sitemap::DirScanner;

# scan-dir returns a Pair: builder => news-builder
my $result = scan-dir('./my-site',
    :base-url<https://example.com>,
    :extract-videos,
    :extract-news,
);
my $builder = $result.key;
my $news-builder = $result.value;

$builder.write: 'sitemap.xml';

if $news-builder {
    $news-builder.write: 'sitemap-news.xml';
}

Sitemap::SiteTree (Build from Site Hierarchy)

use Sitemap::SiteTree;

# Build a tree - priorities auto-inferred from depth
my $root = Sitemap::SiteTree.new;
my $blog = $root.add-child('blog', :priority(0.8));
$blog.add-child('first-post');
$blog.add-child('second-post');

my $builder = $root.to-builder('https://example.com');
$builder.write: 'sitemap.xml';
# Priorities: root=1.0, blog=0.8, first-post=0.6, second-post=0.6

# Wire parents from a flat list
my $home  = Sitemap::SiteTree.new(:stub<home>);
my $about = Sitemap::SiteTree.new(:stub<about>, :parent-stub<home>);
Sitemap::SiteTree.wire-parents([$home, $about]);

# Debug tree visualization
say $root.tree;

# Serialize to/from YAML
my $yaml = $root.to-yaml;
"site-tree.yml".IO.spurt: $yaml;

my $reconstructed = Sitemap::SiteTree.from-yaml($yaml);

YAML Serialization

Sitemap::SiteTree and Sitemap::Item support serialization to/from YAML via to-yaml/from-yaml (on SiteTree) and to-hash/from-hash (on both).

use Sitemap::SiteTree;

# Build a tree
my $root = Sitemap::SiteTree.new;
my $blog = $root.add-child('blog', :priority(0.8));
$blog.add-child('first-post');

# Roundtrip through YAML
my $yaml = $root.to-yaml;
my $tree = Sitemap::SiteTree.from-yaml($yaml);

# Rebuild the sitemap
my $builder = $tree.to-builder('https://example.com');
$builder.write: 'sitemap.xml';

Items and sub-resources (images, videos, links, news) can be serialized independently:

use Sitemap::Item;
use YAMLish;

my $item = Sitemap::Item.new(
    url => 'https://example.com/page',
    priority => 0.8,
);
$item.add-image('https://example.com/img.jpg');

my $yaml = save-yaml($item.to-hash);
my $copy = Sitemap::Item.from-hash(load-yaml($yaml));

The YAML format maps directly to the object structure - parent-child relationships are preserved via nested children arrays.

Sitemap::Fetcher (Fetch Sitemaps from Servers)

use Sitemap::Fetcher;

# Fetch a single sitemap
my $xml = fetch('https://example.com/sitemap.xml', :verbose);

# Fetch sitemap with auto-discovery (robots.txt + /sitemap.xml fallback)
my $xml = fetch('https://example.com', :ssl-verify(False));

# Recursive fetch (downloads all child sitemaps in a sitemap index)
my %result = fetch-recursive(
    'https://example.com/sitemap.xml',
    'xml',
    '',
    :recursive,
    :verbose
);
# %result<index-file> = saved index file
# %result<dir> = directory with child sitemaps
# %result<children> = list of child sitemap files

# Generate output filename with path context
my $filename = output-filename(
    'https://example.com/blog/sitemap.xml',
    'xml'
);
# Returns: sitemap-blog.xml

For non-crawl workflows, Video and News data can be added manually via the builder API:

use Sitemap::Builder;
my $builder = Sitemap::Builder.new;

# Add video sitemap data
$builder.add-item('https://example.com/video-page',
    videos => [
        Sitemap::Item::Video.new(
            thumbnail-loc => 'https://example.com/thumb.jpg',
            title         => 'My Video Title',
            description   => 'A description of the video',
        )
    ]
);

# Add news sitemap data
$builder.add-item('https://example.com/news-article',
    news => [
        Sitemap::Item::News.new(
            publication          => 'Example News',
            publication-language => 'en',
            title                => 'Article Headline',
            publication-date     => DateTime.new('2024-01-01'),
        )
    ]
);

Sitemap::JsonLd (Extract JSON-LD Structured Data)

use Sitemap::JsonLd;

# Extract all JSON-LD objects from HTML
my @objects = extract-jsonld($html);

# Filter by type (supports schema.org URL prefixes and @graph)
my @videos = find-by-type(@objects.List, 'VideoObject');
my @news = find-by-type(@objects.List, 'NewsArticle');

# Convenience: extract video/news objects with full field mapping
my @video-objects = extract-video-objects($html);
# Returns: url, thumbnail, title, description, duration (seconds), pub-date

my @news-objects = extract-news-objects($html);
# Returns: publication, language, title, publication-date, stale (Bool)
# Matches: NewsArticle, ReportageNewsArticle, OpinionNewsArticle,
#          ReviewNewsArticle, AnalysisNewsArticle, BackgroundNewsArticle
# Does NOT match: Article, BlogPosting, or other non-news types
# Articles older than 48h are marked stale => True

# Parse ISO 8601 durations (e.g., from VideoObject)
my $seconds = parse-iso8601-duration('PT1H2M3S');  # 3723

CLI Options

Crawl Command (sitemap <url>)

  • -o, --output <file> - Output file (default: sitemap.)
  • -f, --format <format> - Output format: xml, html, txt, rss, atom, mrss (default: xml)
  • -v, --verbose - Verbose output
  • -d, --max-depth <n> - Maximum crawl depth
  • -u, --max-urls <n> - Maximum number of URLs
  • -m, --max-urls-per-file <n> - Max URLs per sitemap file (default: 50000)
  • -a, --user-agent <str> - User agent string
  • --no-respect-robots - Don't respect robots.txt (default: respect)
  • --no-images - Don't extract images during crawl (default: extract)
  • --no-hreflang - Don't extract hreflang links during crawl (default: extract)
  • --videos - Extract video metadata from HTML tags and JSON-LD (default: off)
  • --news - Extract NewsArticle JSON-LD and write a separate news sitemap (default: off)
  • --no-lastmod - Don't extract lastmod from HTTP headers (default: extract)
  • --no-priority - Don't calculate priority from crawl depth (default: calculate)
  • --no-verify-ssl - Don't verify SSL certificates (default: verify)
  • --ping - Ping search engines after generating sitemap
  • -z, --compress - Gzip compress output file
  • --no-pretty - Disable pretty printing (XML output only, default: pretty)
  • --xsl <url> - Add XSL stylesheet reference

Fetch Command (sitemap fetch <url>)

  • -o, --output <file> - Output file (default: preserve original filename)
  • -f, --format <fmt> - Output format: xml (default), html, txt, rss, atom, mrss
  • -r, --recursive - Fetch all sitemaps in a sitemap index
  • --force - Delete existing output dir before recursive fetch
  • --no-verify-ssl - Don't verify SSL certificates
  • -v, --verbose - Verbose output

Note: Recursive fetch creates a descriptive directory (e.g., example.com-sitemap/) with path-aware filenames (e.g., sitemap-blog.xml, sitemap-news-2024.xml).

Build Command (sitemap build <file>)

  • -o, --output <file> - Output file (default: sitemap.)
  • -f, --format <fmt> - Output format: xml, html, txt, rss, atom, mrss (default: xml)
  • -m, --max-urls-per-file <n> - Max URLs per sitemap file (default: 50000)
  • --base-url <url> - Base URL for ping (required with --ping)
  • --ping - Ping search engines after generating
  • -v, --verbose - Verbose output

Parse Command (sitemap parse <file|url>)

  • -r, --recursive - Enable recursive parsing
  • --recursive-depth=<n> - Set recursive depth (0=infinite, 1+=limited)
  • --no-verify-ssl - Don't verify SSL certificates
  • -v, --verbose - Verbose output (show lastmod)
  • -h, --help - Show this help message

Convert Command (sitemap convert <file>)

  • -o, --output <file> - Output file (default: derived from input)
  • -f, --format <fmt> - Output format: xml, html, txt, rss, atom, mrss (default: txt)
  • -v, --verbose - Verbose output
  • -h, --help - Show this help message

Tree Command (sitemap tree <file>)

Build a sitemap from a JSON (.json) or YAML (.yml/.yaml) site tree definition. Priorities are auto-inferred from tree depth.

JSON format:

{
  "base_url": "https://example.com",
  "pages": [
    {"stub": "home", "priority": 1.0},
    {"stub": "about",
     "children": [
       {"stub": "team"}
     ]},
    {"stub": "blog",
     "children": [
       {"stub": "first-post"},
       {"stub": "second-post"}
     ]}
  ]
}

YAML format:

base_url: "https://example.com"
pages:
  - stub: home
    priority: 1.0
  - stub: about
    children:
      - stub: team
  - stub: blog
    children:
      - stub: first-post
      - stub: second-post
  • -o, --output <file> - Output file (default: sitemap.)
  • -f, --format <fmt> - Output format: xml, html, txt, rss, atom, mrss (default: xml)
  • -z, --compress - Gzip compress output file
  • --no-pretty - Disable pretty printing (XML only)
  • --xsl <url> - Add XSL stylesheet reference
  • --ping - Ping search engines after generating
  • -v, --verbose - Show tree structure during generation

Dir Command (sitemap dir <directory>)

  • --base-url <url> - Base URL for sitemap entries (auto-detect from robots.txt, fallback: file:///)
  • -d, --max-depth <n> - Maximum crawl depth (default: 0 = unlimited)
  • -u, --max-urls <n> - Maximum number of URLs (default: 0 = unlimited)
  • -m, --max-urls-per-file <n> - Max URLs per sitemap file (default: 50000)
  • --no-images - Don't extract images (default: extract)
  • --no-hreflang - Don't extract hreflang links (default: extract)
  • --videos - Extract video metadata from HTML tags and JSON-LD (default: off)
  • --news - Extract NewsArticle JSON-LD and write a separate news sitemap (default: off)
  • --no-priority - Don't calculate priority from depth (default: calculate)
  • --ping - Ping search engines after generating
  • Standard output options also available: -o, -f, -z, --no-pretty, --xsl, -v

AUTHOR

Sasha Abbott sashaa@disroot.org

LICENSE

This library is free software; you can redistribute it and/or modify it under CC0.