- Raku 99.3%
- HTML 0.7%
| bin | ||
| lib | ||
| t | ||
| .gitignore | ||
| Changes | ||
| dist.ini | ||
| LICENSE | ||
| META6.json | ||
| README.md | ||
NAME
Sitemap - Sitemap generator for Raku
DESCRIPTION
Sitemap is a Raku module for generating XML sitemaps, crawling websites, and fetching sitemaps from servers.
Features
- Generate standard XML sitemaps (sitemap.org protocol)
- Image & hreflang link extraction during crawl (configurable via --no-images / --no-hreflang)
- Automatic video & news extraction during crawl (HTML tags + JSON-LD,
--videos/--newsflags) - Separate news sitemap output with 48-hour stale detection
- Lastmod extraction from HTTP headers (configurable via --no-lastmod)
- Priority calculation from crawl depth (configurable via --no-priority)
- Manual API: Video & News sitemaps via
Sitemap::Item::Video/Sitemap::Item::News - Output formats: XML, HTML, TXT, RSS 2.0, Atom 1.0, Media RSS (mRSS)
- Crawl websites with respect for robots.txt
- Parse existing sitemaps (including recursive parsing of sitemap indexes)
- Fetch sitemaps from remote servers (with recursive support for sitemap indexes)
- Automatic gzip decompression of .gz sitemaps
- Automatic sitemap discovery (robots.txt + fallback to /sitemap.xml)
- Recursive fetch creates descriptive directories (e.g.,
example.com-sitemap/) - Child sitemaps named with path context (e.g.,
sitemap-blog.xml,sitemap-news-2024.xml) - Gzip compression for output files
- XSL stylesheet support for human-readable XML
- Ping search engines (Google, Bing) after generation
- Build hierarchical site trees with depth-inferred priorities
- Serialize/deserialize site trees to/from YAML
- Sitemap index for large sites (>50k URLs)
Note: Image, hreflang, lastmod, and priority data only appears in XML format output.
Dependencies
- Cro::HTTP - HTTP client for crawling and fetching
- URI - URL parsing and resolution
- LibXML- XML parsing
- LibXML::Writer - XML generation (sitemaps, RSS, Atom)
- Compress::Zlib - Gzip compression and decompression
- JSON::Fast - JSON parsing (JSON-LD extraction)
- YAMLish - YAML serialization (SiteTree)
- Syndicate - Feed format generation (RSS, Atom, MRSS)
Installation
zef install Sitemap
Installing from source
cd /path/to/Sitemap
zef install .
Requires Raku 6.d or later.
SYNOPSIS
Command Line
# Crawl a website and generate XML sitemap
sitemap https://example.com
# Crawl with options
sitemap https://example.com -o sitemap.xml -f xml -v
# Generate HTML sitemap
sitemap https://example.com -f html -o sitemap.html
# Generate RSS feed
sitemap https://example.com -f rss -o feed.rss
# Build from URL list file
sitemap build urls.txt -o sitemap.xml
# Parse existing sitemap
sitemap parse existing-sitemap.xml
# Fetch sitemap from server (auto-discover from robots.txt)
sitemap fetch example.com
# Fetch with recursive support (downloads all child sitemaps in index)
sitemap fetch example.com --recursive
# Recursive fetch with descriptive directory naming
# Creates example.com-sitemap/ directory with context-aware filenames
sitemap fetch example.com --recursive -v
# Crawl local directory and generate sitemap
sitemap dir ./my-site
# Crawl directory with custom base URL
sitemap dir ./my-site --base-url https://example.com -v
# Build from JSON or YAML site tree definition (priorities inferred from depth)
sitemap tree site.json -o sitemap.xml
sitemap tree site.yaml -o sitemap.xml
# View the tree structure
sitemap tree site.json --verbose
Library Usage
Sitemap::Builder (Generate Sitemaps)
use Sitemap::Builder;
# Create sitemap from URLs
my $builder = Sitemap::Builder.new;
$builder.add-item('https://example.com/page1', :priority(0.8));
$builder.add-item('https://example.com/page2', :lastmod(DateTime.now));
# Write to file (auto-splits at 50k URLs)
$builder.write: 'sitemap.xml';
# With compression and XSL
my $builder2 = Sitemap::Builder.new(:compress, :xsl-url('/sitemap.xsl'));
$builder2.add-item('https://example.com/page');
$builder2.write: 'sitemap.xml';
# Discover sitemap from robots.txt (with /sitemap.xml fallback)
my $discovered = $builder.discover-sitemap('https://example.com');
Sitemap::Crawler (Crawl Websites)
use Sitemap::Crawler;
# Crawl a website
my $crawler = Sitemap::Crawler.new('https://example.com');
$crawler.on-add: -> $url {
say "Found: $url";
};
my $builder = $crawler.crawl;
$builder.write: 'sitemap.xml';
# Crawl with automatic video and news extraction
my $crawler2 = Sitemap::Crawler.new(
'https://example.com',
:extract-videos,
:extract-news,
);
my $builder2 = $crawler2.crawl;
$builder2.write: 'sitemap.xml';
# News builder contains extracted NewsArticle objects
# (only populated when extract-news is True)
if $crawler2.news-builder -> $nb {
$nb.write: 'sitemap-news.xml';
}
Sitemap::DirScanner (Crawl Local Directories)
use Sitemap::DirScanner;
# scan-dir returns a Pair: builder => news-builder
my $result = scan-dir('./my-site',
:base-url<https://example.com>,
:extract-videos,
:extract-news,
);
my $builder = $result.key;
my $news-builder = $result.value;
$builder.write: 'sitemap.xml';
if $news-builder {
$news-builder.write: 'sitemap-news.xml';
}
Sitemap::SiteTree (Build from Site Hierarchy)
use Sitemap::SiteTree;
# Build a tree - priorities auto-inferred from depth
my $root = Sitemap::SiteTree.new;
my $blog = $root.add-child('blog', :priority(0.8));
$blog.add-child('first-post');
$blog.add-child('second-post');
my $builder = $root.to-builder('https://example.com');
$builder.write: 'sitemap.xml';
# Priorities: root=1.0, blog=0.8, first-post=0.6, second-post=0.6
# Wire parents from a flat list
my $home = Sitemap::SiteTree.new(:stub<home>);
my $about = Sitemap::SiteTree.new(:stub<about>, :parent-stub<home>);
Sitemap::SiteTree.wire-parents([$home, $about]);
# Debug tree visualization
say $root.tree;
# Serialize to/from YAML
my $yaml = $root.to-yaml;
"site-tree.yml".IO.spurt: $yaml;
my $reconstructed = Sitemap::SiteTree.from-yaml($yaml);
YAML Serialization
Sitemap::SiteTree and Sitemap::Item support serialization to/from YAML
via to-yaml/from-yaml (on SiteTree) and to-hash/from-hash (on both).
use Sitemap::SiteTree;
# Build a tree
my $root = Sitemap::SiteTree.new;
my $blog = $root.add-child('blog', :priority(0.8));
$blog.add-child('first-post');
# Roundtrip through YAML
my $yaml = $root.to-yaml;
my $tree = Sitemap::SiteTree.from-yaml($yaml);
# Rebuild the sitemap
my $builder = $tree.to-builder('https://example.com');
$builder.write: 'sitemap.xml';
Items and sub-resources (images, videos, links, news) can be serialized independently:
use Sitemap::Item;
use YAMLish;
my $item = Sitemap::Item.new(
url => 'https://example.com/page',
priority => 0.8,
);
$item.add-image('https://example.com/img.jpg');
my $yaml = save-yaml($item.to-hash);
my $copy = Sitemap::Item.from-hash(load-yaml($yaml));
The YAML format maps directly to the object structure - parent-child
relationships are preserved via nested children arrays.
Sitemap::Fetcher (Fetch Sitemaps from Servers)
use Sitemap::Fetcher;
# Fetch a single sitemap
my $xml = fetch('https://example.com/sitemap.xml', :verbose);
# Fetch sitemap with auto-discovery (robots.txt + /sitemap.xml fallback)
my $xml = fetch('https://example.com', :ssl-verify(False));
# Recursive fetch (downloads all child sitemaps in a sitemap index)
my %result = fetch-recursive(
'https://example.com/sitemap.xml',
'xml',
'',
:recursive,
:verbose
);
# %result<index-file> = saved index file
# %result<dir> = directory with child sitemaps
# %result<children> = list of child sitemap files
# Generate output filename with path context
my $filename = output-filename(
'https://example.com/blog/sitemap.xml',
'xml'
);
# Returns: sitemap-blog.xml
For non-crawl workflows, Video and News data can be added manually via the builder API:
use Sitemap::Builder;
my $builder = Sitemap::Builder.new;
# Add video sitemap data
$builder.add-item('https://example.com/video-page',
videos => [
Sitemap::Item::Video.new(
thumbnail-loc => 'https://example.com/thumb.jpg',
title => 'My Video Title',
description => 'A description of the video',
)
]
);
# Add news sitemap data
$builder.add-item('https://example.com/news-article',
news => [
Sitemap::Item::News.new(
publication => 'Example News',
publication-language => 'en',
title => 'Article Headline',
publication-date => DateTime.new('2024-01-01'),
)
]
);
Sitemap::JsonLd (Extract JSON-LD Structured Data)
use Sitemap::JsonLd;
# Extract all JSON-LD objects from HTML
my @objects = extract-jsonld($html);
# Filter by type (supports schema.org URL prefixes and @graph)
my @videos = find-by-type(@objects.List, 'VideoObject');
my @news = find-by-type(@objects.List, 'NewsArticle');
# Convenience: extract video/news objects with full field mapping
my @video-objects = extract-video-objects($html);
# Returns: url, thumbnail, title, description, duration (seconds), pub-date
my @news-objects = extract-news-objects($html);
# Returns: publication, language, title, publication-date, stale (Bool)
# Matches: NewsArticle, ReportageNewsArticle, OpinionNewsArticle,
# ReviewNewsArticle, AnalysisNewsArticle, BackgroundNewsArticle
# Does NOT match: Article, BlogPosting, or other non-news types
# Articles older than 48h are marked stale => True
# Parse ISO 8601 durations (e.g., from VideoObject)
my $seconds = parse-iso8601-duration('PT1H2M3S'); # 3723
CLI Options
Crawl Command (sitemap <url>)
-o, --output <file>- Output file (default: sitemap.)-f, --format <format>- Output format: xml, html, txt, rss, atom, mrss (default: xml)-v, --verbose- Verbose output-d, --max-depth <n>- Maximum crawl depth-u, --max-urls <n>- Maximum number of URLs-m, --max-urls-per-file <n>- Max URLs per sitemap file (default: 50000)-a, --user-agent <str>- User agent string--no-respect-robots- Don't respect robots.txt (default: respect)--no-images- Don't extract images during crawl (default: extract)--no-hreflang- Don't extract hreflang links during crawl (default: extract)--videos- Extract video metadata from HTML tags and JSON-LD (default: off)--news- Extract NewsArticle JSON-LD and write a separate news sitemap (default: off)--no-lastmod- Don't extract lastmod from HTTP headers (default: extract)--no-priority- Don't calculate priority from crawl depth (default: calculate)--no-verify-ssl- Don't verify SSL certificates (default: verify)--ping- Ping search engines after generating sitemap-z, --compress- Gzip compress output file--no-pretty- Disable pretty printing (XML output only, default: pretty)--xsl <url>- Add XSL stylesheet reference
Fetch Command (sitemap fetch <url>)
-o, --output <file>- Output file (default: preserve original filename)-f, --format <fmt>- Output format: xml (default), html, txt, rss, atom, mrss-r, --recursive- Fetch all sitemaps in a sitemap index--force- Delete existing output dir before recursive fetch--no-verify-ssl- Don't verify SSL certificates-v, --verbose- Verbose output
Note: Recursive fetch creates a descriptive directory (e.g., example.com-sitemap/) with path-aware filenames (e.g., sitemap-blog.xml, sitemap-news-2024.xml).
Build Command (sitemap build <file>)
-o, --output <file>- Output file (default: sitemap.)-f, --format <fmt>- Output format: xml, html, txt, rss, atom, mrss (default: xml)-m, --max-urls-per-file <n>- Max URLs per sitemap file (default: 50000)--base-url <url>- Base URL for ping (required with --ping)--ping- Ping search engines after generating-v, --verbose- Verbose output
Parse Command (sitemap parse <file|url>)
-r, --recursive- Enable recursive parsing--recursive-depth=<n>- Set recursive depth (0=infinite, 1+=limited)--no-verify-ssl- Don't verify SSL certificates-v, --verbose- Verbose output (show lastmod)-h, --help- Show this help message
Convert Command (sitemap convert <file>)
-o, --output <file>- Output file (default: derived from input)-f, --format <fmt>- Output format: xml, html, txt, rss, atom, mrss (default: txt)-v, --verbose- Verbose output-h, --help- Show this help message
Tree Command (sitemap tree <file>)
Build a sitemap from a JSON (.json) or YAML (.yml/.yaml) site tree definition. Priorities are auto-inferred from tree depth.
JSON format:
{
"base_url": "https://example.com",
"pages": [
{"stub": "home", "priority": 1.0},
{"stub": "about",
"children": [
{"stub": "team"}
]},
{"stub": "blog",
"children": [
{"stub": "first-post"},
{"stub": "second-post"}
]}
]
}
YAML format:
base_url: "https://example.com"
pages:
- stub: home
priority: 1.0
- stub: about
children:
- stub: team
- stub: blog
children:
- stub: first-post
- stub: second-post
-o, --output <file>- Output file (default: sitemap.)-f, --format <fmt>- Output format: xml, html, txt, rss, atom, mrss (default: xml)-z, --compress- Gzip compress output file--no-pretty- Disable pretty printing (XML only)--xsl <url>- Add XSL stylesheet reference--ping- Ping search engines after generating-v, --verbose- Show tree structure during generation
Dir Command (sitemap dir <directory>)
--base-url <url>- Base URL for sitemap entries (auto-detect from robots.txt, fallback: file:///)-d, --max-depth <n>- Maximum crawl depth (default: 0 = unlimited)-u, --max-urls <n>- Maximum number of URLs (default: 0 = unlimited)-m, --max-urls-per-file <n>- Max URLs per sitemap file (default: 50000)--no-images- Don't extract images (default: extract)--no-hreflang- Don't extract hreflang links (default: extract)--videos- Extract video metadata from HTML tags and JSON-LD (default: off)--news- Extract NewsArticle JSON-LD and write a separate news sitemap (default: off)--no-priority- Don't calculate priority from depth (default: calculate)--ping- Ping search engines after generating- Standard output options also available:
-o,-f,-z,--no-pretty,--xsl,-v
AUTHOR
Sasha Abbott sashaa@disroot.org
LICENSE
This library is free software; you can redistribute it and/or modify it under CC0.