Skip to main content

Overview

Web sources allow you to import content directly from your website into your agent’s knowledge base. This is the most common way to train your agent on existing website content like product pages, documentation, blog posts, and service descriptions.
Start with your most important pages first. Quality matters more than quantity - a focused knowledge base with relevant content performs better than one filled with irrelevant pages.

Discovery Methods

The platform offers four methods to discover and import web content:

Quick Scan

Fast domain mapping that quickly discovers pages across your website.

Deep Scan

Thorough crawling with advanced options for precise control over what gets imported.

Sitemap Import

Import URLs directly from your website’s sitemap.xml file.

Manual Entry

Paste specific URLs when you know exactly which pages to import.

Quick Scan

Quick Scan is the fastest way to discover pages on your website. It uses intelligent domain mapping to find pages without fully crawling each one.

How to Use

  1. Select Quick Scan as your discovery method
  2. Enter your website URL (e.g., https://example.com)
  3. Click Scan Domain
  4. Review the discovered URLs in the pending list
  5. Save the pages you want to your agent’s knowledge base

Advanced Options

By default, Quick Scan will discover unlimited pages. You can set a limit to cap the number of URLs discovered:
  • Unlimited: Discover all available pages
  • Custom limit: Set a specific number (e.g., 100 pages)
This is useful when you have a large website but only need a subset of pages.

Deep Scan

Deep Scan provides thorough crawling with fine-grained control over the crawling process. Use this when you need precise control over which pages are discovered.

How to Use

  1. Select Deep Scan as your discovery method
  2. Enter your starting URL (e.g., https://example.com/docs)
  3. Configure advanced options (optional)
  4. Click Scan Domain
  5. Monitor the crawl progress in real-time
  6. Review and save discovered URLs

Advanced Options

Deep Scan offers several configuration options:
Controls how many levels deep the crawler will follow links.
DepthBehavior
0Only the starting URL
1Starting URL + pages linked from it
2Starting URL + 2 levels of linked pages
3+Continues following links to specified depth
Default: 2 levels
Higher depth values result in more pages but longer crawl times.
Time in milliseconds to wait between requests. This helps avoid overwhelming your server and prevents rate limiting.Default: 200msIncrease this value if your server has rate limiting or if you’re experiencing timeout errors.
Maximum number of URLs to discover during the crawl.
  • Unlimited: No cap on discovered URLs
  • Custom limit: Stop after discovering specified number of pages
Default: 100 URLs
Controls whether the crawler stays on your domain or follows external links.
OptionBehavior
Same Domain OnlyOnly crawl pages on the same domain as the starting URL
All DomainsFollow links to external websites too
Default: Same Domain Only
Enabling “All Domains” can significantly increase crawl time and may include irrelevant content.
Limit crawling to specific paths on your website. Enter comma-separated paths to restrict the crawler.Example: /docs, /blog, /productsThis would only crawl URLs that contain /docs, /blog, or /products in their path.
Additional filters to exclude unwanted URLs:
FilterWhat it excludes
Skip Social MediaLinks to Facebook, Twitter, LinkedIn, etc.
Skip File URLsLinks to PDFs, images, downloads, etc.
Skip Anchor LinksURLs with # fragments
All filters are enabled by default.

Canceling a Crawl

During a Deep Scan, you can click Cancel at any time to stop the crawl. Any URLs discovered up to that point will still be available in your pending list.

Sitemap Import

If your website has a sitemap.xml file, you can import all URLs from it directly. This is often the most reliable method for well-maintained websites.

How to Use

  1. Select Sitemap as your discovery method
  2. Enter your sitemap URL (e.g., https://example.com/sitemap.xml)
  3. Click Import Sitemap
  4. Review the parsed URLs
  5. Save the pages you want

Finding Your Sitemap

Common sitemap locations:
  • https://yoursite.com/sitemap.xml
  • https://yoursite.com/sitemap_index.xml
  • https://yoursite.com/sitemap/sitemap.xml
Check your website’s robots.txt file - it often contains a link to your sitemap:
Sitemap: https://yoursite.com/sitemap.xml

Nested Sitemaps

The platform automatically handles sitemap index files - sitemaps that reference other sitemaps. When you import a sitemap index, it will:
  1. Detect that it’s an index file
  2. Fetch each nested sitemap automatically
  3. Combine all URLs into a single list
  4. Support up to 3 levels of nesting
If your sitemap has more than 3 levels of nesting, some deeper sitemaps may be skipped. This limit helps prevent excessively long import times.

Manual URL Entry

When you know exactly which pages you want to import, manual entry is the fastest option.

How to Use

  1. Select Manual as your discovery method
  2. Paste your URLs into the text area (one per line)
  3. Click Add URLs
  4. Review and save

Supported Formats

The manual entry field accepts:
  • Plain URLs (one per line)
  • URLs with or without https:// prefix
  • Pasted HTML content (URLs will be automatically extracted)
Example input:
https://example.com/page-1
https://example.com/page-2
example.com/page-3
www.example.com/page-4

Extracting URLs from HTML

If you copy HTML content (like from a webpage source), the platform will automatically extract all valid URLs from anchor tags and plain text.
Use the Parse from Clipboard button to extract URLs from copied web content containing links.

Managing Pending Sources

After discovering URLs using any method, they appear in the Pending Sources list where you can review and manage them before saving.

Filtering Pending Sources

FilterPurpose
SearchFind URLs containing specific text
ExcludeRemove URLs matching patterns (e.g., /admin, .pdf)
TypeFilter by discovery method (Quick Scan, Deep Scan, Sitemap, Manual)

Duplicate Detection

The platform automatically detects duplicates:
StatusMeaning
NEWURL not in your knowledge base
Duplicate (in agent)URL already exists in your agent’s sources
Duplicate (in pending)Same URL already in your pending list
Duplicates are shown in a separate section and can be cleared with one click.

Saving Sources

Once you’ve reviewed your pending URLs:
  1. Use filters to exclude unwanted pages
  2. Click Save to Agent to add them to your knowledge base
  3. Sources will begin processing automatically

Best Practices

Begin with your most important pages (product pages, key documentation, FAQs). Test your agent, then add more content as needed.
Sitemaps are maintained by your website and provide the most accurate list of pages. They’re also faster than crawling.
Exclude admin pages, login pages, and irrelevant sections. Use patterns like /admin, /login, /cart in the exclude filter.
Deep scans of large websites can take several minutes. The progress indicator shows real-time status.
When you update your website content, re-import the affected pages to keep your agent’s knowledge current.

Common Issues

Crawl Times Out

If your crawl times out:
  • Reduce the Max Depth setting
  • Increase the Wait Time between requests
  • Set a lower URL Limit
  • Use Subpath Restriction to focus on specific sections

Sitemap Won’t Load

If sitemap import fails:
  • Verify the sitemap URL is accessible in your browser
  • Check that the sitemap is valid XML
  • Ensure your server isn’t blocking automated requests
  • Try the direct sitemap URL (not the robots.txt reference)

Missing Pages

If expected pages aren’t discovered:
  • Check if pages are linked from your starting URL
  • Increase the Max Depth setting
  • Verify pages aren’t blocked by robots.txt
  • Try using Manual Entry for specific pages

Next Steps