Web Sources

Overview

Web sources allow you to import content directly from your website into your agent’s knowledge base. This is the most common way to train your agent on existing website content like product pages, documentation, blog posts, and service descriptions.

Start with your most important pages first. Quality matters more than quantity - a focused knowledge base with relevant content performs better than one filled with irrelevant pages.

Discovery Methods

The platform offers four methods to discover and import web content:

Quick Scan

Fast domain mapping that quickly discovers pages across your website.

Deep Scan

Thorough crawling with advanced options for precise control over what gets imported.

Sitemap Import

Import URLs directly from your website’s sitemap.xml file.

Manual Entry

Paste specific URLs when you know exactly which pages to import.

Quick Scan

Quick Scan is the fastest way to discover pages on your website. It uses intelligent domain mapping to find pages without fully crawling each one.

How to Use

Select Quick Scan as your discovery method
Enter your website URL (e.g., https://example.com)
Click Scan Domain
Review the discovered URLs in the pending list
Save the pages you want to your agent’s knowledge base

Advanced Options

URL Limit

By default, Quick Scan will discover unlimited pages. You can set a limit to cap the number of URLs discovered:

Unlimited: Discover all available pages
Custom limit: Set a specific number (e.g., 100 pages)

This is useful when you have a large website but only need a subset of pages.

Deep Scan

Deep Scan provides thorough crawling with fine-grained control over the crawling process. Use this when you need precise control over which pages are discovered.

How to Use

Select Deep Scan as your discovery method
Enter your starting URL (e.g., https://example.com/docs)
Configure advanced options (optional)
Click Scan Domain
Monitor the crawl progress in real-time
Review and save discovered URLs

Advanced Options

Deep Scan offers several configuration options:

Max Depth

Controls how many levels deep the crawler will follow links.

Depth	Behavior
0	Only the starting URL
1	Starting URL + pages linked from it
2	Starting URL + 2 levels of linked pages
3+	Continues following links to specified depth

Default: 2 levels

Higher depth values result in more pages but longer crawl times.

Wait Time

Time in milliseconds to wait between requests. This helps avoid overwhelming your server and prevents rate limiting.Default: 200msIncrease this value if your server has rate limiting or if you’re experiencing timeout errors.

URL Limit

Maximum number of URLs to discover during the crawl.

Unlimited: No cap on discovered URLs
Custom limit: Stop after discovering specified number of pages

Default: 100 URLs

Domain Restriction

Controls whether the crawler stays on your domain or follows external links.

Option	Behavior
Same Domain Only	Only crawl pages on the same domain as the starting URL
All Domains	Follow links to external websites too

Default: Same Domain Only

Enabling “All Domains” can significantly increase crawl time and may include irrelevant content.

Subpath Restriction

Limit crawling to specific paths on your website. Enter comma-separated paths to restrict the crawler.Example: /docs, /blog, /productsThis would only crawl URLs that contain /docs, /blog, or /products in their path.

Filtering Options

Additional filters to exclude unwanted URLs:

Filter	What it excludes
Skip Social Media	Links to Facebook, Twitter, LinkedIn, etc.
Skip File URLs	Links to PDFs, images, downloads, etc.
Skip Anchor Links	URLs with `#` fragments

All filters are enabled by default.

Canceling a Crawl

During a Deep Scan, you can click Cancel at any time to stop the crawl. Any URLs discovered up to that point will still be available in your pending list.

Sitemap Import

If your website has a sitemap.xml file, you can import all URLs from it directly. This is often the most reliable method for well-maintained websites.

How to Use

Select Sitemap as your discovery method
Enter your sitemap URL (e.g., https://example.com/sitemap.xml)
Click Import Sitemap
Review the parsed URLs
Save the pages you want

Finding Your Sitemap

Common sitemap locations:

https://yoursite.com/sitemap.xml
https://yoursite.com/sitemap_index.xml
https://yoursite.com/sitemap/sitemap.xml

Check your website’s robots.txt file - it often contains a link to your sitemap:

Sitemap: https://yoursite.com/sitemap.xml

Nested Sitemaps

The platform automatically handles sitemap index files - sitemaps that reference other sitemaps. When you import a sitemap index, it will:

Detect that it’s an index file
Fetch each nested sitemap automatically
Combine all URLs into a single list
Support up to 3 levels of nesting

If your sitemap has more than 3 levels of nesting, some deeper sitemaps may be skipped. This limit helps prevent excessively long import times.

Manual URL Entry

When you know exactly which pages you want to import, manual entry is the fastest option.

How to Use

Select Manual as your discovery method
Paste your URLs into the text area (one per line)
Click Add URLs
Review and save

Supported Formats

The manual entry field accepts:

Plain URLs (one per line)
URLs with or without https:// prefix
Pasted HTML content (URLs will be automatically extracted)

Example input:

https://example.com/page-1
https://example.com/page-2
example.com/page-3
www.example.com/page-4

Extracting URLs from HTML

If you copy HTML content (like from a webpage source), the platform will automatically extract all valid URLs from anchor tags and plain text.

Use the Parse from Clipboard button to extract URLs from copied web content containing links.

Managing Pending Sources

After discovering URLs using any method, they appear in the Pending Sources list where you can review and manage them before saving.

Filtering Pending Sources

Filter	Purpose
Search	Find URLs containing specific text
Exclude	Remove URLs matching patterns (e.g., `/admin`, `.pdf`)
Type	Filter by discovery method (Quick Scan, Deep Scan, Sitemap, Manual)

Duplicate Detection

The platform automatically detects duplicates:

Status	Meaning
NEW	URL not in your knowledge base
Duplicate (in agent)	URL already exists in your agent’s sources
Duplicate (in pending)	Same URL already in your pending list

Duplicates are shown in a separate section and can be cleared with one click.

Saving Sources

Once you’ve reviewed your pending URLs:

Use filters to exclude unwanted pages
Click Save to Agent to add them to your knowledge base
Sources will begin processing automatically

Best Practices

Start focused, then expand

Begin with your most important pages (product pages, key documentation, FAQs). Test your agent, then add more content as needed.

Use sitemaps when available

Sitemaps are maintained by your website and provide the most accurate list of pages. They’re also faster than crawling.

Use exclusion filters liberally

Exclude admin pages, login pages, and irrelevant sections. Use patterns like /admin, /login, /cart in the exclude filter.

Be patient with large sites

Deep scans of large websites can take several minutes. The progress indicator shows real-time status.

Re-import when content changes

When you update your website content, re-import the affected pages to keep your agent’s knowledge current.

Common Issues

Crawl Times Out

If your crawl times out:

Reduce the Max Depth setting
Increase the Wait Time between requests
Set a lower URL Limit
Use Subpath Restriction to focus on specific sections

Sitemap Won’t Load

If sitemap import fails:

Verify the sitemap URL is accessible in your browser
Check that the sitemap is valid XML
Ensure your server isn’t blocking automated requests
Try the direct sitemap URL (not the robots.txt reference)

Missing Pages

If expected pages aren’t discovered:

Check if pages are linked from your starting URL
Increase the Max Depth setting
Verify pages aren’t blocked by robots.txt
Try using Manual Entry for specific pages

Next Steps

Text Sources

Add custom text content not on your website.

Q&A Sources

Create targeted question-answer pairs.

Test Your Agent

Verify your knowledge base in the Playground.

Sources Overview

Learn about all source types and best practices.

Getting Started

Chatbot

Receptionist

Overview

Discovery Methods

Quick Scan

Deep Scan

Sitemap Import

Manual Entry

Quick Scan

How to Use

Advanced Options

Deep Scan

How to Use

Advanced Options

Canceling a Crawl

Sitemap Import

How to Use

Finding Your Sitemap

Nested Sitemaps

Manual URL Entry

How to Use

Supported Formats

Extracting URLs from HTML

Managing Pending Sources

Filtering Pending Sources

Duplicate Detection

Saving Sources

Best Practices

Common Issues

Crawl Times Out

Sitemap Won’t Load

Missing Pages

Next Steps

Text Sources

Q&A Sources

Test Your Agent

Sources Overview

Getting Started

Chatbot

Receptionist

​Overview

​Discovery Methods

Quick Scan

Deep Scan

Sitemap Import

Manual Entry

​Quick Scan

​How to Use

​Advanced Options

​Deep Scan

​How to Use

​Advanced Options

​Canceling a Crawl

​Sitemap Import

​How to Use

​Finding Your Sitemap

​Nested Sitemaps

​Manual URL Entry

​How to Use

​Supported Formats

​Extracting URLs from HTML

​Managing Pending Sources

​Filtering Pending Sources

​Duplicate Detection

​Saving Sources

​Best Practices

​Common Issues

​Crawl Times Out

​Sitemap Won’t Load

​Missing Pages

​Next Steps

Text Sources

Q&A Sources

Test Your Agent

Sources Overview

Overview

Discovery Methods

Quick Scan

How to Use

Advanced Options

Deep Scan

How to Use

Advanced Options

Canceling a Crawl

Sitemap Import

How to Use

Finding Your Sitemap

Nested Sitemaps

Manual URL Entry

How to Use

Supported Formats

Extracting URLs from HTML

Managing Pending Sources

Filtering Pending Sources

Duplicate Detection

Saving Sources

Best Practices

Common Issues

Crawl Times Out

Sitemap Won’t Load

Missing Pages

Next Steps