Crawl Your Website
Find many pages on your site and teach your agent from them.
Before you begin
- The site is public (no login required).
- You know your main site URL (for example: https://example.com).
- You understand that your plan may limit how many knowledge sources you can have and how many crawl credits (pages) you can use each billing period.
What this does
Crawling has two phases:
- Discovery: We scan your site to find pages. You see how many pages were found and how this compares to your plan limits.
- Ingestion: After you confirm, we fetch the content from those pages and add it to your agent’s Knowledge Base.
After Discovery, you’ll see a confirmation screen where you can choose how to crawl:
- Crawl entire website – crawl as many discovered pages as your plan and crawl credits allow.
- Add new pages only – only pages that have never been ingested before are added.
- Refresh existing pages only – re-crawls pages that are already in your Knowledge Base to pick up changes.
(Onboarding and the quick-start wizard always run a full crawl automatically; the options above apply when you use Train Voice Agent → Add entire website in the dashboard.)
Steps
Open your agent in the Babelbeez dashboard.
Go to Train Voice Agent.
Choose Add entire website.
Enter your site URL (for example: https://example.com). Click Start.
Review the Discovery results:
- Pages found on your site.
- Your plan limits (sources per agent and crawl credits).
- How many pages can be crawled safely with your current plan.
Choose how you want to crawl:
- Crawl entire website – best for your first crawl of a site. We’ll crawl up to the allowed number of pages based on your plan and crawl credits.
- Add new pages only – only pages that have never been ingested before are added. Use this after you’ve added new sections or blog posts.
- Refresh existing pages only – re-crawls pages that are already in your Knowledge Base to pick up content changes.
On a brand-new site, you may only see Crawl entire website until some pages have been ingested.
Click Confirm to start Ingestion (learning the content) for the option you chose. You’ll see progress and each page added to the Knowledge Base as it completes.
What you should see
- A confirmation screen after Discovery with the number of pages found and options for how to crawl.
- New sources added to the Knowledge Base list with status, then moving to “ready” when complete.
Tips
- Start with your main site URL (https://example.com), not a deep page.
- Use clear internal links on your site so important pages are discovered.
- If you only need a few pages, consider Scrape a single page.
- Use Crawl entire website the first time you ingest a site.
- Use Add new pages only when you’ve added new pages and want to avoid re-processing everything.
- Use Refresh existing pages only after you update important pages (for example, pricing or FAQs) and want your agent to pick up the new wording.
Manage jobs
- Cancel a running crawl
- Use the cancel option shown during Ingestion to stop the job.
- Delete sources
- In the Knowledge Base list, remove individual sources you no longer need.
Troubleshooting
- Too many pages found
- The Discovery found more pages than your plan allows. Reduce scope (link fewer pages), pick a smaller option (for example, Add new pages only), or upgrade your plan. See Plan limits and upgrades.
- Crawl seems stuck
- Check your internet connection and leave the tab open. Very large sites can take time. If needed, cancel and try again with a smaller scope.
- Pages missing or crawl shows a robots.txt warning
- Some pages may be blocked by
robots.txtor require login. Only public, crawlable pages are included. - If your crawl finishes but shows a red warning about
robots.txtand no pages were learned, it means your site’srobots.txtblocked all of the pages we tried to crawl. - To fix this:
- Open
https://your-site.com/robots.txtin your browser and look forDisallowrules that block important sections (for example,/,/blog,/services). - Update
robots.txtso that the pages you want your agent to learn from are allowed to be crawled. - Run the crawl again, or start from a different URL on your site that is allowed by
robots.txt. - If you only need a few pages, you can also use Scrape a single page instead of a full crawl.
- Open
- In the quick-start / onboarding wizard, the Next button will stay disabled if nothing could be learned because of
robots.txt. Updaterobots.txtor go back and try a different URL, then run the crawl again.
- Some pages may be blocked by
- A crawl option is disabled or shows fewer pages than Discovery found
- Your plan limits (sources per agent and crawl credits) may not allow all pages to be processed in that mode. The confirmation screen shows how many pages are allowed for each option.
- “Refresh existing pages only” is not available
- This option appears once you already have knowledge pages for that site. For a brand-new site, start with Crawl entire website.
FAQ
- Will Discovery count against my limits?
- Discovery finds pages. Crawl credits and sources are used after you confirm and Ingestion begins.
- Can I crawl subdomains?
- Crawling stays within the domain you provide. To include subdomains, run separate crawls for each (for example:
https://blog.example.com).
- Crawling stays within the domain you provide. To include subdomains, run separate crawls for each (for example:
Next steps
- Review the Knowledge Base entries when they show “ready.”
- (Optional) Set Starting Knowledge: click a summarized row and choose “Use as starting knowledge.”
- Test answers in Live Preview.
- Add more pages later with another crawl or Scrape a single page.
