InformlyInformly Docs
Documents

Web crawl

Turn a public URL into an Informly document by fetching the page, extracting the main text, and indexing it alongside your uploaded files.

Web crawl is the fastest way to bring a page on the open web into your Informly library. Instead of downloading a help article or product page and uploading it as a file, you hand Informly a URL and it does the rest. The result is a regular document — same status pipeline, same chunks, same chat behavior as anything you upload by hand.

This is the right choice for content that already lives on the web: help center articles, blog posts, public product pages, changelogs, and FAQs.

When to use a crawl instead of a file upload

Use casePick
Public page on a site you controlCrawl
Article you copy-pasted from a doc on diskFile upload
Internal page behind a loginFile upload or data source
Content that changes regularlyCrawl, then reprocess

Run a crawl

Open the crawler

Go to Documents → Crawl URL.

Paste the URL

Enter the full URL, including https://. Informly fetches the page server-side, so the URL must be reachable from the public internet.

Set visibility

Choose public or private visibility, just like a file upload.

Start the crawl

Click Crawl. Informly fetches the page, extracts the main text, and queues the resulting document for processing.

The new document appears in the library immediately with a Pending badge and moves through the same Pending → Processing → Completed lifecycle as an uploaded file.

What gets extracted

Informly pulls out the main body content and discards site furniture — navigation, sidebars, footers, cookie banners, and most ads. What's kept is the article-like text a reader would actually want.

If a page renders most of its content with JavaScript after load, the extracted text may be sparse. In that case, copy the rendered text and create the document via paste text instead.

Multi-page sites

Informly crawls one URL per request today. To ingest an entire help center or documentation site, run a separate crawl for each page you want indexed. This keeps each page in its own document, which makes citations more precise and reprocessing cheaper.

If you have a long list of URLs to ingest, set aside an afternoon and work through them in batches. Each crawl is fast, and you can leave the library page and come back as the statuses update.

Keeping crawled pages fresh

Crawled documents do not auto-refresh. If the source page changes, the document in Informly stays on the version it had when you crawled it.

To refresh:

Open the document

Find it in the library and click into the detail page.

Click Reprocess

Informly re-fetches the original URL, extracts the latest text, and re-indexes it.

For content that changes weekly or daily, consider connecting a data source instead — those sync automatically.

Permissions and good behavior

Only crawl pages you own or have explicit permission to use. Pulling someone else's content into your AI without permission can violate their copyright and their site's terms of service. Stick to your own help center, your own docs, and content you've licensed.

If a site blocks Informly's crawler in robots.txt or returns an authentication wall, the crawl will fail with the reason shown on the document detail page.

When a crawl fails

Common failure modes and fixes:

CauseFix
URL unreachable or returned an errorConfirm the URL loads in your browser, then retry.
Page is behind a loginUse file upload or a data source.
JavaScript-rendered contentCopy the rendered text and use paste text.
Page extracted but had no useful textThe page is mostly images or interactive — not a good fit.

What's next

On this page