Crawler
Automatically analyze your product with the Pointer CLI to build comprehensive knowledge.
The Pointer Crawler enables you to automatically gather and analyze content from your product, creating a comprehensive knowledge base for AI-powered features.
Prerequisites
- Node.js version 16 or higher
- Access to Pointer dashboard
Installation
Install the Pointer CLI globally using npm:
Verify the installation:
Authentication
Create an API key
Navigate to API Keys
Go to your Keys settings in the Pointer dashboard.
Generate new key
Click Create new key and provide:
- Name: Descriptive identifier (e.g., “CLI Production”)
- Description: Optional context about key usage
- Expiration: Optional expiry date (defaults to never expire)
Copy your secret key
Save the generated key immediately - it won’t be shown again. Keys follow the format:
Configure authentication
Set your secret key using one of these methods:
Environment variables are recommended for security. Command-line options may expose keys in shell history.
Core workflow
Step 1: Initialize your website
Start by adding your website to the crawler configuration:
The interactive prompt will guide you through:
- Entering a friendly name for identification
- Providing your website URL
- Confirming the configuration
Step 2: Scrape your content
Begin the automated content collection:
Choose from interactive options:
- Scraping mode: Headless (fast) or Browser (with authentication)
- Crawl depth: Fast (surface content) or Deep (interactive elements)
- PII protection: Configure sensitivity and redaction settings
The CLI saves your progress automatically. If interrupted, it will offer to resume from where it left off.
Step 3: Upload for analysis
Send your scraped content to Pointer for processing:
The CLI will:
- Display a summary of collected data
- Confirm the upload scope
- Transfer content to your knowledge base
Command reference
Primary commands
Command | Description | Authentication |
---|---|---|
pointer init | Add a website to crawl | Required |
pointer scrape | Collect content from configured websites | Required |
pointer upload | Transfer scraped data to Pointer | Required |
pointer status | Check crawl processing status | Required |
pointer list | View local scraped data | Not required |
pointer cleanup | Remove all local data | Not required |
pointer purge | Delete server-side crawl data | Required |
Global options
Available for all commands:
Option | Description |
---|---|
-s, --secret-key <key> | API secret key (overrides environment variable) |
-v, --version | Display CLI version |
--help | Show command help |
Scraping options
Configure pointer scrape
behavior:
Option | Description | Default |
---|---|---|
--max-pages <number> | Maximum pages to crawl | 200 |
--concurrency <number> | Parallel page processing | 1 |
--fast | Use fast crawl mode | Interactive prompt |
--no-pii-protection | Disable PII detection | PII protection enabled |
--pii-sensitivity <level> | Set detection level (low/medium/high) | Interactive prompt |
--exclude-routes <patterns> | Comma-separated routes to exclude | None |
--include-routes <patterns> | Comma-separated routes to include (whitelist mode) | None |
--bearer-token <token> | Bearer token for API authentication | None |
--headers <json> | Custom headers as JSON string | None |
--cookies-file <path> | Path to cookies JSON file | None |
--log-level <level> | Logging verbosity | info |
Excluding routes
The --exclude-routes
flag allows you to specify routes that should be excluded from scraping. This is useful for avoiding admin panels, API endpoints, or specific file types.
Pattern types:
- Exact match:
/admin
- excludes only the exact path - Wildcard patterns:
/admin/*
- excludes all paths starting with/admin/
*.pdf
- excludes all PDF files/api/*/docs
- excludes paths like/api/v1/docs
,/api/v2/docs
The exclusion check is performed on the URL path only (not the full URL). Patterns are case-sensitive, and the start URL cannot be excluded.
Including routes (whitelist mode)
The --include-routes
flag allows you to specify which routes should be included in scraping. When used, ONLY matching routes will be scraped.
Include vs Exclude Logic:
- If
--include-routes
is specified, a URL must match at least one include pattern to be scraped - If both
--include-routes
and--exclude-routes
are specified:- URL must match an include pattern
- URL must NOT match any exclude pattern
Authentication options
Bearer token authentication
Use for APIs that require bearer token authentication:
This adds the header: Authorization: Bearer sk-proj-abc123xyz789
Custom headers
Add any custom headers required by the target website:
Cookies file
Load cookies from a JSON file for session-based authentication:
The cookies file should be in Playwright’s cookie format:
Combined examples
Scraping a protected API documentation
Scraping an e-commerce site with login
- First, save your cookies after manual login:
- Then use the saved cookies for subsequent scrapes:
Scraping with multiple authentication methods
Best practices
Access control with test accounts
Access control with test accounts
The crawler will access your website with the same permissions as a regular user. To prevent unauthorized access to sensitive areas:
- Configure your application to block or redirect admin/private routes for the user account you’ll be using
- Create a dedicated test user with limited permissions (no admin access)
- Use
--exclude-routes
to prevent scraping of sensitive areas like/admin/*
or/api/*
- Ensure no sensitive information (API keys, passwords, personal data) is exposed in crawlable content, enabling PII blocking on high mode
Use interactive mode
Use interactive mode
Run commands without options for guided workflows:
The CLI provides clear prompts and smart defaults for all operations.
Secure your credentials
Secure your credentials
- Store API keys in environment variables
- Never commit keys to version control
- Set expiration dates for temporary access
Optimize crawling
Optimize crawling
- Use browser mode only when authentication is required
- Enable PII protection for user-facing applications
- Monitor crawl status before uploading
Manage your data
Manage your data
- Review scraped content with
pointer list
before uploading - Use
pointer cleanup
to remove local data after successful uploads - Keep crawl sessions organized with descriptive website names
Advanced scraping tips
Advanced scraping tips
- Test with
--log-level debug
to see which URLs are being included/excluded - Use
--max-pages 10
first to test your patterns before full scraping - Save cookies from browser mode for reuse in headless scraping
- Combine authentication methods when sites require multiple forms of auth
- Use exact paths in include/exclude patterns when you need precision
Automation examples
While the CLI is designed for interactive use, automation is supported for CI/CD pipelines:
Use automation options carefully. Interactive mode provides safety confirmations and validation that prevent common errors.
Troubleshooting
Authentication errors
If you encounter authentication issues:
- Verify your API key is valid in the dashboard
- Check environment variable is set correctly:
echo $POINTER_SECRET_KEY
- Ensure the key hasn’t expired
- Confirm you have necessary permissions
Crawling interruptions
The crawler automatically saves progress. If interrupted:
Upload limitations
- Maximum 500 pages per upload (API limit)
- Large crawls are automatically truncated
- Use
--max-pages
to control crawl size upfront
Next steps
After successfully crawling and uploading your content:
- View your enriched knowledge base in the Knowledge section
- Configure AI features to leverage the collected data
- Monitor analytics to understand content usage
- Set up regular crawls to keep knowledge current