Crawl Budget Optimization for E-commerce Sites
For e-commerce sites with 10,000+ URLs, crawl budget is one of the most consequential and most ignored technical SEO levers. The symptom looks like an indexation problem: “Google isn’t indexing our new products” or “Our category pages keep dropping in and out of the index.” The cause is usually upstream: Google is wasting its crawl budget on URLs that shouldn’t exist, leaving genuinely important pages under-crawled and under-indexed.
This guide walks through how crawl budget actually works in 2026, how to identify waste, and the architectural changes that maximize indexation of the URLs that matter.
What crawl budget actually is
Crawl budget is the number of URLs Googlebot will crawl on your site in a given period. Two factors determine it:
1. Crawl capacity limit: how much Googlebot can crawl without overloading your server. Slower servers = lower capacity = less crawl budget.
2. Crawl demand: how much Google wants to crawl your site, based on perceived freshness, importance, and update frequency.
For most sub-10K-page sites, crawl budget isn’t a constraint — Google can crawl everything. Above 10K pages, especially with parameter-rich URLs, crawl budget becomes a real bottleneck.
If you don’t manage it, Google’s algorithm picks what to crawl. That algorithm doesn’t know your business priorities — it allocates by perceived importance via internal links and external signals.
How to know you have a crawl budget problem
Signals you’re hitting crawl budget limits:
1. Recent products / pages take 7+ days to be indexed. Google sees them in sitemap but doesn’t crawl quickly.
2. Search Console → Crawl stats shows you’re crawled at maximum per-day rate consistently.
3. Significant “Discovered - not yet indexed” or “Crawled - not yet indexed” counts. These indicate Google found URLs but is conserving resources.
4. Server log analysis shows Googlebot wasting time on low-value URLs. Faceted search parameter URLs, internal search results, paginated product variations.
5. Important pages have outdated cached versions in Google.
If 2+ of these apply, crawl budget needs attention.
The biggest sources of crawl waste
For e-commerce specifically:
1. Faceted navigation parameters
Filter combinations: ?color=blue&size=medium&price=20-40 and infinite variants. Each combination is technically a unique URL — and Google may try to crawl all of them.
A 1,000-product site with 5 filter types averaging 4 options each = 1,024 filter combinations per category. Multiply across categories = millions of URLs Google could crawl.
2. Internal search results
Pages like /search?q=blue+shoes. Each unique search query creates a unique URL. Spider traps if not handled.
3. Sort and view-style parameters
?sort=price_asc, ?view=grid, ?per_page=24. Duplicate content across many parameter variations.
4. Session IDs in URLs
Old PHP-style sites still doing ?phpsessid=abc123 create unique URLs per session.
5. Calendar/archive paginated pages
Date-archive pages with infinite depth: /archive/2019/03/page/47. Often duplicate of main content.
6. Print versions
?print=1 versions of every page. Pure duplicate content.
7. UTM parameters
URLs from marketing campaigns: ?utm_source=newsletter&utm_campaign=spring-sale. Should canonicalize to clean URLs.
8. Soft 404s
URLs returning 200 status but with no real content. Out-of-stock product pages with no useful info.
9. Orphaned URLs
Pages reachable via crawl but not linked from anywhere internally. Often legacy content.
The fixes (in order of impact)
Fix 1: Robots.txt for high-volume parameter URLs
For faceted navigation and search parameters that have no SEO value:
User-agent: *
Disallow: /*?sort=
Disallow: /*?view=
Disallow: /*?per_page=
Disallow: /search?
Disallow: /*?phpsessid=
This tells Google not to crawl these URL patterns. Cleanest, most direct fix for low-value parameter URLs.
Caveat: robots.txt blocking doesn’t prevent indexation if the URL is linked from elsewhere. Combine with no-index where needed.
Fix 2: Canonical tags for parameter variations
For parameter URLs that should consolidate to a clean version:
<link rel="canonical" href="https://example.com/category/shoes" />
On the parameter-laden URL ?color=blue&size=large, canonical points to the parent URL. Tells Google “these are duplicates; index this canonical.”
Fix 3: URL parameter handling in Search Console (deprecated)
Note: Google deprecated the URL Parameters tool in 2022. Don’t rely on it. Use robots.txt + canonical instead.
Fix 4: Noindex for unwanted-but-needed-to-render pages
For pages that need to be crawlable (links, internal logic) but shouldn’t be indexed:
<meta name="robots" content="noindex, follow">
Common applications:
- Filter combinations
- Internal search results
- Thank-you pages
- User account pages
Note: noindex still uses crawl budget. For high-volume URLs that aren’t useful at all, robots.txt block is better.
Fix 5: Sitemap discipline
Your sitemap should ONLY contain URLs you want indexed:
- Canonical URLs only
- 200 status codes only
- No noindexed URLs
- No redirected URLs
Submit clean sitemaps. Search Console shows mismatch counts — fix to zero discrepancy.
Fix 6: Internal linking pruning
Pages that have no value should also have few internal links. Examples:
- Login pages don’t need links from product pages
- Account pages don’t need site-wide footer links
Reduce internal linking to low-value pages so Google deprioritizes their crawl.
Fix 7: Pagination strategy
For paginated lists (category pages, blog archives):
Modern approach in 2026: serve pages via standard pagination (/category/page/2/), don’t use rel=next/prev (Google deprecated those), allow indexing of all paginated pages (each page should have unique content).
For infinite scroll: provide a paginated fallback. Pure infinite scroll without URL-based pagination is invisible to Googlebot.
Fix 8: Faceted navigation strategy
For sites with many filters, consider:
Strategy A: AJAX-loaded filters that don’t change URLs. No new URLs created. Crawl budget protected. SEO downside: filter combinations can’t rank.
Strategy B: Static category pages for top filter combinations. Pre-build “Blue Shoes” and “Running Shoes” as their own URLs. Other combinations stay AJAX-only.
Strategy C: All filter URLs indexable but with smart canonicals. Most filter combos canonical to parent; the most-searched combos have their own canonical URLs.
For large e-commerce sites, Strategy B is the proven approach for balancing crawl budget and SEO opportunity.
Fix 9: Server response time
Slower servers throttle Google’s crawl rate. Improving server response time directly increases crawl capacity.
Target: Time to First Byte (TTFB) under 600ms for the average page. Most CDN-cached pages should be under 200ms.
Fix 10: 404 vs 410 vs redirect for deleted products
When products are deleted:
- 410 Gone: tells Google “this is permanently removed.” Removed from index faster than 404.
- 404 Not Found: standard removal. Slower deindexation but acceptable.
- 301 redirect to relevant alternative: best when there’s a logical successor (newer product, parent category).
Many e-commerce sites accumulate thousands of 404s as products discontinue. Configure 410 for known-permanently-removed SKUs to clean indexation faster.
A 30-day crawl budget audit
Days 1-7: Diagnose.
- Search Console → Settings → Crawl stats. Note daily crawl rate and trend.
- Search Console → Pages → Filter “Crawled but not indexed” and “Discovered but not indexed” — count and patterns.
- If you have server logs, analyze Googlebot hits over 30 days. Identify URL patterns by frequency.
Days 8-15: Quick wins.
- Robots.txt block obvious low-value parameter URLs.
- Canonical clean-up for known duplicates.
- Sitemap clean-up (remove non-canonical, redirected, 404 URLs).
Days 16-22: Faceted navigation strategy.
- Identify top 50 filter combinations users search for.
- Build static URLs for those (or implement canonical to parent for others).
- Either AJAX-only the rest or block via robots.txt.
Days 23-30: Validate.
- Re-check Crawl stats. Crawl rate should reallocate toward important pages.
- Re-check Search Console indexation. Discovered-not-indexed should drop.
- Submit important URLs for re-indexing.
Common crawl budget mistakes
1. Blocking everything that doesn’t convert directly. Some “support” URLs (about, blog, FAQ) build authority. Don’t block aggressively.
2. Robots.txt-blocking pages that you also have indexed. Google won’t deindex blocked pages — they stay in the index without crawl. Use noindex for things you want removed.
3. Treating all parameter URLs as bad. Some filter combinations are searched (“Blue Running Shoes Size 10”) and worth indexing. Strategic, not blanket.
4. Ignoring sitemap quality. Sitemap with 50% of URLs that are noindexed or redirected dilutes Google’s understanding of what matters.
5. Not optimizing TTFB. Slow servers cap crawl capacity. Speed improvements directly increase crawl budget.
6. Letting orphaned pages accumulate. Old campaign landing pages, deprecated features. Audit yearly and remove.
Frequently asked questions
How many URLs is “large” for crawl budget concerns? Roughly 10,000+ URLs starts to matter. Above 100,000, it’s a primary concern.
Does crawl budget apply to small sites? Generally no. Sites under 10K URLs have plenty of crawl budget unless server speed is extremely slow.
How do I measure crawl budget? Search Console → Settings → Crawl stats shows daily crawl rate, average response time, and total crawl requests over time.
Will fixing crawl budget improve rankings? Indirectly. Better indexation of important pages improves their visibility, which over time improves rankings. Crawl budget alone doesn’t move rankings; indexation does.
Should I worry about crawl budget if I’m using Cloudflare or another CDN? CDN helps with server response time, which helps with crawl capacity. But CDN doesn’t fix logical issues (parameter explosion, low-value URLs). Both matter.
Crawl budget optimization is unglamorous, technical work that pays back through better indexation of what actually matters. For large e-commerce sites, it’s one of the highest-leverage technical SEO investments available — and one of the most overlooked. The audit takes a week; the cleanup takes a month; the payback compounds for years.