如何使用robots.txt屏蔽特定URL？含阻止Google抓取及Disallow标签配置

阿华AIGC实验室

2026-5-20

Hey there! If you want to stop Google (and other search engines) from crawling specific parts of your site, robots.txt is the right tool for the job. Let's walk through exactly how to use the Disallow directive to block those URLs, with concrete examples for common scenarios.

1. First, the Basics

Your robots.txt file lives in the root directory of your website (e.g., https://yourdomain.com/robots.txt). It uses directives to communicate with web crawlers, and Disallow is the key one for blocking access.

You start by specifying which crawler the rule applies to with User-agent:

Use User-agent: Googlebot to target only Google's crawler.
Use User-agent: * to target all search engine crawlers.

2. Common Blocking Scenarios with Disallow

Let's cover the most frequent use cases with clear code examples:

Block a single specific URL

If you want to block a single page like https://yourdomain.com/private-dashboard.html, add this to your robots.txt:

User-agent: Googlebot
Disallow: /private-dashboard.html

The path after Disallow is the relative URL from your site root.

Block an entire directory

To block every page inside a directory (e.g., /blog/unpublished-drafts/ and all its subpages), use:

User-agent: *
Disallow: /blog/unpublished-drafts/

The trailing slash ensures you're targeting the entire directory, not just a page with a similar name.

Block URLs matching a pattern (with wildcards)

Google supports wildcards for more flexible blocking. Here are useful patterns:

Block all files with a specific extension (e.g., PDFs):
```
User-agent: Googlebot
Disallow: /*.pdf$
```
The $ matches the end of the URL, so only files ending in .pdf are blocked.
Block all URLs containing a specific string (e.g., /old-archive/ anywhere in the path):
```
User-agent: *
Disallow: /*old-archive*
```
The * acts as a wildcard for any characters before or after the target string.
Block all URLs with query parameters (e.g., any URL with ?tracking=):
```
User-agent: Googlebot
Disallow: /*?tracking=
```

3. Critical Things to Remember

robots.txt is advisory: Legitimate crawlers like Google will follow it, but malicious bots might ignore it. For truly private content, use password protection or server-side restrictions.
If you already have content indexed that you want removed, Disallow alone won't delete it from search results. You'll need to add a noindex meta tag to those pages or use Google's site management tools to request removal.
Always test your robots.txt rules: Use Google's built-in testing tool in its site console to verify that your blocked URLs are correctly restricted.

内容的提问来源于stack exchange，提问作者Mohan Prajapati