Questions tagged [robots.txt]

Convention to prevent webcrawlers from indexing your website.

If a site owner wishes to give instructions to web robots they must place a text file called robots.txt in the root of the web site hierarchy (e.g. www.example.com/robots.txt). This text file should contain the instructions in a specific format (see examples below). Robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the web site. If this file doesn't exist web robots assume that the web owner wishes to provide no specific instructions.

A robots.txt file on a website will function as a request that specified robots ignore specified files or directories when crawling a site. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operate on certain data. Links to pages listed in robots.txt can still appear in search results if they are linked to from a page that is crawled.

For websites with multiple subdomains, each subdomain must have its own robots.txt file. If example.com had a robots.txt file but a.example.com did not, the rules that would apply for example.com would not apply to a.example.com.

Source: wikipedia

93 questions
32
votes
5 answers

How to set robots.txt globally in nginx for all virtual hosts

I am trying to set robots.txt for all virtual hosts under nginx http server. I was able to do it in Apache by putting the following in main httpd.conf: SetHandler None Alias /robots.txt…
anup
  • 747
23
votes
5 answers

How Can I Encourage Google to Read New robots.txt File?

I just updated my robots.txt file on a new site; Google Webmaster Tools reports it read my robots.txt 10 minutes before my last update. Is there any way I can encourage Google to re-read my robots.txt as soon as possible? UPDATE: Under Site…
qxotk
  • 1,436
14
votes
5 answers

Which bots and spiders should I block in robots.txt?

In order to: Increase security of my website Reduce bandwidth requirements Prevent email address harvesting
DaveC
  • 243
10
votes
4 answers

How to create robots.txt file for all domains on Apache server

We have a XAMPP Apache development web server setup with virtual hosts and want to stop serps from crawling all our sites. This is easily done with a robots.txt file. However, we'd rather not include a disallow robots.txt in every vhost and then…
Mike B
  • 203
8
votes
3 answers

How do I use robots.txt to disallow crawling for only my subdomains?

If I want my main website to on search engines, but none of the subdomains to be, should I just put the "disallow all" robots.txt in the directories of the subdomains? If I do, will my main domain still be crawlable?
tkbx
  • 201
7
votes
6 answers

What happens if a website does not have a robots.txt file?

If the robots.txt file is missing in the root directory of a website, how are things treated as: the site is not indexed at all the site is indexed without any restrictions It should logically be the second one according to me. I ask in reference…
Lazer
  • 445
6
votes
4 answers

How do you create a single robots.txt file for all sites on an IIS instance

I want to create a single robots.txt file and have it served for all sites on my IIS (7 in this case) instance. I do not want to have to configure anything on any individual site. How can I do this?
5
votes
1 answer

Nginx robots.txt configuration

I can't seem to properly configure nginx to return robots.txt content. Ideally, I don't need the file and just want to serve text content configured directly in nginx. Here's my config: server { listen 80 default_server; listen [::]:80…
Denys S.
  • 225
5
votes
6 answers

Blocking yandex.ru bot

I want to block all request from yandex.ru search bot. It is very traffic intensive (2GB/day). I first blocked one C class IP range, but it seems this bot appear from different IP ranges. For example: spider31.yandex.ru ->…
Ross
  • 268
  • 1
  • 3
  • 9
3
votes
3 answers

robots.txt is redirecting to default page

Hullo, Typically, if I type into my address bar, "oneofmysites.com/robots.txt", any browser will display the content of robots.txt. As you can see, this is pretty standard behaviour. I have just one web server which does not. Instead, robots.txt…
Parapluie
  • 165
3
votes
1 answer

Baidu Spider causing 3Gb of traffic a day - but I do business in China

I'm in a difficult situation, the Baidu spider is hitting my site causing about 3Gb a day worth of bandwidth. At the same time I do business in China so don't want to just block it. Has anyone else been in a similar situation (any spider)? Did you…
3
votes
3 answers

robots.txt and other .txt returning 404 on IIS?

We have an IIS site running Dotnetnuke that we took over from another group. We have added a robots.txt file to the root but it returns a 404. Actually any txt file in the root seems to return 404. I can't seem to spot where they may have blocked…
3
votes
2 answers

Robots.txt - no follow, no index

Please can someone explain to me the difference between setting allow and disallow in a robots.txt file and create No follow, No index meta tags! Is it possible to set no follow and no index within the robots.txt file? I have look on…
Ian
3
votes
1 answer

Why is googlebot requesting robots.txt from my SSH server?

I run ossec on my server and periodically I receive a warning like this: Received From: myserver->/var/log/auth.log Rule: 5701 fired (level 8) -> "Possible attack on the ssh server (or version gathering)." Portion of the log(s): Nov 19 14:26:33…
Brian
  • 796
  • 1
  • 7
  • 16
3
votes
3 answers

How to prevent discovery of a secure URL?

If I have a url that is used for getting messages and I create it like so: http://www.mydomain.com/somelonghash123456etcetc and this URL allows for other services to POST messages to. Is it possible for a search engine robot to find it? I don't want…
1
2 3 4 5 6 7