Friday, October 31, 2008

Re: Search Engine Bots Generating Strange Queries

Hi Mike,

Disallowing that in your robots.txt is a waste of time.

The robots.txt file was started by Google, and is not an officially
supported feature of all crawlers. So they don't have to follow it,
and I can tell you this doesn't sound like the google bot anyway,
because that bot doesn't generate phantom URIs.

Web crawlers can extract URIs from many different sources, and they
can generate URIs as they see fit. URIs can come from HTML, CSS, SWF,
JavaScript, and form post/get actions. I've even seen crawlers submit
post requests to generate more URIs to crawl.

Crawlers will also clean URIs removing ids, changing queries, fake
cookies, and sometimes rotate their IP address.

There are no rules about crawlers, no guidelines they have to follow,
or limits on how long they will crawl or how aggressively they will
request URIs from your server.

You should modify your Routes to point to a 404 if they request paths
that you don't want them to see.
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "CakePHP" group.
To post to this group, send email to cake-php@googlegroups.com
To unsubscribe from this group, send email to cake-php+unsubscribe@googlegroups.com
For more options, visit this group at http://groups.google.com/group/cake-php?hl=en
-~----------~----~----~----~------~----~------~--~---

No comments: