Sunday, May 27, 2012

Censorship By Google Using Hidden Robots Text File Restrictions


There are various methods through which Google censors and manipulates SERPs, news, blogs and other results. Google rely heavily upon the “crawling methodology” to produce results as per its own requirements. Google may create a 503 error or other such error. However, such an error is apparent if you analyse the webmaster tool.

It seems Google has developed another innovative method of censorship of posts that it finds controversial. Google seems to be manipulating the robots text file (robots. txt) to block even those posts and segments that are not, by design, supposed to be blocked.

The worst part is that it is done in a clandestine manner and you cannot do much even if you thoroughly analyse the webmaster tool.  Today I spend an entire day to understand why the post titled “cyber forensics and Indian approach” was censored by Google.

I analysed the webmaster tool and found the message telling me that the “health” of the blog titled Cyber Forensics in India is not in good shape. The exact message reads “Severe health issues are found on your site - Check site health”. Upon further analysing the problem, the webmaster tool told that “some important page is blocked by robots.txt”.

I first analysed the message that reads “Is robots.txt blocking important pages?” and it returned the message that reads “The page you are trying to reach does not exist (404)”. I then tried to analyse the important page that has been blocked by the robots.txt file and it gave me this page.

Before proceeding further, let us check the standard robots.txt file of Blogspot blogs. The standard robots.txt file in case of the present blog (and all other Blogspot blogs as it is similar except the blog address) is as follows:

User-agent: Mediapartners-Google
Disallow:

User-agent: *
Disallow: /search
Allow: /

Sitemap:

http://cyberforensicsofindia.blogspot.com/feeds/posts/default?orderby=UPDATED

It is clear that the only thing that has been disallowed by the Blogspot robots.txt file is the /search directory and its sub folders. All other directories and their sub folders are crawlable and accessible to not only Google’s bots but also crawling bots of other search engines as well.

Now when we clicked upon the important page that has been blocked by robots.txt file of our Blog, it took us here. Now this is absurd on at least two counts. Firstly, this is bound to be blocked due to the blocking of search directory so there is nothing unnatural as such. This cannot be termed as a “serious health issue” for the blog.

Secondly, there is no entry or record of the post that has been censored by Google at all. There is no error, either crawling or indexing. There are no malware issues. There are no pages removal issues involved as well.

Clearly, whatever happened to that post happened at the level of Google and Google owes an explanation to us in this regard. We are aware that we are not facing this issue alone and there are tons of examples where these issues have arisen and resolved at Google.

However, we saw no reason for the blocking, filtering, censorship or deindexing of our post. It is time for Google to explain.