AbanteCart Community
General Topics => SEO => Topic started by: ezeeozee on March 25, 2016, 07:54:52 AM
-
Specifically> baiduspider from baidu.com
So annnoying.
Apparently it originates from China or thereabouts and it is bombarding my website with thousands of crawls which are eating up my bandwidth.
I want to block it altogether but can't find an easy way to do this. Anyone know an easy way?
On a previous cart I used to use, there was a page within Admin which listed all the spiders/bots/search engines, which had access to the site and you could allow/disallow access with a click.
Is there anything like this within Abantecart? Or an Extension for this?
Is tinkering with the htaccess file the only option?
Any answers to any of the above would be appreciated even if it means pointing me to where this query has already been answered somewhere.
Thank you.
-
Hello.
Try robot.txt file https://support.google.com/webmasters/answer/6062608?hl=en
-
Brilliant, thank you!
-
Doesn't work.
Baiduspider is resistant to attempts to block it through robot.txt and also .htaccess.
Modifying both files has made diddly squat difference to the volume of crawls from Baiduspider.
As a last resort I am trying to block the IP - they use multiple IPs but they all start the same so I have added this to the .htaccess file:
Deny from 180.76.15.
Hopefully, this will work.
-
Hi, please take a look here and test it. Who knows it works for you.
http://webmasters.stackexchange.com/questions/31837/how-to-block-baidu-spiders (http://webmasters.stackexchange.com/questions/31837/how-to-block-baidu-spiders)
-
I've tried *all the methods listed on that webpage. None blocked Baiduspider.
*The final suggestion was to block IP addresses. This is the method I have just implemented as of today.
We'll know if this method is successful in a couple of days.
Update: Code 403 Forbidden hits have risen from a static 21 to 3000. So it is possible that blocking IP access via the .htaccess file is having an impact.
I have still had 80,000 hits from Baiduspider in 2 days however...
-
Add the following code in your robots txt
# Disallow Following crawlers access to site.
User-agent: TurnitinBot/2.1
Disallow: /
User-agent: TurnitinBot
Disallow: /
User-agent: ia_archiver
Disallow: /
User-agent: EmailWolf
Disallow: /
User-agent: EmailSiphon
Disallow: /
User-agent: EmailCollector
Disallow: /
User-agent: AboutUsBot
Disallow: /
User-agent: SurveyBot
Disallow: /
User-agent: robtexbot
Disallow: /
-
Only Robots can help you man.
-
Yes, robots are helpful that acts as a communicating standard between website and crawler. The robots.txt instruct the crawlers about which url web page to crawl.
-
The file robot.txt can be used for this purpose. Using robot.txt file you can allow or disallow crawler for crawling your site.
The format of the robot.txt is -
User-agent: * (you can insert the search engine names which you want to crawl your site. Here * means all the search engines can crawl your website.)
Disallow: /filename/ (put the file name which you don't want to be crawl by the crawlers.)