AbanteCart Community

General Topics => SEO => Topic started by: ezeeozee on March 25, 2016, 07:54:52 AM

Title: Blocking Unwanted Spider Crawls
Post by: ezeeozee on March 25, 2016, 07:54:52 AM
Specifically> baiduspider from baidu.com

So annnoying.

Apparently it originates from China or thereabouts and it is bombarding my website with thousands of crawls which are eating up my bandwidth.

I want to block it altogether but can't find an easy way to do this. Anyone know an easy way?

On a previous cart I used to use, there was a page within Admin which listed all the spiders/bots/search engines, which had access to the site and you could allow/disallow access with a click.

Is there anything like this within Abantecart? Or an Extension for this?

Is tinkering with the htaccess file the only option?

Any answers to any of the above would be appreciated even if it means pointing me to where this query has already been answered somewhere.

Thank you.
Title: Re: Blocking Unwanted Spider Crawls
Post by: Basara on March 25, 2016, 07:58:23 AM
Hello.

Try robot.txt file https://support.google.com/webmasters/answer/6062608?hl=en
Title: Re: Blocking Unwanted Spider Crawls
Post by: ezeeozee on March 25, 2016, 01:13:06 PM
Brilliant, thank you!
Title: Re: Blocking Unwanted Spider Crawls
Post by: ezeeozee on April 03, 2016, 07:13:46 AM
Doesn't work.

Baiduspider is resistant to attempts to block it through robot.txt and also .htaccess.

Modifying both files has made diddly squat difference to the volume of crawls from Baiduspider.

As a last resort I am trying to block the IP - they use multiple IPs but they all start the same so I have added this to the .htaccess file:

Deny from 180.76.15.

Hopefully, this will work.
Title: Re: Blocking Unwanted Spider Crawls
Post by: yonghan on April 03, 2016, 07:53:07 AM
Hi, please take a look here and test it. Who knows it works for you.

http://webmasters.stackexchange.com/questions/31837/how-to-block-baidu-spiders (http://webmasters.stackexchange.com/questions/31837/how-to-block-baidu-spiders)
Title: Re: Blocking Unwanted Spider Crawls
Post by: ezeeozee on April 03, 2016, 09:07:02 AM
I've tried *all the methods listed on that webpage. None blocked Baiduspider.

*The final suggestion was to block IP addresses. This is the method I have just implemented as of today.

We'll know if this method is successful in a couple of days.

Update: Code 403 Forbidden hits have risen from a static 21 to 3000. So it is possible that blocking IP access via the .htaccess file is having an impact.

I have still had 80,000 hits from Baiduspider in 2 days however...
Title: Re: Blocking Unwanted Spider Crawls
Post by: DevonT65 on April 23, 2016, 02:19:08 PM
Add the following code in your robots txt

# Disallow Following crawlers access to site.

User-agent: TurnitinBot/2.1
Disallow: /

User-agent: TurnitinBot
Disallow: /

User-agent: ia_archiver
Disallow: /

User-agent: EmailWolf
Disallow: /

User-agent: EmailSiphon
Disallow: /

User-agent: EmailCollector
Disallow: /

User-agent: AboutUsBot
Disallow: /

User-agent: SurveyBot
Disallow: /

User-agent: robtexbot
Disallow: /
Title: Re: Blocking Unwanted Spider Crawls
Post by: Natashawilliams on January 30, 2017, 07:25:31 AM
Only Robots can help you man.
Title: Re: Blocking Unwanted Spider Crawls
Post by: jackluter on April 13, 2017, 05:45:55 AM
Yes, robots are helpful that acts as a communicating standard between website and crawler. The robots.txt instruct the crawlers about which url  web page to crawl.
Title: Re: Blocking Unwanted Spider Crawls
Post by: marshasarv on April 23, 2018, 01:04:16 AM
The file robot.txt can be used for this purpose. Using robot.txt file you can allow or disallow crawler for crawling your site.

The format of the robot.txt is -

User-agent: * (you can insert the search engine names which you want to crawl your site. Here * means all the search engines can crawl your website.)

Disallow: /filename/ (put the file name which you don't want to be crawl by the crawlers.)