Author Topic: Blocking Unwanted Spider Crawls  (Read 1998 times)

Offline ezeeozee

  • Newbie
  • *
  • Posts: 11
  • Karma: +4/-0
    • View Profile
Blocking Unwanted Spider Crawls
« on: March 25, 2016, 07:54:52 AM »
Specifically> baiduspider from baidu.com

So annnoying.

Apparently it originates from China or thereabouts and it is bombarding my website with thousands of crawls which are eating up my bandwidth.

I want to block it altogether but can't find an easy way to do this. Anyone know an easy way?

On a previous cart I used to use, there was a page within Admin which listed all the spiders/bots/search engines, which had access to the site and you could allow/disallow access with a click.

Is there anything like this within Abantecart? Or an Extension for this?

Is tinkering with the htaccess file the only option?

Any answers to any of the above would be appreciated even if it means pointing me to where this query has already been answered somewhere.

Thank you.

Offline Basara

  • Administrator
  • Hero Member
  • *****
  • Posts: 2730
  • Karma: +128/-0
    • View Profile
Re: Blocking Unwanted Spider Crawls
« Reply #1 on: March 25, 2016, 07:58:23 AM »

Offline ezeeozee

  • Newbie
  • *
  • Posts: 11
  • Karma: +4/-0
    • View Profile
Re: Blocking Unwanted Spider Crawls
« Reply #2 on: March 25, 2016, 01:13:06 PM »
Brilliant, thank you!

Offline ezeeozee

  • Newbie
  • *
  • Posts: 11
  • Karma: +4/-0
    • View Profile
Re: Blocking Unwanted Spider Crawls
« Reply #3 on: April 03, 2016, 07:13:46 AM »
Doesn't work.

Baiduspider is resistant to attempts to block it through robot.txt and also .htaccess.

Modifying both files has made diddly squat difference to the volume of crawls from Baiduspider.

As a last resort I am trying to block the IP - they use multiple IPs but they all start the same so I have added this to the .htaccess file:

Deny from 180.76.15.

Hopefully, this will work.

Online handoyo

  • Sr. Member
  • ****
  • Posts: 386
  • Karma: +83/-1
    • View Profile
Re: Blocking Unwanted Spider Crawls
« Reply #4 on: April 03, 2016, 07:53:07 AM »
Hi, please take a look here and test it. Who knows it works for you.

http://webmasters.stackexchange.com/questions/31837/how-to-block-baidu-spiders

Offline ezeeozee

  • Newbie
  • *
  • Posts: 11
  • Karma: +4/-0
    • View Profile
Re: Blocking Unwanted Spider Crawls
« Reply #5 on: April 03, 2016, 09:07:02 AM »
I've tried *all the methods listed on that webpage. None blocked Baiduspider.

*The final suggestion was to block IP addresses. This is the method I have just implemented as of today.

We'll know if this method is successful in a couple of days.

Update: Code 403 Forbidden hits have risen from a static 21 to 3000. So it is possible that blocking IP access via the .htaccess file is having an impact.

I have still had 80,000 hits from Baiduspider in 2 days however...
« Last Edit: April 03, 2016, 09:35:45 AM by ezeeozee »

Offline DevonT65

  • Newbie
  • *
  • Posts: 1
  • Karma: +1/-0
    • View Profile
    • Cheap Dedicated Server India
Re: Blocking Unwanted Spider Crawls
« Reply #6 on: April 23, 2016, 02:19:08 PM »
Add the following code in your robots txt

# Disallow Following crawlers access to site.

User-agent: TurnitinBot/2.1
Disallow: /

User-agent: TurnitinBot
Disallow: /

User-agent: ia_archiver
Disallow: /

User-agent: EmailWolf
Disallow: /

User-agent: EmailSiphon
Disallow: /

User-agent: EmailCollector
Disallow: /

User-agent: AboutUsBot
Disallow: /

User-agent: SurveyBot
Disallow: /

User-agent: robtexbot
Disallow: /

Offline Natashawilliams

  • Newbie
  • *
  • Posts: 10
  • Karma: +1/-0
  • I am an HR Professional
    • View Profile
    • Bulk SMS Gateway
Re: Blocking Unwanted Spider Crawls
« Reply #7 on: January 30, 2017, 07:25:31 AM »
Only Robots can help you man.

Offline jackluter

  • Newbie
  • *
  • Posts: 3
  • Karma: +0/-0
  • Hi am jack
    • View Profile
    • Mobile app development New York
Re: Blocking Unwanted Spider Crawls
« Reply #8 on: April 13, 2017, 05:45:55 AM »
Yes, robots are helpful that acts as a communicating standard between website and crawler. The robots.txt instruct the crawlers about which url  web page to crawl.