The Lazy Admin Blog

Home  /  Apache • Litespeed • Wordpress  /  Dealing with aggressive bot scanners

Dealing with aggressive bot scanners

September 14, 2020 Apache, Litespeed, Wordpress Leave a Comment

Google is your friend, but sometimes their crawler can be a little bit aggressive consuming resources from your webservers (and sometimes even crashing it). With static sites or at least sites that are using a decent cache, this shouldn’t be an issue, but with complex online stores (Mostly Magento / WooCommerce) sometimes you can get yourself into an “infinite loop” of search indexing. This “infinite loop” can occur if a product has variables to search by, let’s take a printer for example You can search by Make, Model, Type, Color, each of those has at least 20 entries – so many variables to filter the search by makes this product almost “invisible” to Google as it will index it for each available option once, and with 1000+ products – that’s a mission impossible.

If we take one printer for example (productID 10 in our DB), and use only colors for filtering, google will see multiple different search URLs for the same product ID (10) –

mysites.com/?productID=10?printer=hp&type=laser&model=A17&color=red
mysites.com/?productID=10?printer=hp&type=laser&model=A17&color=blue
mysites.com/?productID=10?printer=hp&type=laser&model=A17&color=yellow

Adding more filters will just make it worst, again for the same product ID –

mysites.com/?productID=10?printer=hp&type=laser&model=A17&color=red
mysites.com/?productID=10?printer=hp&type=laser&model=A17&color=blue
mysites.com/?productID=10?printer=hp&type=laser&model=A17&color=yellow

mysites.com/?productID=10?printer=hp&type=jet&model=A17&color=red
mysites.com/?productID=10?printer=hp&type=jet&model=A17&color=blue
mysites.com/?productID=10?printer=hp&type=jet&model=A17&color=yellow

On top of these, most likely that the page will contain either a “Add to cart” / “Add to wishlist” / “Products per page” button, so given that I have 1000 product, and just added “&products-per-page=999” to my URL, I am immediately overloading my database.

There are some “webmaster tools” tips & tricks on how to control the crawler rate, here are some references –

https://support.google.com/webmasters/answer/48620
https://support.google.com/webmasters/answer/6080548
https://www.bing.com/webmaster/help/crawl-control-55a30302

We will show how to use a “quick & dirty” webserver side blocker, most likely will affect your product’s SEO / Indexing so use it at your own risk.
This should allow Google to index the product, but without filters that we will choose.

Allowed – mysites.com/?productID=10
Blocked – mysites.com/?productID=10?products-per-page=999

First solution: HTACCESS Blocking

RewriteCond %{HTTP_USER_AGENT} ^.*(Googlebot|msnbot|bingbot).*$ [NC]
RewriteCond %{QUERY_STRING} (\=)
RewriteRule .* - [F,L]

Here we deny any request from the bots listed in the first line to any URI with “=” in it, which suggests a query string.
We can modify the HTTP_USER_AGENT or QUERY_STRING lists as needed to block more specific requests.

Second solution: Mod security rule

SecRule REQUEST_HEADERS:User-Agent "@rx (Googlebot|msnbot|bingbot)" "id:999918,chain,status:403,deny,log,msg:'HIGH SCAN FROM GOOGLE(POST)'"
SecRule REQUEST_METHOD "POST" "chain"
SecRule QUERY_STRING "@rx (add_to_wishlist|add-to-cart|orderby|products-per-page|product_count)"

SecRule REQUEST_HEADERS:User-Agent "@rx (Googlebot|msnbot|bingbot)" "id:999919,chain,status:403,deny,log,msg:'HIGH SCAN FROM GOOGLE(GET)'"
SecRule REQUEST_METHOD "GET" "chain"
SecRule QUERY_STRING "@rx (add_to_wishlist|add-to-cart|orderby|products-per-page|product_count)"

In this example, we used two rules for flexibility reasons but we can combine them if needed.
Same as in the htaccess case, we can adjust the User-Agent list in the first line and the QUERY_STRING in the second one.

Tags: google, htaccess, modsecurity
Previous Article
Next Article

Related Posts

  • Libmodsecurity installation

    Libmodsecurity installation

    April 14, 2016

Leave a Reply

Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Search Our Blog

Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
Apache
CentOS
CloudLinux
cPanel
Emails
ESXI
iSCSI
JetBackup
Linux
Litespeed
MySQL
NGINX
Oracle
Reduxio
Security
SSL
Uncategorized
VMware
Wordpress
XEN

Tags

apache aspx backup bash CentOS cloudlinux cPanel CXS Emails freetds google htaccess IMAP InnoDB iscsi JetBackup Libmodsecurity litespeed modsec modsecurity mssql MySQL netapp nginx odbc Oracle php php.ini phpselector rsync ssh ssmtp systemd threads VMFS WHM Wordpress xenserver

Popular Posts

  • Convert JetBackup to cPanel structure October 6, 2022
  • How To Install & Configure a Galera Cluster with MariaDB on Centos 7 February 6, 2018
  • Allow a cPanel server to run a VHOST from multiple IP addresses April 3, 2018
  • rsync without prompting for password October 10, 2022

Recent Posts

  • Understanding Why More Threads Can Sometimes Slow Down Performance October 9, 2024
  • Set up a new systemd service May 18, 2024
  • Bash Arrays November 7, 2023
  • rsync without prompting for password October 10, 2022

Recent Comments

  • Sven on rsync without prompting for password
  • TheLazyAdmin on rsync without prompting for password
  • Sven on rsync without prompting for password
  • TheLazyAdmin on Convert JetBackup to cPanel structure
  • Chris on Convert JetBackup to cPanel structure
Privacy Policy • Contact