Google is your friend, but sometimes their crawler can be a little bit aggressive consuming resources from your webservers (and sometimes even crashing it). With static sites or at least sites that are using a decent cache, this shouldn’t be an issue, but with complex online stores (Mostly Magento / WooCommerce) sometimes you can get yourself into an “infinite loop” of search indexing. This “infinite loop” can occur if a product has variables to search by, let’s take a printer for example You can search by Make, Model, Type, Color, each of those has at least 20 entries – so many variables to filter the search by makes this product almost “invisible” to Google as it will index it for each available option once, and with 1000+ products – that’s a mission impossible.
If we take one printer for example (productID 10 in our DB), and use only colors for filtering, google will see multiple different search URLs for the same product ID (10) –
mysites.com/?productID=10?printer=hp&type=laser&model=A17&color=red mysites.com/?productID=10?printer=hp&type=laser&model=A17&color=blue mysites.com/?productID=10?printer=hp&type=laser&model=A17&color=yellow
Adding more filters will just make it worst, again for the same product ID –
mysites.com/?productID=10?printer=hp&type=laser&model=A17&color=red mysites.com/?productID=10?printer=hp&type=laser&model=A17&color=blue mysites.com/?productID=10?printer=hp&type=laser&model=A17&color=yellow mysites.com/?productID=10?printer=hp&type=jet&model=A17&color=red mysites.com/?productID=10?printer=hp&type=jet&model=A17&color=blue mysites.com/?productID=10?printer=hp&type=jet&model=A17&color=yellow
On top of these, most likely that the page will contain either a “Add to cart” / “Add to wishlist” / “Products per page” button, so given that I have 1000 product, and just added “&products-per-page=999” to my URL, I am immediately overloading my database.
There are some “webmaster tools” tips & tricks on how to control the crawler rate, here are some references –
https://support.google.com/webmasters/answer/48620 https://support.google.com/webmasters/answer/6080548 https://www.bing.com/webmaster/help/crawl-control-55a30302
We will show how to use a “quick & dirty” webserver side blocker, most likely will affect your product’s SEO / Indexing so use it at your own risk.
This should allow Google to index the product, but without filters that we will choose.
Allowed – mysites.com/?productID=10
Blocked – mysites.com/?productID=10?products-per-page=999
First solution: HTACCESS Blocking
RewriteCond %{HTTP_USER_AGENT} ^.*(Googlebot|msnbot|bingbot).*$ [NC] RewriteCond %{QUERY_STRING} (\=) RewriteRule .* - [F,L]
Here we deny any request from the bots listed in the first line to any URI with “=” in it, which suggests a query string.
We can modify the HTTP_USER_AGENT or QUERY_STRING lists as needed to block more specific requests.
Second solution: Mod security rule
SecRule REQUEST_HEADERS:User-Agent "@rx (Googlebot|msnbot|bingbot)" "id:999918,chain,status:403,deny,log,msg:'HIGH SCAN FROM GOOGLE(POST)'" SecRule REQUEST_METHOD "POST" "chain" SecRule QUERY_STRING "@rx (add_to_wishlist|add-to-cart|orderby|products-per-page|product_count)" SecRule REQUEST_HEADERS:User-Agent "@rx (Googlebot|msnbot|bingbot)" "id:999919,chain,status:403,deny,log,msg:'HIGH SCAN FROM GOOGLE(GET)'" SecRule REQUEST_METHOD "GET" "chain" SecRule QUERY_STRING "@rx (add_to_wishlist|add-to-cart|orderby|products-per-page|product_count)"
In this example, we used two rules for flexibility reasons but we can combine them if needed.
Same as in the htaccess case, we can adjust the User-Agent list in the first line and the QUERY_STRING in the second one.
Leave a Reply