spider – Giant Geek Blog

ads.txt file

There are many files that crawlers expect to find in well-known locations on websites, one such file is ads.txt. While you might not have paid advertisements, crawlers may still look for a copy of this file leading to HTTP 404 errors in your logs. To prevent the error and show that you should have no advertisements leading there you can add the file with placeholder values as follows:

In the root of your website, create a new file with the name ads.txt.

#ads.txt - no DIRECT or RESELLER
www.example.com, placeholder, DIRECT, placeholder 
# NONE

NOTE: If you ever do use an advertiser, they will generally inform you as to changes to make to this file.

REFERENCES:

Java User-Agent detector and caching

It’s often important for a server side application to understand the client platform. There are two common methods used for this.

On the client itself, “capabilities” can be tested.
Unfortunately, the server cannot easily test these, and as such must usually rely upon the HTTP Header information, notably “User-Agent”.

Example User-agent might typically look like this for a common desktop browser, developers can usually determine the platform without a lot of work.
"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; InfoPath.3)"
Determining robots and mobile platforms, unfortunately is a lot more difficult due to the variations. Libraries as those described below simplify this work as standard Java Objects expose the attributes that are commonly expected.

With Maven, the dependencies are all resolved with the following POM addition:
<dependency> <groupid>net.sf.uadetector</groupid> <artifactid>uadetector-resources</artifactid> <version>2014.10</version> </dependency>

/* Get an UserAgentStringParser and analyze the requesting client */ final UserAgentStringParser parser = UADetectorServiceFactory.getResourceModuleParser(); final ReadableUserAgent agent = parser.parse(request.getHeader("User-Agent"));

out.append("You're a '"); out.append(agent.getName()); out.append("' on '"); out.append(agent.getOperatingSystem().getName()); out.append("'.");

As indicated on the website documentation, running this query for each request uses valuable server resources, it’s best to cache the responses to minimize the impact!

http://uadetector.sourceforge.net/usage.html#improve_performance

NOTE: the website caching example is hard to copy-paste, here’s a cleaner copy.

/* * COPYRIGHT. */ package com.example.cache;


import java.util.concurrent.TimeUnit;

import net.sf.uadetector.ReadableUserAgent;

import net.sf.uadetector.UserAgentStringParser;

import net.sf.uadetector.service.UADetectorServiceFactory;

import com.google.common.cache.Cache;

import com.google.common.cache.CacheBuilder;
/**

* Caching User Agent parser

* @see http://uadetector.sourceforge.net/usage.html#improve_performance

* @author Scott Fredrickson [skotfred]

* @since 2015jan28

* @version 2015jan28

*/

public final class CachedUserAgentStringParser implements UserAgentStringParser {
 private final UserAgentStringParser parser = UADetectorServiceFactory.getCachingAndUpdatingParser();

private static final int CACHE_MAX_SIZE = 100;

private static final int CACHE_MAX_HOURS = 2;

/**

* Limited to 100 elements for 2 hours!

*/

private final Cache<String , ReadableUserAgent> cache = CacheBuilder.newBuilder().maximumSize(CACHE_MAX_SIZE).expireAfterWrite(CACHE_MAX_HOURS, TimeUnit.HOURS).build();

/** * @return {@code String} */ @Override public String getDataVersion() { return parser.getDataVersion(); } /** * @param userAgentString {@code String} * @return {@link ReadableUserAgent} */ @Override public ReadableUserAgent parse(final String userAgentString) { ReadableUserAgent result = cache.getIfPresent(userAgentString); if (result == null) { result = parser.parse(userAgentString); cache.put(userAgentString, result); } return result; } /** * */ @Override public void shutdown() { parser.shutdown(); } }

REFERENCES:

Prevent Robots from indexing portions of content

Yahoo! initially introduced a CSS class that can be used to notify robots/spiders that a specific section or fragment of content should not be included for search purposes.

class=”robots-noindex”

REFERENCES:

robots-nocontent

SEO is always a tricky matter as it’s always changing, way back in 2007 Yahoo! added a means to ‘hide’ specific content on your page from it’s spider with the user of a CSS class that can be used anywhere on your page. True…. this can be abused, but is generally good to keep common content such as navigation and/or ads out of the index. Unfortunately, only Yahoo! supports this.

class="robots-nocontent"

Seeding the search engines for free…

It gets harder each day to do this, but here are a few resources to initiate your search engine listings.

MSN Search
Google
Accoona
dmoz.org

Technical:
Netcraft
Alexa (info.txt)
geoURL
geoTags (not available may 2006?)

Paid?:
Yahoo! / AllTheWeb

ROBOTS.TXT

I’ve been asked about this file in many projects i’ve worked on. It resides in the root of the website, and has no external references to it, however, there is usually a lot of requests for it in the server logs. (Or… “404 Not Found” Errors if it doesn’t exist).

Additionally, automated security audit software will often indicate that this file itself is a possible security problem as it can expose hidden areas of your website (more on this later).

Here’s what it’s all about….

ROBOTS.TXT is used by spiders and robots, primarily so that they can index your website for search engines (which is usually a good thing). However…. there are times when you don’t want this to occur. Some spiders/robots can be too agressive on your servers, consuming precious bandwidth and CPU utilization, or they can dig too deep into your content. As such you might want to control their access.

The Robots Exclusion Protocol sets out several ways to accomplish this goal. Of course the spider must comply with this convention.
1. ROBOTS.TXT can be used to limit the access:

Example that limits only the images, javascript and css folders:

#robots.txt - for info see http://www.robotstxt.org/wc/robots.html User-agent: * Disallow: /images/ Disallow: /js/ Disallow: /css/

2. A <meta> tag on each webpage indicating spider actions to take.

<html> <head> <title>example</title> <meta name="robots" content="INDEX, FOLLOW, ALL" /> </head> <body> ... </body> </html>

Values, there are a few other attributes, but these are the most common….
INDEX -index this page
NOINDEX – do not index this page
FOLLOW -follow links from this page
NOFOLLOW -do not follow links from this page
ALL – same as INDEX, FOLLOW
NONE – same as NOINDEX, NOFOLLOW

In most cases, a spider/robot will first request the ROBOTS.TXT file, and then start indexing the site. You can exclude all or specific spiders from individual files or directories.

As for the earlier bit on security, since this file is available to anyone, you should never indicate sensitive areas of your website in this file as it would be an easy way to find those areas.

S	M	T	W	T	F	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30