Website testing with SortSite

SortSite is a popular desktop software for testing of web applications for broken links, browser compatibility, accessibility and common spelling errors. It is also available as a web application known as “OnDemand“.

You can generate a free sample test of your website at:
http://try.powermapper.com/Demo/SortSite

REFERENCES:

“msapplication-config” and browserconfig.xml

Windows-8/MSIE-11 introduced Tiles, as such server administrators may have started seeing HTTP 404 errors in their server logs as it attempts to look for a “browserconfig.xml” file at the root of a website domain. If you are inclined to use this file, you should definitely look into the documentation for how to best make use of it. Others may just wish to prevent the error from making “noise” in their log files.

To remove the error, add the following to your pages; alternately you COULD define the URL of your file as the ‘content’ attribute:

<meta name="msapplication-config" content="none" />

You can alternately place an empty /browserconfig.xml on your web server for each domain.

An common example of how to use this file is below:

<?xml version="1.0" encoding="utf-8"?>
<browserconfig>
<msapplication>
<tile>
<square70x70logo src="/mstile-70x70.png"/>
<square150x150logo src="/mstile-150x150.png"/>
<wide310x150logo src="/mstile-310x150.png"/>
<square310x310logo src="/mstile-310x310.png"/>
<TileColor>#8bc53f</TileColor>
<TileImage src="/mstile-150x150.png" />
</tile>
</msapplication>
</browserconfig>

REFERENCES:

Custom 404 Page for Tomcat web applications

This is a relatively common problem in JSP based apps as you need to understand the configuration. It’s further complicated if you use Apache HTTPD in front of the Apache Tomcat server to process requests as you need to know where each request is processed.

For this example, we will use the standard 404 error, but you can also intercept other errors for custom pages.

  1. create 404.jsp:

    <% final SimpleDateFormat simpleDate = new SimpleDateFormat("EE MMM dd yyyy hh:mm:ss aa zzz");
    final String dttm = simpleDate.format(new Date()); %>
    <html>
    <title>404 Not Found</title>
    <ul>
    <li>Time: <%= dttm %></li>
    <li>User-Agent: <%= request.getHeader("User-Agent") %></li>
    <li>Server: <%= request.getServerName() %></li>
    <li>Request: <%= request.getRequestURI() %></li>
    <li>Remote: <%= request.getRemoteAddr() %></li>
    <li>Referer: <%= request.getHeader("Referer") %></li>
    </ul>
    </html>
  2. in WEB-INF/web.xml – add the following (NOTE: location within the file is important but outside the scope of this post)

    <error-page>
    <error-code>404</error-code>
    <location>/404.jsp</location>
    </error-page>
  3. You might want to force the HTTP Header to give something other than a ‘404 status’ code, otherwise MSIE will show an unstyled ‘friendly error message’ if the user has not turned off the default setting. Unfortunately, this also means that search engines might index these pages that should not exist.

REF:

ROBOTS.TXT

I’ve been asked about this file in many projects i’ve worked on. It resides in the root of the website, and has no external references to it, however, there is usually a lot of requests for it in the server logs. (Or… “404 Not Found” Errors if it doesn’t exist).

Additionally, automated security audit software will often indicate that this file itself is a possible security problem as it can expose hidden areas of your website (more on this later).

Here’s what it’s all about….

ROBOTS.TXT is used by spiders and robots, primarily so that they can index your website for search engines (which is usually a good thing). However…. there are times when you don’t want this to occur. Some spiders/robots can be too agressive on your servers, consuming precious bandwidth and CPU utilization, or they can dig too deep into your content. As such you might want to control their access.

The Robots Exclusion Protocol sets out several ways to accomplish this goal. Of course the spider must comply with this convention.
1. ROBOTS.TXT can be used to limit the access:

Example that limits only the images, javascript and css folders:

#robots.txt - for info see http://www.robotstxt.org/wc/robots.html
User-agent: *
Disallow: /images/
Disallow: /js/
Disallow: /css/

2. A <meta> tag on each webpage indicating spider actions to take.

<html>
<head>
<title>example</title>
<meta name="robots" content="INDEX, FOLLOW, ALL" />
</head>
<body>
...
</body>
</html>

Values, there are a few other attributes, but these are the most common….
INDEX -index this page
NOINDEX – do not index this page
FOLLOW -follow links from this page
NOFOLLOW -do not follow links from this page
ALL – same as INDEX, FOLLOW
NONE – same as NOINDEX, NOFOLLOW

In most cases, a spider/robot will first request the ROBOTS.TXT file, and then start indexing the site. You can exclude all or specific spiders from individual files or directories.

As for the earlier bit on security, since this file is available to anyone, you should never indicate sensitive areas of your website in this file as it would be an easy way to find those areas.