January 7, 2008

Is Page Size Important

When it comes to search engine optimization, does size matter? Most SEO gurus recommend pages be kept under 40K, while others say up to 100K is fine. I thought it might be interesting to do a quick and dirty study to see if page size was important.

First I needed keywords. To be relevant to the market we are in, I chose the first ten keyword phrases from a list of keywords that rose in value in the first nine months of 2007. They all had adsense values of $2 to $6. When using these I put them in quotes, then searched Google. For each of the top ten results I noted the size of the file, and whether or not the keyword phrase was in the URL and/or title of the page. That gave me 100 values for page size of top-ranked pages.

Next, I took the same ten keyword phrases, and searched again, this time recording the results from page 10 of the Google results. What do you guess the results showed? Well, not to keep you in suspense, here are the figures I came up with from this little study:

Results from the tenth page of search results were, on average, about 6% smaller than those on the first page of results. Average page size for the 100 searches of page one results was  36.7K, while page ten results averaged 34.5K. Page size for the page one results ranged from just 1K (a title to a flash presentation) to 117K, while for the page ten results the range was 4K to 134K. Given the small sample size, a 6% difference may be due to chance, but more likely it reflects a small, but significant preference for larger pages as one of the 200+ factors considered when ranking a page.

Looking at the results for keywords, I was mildly surprised to find the exact same percentage of sites in both groups had the keyword in the URL (generally the page name) — 16%. For page titles, however, there was a substantial difference, with  66% of the page one results having the exact keyword phrase in the page title, while only 53% of those from page ten results had that feature.

I’m working on a program to conduct some statistical keyword research, keep visiting this blog to see what comes of it. Meanwhile, I wouldn’t worry too much about page size, within reasonable limits. It is probably better, overall, to have lots of smaller pages than fewer larger ones — but just do whatever suits the data you have to present, and tweak it if necessary, based on results.

Putting keywords in the title of your pages is basic, having them in the URL not so important — though it can’t hurt.  Larger pages are beneficial so long as you can produce as many of them as you would smaller pages — if not, opt for the greater numbers, it allows you to target more keywords.

December 10, 2007

Random Elements

If you have a database of information that is larger than you want to display on your site at one time, you will probably want to show a random selection that changes periodically. This is often the case when you are using some part of a large database to provide extra content for one site, while on another site it might be the main content and used in full. At least some of the sites in your Web Empire should have dynamic content updated automatically. This can be very beneficial for your search engine ranking, if done properly.

The search engines love sites that are frequently updated. Blogs that get a new page every day are ranked higher than the same data would be on a static site — there is almost an implicit assumption that the site will continue to be updated and enlarged, and hence deserves higher ranking.

However, if a search engine spider comes to your site and records different information each time — even just minutes apart, it will recognize it as dynamic content, and much of the potential benefit is lost. Also, when a user happens to notice that they can see a different joke or some other content element on each visit, they will keep refreshing the page just to read that part of it — distorting your visitor statistics.

Likewise, if you have a part of a page with ‘today in history’ or similar content that updates daily, the search engines are presumably programmed to recognize that, and give it no extra rank (or very little) for freshness. The exception to this is when the URL to the content itself changes daily — so the search engine sees it as new content, and adds it to the index, giving you more apparent pages than a visitor actually sees.

An example of this technique might be drawn from a database of silent-movie screen stars biographical sketches. If you feature a link with just one on your site, but it changes daily, with an URL like:

www.mysite.com/bio=Charlie_Chaplin

one day and

www.mysite.com/bio=Buster_Keaton

the next, the search engine is more likely to visit frequently, and both those URLs will be in the search engine results.

Of course, you wouldn’t use a page like that on a site devoted to silent movie stars — but you could use it on a site about silent movies, or the roaring 20s, or other related subjects. On another site the biographical sketch may be part of the data for each star, which would also includes a filmography, quotes, photo and other content. You do not want to duplicate the pages on your other site and risk irrelevancy due to duplicate content screening.

When content is inserted into your page, instead of being accessed by its own URL, it is better to update that part of your content every week or two, preferably at irregular intervals. This looks much more like a site being manually updated, and earns the highest freshness rating you can get without actually needing to add new material.

To use this technique with PHP, you just need to generate a couple random numbers. Generally, the RAND() function in PHP uses the server’s time clock to seed a pseudo-random generator which returns a random number in the range you specify:

rand(5,10)

Would return any of the numbers 5, 6, 7, 8, 9, or 10 at apparent random. There is no need to seed the random number generator — but you can — and therin lies the secret of controlling the frequency with which your ‘random’ content is updated. The main use for the ’seed’ is to allow you to get the same random sequence over and over — something scientists often need when randomizing results with multiple trials. We can use this feature to ensure that content gets updated at whatever frequency we want. The trick is to seed the random generator with a number that will only change as often as we want the content to change. We do this by using the current time divided by the amount of time the content should remain unchanged. So, for example, you might use:

$seed = floor(time()/86400);
srand($seed);
$recordnum=rand(1,$maxrecords);

The srand() command seeds the random number generator — it takes an integer value, so we use floor() to round the results of our division down to the nearest integer. The time() function returns a ten digit integer representing the current UNIX time on the server, in seconds. The number 86400 is the number of seconds in one day. Our $seed value stays the same for one day, then changes, so this function would return a different $recordnum each day, between one and the maximum number of records available, which you have presumably already set in $maxrecords.

To have your update occur every seven to ten days, at random intervals, we just add another random number:

$num=rand(7,10) * 86400;
$seed=floor(time()/$num);

Now your Web Empire has frequently updated information, with no further effort on your part — you can go on to work on your next website to add to your Empire. Visitors and search engines will see the data changing fairly frequently, but not so often or so regularly as to appear like it is inserted dynamic content — though that is exactly what it is.

December 3, 2007

Adding a Sitemap to a Wordpress Blog

One thing I’ve never liked about blogs is the failure to include a single list of all posts by title. The archive function returns full results, and you have to click through page after page to see what is there. If the titles are informative, they provide clue enough to what the post is about, why not have a simple list of all posts?

I looked briefly at existing Wordpress plugins to see if someone had something simple, but they tended toward XHTML sitemaps that were Google compliant, or fancy versions of the existing archive function. I just want a simple linked list of all titles, those were over-kill. Rather than waste even more time searching for something, I just wrote a simple script. Here is the main file, which I call as a php include:

<ul>
<?php
$ptable='wp_posts';
$sql="select post_title,guid from $ptable where post_type='post' and post_status='publish'";
$res=@mysql_query($sql);
if ($res)
{
  if (mysql_num_rows($res))
  {
    while ($row=@mysql_fetch_array($res))
    {
      $ptitle=$row['post_title'];
      $purl=$row['guid'];
      echo("<li><a href=\"$purl\">$ptitle</a>\n");
    }
  }
}
?>
</ul>

Just change the $ptitle variable to begin with whatever prefix your blog uses, and call this with an include from your theme. Just how you call it is difficult to explain, because it changes from one theme to the next. First, decide where you are going to put it. On some blogs I created a new page and called it ‘Site Index’ or something similar, and called the sitemap when that page was displayed. On this site I have just added it to the existing ‘about’ page — the link is at the upper right corner of every page.

To have the sitemap code included only on that page, I added an if() clause to theme. Some themes have a ‘page’ file, and it goes there. Others don’t have that and it needs to go in the ‘index’ file. No doubt there are other themes that use other files, you may need to experiment. Then, in the theme file, find which part of the code displays the content, and replace it with your loop. Suppose you saved the sitemap file as map.php and the page you want to include it on is called ‘Sitemap’ — you include something like this:

if(is_page('Sitemap')){ include('map.php');}else{ [existing content code] }

A sitemap will ensure that every page on your blog gets indexed by the search engines, and also provides a convenient way for new readers to become familiar with your content.

November 19, 2007

Duplicate Layout Penalty

There is a lot of talk on the forums dedicated to SEO and Internet business issues about something called a ‘duplicate content penalty’. Several people maintain that there is no such thing, or that it should be referred to as a duplicate content filter.

Well there is very definately a duplicate content penalty. And there is a duplicate content filter. I have discovered that there is even a duplicate layout penalty. I googled ‘duplicate layout penalty’ and got zero results — so nobody is talking about that, but I will.

First, there are several types of duplicate content, and various penalties and filters apply. And the criteria by which the penalties or filters are applied is neither clear nor consistent. The worst-case example of duplicate content is a mirror-site. Make a second copy of your site, and you will find one or both removed from the search engine listings. But sites with a valid need for mirror sites, like sourceforge.net or php.net do not get penalized. Different mirrors often have different page-rank, but that may be due to differences in incoming links, server response time, and other ranking factors that have nothing to do with content.

The next level of duplicate content is the reprint article site. If all of the pages on your site consist of free reprint articles, you will likely be penalized with low pagerank. But some sites with nothing but reprint articles fare well — articlecity.com has pr6 rank for their home page.

So clearly, duplicate content is a FACTOR in page rank, but not the only consideration. I looked at some one-year-old free reprint articles. In a typical case, one was listed in Google results, though if you clicked the link to see filtered-out results, there were three listings. Yahoo, on the other hand, showed 14 copies of the same article. Interestingly, the sites listed by Yahoo all were indexed by Google, usually with thousands of pages. So even though a site was indexed, the results might not appear in a Google search, even when duplicate listings were requested. One home page for that sample article had a pr3 rank, but most were pr0 or un-ranked.

Perhaps I’ll take the time to do a more detailed study of those factors in the future, but for now, lets just say the situation is complex enough that most people are just guessing when they talk about the duplicate content penalty. So when we move on to the topic of this post, duplicate layout penalty, we are on even shakier ground — but I’m personally convinced it exists.

Here is the scenario. I made a topic-specific website for a particular U.S. locality. It got a lot of traffic, and better yet, made a lot of money through Adsense. There is demand for that kind of information, and it pays off, so I create another website for the same topic, but a different locality — so all the data is different, but being lazy I used the exact same page layout, and just a different database. This continued through about five locations, each on different IPS servers. When they had aged a bit, so I could expect them to ‘catch-up’ with the first site in traffic and earnings, I noticed they were not doing so well. More specifically, the 2nd and 3rd site were rated one pagerank lower than the first, and the 4th and 5th were yet another pagerank lower.

Lower pagerank = lower traffic. Google has over half of the web-search market, so their pagerank is important. Here, I made sites, all equally useful, all with similar incoming links, and of six sites, one ranks pr4, two pr3, two pr2 and one pr0.

Well, maybe it was coincidence. But I remembered another disappointing set of sites that started out well, then didn’t live up to the first example — and I found the same pattern. It wasn’t just that the older sites ranked better, but at one year old, the first site had a big advantage, and the order in which sites were added predicted relative rank, with later sites ranking lower.

So why, I can hear skeptics ask, don’t newer blogs using Wordpress, rank poorly — there are tens of thousands (or 100’s of thousands?) of sites using the same basic layout. My only guess is that it is the same reason that 1,000s of news sites do not get hit with duplicate content penalties/filtering. Either the folks at google are so brilliant they developed a ranking program that can detect actual value, or (more likely) they programmed in certain exceptions to their ‘rules’.