February 9, 2008

Effect of Post Frequency on Pagerank

I haven’t posted in a while, just been too busy. Then I happened across a blog that seemed to have some good material in it, but it had a pagerank of zero. Odd, I thought. Then I noticed it had not had a new post added in over a year. I wondered if that was why it ranked so dismally?

So I did a quick mini-study to see if there was a correlation between frequency of posting on blogs and their pagerank. I looked at 51 blogs that were each at least six months old (to overcome any sandbox effect) and assigned them to one of five frequency ranks, based on posts over the past three months. Then I recorded the current Google pagerank for each site.

The five ranks are defined as:

  1. abandoned — no posts for last three months
  2. infrequent — less than one post per week for past three months
  3. occasional — one to three posts per week over the past three months
  4. regular — four to seven posts per week for the past three months
  5. frequent — more than seven posts per week for past three months

Here are the results. I was surprised both by the distribution of sites into these frequencies, and the average pagerank for each group:

  1. abandoned — n=2, avg 4.5, mean 5
  2. infrequent — n=12, avg 4.6, mean 5
  3. occasional — n=10, avg 3.5, mean 4
  4. regular — n=16, avg 4.5, mean 5
  5. frequent — n=11, avg 4.1, mean 5

The first surprise was that I came across so few abandoned blogs — my method of selecting blogs was probably responsible for that, I will detail that below. I was also surprised more bloggers posted regularly or frequently (52%) than infrequently or occasionally (43%) — again the selection of blogs probably biased that though.

The main point, the distribution of pagerank vs frequency, shows no consistent effect. The average pagerank for all 51 blogs was 4.23, and the mean 5. Only the occasional frequency differs substantially from that, and I’m at a loss to explain why that would be so — maybe it is a sampling error, since the selection method was somewhat less than random.

I also noticed that one of the two abandoned blogs was ranked PR6 — even though it hadn’t a new post for the past year and half (the last post was dated June 2006). So I think it is safe to assume that posting frequency has no effect on a sites pagerank.

To find blogs I first logged in to the Wordpress site and visited one of my own blogs there. When you are logged-in they present your blog in a frame, with a bar at the top that includes a link to some seemingly random Wordpress blog, so the first 21 sites were selected by repeatedly clicking that link. As mentioned above, only sites older than six months were included in the study.

I began to notice, however, that all of the blogs had posts within the last week or so — apparently they only include blogs with recent posts in that selection. So next I went to Google and looked at ten more Wordpress blogs, using the search term: site:wordpress.com blog and beginning my search at page ten of the results, so as not to include the most popular blogs. I added ten more records that way, but even that seemed to be turning up fairly high ranked blogs, so
next I jumped to page 50 of results, and added ten more blogs.

Then, just to see if it made any difference, I added ten more sites using the search term blog but not specifying a site, and again began at page 50 of the results. Even that far down the list, PR7 blogs made up almost 1/3 of those last ten results — so obviously I biased the results toward higher ranking blogs when I used Google results to select sites. The first 21 sites from Wordpress were biased toward blogs with recent posts, but the average pagerank for those sites was just 3, and the mean 4.

In conclusion then, the frequency with which you post has little or no effect on the pagerank of your blog. Concentrate on providing good content, post often enough to build up a good-sized site, and go for quality over quantity so that you can attract those all-important links. That is what I’ll be doing here from now on, which means you can expect only occasional posts here.

January 28, 2008

Get Your Query Results Indexed

Although I recommend you always put as much of your content into databases as possible, I do not suggest you rely on user queries to generate most of your web content. That can be SEO death, since the search engines do not submit queries, they just follow links.

Sometimes, however, you need to have some part of your website generate pages in response to user queries. You may have a large database that you do not want to allow scrapers to scarf up in its entirety, for example. So you design a query that allows users to view parts of the data, in response to their query, but limiting the output to a fixed number of responses, or else limit wild-card use in the queries, so they can’t get everything.

Basically, all of the database is available, but you would need to know exactly what is in it, in order to retrieve all of the content. Well if the scrapers knew everything in the database before-hand they wouldn’t need to scrape it. So in practice, you are protecting the integrity of your database without restricting clients from seeking specific data. They just can’t browse through all of the information.

But how, then, do you make this information available for the search engines? Well, as stated above, search engines follow links, so give them links. Put a list of ‘recent searches’ on your search page, and you turn a ‘content-less’ form page into a valuable links page. Of course you link each search term to the search engine so that clicking it will give the search results for that term, using a GET query format, so the search term goes in the URL.

The search engines follow those links, and now you have additional pages added gradually and naturally to the search engine index, making it look like the site is continually growing — always a good sign to search engines. When search terms ‘fall off’ the list, the search engines hardly notice — the results page is in their index, and shows up when they re-visit, so they don’t remove it from the index.

January 21, 2008

Keywords and Linking

Ruud Hein over at Search Engine People has begun a series of posts explaining How Search Really Works, in a very basic and understandable way. I recommend it for any beginners out there. As of now, there are only two posts, so it has a long ways to go if he plans to cover the topic thoroughly, but those first two have him off to a good start.

The above link is to the first post, the second can be found here at Keyword Links. That is an aspect of SEO that the beginner often misses; focusing on the elements on the web page itself to the exclusion of all else. There are ways to influence how people describe your content, thus maximizing keywords in the incoming links. But better yet is building your own Web Empire, where you can send hundreds, even thousands of links to your own sites, with complete control over which pages get links and the text in that link. That is one of the beauties of having multiple websites.

This is a good example of why you should put your website content into a database and use modular design. Suppose you want to have site-wide links from Site A to Site B. You can use a simple line of PHP code in one of the components of your site to display a static link to Site B, but then every link has the same text, and links to the same page, both of which weakens their overall value.

Instead, call a function on Site A that returns the number of pages on the site, which will correspond to the counter value in the content database. Then call another that looks up the keyword (or first keyword for use in the META tag) or page title (which should include the keyword) for one of the pages on your site, chosen at random but changing only every week or month. (See my earlier post on Random Elements on how to do that if you don’t already know). Now you have lots of different links, including ‘deep linking’ that has relevant keywords in the link text.

January 10, 2008

Interpreting SEO Data

Now that I have a minute to spare, I’d like to elaborate on my previous post, where I warned newbies not to accept all of Jonathan Leger’s conclusions. The particular post that made me question his ability to reason correctly was the one called Does PageRank Really Matter for Ranking in Google? (I have only read three or four posts on his blog so far, so one bad example out of four is a fairly poor average — maybe it is the only bad post though, I’ll have to read more to weigh in on that question.) His own data clearly shows PageRank does matter, yet he concludes it doesn’t. How can that be?

His data shows these results for the top ten links averaged for 500 keywords:

1. 6.722
2. 6.866
3. 6.292
4. 6.234
5. 5.968
6. 5.88
7. 5.73
8. 5.662
9. 5.656
10. 5.604

So how would that look if PageRank really didn’t matter? It would be a random distribution, so the overall averages would tend to be about the same for each position. I also suspect the numbers would be much lower, since the average PageRank for a random selection of pages would include far fewer high-ranked pages. I think an average around 2 or 3 would be much more probable, but regardless of what the actual number would be, it would be about the same for all positions, just varying slightly at random around the mean.

So how did Jonathan arrive at the brilliant conclusion that PageRank does not matter? He found that about one-third of the time a higher ranked page followed one of lower value. And 14% of the time it was a substantial difference (3 points of Pagerank or more).

So what does that tell us? It clearly demonstrates that PageRank is not the only factor that goes into the choice of what order result pages will be listed in — but that is a far cry from saying PageRank doesn’t matter at all! In fact it shows that most of the time (about 2/3 of the time) any two first-page results will be in PageRank order, with the higher ranked page on top. His own data refutes his conclusion.

Still, I like the fact that Jonathan actually tests things to see what effect they have. That is a step up from most Internet marketing advice, which is based on the consensus of guesses found on the forums.

January 8, 2008

Another Look

Today I took another look at the results of the ‘quicky’ study I described yesterday, after reading an interesting article by Bart Kosko on the superiority of the median as a descriptive measure of data that varies from the typical bell curve distribution.

I had noticed yesterday when figuring out the statistics, of the 20 cases (ten keywords, page one and page ten results) that I measured mean and median for, 19 had a median lower than the mean (or average). That suggests to me that the distribution curve for site size does not follow the typical bell curve, so it may be a good situation in which to consider the medians, rather than averages.

One measure of how likely measured results are to be real, rather than random variation, is how many of the measured observations agree with the averaged-out conclusion. In this study we have ten keywords — for how many of those was it true that the average page one results were larger than those for page 10? I figured that out, and found 70% of the keywords had an average page size greater on page one results.

Doing the same comparison with median values, rather than means, I found 80% of those were larger for the page one results, further emphasizing the probably correctness of the results. When I looked at the average median value for page one results, the value is  29.1; while the average median for page 10 search results is 25.8. So, while the averages suggested page ten results were six percent smaller than page one results, the median results show page ten results were 12% smaller.

One biasing factor I forgot to mention originally, is that half of the keywords I used were from a single industry — finance — things like ‘home improvement loan’ and  ‘low interest credit cards’. As I said, the original list of keywords came from those increasing in price for adsense — so they reflect the current condition of the American economy, where monetary concerns dominate. I need to repeat this study with a more random selection of keywords…