December 18, 2007

Other blogs are saying

I found a site today that lists the top 100 blogs about making money on-line:

45n5

Actually there are nearly 300 blogs listed as of now, and a link to suggest more. Even though this site is only a month old, I wrote Mark at 45n5 to have it added — with luck maybe I can get the bottom spot for a while. The page doesn’t describe how sites are ranked, and so far I haven’t had time to search the entire blog for posts on the subject, but it seems to be some combination of google pagerank, technorati rank and alexa rank. He also has a link-back ‘widget’ to show your blog’s rank there, a great example of link baiting. I’ll certainly add one when my site gets listed.

Over on DoshDosh they posted a page of interest to readers of this blog:

Building an Online Empire: 16 Types of Websites You Can Create for Profit

Some of the suggestions are too much of a long-term time commitment for my tastes, but if your goal is to create your own online job/business, they will work. I’d rather my Web Empire be self-sustaining, with no more than an occasional tweak from me — though I’m a long way from that point now. Although I’ve been making living online for over ten years, it is only recently (OK, about a month) that I’ve decided this job I created for myself needs to have an early retirement as its goal.

Most of the suggestions on the DoshDosh list can be implemented using hands-off techniques like hiring help to do the day to day, however, and others are low maintenance sites. Let your own interests, knowledge and abilities guide you to the types of sites that are right for your Web Empire.

December 17, 2007

Use 301 Re-Directs

If you have read the earlier posts on this blog you will recall that I described using a 301 redirect to overcome a hacker’s attempt to steal my content and page-rank in the post called Site Hijacking (part 2). That technique worked smashingly, as I got any traffic that went to his site (and still do) and now the offending site is entirely removed from the search engine results.

Today, I’m going to talk about the more typical use for a 301 redirect, when you actually have moved content. There are many reasons to do this. I have one site that began in 1993 — before you could even have commercial content on websites (I didn’t actually have my own website, but controlled a directory within a larger site). Sometimes I find it best to move some of the content on that legacy site to one of my newer websites.

I certainly don’t want to lose any visitors though, and there are links all over the place to that content, so I use a 301 redirect to ensure that both search engines and visitors go to the new location. The procedure is simple.

First, I put the content on the new site. I keep the same filename and title and most of the content remains unchanged, though the layout is usually different on the new site.

Next, I replace the old page with a PHP header similar to that from my earlier post:

<?php header("Location:http://www.newsite.com/page.htm",TRUE,301); ?>

Of course I have a statement in the .htaccess file that ensures PHP gets parsed, since these old files were usually plain HTML:

AddType application/x-httpd-php .php .htm

If I used .html I could add that to the above line as well, but I’m lazy and never type an extra character I can avoid — all my HTML files end in .htm

Next, I remove all the links to that page from my sites. Some I point to the new location, others I point to other web pages in my Web Empire that need more traffic. I use the link:url search in Google and the SEOquake link to a similar feature in Yahoo (they keep changing the syntax, it is easier to search through this tool, which is updated frequently, than search Yahoo for the correct search syntax), to make sure I get all the links on my empire. If I recognize any of the links from outside my Web Empire as old friends, I’ll drop them an email to change their links. Similar emails to strangers are just deleted as spam and not worth the time and effort to write.

If the page or pages get much traffic, I sometimes add a ‘referer recorder’ to the original page, before the redirect. (Yes, referer is mis-spelled, ‘taint my fault — look it up if you don’t know why). That is a simple MySQL database table, with two fields, a counter and a text field for the referer. Since I am only interested in where visitors are coming from, and not the frequency for particular referers, I make the referer field a unique key. Then before the redirect I add this snippet of PHP code:

<?php
error_reporting(0);
include(open_db($db));
$serv=$_SERVER['HTTP_REFERER'];
$sql="insert into refer values('','$serv');
if($serv){mysql_query($sql);}
?>

I don’t want users to see any error messages, so I set reporting to zero. The include file is of course my open database code. $serv returns the referer (if any) and the subsequent code saves it (there is no need to specify the counter value as it is set to auto-increment).

After a month or two I check that table to see if any of the incoming visitors who are being redirected come from within my Web Empire — if so, I change the link. I’ll empty the table and check again in a few more months. When (and if) the table remains empty for months on end, I know it is now safe to remove the original page redirects and this table.

December 14, 2007

Scraping and Being Scraped

Scraping is simply using a software program to extract data from a website. So long as the data in not copyrighted, or is not published by the person doing the scraping, it is legal. There are free programs available to scrape information from websites, or you can easily write your own if you understand PHP or another server-side scripting language.

Search engines are scrapers, but unlike most scrapers they take in whole web pages and web sites, rather than extracting out specific data. In fact, they usually provide users access to (i.e. ‘publish’) a ‘cached’ version of a website on their server — which is clearly copyright infringement, but they provide such an essential service that it is overlooked. Can you imagine trying to find anything on-line without using any search-engine? So they provide an opt-out method for people to avoid having some part (or all) of their site included in the search engine results (the robots.txt file), and no one objects.

Don’t try to imitate search engines and publish entire pages you have scraped (copied) or you will likely find yourself being sued. It is possible, however, to scrape specific information from a site without infringing on copyrights. For example, you can easily scrape the current price of a specific stock from any of the hundreds of sites publishing that information, and put it on your own site. Information is not copyrightable. You are not copying their form of expression — the actual HTML code they use to display that information. Of course whenever they update their site with new layout or coding it is likely to ‘break’ you scraper, so check for valid data before displaying anything on your website.

But what if you have a database driven site that is composed of public domain content. How do you keep others from scraping and using your content on their own site? One thing you can do is hire a programmer to write tracking code that looks at the behavior of a visitor, and denies access if it appears to be a program — such as retrieving dozens of pages in one minute, or any of several other tell-tale clues that scrapers typically exhibit. The only problem is, a determined hacker behind the scraper can overcome any detection method you use.

Another technique is to simply not put links to everything in your database at any one time. Have a randomly selected sub-set of database records that are shown to all the visitors for a week or two. Then change the records selected for display. The others will still display if they have been book-marked, but there are no active links to them. The search engines will find them and keep them in their index, because they continue to ‘work’ but a scraper will only get a sub-set of your real data. Of course if the programmer behind the scraper realizes the data is incomplete they can easily get the whole database just like the search engines will, by repeated visits. But most site-wide scrapes are hit-and-run events, they don’t have enough interest in your site to visit it more than once.

This technique is only appropriate in certain cases, with particular types of data, but when it can be used it is fairly effective. The scraper gets a sub-set of your data, throws up a website with that, but never approaches your rank in the search engines because their site is smaller, has duplicate content, etc.

An even better way to beat the competition is to have a database that never stops growing. By adding new content all the time, the scraper can never ‘catch up’ with you. Also, you can insert some original material in with the public-domain information. You have copyright to the original material; make it clear that your site uses a combination of public domain and original material — that will discourage copying by honest scrapers. Of course the crooks are another matter…

December 12, 2007

Copyright Basics

As promised in the preceding post, I will try to explain here the basics of copyright in the U.S.A. with the usual cop-out ‘I am not a lawyer and nothing I say should be considered legal advice.’

Copyright applies to things like written texts and photographs. Sound recordings are a special and complex situation prior to 1972, but I won’t even go into that here. Search the web for information specific to sound recording copyright if that applies to you.

Copyright applies to ‘creative works’ — and protects the form of expression, not facts or information. Titles and names can not be copyrighted. Simple compilations like the phone book can not be copyrighted.

Copyright is not the only law that might apply to content for your web site, you should also consider things like privacy laws, trademarks and patents if any of those apply. As I said yesterday, if you allow content from users, then consider if you may be liable for charges of libel or illegal content.

Anything you write is automatically copyrighted, and remains protected for 70 years after your death. Plenty of time for you to be forgotten so your work gets lost for all time.

Luckily, earlier copyright laws were not so over-protective. Several laws were passed having different effects on the public domain. In 1977 the law requiring the © character or the word copyright be use was changed, so basically anything published prior to that date without a copyright notice is (usually) public domain. The exceptions are when the notice was accidentally left off some prints and subsequently corrected … which the original law requiring the notice allowed for. Also note that nothing in the law specifies that (c) can be substituted for the c in a circle, but I’ve never heard of anyone pressing that as a defense in court.

An important word in the preceding paragraph is ‘published’ — note that it is not the creation date, but publication date that is important. If a photograph, for example, was taken before that date, but not published — or published after 1977, it may be copyrighted even if it lacks a copyright notice. Be very careful about relying on the lack of copyright notice if you do not know the date something was published.

So what else is in the public domain in 2007? Anything published in the USA before 1963 where they failed to renew the copyright. The vast majority of copyrighted material was never renewed. Best selling novels, movies and famous photographer’s images are the most common exception. But you never know for sure until you check the copyright office records. Some of these are being put on-line to make that easier. Also, anything published before 1923 is now public domain.

Again, note that the preceding paragraph refers only to published material. Photographs, for example, may not have been published, and fall under much more protective later legislation. The same for documents, journals or other unpublished material. If the author (or photographer) is know, they are protected for 70 years after the death of the author — so if the author died before 1937 the work is public domain. If the author is unknown (anonymous) or corporate, the item is protected for 120 years from the date of creation — so they must have been created before 1887 to be public domain!

The big caveat to this ridiculous situation is that in order to be sued for copyright infringement, the person suing would have to PROVE who the author of the anonymous work was, and that they were alive within the past 70 years, and that the person suing had inherited the rights to the deceased author’s work. Could happen I suppose, but what are the odds?

For works published outside the USA, different rules sometimes apply, but again most works produced before 1923 that were published are public domain. Works published in Afghanistan, Eritrea, Ethiopia, Iran, Iraq, or San Marino are not protected at all in the USA, even new works. Confused yet?

Make your best effort to build your Web Empire on sturdy legal grounds, and avoid copyright problems. If it is not an obvious case of infringement, and you are in one of the ‘fringe’ areas, most copyright owners will just notify you to cease and desist; it is rarely worth the cost to sue.

December 11, 2007

Free Content

If you are going to build a large Web Empire, you will need lots of content. Of course some of it you will write yourself, or hire others to write. Some of it you will create in other ways, such as taking photographs to be added to your sites. Some of you may buy. But there always comes a time when you think, Isn’t there some free information out there for me to use?

Well there is, but you need to be careful. There are three types of free information you can use. One is public domain texts and photos. Second is user generated content. Third is what I call free licensed content — content you are permitted to use, but under certain restrictions (typically a link-back or other acknowledgment).

Public domain does not always mean old! Certainly, one way copyrighted material enters the public domain is through age, but there are other types of public domain documents. The U.S. government has, by law, declared much of the material they produce public domain. Exceptions are material protected under secrecy laws, privacy protection laws, or when a government document includes information from outside sources (such as contractors), the included information may be protected by copyright. Even with these exceptions, the U.S. government produces and publishes vast quantities of material that we are free to use.

Material that has entered the public domain through age (copyright expired) are also a rich source of material — some of it not so old as you think. Most sources say it ’safe’ to use anything produced before 1923. That is NOT TRUE. Some material as recent as 1963 can be in the public domain through expired copyright, and things as old as 1887 might be protected by copyright. I’ll explain some of the complexities of copyright in another post.

Our second class of free content is user generated. These are most commonly things like forum sites, where users post comments and discuss topics;  submission sites, such as photograph sites, wikis, or software sites, where users are allowed to submit content; or social networking sites. Some types of games include material submitted by the players in their content too. Some blogs are open to submissions from the public, and others allow public comments on posts. For all these sites, you provide the infrastructure (programming and site layout) and users provide the content. Generally, you need to ’seed’ the site with some content to attract others. The biggest problems with this type of site is spam and legal liability. They require constant monitoring to ensure quality. You had better have a means of removing photographs rapidly if someone posts kiddy-porn to your photograph site. Some wiki sites rely on users to edit out bad content, but you can’t rely on that if your site is for-profit. People don’t like working for free when others profit from it.

The third type of free content is licensed. These can be formal licenses, like the various open content licenses (OPL, Creative Commons, GNU, etc.), or they may be informal, such as ‘you can reprint this article if the links remain intact’ that you see in free article repositories. It can be difficult to find information of this sort that isn’t already used so widely that you will have a duplicate content problem. With photos that people submit to a site like Flickr you might find images marked public-domain but find out the person posting it has no rights to begin with, they didn’t take the picture but copied it out of magazine or from another website.

So all of these sources have some merit, but in each case you need to be careful to avoid the pitfalls mentioned. Duplicate content is probably the biggest drawback of free content, but there is a way to get around that. Annotated content, in which you intersperse your comments with direct quotes from a public domain source, allows you to produce a semi-unique page that has some of your content and some from the public domain. Of course you can quote even copyrighted material, but ‘free use’ dictates that those quotes be relatively short and constitute only a small percentage of the original work. If you quote from an expired copyright work, you can include the entire document — just so you include enough of your comments to avoid the duplicate content problem.

So how much of the content needs to be yours? Thats the $36,000 question that SEO experts have been arguing over for years. I’m currently doing some testing to see for myself if I can come up with a reasonable figure, but those results are several months away from completion. Most say you are safe with 50%, my feeling is you can get away with less, maybe 25%, but I’ll need to wait until my test results are in to give a more confident answer to that.