If you are going to build a large Web Empire, you will need lots of content. Of course some of it you will write yourself, or hire others to write. Some of it you will create in other ways, such as taking photographs to be added to your sites. Some of you may buy. But there always comes a time when you think, Isn’t there some free information out there for me to use?

Well there is, but you need to be careful. There are three types of free information you can use. One is public domain texts and photos. Second is user generated content. Third is what I call free licensed content — content you are permitted to use, but under certain restrictions (typically a link-back or other acknowledgment).

Public domain does not always mean old! Certainly, one way copyrighted material enters the public domain is through age, but there are other types of public domain documents. The U.S. government has, by law, declared much of the material they produce public domain. Exceptions are material protected under secrecy laws, privacy protection laws, or when a government document includes information from outside sources (such as contractors), the included information may be protected by copyright. Even with these exceptions, the U.S. government produces and publishes vast quantities of material that we are free to use.

Material that has entered the public domain through age (copyright expired) are also a rich source of material — some of it not so old as you think. Most sources say it ’safe’ to use anything produced before 1923. That is NOT TRUE. Some material as recent as 1963 can be in the public domain through expired copyright, and things as old as 1887 might be protected by copyright. I’ll explain some of the complexities of copyright in another post.

Our second class of free content is user generated. These are most commonly things like forum sites, where users post comments and discuss topics;  submission sites, such as photograph sites, wikis, or software sites, where users are allowed to submit content; or social networking sites. Some types of games include material submitted by the players in their content too. Some blogs are open to submissions from the public, and others allow public comments on posts. For all these sites, you provide the infrastructure (programming and site layout) and users provide the content. Generally, you need to ’seed’ the site with some content to attract others. The biggest problems with this type of site is spam and legal liability. They require constant monitoring to ensure quality. You had better have a means of removing photographs rapidly if someone posts kiddy-porn to your photograph site. Some wiki sites rely on users to edit out bad content, but you can’t rely on that if your site is for-profit. People don’t like working for free when others profit from it.

The third type of free content is licensed. These can be formal licenses, like the various open content licenses (OPL, Creative Commons, GNU, etc.), or they may be informal, such as ‘you can reprint this article if the links remain intact’ that you see in free article repositories. It can be difficult to find information of this sort that isn’t already used so widely that you will have a duplicate content problem. With photos that people submit to a site like Flickr you might find images marked public-domain but find out the person posting it has no rights to begin with, they didn’t take the picture but copied it out of magazine or from another website.

So all of these sources have some merit, but in each case you need to be careful to avoid the pitfalls mentioned. Duplicate content is probably the biggest drawback of free content, but there is a way to get around that. Annotated content, in which you intersperse your comments with direct quotes from a public domain source, allows you to produce a semi-unique page that has some of your content and some from the public domain. Of course you can quote even copyrighted material, but ‘free use’ dictates that those quotes be relatively short and constitute only a small percentage of the original work. If you quote from an expired copyright work, you can include the entire document — just so you include enough of your comments to avoid the duplicate content problem.

So how much of the content needs to be yours? Thats the $36,000 question that SEO experts have been arguing over for years. I’m currently doing some testing to see for myself if I can come up with a reasonable figure, but those results are several months away from completion. Most say you are safe with 50%, my feeling is you can get away with less, maybe 25%, but I’ll need to wait until my test results are in to give a more confident answer to that.