January 14, 2008

Best Source for Public Domain Material

Your Web Empire should include at least one ‘anchor’ site with really great content you create yourself. But you just will not have time to write content for all of your sites, especially when you have several dozen of them. Sometimes you really need content that will not require much work on your part.

Public domain material fits that bill. I don’t know how many websites consist of material from the Gutenberg collection of texts, but I am sure it is substantial, as that is one of the oldest and best-known collections of public domain texts. It is not a great source for that very reason — any content from there is certain to be categorized as duplicate content when you use it.

There have been a couple much-publicized efforts to scan books, most notably those by Microsoft and Google. These books are available as PDF image scans, but they are also run through OCR software, and the texts are available. Depending on the quality of image obtained, some of these OCR scans are surprisingly accurate. The best way to access these books, as well as some from other projects (including Gutenberg) is to use the Internet Archive text collection.

This collection is huge enough, and growing fast enough, that even Google has not fully indexed it. Use the search function to find texts in the subject area relevant to your website, then download a few in the text format, and see if the OCR scan is accurate. If it is, choose a line of 40 or 50 characters from the text, and search it (within quotes) on Google. Oftentimes, you will get ‘no results found’ — indicating the text is not in the Google database. Run it through a spell-check to catch any OCR errors. Add it to your site, and when Google indexes it, yours will be the authority site for that text.

PDPhotoblog.com Playing At Work Image

There are also images, sound recordings, and video clips in the Internet Archive database, just use the drop-down menu to select from other formats to search for sound or images to supplement your textual material. You can build an entire site around public domain material that has not been indexed in the search engines!

December 20, 2007

Gathering Content

Content comes in many ‘flavors’ — it can be text, images, audio/video, Flash, PDF or various other formats. It can be housed in a database, or in individual files. If you want your website to be indexed by the search engines so that they can send you free traffic, it should go heavy on the text. Images are OK so long as they include captions, and your HTML should have both the ‘alt’ and ‘title’ tags filled in the image tag, preferably with different text in each (though saying essentially the same thing, using different words).

Audio/video (a/v) formats are not indexed by search engines. Flash is not indexed. PDF is indexed by Google, but some of the smaller search engines can not read these files. Also, many users will see your document is PDF and click on the ‘view as html’ link Google provides. They never actually visit your site, and the rendered version is usually far inferior to what you have in the formatted PDF file, giving visitors and poor (and false) impression of your site.

Podcasts and similar a/v presentations may be easier for some people to produce, and they are popular for some sites, but you miss out on a lot of free search engine traffic, and need to make up for that with other means of promotion. Ideally you should have a complete transcript of the audio, or better yet that plus a verbal description of the video portion, but that can be a tremendous amount of work to produce. At the very minimum, if you want to include a/v material, include a good description of the content.

So in most cases, we will be using primarily text material for our content, or graphics with text descriptions. Where do you get that material? I have already mentioned public domain content on the post Free Content. Finding public domain material that has not already been published online is one way to use this material without running into duplicate content problems.

Writing all of your own material is time-consuming, but it doesn’t cost you money, so for the budget conscious beginner it may be the preferred choice. Just be sure to double check your spelling and grammar. It does not have to be great prose, worthy of a Hemingway, but you need to be able to get your point across to your audience.

Another way to get data is to pay for written articles. There are few unscrupulous sorts who try to fob-off copied material as their own, after just changing a few words here and there … so be careful who you deal with.

Creating a database is one of the best ways to produce content. Select some aspect of your subject area, and produce a database by copying information — not by copying copyrighted material, like descriptions or reviews, but by copying things like names, statistics, feature lists, options, etc. If you have an automobile site, for example, you can create a database by make, model and year, showing which options were available, which engine sizes, etc., etc. Always try to put more in your database than you need for your site — that way in the future, when producing a related site, you will have more material to choose from if you want to extract part of that database onto your new site, without duplicating the information on your original site.

You can also buy databases full of content. Since they are for sale, they will not be unique, so you should not use one database as the basis for a website, but it is a great way to supplement other material. Also, by selecting a few fields from one database, and a few others from a related database, you can create a unique mix that looks different enough from what others do with the same data that you won’t get penalized for duplicate content. Also, by selecting a sub-set of the records from a database, you can randomly rotate some material, giving your site an appearance of freshness. Remember, you are building a Web Empire, not just a website.

December 14, 2007

Scraping and Being Scraped

Scraping is simply using a software program to extract data from a website. So long as the data in not copyrighted, or is not published by the person doing the scraping, it is legal. There are free programs available to scrape information from websites, or you can easily write your own if you understand PHP or another server-side scripting language.

Search engines are scrapers, but unlike most scrapers they take in whole web pages and web sites, rather than extracting out specific data. In fact, they usually provide users access to (i.e. ‘publish’) a ‘cached’ version of a website on their server — which is clearly copyright infringement, but they provide such an essential service that it is overlooked. Can you imagine trying to find anything on-line without using any search-engine? So they provide an opt-out method for people to avoid having some part (or all) of their site included in the search engine results (the robots.txt file), and no one objects.

Don’t try to imitate search engines and publish entire pages you have scraped (copied) or you will likely find yourself being sued. It is possible, however, to scrape specific information from a site without infringing on copyrights. For example, you can easily scrape the current price of a specific stock from any of the hundreds of sites publishing that information, and put it on your own site. Information is not copyrightable. You are not copying their form of expression — the actual HTML code they use to display that information. Of course whenever they update their site with new layout or coding it is likely to ‘break’ you scraper, so check for valid data before displaying anything on your website.

But what if you have a database driven site that is composed of public domain content. How do you keep others from scraping and using your content on their own site? One thing you can do is hire a programmer to write tracking code that looks at the behavior of a visitor, and denies access if it appears to be a program — such as retrieving dozens of pages in one minute, or any of several other tell-tale clues that scrapers typically exhibit. The only problem is, a determined hacker behind the scraper can overcome any detection method you use.

Another technique is to simply not put links to everything in your database at any one time. Have a randomly selected sub-set of database records that are shown to all the visitors for a week or two. Then change the records selected for display. The others will still display if they have been book-marked, but there are no active links to them. The search engines will find them and keep them in their index, because they continue to ‘work’ but a scraper will only get a sub-set of your real data. Of course if the programmer behind the scraper realizes the data is incomplete they can easily get the whole database just like the search engines will, by repeated visits. But most site-wide scrapes are hit-and-run events, they don’t have enough interest in your site to visit it more than once.

This technique is only appropriate in certain cases, with particular types of data, but when it can be used it is fairly effective. The scraper gets a sub-set of your data, throws up a website with that, but never approaches your rank in the search engines because their site is smaller, has duplicate content, etc.

An even better way to beat the competition is to have a database that never stops growing. By adding new content all the time, the scraper can never ‘catch up’ with you. Also, you can insert some original material in with the public-domain information. You have copyright to the original material; make it clear that your site uses a combination of public domain and original material — that will discourage copying by honest scrapers. Of course the crooks are another matter…

December 12, 2007

Copyright Basics

As promised in the preceding post, I will try to explain here the basics of copyright in the U.S.A. with the usual cop-out ‘I am not a lawyer and nothing I say should be considered legal advice.’

Copyright applies to things like written texts and photographs. Sound recordings are a special and complex situation prior to 1972, but I won’t even go into that here. Search the web for information specific to sound recording copyright if that applies to you.

Copyright applies to ‘creative works’ — and protects the form of expression, not facts or information. Titles and names can not be copyrighted. Simple compilations like the phone book can not be copyrighted.

Copyright is not the only law that might apply to content for your web site, you should also consider things like privacy laws, trademarks and patents if any of those apply. As I said yesterday, if you allow content from users, then consider if you may be liable for charges of libel or illegal content.

Anything you write is automatically copyrighted, and remains protected for 70 years after your death. Plenty of time for you to be forgotten so your work gets lost for all time.

Luckily, earlier copyright laws were not so over-protective. Several laws were passed having different effects on the public domain. In 1977 the law requiring the © character or the word copyright be use was changed, so basically anything published prior to that date without a copyright notice is (usually) public domain. The exceptions are when the notice was accidentally left off some prints and subsequently corrected … which the original law requiring the notice allowed for. Also note that nothing in the law specifies that (c) can be substituted for the c in a circle, but I’ve never heard of anyone pressing that as a defense in court.

An important word in the preceding paragraph is ‘published’ — note that it is not the creation date, but publication date that is important. If a photograph, for example, was taken before that date, but not published — or published after 1977, it may be copyrighted even if it lacks a copyright notice. Be very careful about relying on the lack of copyright notice if you do not know the date something was published.

So what else is in the public domain in 2007? Anything published in the USA before 1963 where they failed to renew the copyright. The vast majority of copyrighted material was never renewed. Best selling novels, movies and famous photographer’s images are the most common exception. But you never know for sure until you check the copyright office records. Some of these are being put on-line to make that easier. Also, anything published before 1923 is now public domain.

Again, note that the preceding paragraph refers only to published material. Photographs, for example, may not have been published, and fall under much more protective later legislation. The same for documents, journals or other unpublished material. If the author (or photographer) is know, they are protected for 70 years after the death of the author — so if the author died before 1937 the work is public domain. If the author is unknown (anonymous) or corporate, the item is protected for 120 years from the date of creation — so they must have been created before 1887 to be public domain!

The big caveat to this ridiculous situation is that in order to be sued for copyright infringement, the person suing would have to PROVE who the author of the anonymous work was, and that they were alive within the past 70 years, and that the person suing had inherited the rights to the deceased author’s work. Could happen I suppose, but what are the odds?

For works published outside the USA, different rules sometimes apply, but again most works produced before 1923 that were published are public domain. Works published in Afghanistan, Eritrea, Ethiopia, Iran, Iraq, or San Marino are not protected at all in the USA, even new works. Confused yet?

Make your best effort to build your Web Empire on sturdy legal grounds, and avoid copyright problems. If it is not an obvious case of infringement, and you are in one of the ‘fringe’ areas, most copyright owners will just notify you to cease and desist; it is rarely worth the cost to sue.

December 11, 2007

Free Content

If you are going to build a large Web Empire, you will need lots of content. Of course some of it you will write yourself, or hire others to write. Some of it you will create in other ways, such as taking photographs to be added to your sites. Some of you may buy. But there always comes a time when you think, Isn’t there some free information out there for me to use?

Well there is, but you need to be careful. There are three types of free information you can use. One is public domain texts and photos. Second is user generated content. Third is what I call free licensed content — content you are permitted to use, but under certain restrictions (typically a link-back or other acknowledgment).

Public domain does not always mean old! Certainly, one way copyrighted material enters the public domain is through age, but there are other types of public domain documents. The U.S. government has, by law, declared much of the material they produce public domain. Exceptions are material protected under secrecy laws, privacy protection laws, or when a government document includes information from outside sources (such as contractors), the included information may be protected by copyright. Even with these exceptions, the U.S. government produces and publishes vast quantities of material that we are free to use.

Material that has entered the public domain through age (copyright expired) are also a rich source of material — some of it not so old as you think. Most sources say it ’safe’ to use anything produced before 1923. That is NOT TRUE. Some material as recent as 1963 can be in the public domain through expired copyright, and things as old as 1887 might be protected by copyright. I’ll explain some of the complexities of copyright in another post.

Our second class of free content is user generated. These are most commonly things like forum sites, where users post comments and discuss topics;  submission sites, such as photograph sites, wikis, or software sites, where users are allowed to submit content; or social networking sites. Some types of games include material submitted by the players in their content too. Some blogs are open to submissions from the public, and others allow public comments on posts. For all these sites, you provide the infrastructure (programming and site layout) and users provide the content. Generally, you need to ’seed’ the site with some content to attract others. The biggest problems with this type of site is spam and legal liability. They require constant monitoring to ensure quality. You had better have a means of removing photographs rapidly if someone posts kiddy-porn to your photograph site. Some wiki sites rely on users to edit out bad content, but you can’t rely on that if your site is for-profit. People don’t like working for free when others profit from it.

The third type of free content is licensed. These can be formal licenses, like the various open content licenses (OPL, Creative Commons, GNU, etc.), or they may be informal, such as ‘you can reprint this article if the links remain intact’ that you see in free article repositories. It can be difficult to find information of this sort that isn’t already used so widely that you will have a duplicate content problem. With photos that people submit to a site like Flickr you might find images marked public-domain but find out the person posting it has no rights to begin with, they didn’t take the picture but copied it out of magazine or from another website.

So all of these sources have some merit, but in each case you need to be careful to avoid the pitfalls mentioned. Duplicate content is probably the biggest drawback of free content, but there is a way to get around that. Annotated content, in which you intersperse your comments with direct quotes from a public domain source, allows you to produce a semi-unique page that has some of your content and some from the public domain. Of course you can quote even copyrighted material, but ‘free use’ dictates that those quotes be relatively short and constitute only a small percentage of the original work. If you quote from an expired copyright work, you can include the entire document — just so you include enough of your comments to avoid the duplicate content problem.

So how much of the content needs to be yours? Thats the $36,000 question that SEO experts have been arguing over for years. I’m currently doing some testing to see for myself if I can come up with a reasonable figure, but those results are several months away from completion. Most say you are safe with 50%, my feeling is you can get away with less, maybe 25%, but I’ll need to wait until my test results are in to give a more confident answer to that.