Scraping and Being Scraped
Scraping is simply using a software program to extract data from a website. So long as the data in not copyrighted, or is not published by the person doing the scraping, it is legal. There are free programs available to scrape information from websites, or you can easily write your own if you understand PHP or another server-side scripting language.
Search engines are scrapers, but unlike most scrapers they take in whole web pages and web sites, rather than extracting out specific data. In fact, they usually provide users access to (i.e. ‘publish’) a ‘cached’ version of a website on their server — which is clearly copyright infringement, but they provide such an essential service that it is overlooked. Can you imagine trying to find anything on-line without using any search-engine? So they provide an opt-out method for people to avoid having some part (or all) of their site included in the search engine results (the robots.txt file), and no one objects.
Don’t try to imitate search engines and publish entire pages you have scraped (copied) or you will likely find yourself being sued. It is possible, however, to scrape specific information from a site without infringing on copyrights. For example, you can easily scrape the current price of a specific stock from any of the hundreds of sites publishing that information, and put it on your own site. Information is not copyrightable. You are not copying their form of expression — the actual HTML code they use to display that information. Of course whenever they update their site with new layout or coding it is likely to ‘break’ you scraper, so check for valid data before displaying anything on your website.
But what if you have a database driven site that is composed of public domain content. How do you keep others from scraping and using your content on their own site? One thing you can do is hire a programmer to write tracking code that looks at the behavior of a visitor, and denies access if it appears to be a program — such as retrieving dozens of pages in one minute, or any of several other tell-tale clues that scrapers typically exhibit. The only problem is, a determined hacker behind the scraper can overcome any detection method you use.
Another technique is to simply not put links to everything in your database at any one time. Have a randomly selected sub-set of database records that are shown to all the visitors for a week or two. Then change the records selected for display. The others will still display if they have been book-marked, but there are no active links to them. The search engines will find them and keep them in their index, because they continue to ‘work’ but a scraper will only get a sub-set of your real data. Of course if the programmer behind the scraper realizes the data is incomplete they can easily get the whole database just like the search engines will, by repeated visits. But most site-wide scrapes are hit-and-run events, they don’t have enough interest in your site to visit it more than once.
This technique is only appropriate in certain cases, with particular types of data, but when it can be used it is fairly effective. The scraper gets a sub-set of your data, throws up a website with that, but never approaches your rank in the search engines because their site is smaller, has duplicate content, etc.
An even better way to beat the competition is to have a database that never stops growing. By adding new content all the time, the scraper can never ‘catch up’ with you. Also, you can insert some original material in with the public-domain information. You have copyright to the original material; make it clear that your site uses a combination of public domain and original material — that will discourage copying by honest scrapers. Of course the crooks are another matter…
No Comments
You must be logged in to post a comment.