Currently I’m working on data entry software for census records. There are a lot of variables to consider: how much information should be included, which facts are to be recorded, how closely is the data to be source-referenced (i.e. page/folio or exact line for each entry?), are short-cuts possible that can make the indexing process easier or more effective?
First, I don’t understand why anyone would want to make a complete transcription of census data, such as some Genweb sites have attempted. Seems like a massive waste of time to me. If the transcription is complete, you don’t need to check the original, right? Well, no, it does not work that way. Genealogists should always confirm the data in the original anyhow — so what do you gain?
The only rational reason for a complete transcript of any original record is that it provides back-up if the original is destroyed, or if the original is not accessible. Well that certainly does not apply to US Federal Census records — there are so many microfilm and digital copies around that only global disaster could wipe them all out, and in that case the transcriptions would probably go to, and there would be nobody left to care.
So what we are talking about here is indexing. Indexes are vital for making information accessible. But how much detail is enough? We want people to be able to determine which of the dozens of John Smith listings in a city are the John Smith they seek. Obviously, year of birth is one of the main identifiers — and that is available on Federal censuses from 1850 forward — though it must be calculated from the persons age in most years. Those same censuses also list the birthplace of each individual — at least the state, or in the case of foreign born, the country — and that is a big help in distinguishing similarly-named individuals too. But it is really the family relationships that are usually the clincher in distinguishing individuals.
With our Rec2Gen software, relational information is available, so the entire household will be listed for each individual, by clicking on the name and looking at the relatives, friends and associates sections of the full report. So obviously our data entry software needs to enter the data for entire households together.
Another problem I have with current indexing practice is this notion that you have to copy exactly what the original record shows. If you are citing information in your compiled genealogy, or a genealogy report of some kind, then yes, you need to report exactly what is shown in the record. But an INDEX is not a RECORD — it is just a finding aid. Making changes that make it easier for people to find the entries they seek just makes sense. Why should people have to check both the Wi and Wm part of any alphabetized list looking for William? And while common abbreviations can be accounted for in computerized search programs, it is impossible to include every possible variation. Why not just index it as William, and when the researcher views the original record they will record Wm or Willm, or whatever they find?
So, getting back to my programming, the indexing program for 1850+ censuses will simply include name, birth year (calculated if need be) and birthplace. Data will be entered for a complete household, even if it happens to span more than one page or folio. That allows for the most rapid data entry, lets me write just one program for many different census years, and still provides easy access to the most important identifying characteristics. It will also allow members to enter just their own family from the censuses.