Mine All (Data) Mine – Drew Smith
|As genealogical researchers, we are awash in data. Our computer hard drives and 3.5″ disks are full of data. The Internet is full of data. Our file cabinets are full of data (data that isn’t even in digital form yet). Every now and again, we get this sneaking suspicion that we already possess the answers to some of our questions, but that those answers are buried somewhere amid the vast quantities of data we already own or have access to.
The situation reminds me of a science-fiction story I once saw broadcast on television, in which a rich man gives all of his money to the devil so that he can go back in time and purchase property he knows contains a huge reservoir of fossil fuels. Unfortunately, he fails to realize that the technology of that time period is incapable of accessing the energy resources. Like that unfortunate time traveler, we find ourselves rich with data, but without the technology to benefit from it.
During the past decade, information scientists have been studying and developing a technology that may soon prove to be of enormous value to genealogical researchers. Known as “data mining,” it is capable of analyzing our data and looking for patterns that might provide the answers we seek. Data mining, also known as “knowledge discovery in databases” (KDD), involves a process that could, in the very near future, be built into the genealogy research software we already use.
The first step in data mining is the collection of data from a variety of sources. Sources currently available only in print or microfilm form would be converted to digital information. Data is retrieved from government databases (such as the Social Security Death Index), news wires (think of the number of news stories and obituaries that may contain information you need), the Web (think of how much is added there everyday), genealogy message boards and mailing lists, your personal e-mail, and any other potential source of genealogical information.
The second step is for the data to be “cleaned.” As you might imagine, raw data from the sources previously mentioned needs to be standardized (date formats, abbreviations for locations, etc.). Next, the data must be stored. The repository for this large collection of data is referred to as a “data warehouse.”
Once our genealogical data has been warehoused, we then need software that can begin to sift through it looking for patterns we have defined. For example, we might ask the software to look for a pattern that involves a cluster of people moving from location A to location B during a certain time frame. By doing this, we might identify a group of people who migrated with our own ancestorsa group that contains relatives we have not yet identified. Another example would be to use the software to look for naming patterns among our ancestors and among people who lived in the same places during the same times.
As you can see, these questions are not the typical questions we can ask our genealogical database software today. But the cost of data storage continues to drop, making it feasible for us to create our own huge data warehouses for our research purposes. The software to analyze data for patterns is already being developed and used for financial and scientific purposes. Can it really be that far in the future before we see the same type of software available to us?
Drew Smith is an instructor with the School of Library and Information Science at the University of South Florida in Tampa. He is also a regular contributor to the quarterly journal Genealogical Computing, where he writes the “Cybrarian” column. He can be reached at drewsmith@aol.com.