Reimagining the Future of Historical Research, One Box at a Time

Oct 11, 2024

Imagine the frustration of a scholar who, after traveling thousands of miles to the National Archives in College Park, must wait hours to access a single box that may or may not contain the written documents they’re seeking.

These boxes can easily contain 100 documents, with many having little-to-none of their hardcopy documents digitized. This all-to-common scenario in archival research—wherein researchers must deduce on their own what content is in the boxes they requested—can feel like searching for a historical needle in a very large haystack.

Researchers at the University of Maryland, with assistance from a trio of information experts from Japan, are ready to offer digital solutions to this dilemma, developing information retrieval tools that will rely on machine learning algorithms and sophisticated cameras to better identify and categorize important historical records.

Douglas W. Oard, a professor in the College of Information and the University of Maryland Institute for Advanced Computer Studies (UMIACS), is leading the project. An expert in information retrieval, Oard envisions an advanced system where even before arriving at the Archives scholars can input their requests using natural language. With the help of the system, they will then be able to work with Archives staff to more easily identify some of the boxes most likely to contain content relevant to the scholar’s query.

This is a massive undertaking, Oard says, with the current category of paper documents in the National Archives’ 43 facilities totaling well over 10 billion pages, of which only about 2% are currently digitized. The key, he explained, will involve combining his team’s collective expertise in information retrieval, digital curation, data mining, computer vision, pattern recognition and artificial intelligence (AI) to develop an archival query platform that is efficient, accurate and user-friendly.

First, Oard will seek to leverage powerful AI tools to map the physical layout of the National Archives in College Park, using machine learning algorithms to assist in predicting the contents of undigitized boxes based on what is known about neighboring records.

“We need to develop a system capable of understanding how the Archives is organized,” Oard explains. “This way, for example, we might infer that the box in the middle is likely to contain materials related to both the box on its left and the box on its right.”

Additional input will come from Tokinori Suzuki, an assistant professor of informatics at Kyushu University in Japan, who currently has a visiting appointment at UMIACS. Supported by a $132,000 grant from the Japan Society for the Promotion of Science, Suzuki will bring invaluable expertise in developing technology for expanding the range of metadata that can describe what an archive contains, and where in the archive that data can be found.

Suzuki’s novel approach to this challenge is based on mining academic literature that has been published online for references to materials held by the archival institutions. By analyzing these references, Suzuki and two of his colleagues from Kyushu University will gather relevant information about documents that historians have already seen, but have not yet been individually described by that archive.

Other assistance on the project comes from College of Information faculty members Diana Marsh, an assistant professor of archives and digital curation, and Katrina Fenlon, an assistant professor specializing in digital collections.

David Doermann, a longtime UMIACS research scientist who is now chair of the Department of Computer Science and Engineering at SUNY University at Buffalo, is working on further expanding how Oard’s team can learn more about what people may have already seen in an archive. Doermann is developing a sophisticated camera system that works like the human eye, shifting its gaze and adjusting its focus to see documents with acuity needed to digitize those documents in real-time.

The computer vision software in the cameras will register the content of historical documents being viewed by scholars, literally looking over their shoulder (with permission) as they work. The software will then add that data to the repository of information, from which Oard’s team can base its predictions of where unseen documents might be found.

Despite the hurdles ahead—due in no small part to the sheer volume of data involved—the research team believes their work holds profound promise.

“The project poses new challenges, but the potential to transform access to historical documents is what drives us,” Oard says. “We’re not just building technology—we’re working to enrich the future of historical research, one box at a time.”

—Story by Melissa Brachfeld, UMIACS communications group