The Hathi Trust Research Center
Introduction to HTRC — Stephen Downie
WCSA/Collection Building — Jacob Jett and Pip Willcox
Feature Extraction — Peter Organisciak
HTRC Bookworm — Loretta Auvil
Thursday, October 23, 14:00
While large corpora of digitized books provide unprecedented opportunities for novel modes of scholarly inquiry, through computational analysis, into the cultural legacy of mankind as preserved in print media, corpora that include books that are under copyright and cannot be made available to scholars for computational analysis in bulk form, present a special challenge. A second special challenge, especially for inquiry in the humanities, is that the interface should provide scholars with tools for distant reading as well as close reading. This panel will consist of four papers that present a complementary set of approaches to this problem, which, individually, provide scholarly users with interfaces to the text data along collection-centric, document-centric, and word/phrase-centric threads of inquiry into the corpus.
The first paper will present the underlying cyberinfrastructural approach to this problem as embodied in the non-consumptive approach to scholarly research, in which, rather than the data be downloaded, the user interacts with the data through an interface by which intended algorithms are brought to the data, rather than the data be brought to the algorithms on the user side. The second paper will describe the collection-centric interface to the corpus via the notion of the workset, the customized subset of the corpus, which a scholar gathers, curates, and sustains through version management, and will conceptually describe the data model that can be operationalized into an architecture and interface to support worksets. The third and fourth papers will describe the HTRC-Bookworm and HTRC-Feature-Extraction initiatives, which represent word/phrase -centric and document-centric approaches, respectively, to interfacing with the corpus — the former by enabling the discovery of chronological trends, for specific words or phrases, at different grain sizes within worksets, and the latter by treating a document as a bag-of-words and generating frequency counts for textual features of interest within the document on a per-page basis.
Overall, the suite of approaches described above constitute a coherent strategy for enabling and supporting, under the constraint of non-consumptivity, the scholarly reading of, and interaction with, corpora. The panel will draw on the work done at the HathiTrust Research Center (HTRC), where these interfaces are at different stages of deployment, prototyping and conceptualization.