nuṣūṣ Project
Emirhan Kabataş
Digital humanities continues to rise with increasing momentum with new studies from every field of humanities. As interdisciplinarity has become a natural necessity at the point where digitalization and the humanities have reached today, text-based studies in this context are gaining more importance. The digitization of texts in various formats since the early days of computer use has eliminated the main problems related to the circulation of scientific knowledge and access to primary sources from the periods when the idea of digital humanities was not yet aware. However, at this point, character-based descriptions of digitized texts have become a necessity both in scientific and other professional fields. Because in order for the information in the written text to be processed in the digital environment, the presence of "visuals" in various forms was insufficient for an effective processing and production process. OCR (Optical Character Recognition) technology, developed to overcome this problem, has become a turning point for text and document-based work in many fields. Thanks to OCR, anyone working with text has full access to every single character in the text. This access eliminated many of the fundamental problems and wastes of time, such as the reproduction of certain data and the impossibility of in-text searching. Text-based resources, one of the most important types of resources in the humanities, have thus become much more convenient for scholars. However, although researchers were able to use functions such as in-text search in individual cases, the available technology was naturally still insufficient for researchers who analyzed a large number of texts in a context. At this point, the idea of transforming various texts into a "corpus" within a specific context emerged, thus creating an important opportunity for researchers who want to examine and analyze a large number of texts with specific filters.
The project nuṣūṣ, developed by Antonio Musto of New York University, Giovanni DiRusso of Harvard University, and Jeremy Farrell of Emory University, is one of the projects that started with the aforementioned idea of "corpus" (Figure 2). nuṣūṣ can be defined as a digitized corpus of Arabic texts designed to fill the gaps in the existing digital corpus. Although the initial idea of the project was to create a collection centered on Sufi texts, it has expanded to include a number of theological, philosophical, and Christian theological texts.
As of July 2023, the nuṣūṣ database contains 91 texts, 34 authors, 1,083,050 words and 4,707 pages. The texts in this database are digitized using eScriptorium, a digital paleography tool using the Kraken OCR engine developed by OpenITI. The project website, www.nusus.net, consists of Home, Corpus Info, Browse Corpus, Search in Corpus and About tabs. The Home tab contains an introductory text as well as information about digital updates and changes to the corpus. Search in Corpus (Image 1), one of the most functional parts of the project, allows for detailed corpus searches with specific filters. The final tab, About, contains information about the project developers as well as copyright warnings. The Corpus Info tab contains some technical details and statistics about the corpus, while the Browse Corpus tab (Image 3) allows you to browse the metadata of the works in the corpus and perform simple searches. The links in this section contain some information such as author biographies and work information.
Nuṣūṣ, which offers a meaningful contribution to researchers from many fields and disciplines, is a project that is still being developed and can be made richer with the addition of many sources and texts in the future. The website of the project, whose dictionary definition is "clear and definite judgments whose veracity is beyond doubt" (Kubbealtı Lügatı), can be accessed at the following link: www.nusus.net