Sei sulla pagina 1di 8

ISBN 978-88-942535-7-3

Copyright ©2023 AIUCD


Associazione per l’Informatica Umanistica e la Cultura Digitale

Il presente volume e tutti i contributi sono rilasciati sotto licenza


Creative Commons Attribution Share-Alike 4.0 International license (CC-BY-SA 4.0).
Ogni altro diritto rimane in capo ai singoli autori.

This volume and all contributions are released under the


Creative Commons Attribution Share-Alike 4.0 International license (CC-BY-SA 4.0).
All other rights retained by the legal owners.

A cura di: Carbé, Emmanuela ; Lo Piccolo, Gabriele ; Valenti, Alessia; Stella, Francesco (2023). La memoria digitale: forme del
testo e organizzazione della conoscenza. Atti del XII Convegno Annuale AIUCD, Siena: Università degli Studi di Siena

Ogni link citato era attivo al 22 maggio 2023, salvo ove diversamentee indicato.
All links have been visited on 22th May 2023, unless otherwise indicated

Si prega di notificare all’editore ogni omissione o errore si riscontri, al fine di provvedere alla rettifica.
Please notify the publisher of any omissions or errors found, in order to rectify them.
aiucd.segreteria [at] aiucd.org
I contributi pubblicati nel presente volume hanno ottenuto il parere favorevole da parte di valutatori esperti della materia,
attraverso un processo di revisione anonima mediante double-blind peer review sotto la responsabilità̀ del Comitato Scientifico
di AIUCD 2023.

All the papers published in this volume have received favourable reviews by experts in the field of DH, through an anonymous
double-blind peer review process under the responsibility of the AIUCD 2023 Scientific Committee.

Il programma della conferenza AIUCD 2023 è disponibile online


The AIUCD 2023 Conference Program is available online
http://www.aiucd2023.unisi.it

Comitato Scientifico

Nicola Barbuti
Marina Buzzoni
Emmanuela Carbé (co-chair)
Fabio Ciracì
Fabio Ciotti
Angelo Mario Del Grosso
Maurizio Lana
Monica Monachini
Paolo Monella
Roberto Rosseelli Del Turco
Gino Roncaglia
Francesco Stella (chair)
Francesca Tomasi

Comitato organizzativo
Elisabetta Bartoli
Paola Bellomi
Monica Bianchini
Silvia Calamai
Riccardo Castellana
Rosalba Nodari
Antonio Rizzo
Enrico Zanini

Segreteria del Convegno: Gabriele Lo Piccolo, Francesca Pietrini;


Giulia Bassi, Silvia Cappa, Chiara Cauzzi, Martina Corti, Elena Crocicchia, Anna Guadagnoli, Giada Giannetti, Bogdan Groza, Alessia
Luvisotto, Paola Mocella, Pietro Orlandi, Martina Paccara, Elisa Petri, Maria Grazia Schiaroli

Si ringrazia l’Ufficio stampa, comunicazione istituzionale e stampa digitale, l’Ufficio Ricerca, Biblioteche,
Internazionalizzazione e Terza Missione e il Supporto eventi culturali e convegnistici dell’Università di Siena

Supporto tecnico: MCM Service, Santa Chiara Lab, Presidio San Niccolò

Enti organizzatori

AIUCD;

Università degli Studi di Siena: Dipartimento di Filologia e critica delle letterature antiche e moderne (DFCLAM), Centro
interuniversitario di Studi Comparati I Deug-Su, Centro Interdipartimentale di Ricerca Franco Fortini in “Storia della tradizione
culturale del Novecento”, Santa Chiara Lab, in collaborazione con i Dipartimenti di Ingegneria dell’Informazione e Scienze
Matematiche (DIISM), di Scienze sociali, politiche e cognitive (DISPOC) e di Scienze storiche e dei beni culturali (DSSBC);

CLARIN-IT.

Con il patrocinio di: Journal of the Text Encoding Initiative


Chair di track

Archivi, edizioni digitali, organizzazione della conoscenza


Marina Buzzoni, Paolo Monella, Roberto Rosselli Del Turco

Analisi computazionale dei testi


Fabio Ciotti, Rachele Sprugnoli

Intelligenza Artificiale e modelli applicati ai beni culturali


Monica Bianchini, Federico Boschetti

Preservazione della memoria e del patrimonio digitale


Nicola Barbuti, Maurizio Lana

Workshop
Francesco Stella, Emmanuela Carbé

Lista dei revisori

Stefano Allegrezza, Cristiano Amendola, Paolo Andreini, Laura Antonietti, Luca Bandirali, Sofia Baroncini, Elisabetta
Bartoli, Stefano Bazzaco, Andrea Bellandi, Paola Bellomi, Benedetta Bessi, Andrea Bolioli, Luca Bombardieri, Simone
Bonechi, Alice Borgna, Flavia Bruni, Paolo Buono, Dino Buzzetti †, Silvia Calamai, Anna Cappellotto, Giuliana Capriolo,
Vittore Casarosa, Riccardo Castellana, Paola Castellucci, Simona Chiodo, Fabio Ciracì, Elisa Corrò, Elisa Cugliana, Fabio
Cusimano, Christian D’Agata, Elisa D’Argenio, Vincenza D’Urso, Stefano Dall’Aglio, Marilena Daquino, Angelo Mario
Del Grosso, Antonio Di Silvestro, Diego Mantoan, Dominique Brunato, Dominique Longrée, Edmondo Grassi, Elena
Spadini, Giulia Fabbris, Pierluigi Feliciati, Paolo Fioretti, Franz Fischer, Greta Franzini, Francesca Frontini, Daniele Fusi,
Simone Giusti, Marco Grasso, Fabiana Guernaccini, Alessandro Iannella, Benedetta Iavarone, Alessandro Lenci, Eleonora
Litta, Agnese Macchiarelli, Marco Maggini, Elisabetta Magnanti, Francesco Mambrini, Tiziana Mancinelli, Anna Maria
Marras, Cristina Marras, Luca Martinelli, Stefano Melacci, Federico Meschini, Alessio Miaschi, Andrea Micheletti,
Giovanni Morrone, Rosalba Nodari, Giuseppe Palazzolo, Niccolò Pancino, Fiammetta Papi, Enrico Pasini, Marco
Passarotti, Giulia Pedonese, Igor Pizzirusso, Federico Ponchio, Francesca Pratesi, Alessia Lucia Prete, Simone Rebora,
Giulia Renda, Gino Roncaglia, Irene Russo, Enrica Salvatori, Eva Sassolini, Daniele Silvi, Daria Spampinato, Linda
Spinazze’, Francesco Stella, Matteo Tiezzi, Francesca Tomasi, Sara Tonelli, Gennaro Vessio, Paul Gabriele Weston,
Michelangelo Zaccarello, Patrizia Zambrano, Marco Zappatore, Andrea Zugarini
Digital Accrocchio:
a computational image searching tool for social history
Tiago Luís Gil1
1 University of Brasilia, Brazil – tiagoluisgil@gmail.com
ABSTRACT
Social history is fundamentally an attempt to include ordinary people in history. This inclusion, however, has always been
challenging due to the little representativeness of these people in the sources, which has always provided a noteworthy
"inductivist" or "empiricist" character to the works of this type of historical research. Nevertheless, how to tell the story of
ordinary people without using countless of sources to seek the remaining pieces of evidence of a few subalterns? This
research is inserted in this framework, in the search for studying the cotton spinners and weavers in colonial Brazil, between
1790 and 1810. Besides being done by subaltern people, this activity was partly forbidden due to mercantilist policies.
Despite this, several records of its broad and significant existence exist in a documental set of about 12,000 handwritten
documents available in Archives. To carry out this research, we will use technological resources of programming with
python and openCV, in an attempt to create a tool that can scan thousands of digitised documents in the search for those
(so far) unknown women.

KEYWORDS
Computational research tools; social history; fabric production

1. INTRODUCTION
For centuries, history was the story of “great” men: kings, generals, and presidents. In the 1950s, social history arose to
speak of peasants, serfs, and workers. This task implied the use of many sources because ordinary people have always been
weakly represented in documents. In the 1950s, research began with punched cards that collected small fragments of
information from each character to form a bigger picture.[1,2] In the 1980s, these researches started to use computers, and
several methods and programs were created within this effort.[3,4,5] This proposal aims to start from a specific research
problem and develop digital solutions tailored to those needs, in tune with those initiatives from the 1980s. The case study
will be the captaincy of São Paulo (currently Brazil) between the end of the 18th century and the beginning of the 19th.

2. THE CONTEXT OF SPINNERS AND SOURCES FOR THEIR RESEARCH


At the end of the 18th century and the beginning of the 19th century, there was widespread spinning and weaving in several
cities of the Captaincy of São Paulo, in Brazil. It was an activity done by women, primarily poor women and widows,
among the humblest half of the population. The cloth production was made in combination by spinners and weavers, who
occupied about 15% of the houses in the localities of the Paraíba valley. This activity was the second most practiced in
those localities, surpassed only by agriculture.[6]
The production of cloth was prohibited in Brazil, due to a Decree of 1795 that, with the argument of avoiding the neglect
of agriculture, aimed to reserve the colonial consumer market for portuguese middlemen of European fabrics in a context
in which England was emerging as the world's workshop. Despite the general prohibition, there was room for weaving
insofar as the production of thick fabrics used for making clothes for the enslaved people was permitted. There is no way
to be sure (nor was there at the time) about the use of these fabrics by free people and in this interstice that the continuity
of this production occurred.[7]
Weaving was not effectively combated, despite the prohibition. However, because of the Decree - which historians took
exceptionally seriously - this theme was entirely ignored by historiography. At the same time, historiography guided the
work of archivists, who never included terms such as weaving, spinning, and fabric production in the cataloguing sheets,
hindering any research on this subject.
The documents available are the so-called nominative lists of inhabitants, a kind of census carried out for economic
purposes that listed the occupation of each house. The registers describe the entire community house by house, indicating
its residents, age, and marital status of each, as well as free and enslaved persons. Finally, each house has its main economic
activity described, often in detail. The famous demographic historian Louis Henry addressed this particular document in
the Brazilian editions of his famous "Manuel de Dépouillement..."[8]
The lists, however, form a very voluminous set of data that challenges the researcher. The classic works of social history
ended up being restricted to a single city or community in such a way that it was possible to know all the anonymous people

104
in that locality. Such were the researches of Goubert and Ladurie, to give two classic examples, and it was also in
community contexts that much microhistorical research was carried out. It was the necessary cutout, but also the cutout
possible for the breath of a researcher, even with the help of many assistants, as was the case with Ladurie.[1,9, 10]
The research at hand intends to somewhat modify the scope of this approach. Instead of one community, four,
Guaratinguetá, Pindamonhangaba, Taubaté and Lorena, all neighbouring and all with a strong emphasis on cloth
production. The towns exchanged many people among themselves, as has been observed up to now, through marriages,
cronyism, removals, and trade in enslaved people, much more than with other localities, indicating strong integration.
Moreover, this integration was also fundamental so that the spinners could have the guarantee of a buyer of yarn since their
poverty prevented any expectation. In other words, the inclusion of four communities is not an empirical documental
increment but intends to show a geography of cloth production, which is only possible by taking the set of villages.
The question that arises is the volume of data that this four-village approach presupposes. If spinners appear in few
documents, as such only in the lists, they appear as godmothers and mothers in baptismal records, wives, and witnesses in
marriage registers, and buying and selling in notarial records. All this - which would already be much for a single locality
- becomes an almost impossible amount of data for all localities together. Furthermore, it calls into question the viability
of a single historian's research. It is not a matter, therefore, of speeding up or processing more documents but of making
research possible, which starts with a question that demands little information spread over thousands of pages. For this, we
intend to resort to the use of programming.

3. THE SOLUTION FOUND: A HOME-MADE TOOL


A document image management system of the villages studied in this research proposal was created. It is a system that
organizes the images and is able, according to the specificity of each document, according to its visual morphology (whether
in table form, as are the Nominative Lists, or in the form of text "nests", as are the baptisms) and separates them into
records, house by house, in the case of the Lists, baptism by baptism, in the case of the parish registers. These documents
are then inserted as a binary image into a sqlite database system (within a python installation) to be called up by a search
system that identifies handwritten image patterns when required. This tool is not HTR software but a more straightforward
solution that uses only a few text samples to search for specific words without the need for training via artificial intelligence
– a mere word searcher or an image “ctrl + F”.

Figures. 1, 2 and 3. Source segmentation process. Figure 1. Original image. Figure 2. Original image with cut-off point. Figure 3. Cut
image.

The available HTR programs are satisfying and worthy of their proposed tasks, but they would be impractical in such a
research. First, segment recognition proved insufficient to account for the morphologies of documents made in the form of
a register. It could not separate baptismal records from one another, nor the rows of a table. For this, the tools of OpenCV

105
(image management library available in many languages, including python) provided very efficient answers with a
reasonably small and easy-to-manage code. The OpenCV library recognizes the table rows in the horizontal direction and
establishes a cut-off point (Figure 2, middle column, with the pink line highlighted). This point is used to fragment the
image, row by row or, in this case, house by house. The image recognition system scans each fragment in the search for
specific terms.
HTR tools require a minimum amount of text (5000 words, for example), for which they promise an accuracy of over 90%.
However, these high percentages are only possible with abundant "secondary" words (prepositions, articles, etc.). In
contrast, many words with a high meaning impact in the text are often incorrectly transcribed because they were less
common or due to machine learning overfitting.[11,12] In this case, a single example of the desired word (although a more
significant number allows an increase in the accuracy of the search) is enough to find very close versions. A single image
fragment (or "template", corresponding to a word), with a good calibration of OpenCV's matchTemplate tool, allowed
finding 100% of similar occurrences (with a burden of 30% false positives) or 95% of similar occurrences without false
positives. The use of another library, MTM MatchTemplates, which allows the simultaneous use of several "templates"
(with variations of the same word), reached 100% of success with only 15% of false positives, having as a burden the need
to insert three or four slightly different images of the same word (something effectively easy to accomplish). In all cases,
a sample of 310 images was used. A traditional HTR tool would require training with thousands of words and would not
effectively solve all problems.

Figures 4, 5, 6, 7, 8 and 9. Searching system and templates.

Figure 4. Search term: Cotton (“algodão”).

Colums
1 2 3 4 5 6

Figure 5. Original fragment taken from the image.

Figure 6 . Fragment with detected text.

106
Figure 7. Image processing illustration.

Figure 8. “Cotton” (“algodão”) variations in the same manuscript.

A source such as the nominative lists expresses this well: these are documents in the form of a table, with the first column
being an identifying number and the second column being the names of the household members, followed by the column
with their age, marital status (fourth column) and colour (fifth column). The sixth column contains information on the
productive activities of the household. In the first column, there are only numbers; in the second column, only names follow
specific patterns and have unique characteristics. It is not just a random text. Similarly, columns 3, 4 and 5 are filled only
with letters and numbers, which is easy to inform the computer via python. These are unambiguous and simple rules.
Finally, column 6 may count more word variations, but there is also a great regularity since many people did the same
things, and the census takers repeat the same textual formula. That is a well-known regularity that can be easily transmitted
to the computer with the aid of python and the OpenCV library. Furthermore, handwriting often varies in groups of 50
pages (as there are several census takers), and this would require new HTR training. The system developed by this research
allows labour saving by dispensing with training and with good use of the materials produced.

4. CONCLUSION
Different research has different demands on technology resources, and the craft of research can benefit from digital
resources. However, large, robust solutions do not always bring direct benefits to all approaches, and often the needs of
each research require ad-hoc answers. What is presented here is a specific solution for a very particular set of data, presented
in the form of images of a document with its characteristics. Research focused on the history of ordinary people involves
adequate treatment of the available material and the necessary data. Training with machine learning through some HTR
model would not be possible given the amount of text available (often handwriting changes, and a minimum of 5000 words
is often unattainable) and the characteristics of this text (names, ages and certain economic activities). In this sense, the use
of the tools presented became quite opportune and allowed the continuity of the research without dismissing the digital
resources that can help make it viable. Finally, it seems relevant to point out this example as a digital craft solution, with
solutions tailored to the research and with great respect to the research problem presented, without imposing digital
restrictions on the theory that guides the research.

REFERENCES
[1] Ginzburg, Carlo. Il formaggio e i vermi. Il cosmo di un mugnaio del ’500. Adelphi, 2019.
[2] Goubert, Pierre. Cent Mille Provinciaux au XVIIe Siècle. Beauvais et le Beauvaisis de 1600 à 1730. Paris: Flammarion, 1968.
[3] Thaller, Manfred. “Methods and techniques of historical computation”. Em History and Computing, edited by Peter Denley e Deian
Hopkin. Manchester: Manchester University Press, 1987.
[4] Genet, Jean-Philippe. “The PROSOP system”. Em History and Computing, edited by Peter Denley e Deian Hopkin. Manchester:
Manchester University Press, 1987.
[5] Harrison, Sarah, Charles Jardine, Jessica King, Tim King, e Alan Macfarlane. “Reconstructing Historical Communities by
Computer”. Current Anthropology 20, no 04 (1979): 808–9.
[6] Listas Nominativas de Habitantes de Pindamonhangaba, Taubaté, Lorena e Guaratinguetá. 1800-1805. Arquivo do Estado de São
Paulo.
[7] Novais, Fernando A. Portugal e Brasil na crise do antigo sistema colonial (1777-1808). São Paulo: HUCITEC, 1989.
[8] Henry, Louis. Técnicas de análise em demografia histórica. Curitiba: UFPR, 1977.
[9] Ladurie, Emmanuel Le Roy. Montaillou, village occitan de 1294 à 1324. Paris: Folio, 2008.
[10] Levi, Giovanni. Centro e periferia di uno stato assoluto. Torino: Rosenberg & Sellier, 1985.
[11] Scheltjens, Werner. “The Feasibility of Machine-Learning Based Workflows for Editing Serial Historical Sources : First Results of
the Schenkenschans Customs Registers Project”. Otto-Friedrich-Universität, 2023. https://doi.org/10.20378/irb-58260
[12] Perdiki, Elpida. "Review of 'Transkribus: Reviewing HTR training on (Greek) manuscripts'." RIDE 15 (2022). DOI:
10.18716/ride.a.15.6. Accessed: 01.03.2023.

107

Potrebbero piacerti anche