The social sciences and humanities are being fundamentally shaken by the massive adoption of digitization, which allows the collection and availability of data in massive quantities. The major challenge that this transition raises is the ability to study and understand these data in order to distil new theories based on quantitative and qualitative characterization of the phenomena under study.
When these data are linguistic, natural language processing (NLP) is positioned as an essential lever for carrying out this type of study. In particular, in history, large corpora are beginning to become available (newspapers, administrative notes, literature, ...) which, if processed using relevant methods, would make it possible to trace major events back to their source, or to understand why clusters of clues announcing their emergence did or did not ultimately lead to such events.
My dissertation project is part of the ERC ENP-China project studying the transformation of Chinese elites in the 19th-20th century, for which historians are led to work on more than 100 years of newspapers in order to find relationships between the people or organizations concerned and their involvement in historical events. These two types of linguistic objects, named entities and events, have been widely studied in NLP, but mostly with contemporary applications as targets and often in relation with different issues and questions. Beyond these primitive objects, the work of a historian requires the development of a global vision of the unfolding of history, based on the details evoked by the source data. The objective of this project is to study the generation of event graphs from large sets of language data, as a support for historical research. These graphs will present the evolution of historical events in the context of the entities that took part in them and will be linked to the source documents from which the existence of these events can be inferred.