ERTIM PhD students

ERTIM is a host team at the crossroads of multilingual engineering and automatic language processing. Its main aims are research in text semantics for applications, the development of methodologies for the engineering of multilingual texts and digital documents, and the production of multilingual resources (lexical, terminological, textual, didactic).

Within the ERTIM team, doctoral research lies at the crossroads of the laboratory's traditional issues: multilingualism, linguistic engineering, electronic document declined on different media (web, mobile, etc.).

Most doctoral students currently working on their theses are funded by partner companies (ARISEM, AMI Software) through CIFRE grants, research contracts or ATER-type funding.

Listed below are the works of doctoral students currently working on their theses within the ERTIM team, as well as the theses of doctoral students who have already defended their theses:

Automatic transcription of Armenian: phonetic and grammatical...

Automatic transcription of Armenian: phonetic and phonological issues

Samuel Chakmakjian

Expected date of defense: 2024
Research co-directors: Damien Nouvel and Anaïd Donabedian (SEDYL)
Summary:
Despite major global advances in artificial intelligence and Automatic Speech Recognition for some world languages, there is no widespread automatic speech recognition model or readily available speech recognition software for Armenian. This project will address the need to systematize the first link in the chain that constitutes language processing (NLP) dealing with spoken data: phonetic input.

Our first objective is to provide a phonetic description based on instrumental studies of modern Armenian with all its parameters of variation, which can be exploited by computational linguists and language engineers working on the automatic processing of Armenian corpora.

The second objective of this project is to complete the chain, moving from a phonological transcription to a transcription in Armenian orthography. In doing so, it will be necessary to determine whether it is possible to take a common phonological model for the two main variants of the language. The possibility of such unification remains a major question in the field of Armenian linguistics that has yet to be resolved.

Building on the wealth of oral corpora (collected and produced by SeDyL, Labex EFL, IRISA, EANC), we aim to link experimental and theoretical research in phonetics and phonology of Armenian, and to provide and test a model with algorithms and neural networks (Hidden Markov Model, wav2vec). Our results will be of importance both for Armenian linguistics and also for practical applications.

Automatic classification of consumers according to personality and...

Automatic classification of consumers according to their personality and values expressed from spontaneous discourse derived from social networks in the perfume field

Boyu Niu

Expected date of defense: 2023
Research Director: Frédérique Segond
Summary:
This CIFRE thesis will be carried out within the ER-TIM and the Consumer & Sensory Innovation team at International Flavors & Fragrances Inc. (IFF) under the supervision of Dr Frédérique SEGOND.

During the thesis, we will implement a system capable of detecting personality values from the spontaneous discourse of fragrance consumers and classifying those. The values were inspired by Schwartz's studies (1996, 2003, 2006). To implement this system, we will use techniques in the field of automatic natural language processing (ANLP), drawing on psycholinguistic and sociolinguistic techniques. After implementation, we'll carry out an evaluation campaign on the system's performance, before implementing it so that it's functional for the company.

More concretely, the sub-domains of NLP we'll be using include style analysis, sentiment analysis, sarcasm detection, thematic extraction, etc.

To begin with, we will study IFF's knowledge about consumers: would it be possible to transform this knowledge into TALN tools so that this knowledge can be applied to natural language texts?

It is also an opportunity to ask the question whether Schwartz's personality values can be detected through natural language, and would the linguistic realization of these values correspond to the description of them? Would it be possible for us to find new values during our research, more or less specific to the world of perfumes?

Detecting news on social networks - Yizhou Xu

News detection on social networks

Yizhou Xu

Expected date of defense: 2023
Research co-directors: Frédérique Segond and Kata Gabor

Summary:
In the age of information explosion, the Internet user, regularly faced with a voluminous set of textual data available on the Internet, will have to browse piles of similar pages on the same subject to find some "new information", which highlights the need to automatically detect and extract novelties. Novelty detection involves retrieving elements that have not appeared before, and that are unknown and original in relation to pre-determined references.

Social networks, such as Twitter and Facebook, are becoming increasingly important as major sources of these novelties: users share, discuss and follow news on these platforms; companies use these platforms to launch new products. Automatically detecting novelties on social networks is thus an essential task for monitoring or analysis systems in many fields, and among them economic intelligence and intelligence in the defense and security field.

This thesis, carried out at Bertin IT, therefore aims to develop tools and methods for automatically detecting and extracting novelties in texts originating from social networks. In this study, we will address different aspects of this task (new entities, new relationships or new events) and propose solutions for different application scenarios (economic intelligence and intelligence in the defense and security field).

Development of NLP tools for a variety of Quechua languages - Johanna Cordova

Developing NLP tools for a variety of Quechua

Johanna Cordova

Expected date of defense: 2023
Research co-directors: Damien Nouvel and César Itier

Summary:
Quechua languages are the Amerindian language family with the largest number of native speakers. In Peru, according to the 2017 census, 13.9% of the population have Quechua as their first language, and around 20% speak it. Yet it is almost totally absent from digital usage. In automatic language processing (NLP), it is a poorly endowed language, with a strong disparity of resources depending on the variety of Quechua considered. The aim of this thesis is to develop a set of fundamental tools for the automatic processing of a variety of Central Quechua, Ancashino Quechua, spoken by around 400,000 people (but in danger of extinction according to the UNESCO classification). The process involves three stages: the collection of oral and written corpora and the digitization of available resources in this variety (dictionaries, collections of tales and stories), the implementation of a morphological analyzer, and the elaboration of a treebank corpus. The resources developed will be valorized through applications such as a spell-checker and/or an aligner for parallel Quechua-Spanish corpora. In a global context of valorization of native languages, and at a time when ambitious policies linked to linguistic rights are being deployed in Andean countries, the presence of Quechua in technologies would constitute an important lever for reinforcing its practice and facilitating its teaching.

Automated analysis of naming processes in corpora

Automated analysis of naming processes in corpora: contributions of entity recognition and coreference for discourse analysis

Manon Cassier

Expected date of defense: 2021
Research co-directors: Julien Longhi and Damien Nouvel

Abstract:
The context of the thesis is set in the perspective of work on the automated interpretation of complex manifestations of discourse facts not capturable by current discourse analysis (DA) methods in data derived from political interview transcripts. It focuses on the linguistic mechanism of "nomination", in connection with the concepts of naming, designation and referencing. Based on theoretical and descriptive work on discourse facts, the aim is to prototype, implement, experiment with and validate approaches for the detection and characterization of nominations, in conjunction with the processing developed by the TALAD project's NLP teams. In particular, the outputs of entity and coreference recognition will be exploited to determine their contribution to an experimental system focusing on nominations. Feedback will be given to each NLP treatment in order to evaluate its contribution to nomination recognition, with a view to integration with traditional DA tools. One of the challenges of this thesis is also to propose a classification system for the Reticular enterprise in order to qualify different actors in political life. Indeed, Reticular is interested in qualifying actors as "doctrine designers", "popularizers", "opinion relays" (sometimes "influencers") and "new converts", "fans", or even "supporters". In this way, we'll draw on the brands formally identified by NLP techniques to help characterize actors, not by what they "say", but by their "(way of) saying".

Detecting influencers in a selection of social media - Kevin Deturck

Detecting influential people in selected social media

Kevin Deturck

Expected date of defense: 2020
Research supervisors: Mathieu Valette, Frédérique Segond and Damien Nouvel

Summary :

In this thesis, we will develop a theoretical framework for automatically identifying influential people in social media based on their manifestations in terms of interaction with other users and their profile with particular traits. Globally, approaches to influence detection are distinguished by the type of data they address: structured or unstructured. The theoretical framework chosen for our thesis has the particularity of combining these two types of data in an attempt to obtain the best complementarity and build the most effective system. We will translate the general traits of the influencer into discursive markers that require the analysis of unstructured data such as text, and into structural characteristics that will call upon structured data such as metadata.

Our thesis is set in a context of NLP on French, and the corpora already available are in French, so we will work mainly on this language. However, we will ensure that the models implemented can be adapted in a multilingual context; social media technologies remain the same whatever the language of the messages.

Our work can be integrated into two projects already underway at Viseo Technologies: one deals with enriching a CRM (Customer Relationship Management) tool by adding the most influential consumers, the other aims to detect the recruitment of young people by jihadists, adding a politico-social dimension to the project's commercial application.

Social media are crucial for the dynamism of interactions between their users and therefore the influence that can be exerted. Our project will provide a better understanding of the mechanisms for transmitting information on these media.

Developing a localized terminology to help migrants access the law...

Elaboration d'une terminologie localisée pour l'aide à l'accès au droit des migrants hindiphones, ourdouphones et pendjabiphones

Bénédicte Diot-Parvaz

Thèse soutenue le 30 novembre 2019
Codirection de Recherche : Annie Montaut et Mathieu Valette

Résumé :
Nul n'est censé ignorer la loi. Yet it's often difficult to understand the law, a discipline with an abstruse reputation, especially when you're a migrant in a country whose language and cultural codes you haven't mastered. To guarantee the rights of those subject to the law, the French government provides interpreters and translators for defendants, plaintiffs and victims who do not understand French, in order to integrate them into the judicial process. Law is a technical field that requires a dual level of interpretation: heuristic (interpretation of texts) and sociolinguistic (from one language and cultural system to another), the latter requiring adaptation of the message to facilitate understanding by the public. This thesis project follows on from a professional master's degree in TRM and a research master's degree in language sciences, targeting Hindi-, Urdu- and Punjabi-speaking communities with a view to making the law accessible to them and facilitating their integration. Indeed, while South Asian populations are familiar with a legal context marked by common law (the Anglo-Saxon system) and customary law specific to each religion for personal law (family law in general), many concepts of French law seem unintelligible to them. However, some countries with a strong migratory tradition, including Canada, have developed resources and techniques for terminological localization. By collecting and studying legal corpora and adapting terminology to take into account the socio-cultural and linguistic factors at play among these migrant populations, this project aims to produce a dictionary that will serve as an interface between the migrant populations concerned and social workers.

Lexical frequency and readability of L2 texts: a comparative study....

Lexical frequency and readability of L2 texts: a comparative study of Burmese and English texts

Jennifer Lewis-Wong

Expected date of defense: 2020
Research co-directors: San San Hnin Tun and Mathieu Valette

Summary:
For learners and teachers alike, information on the lexical frequency of words in a text, its lexical profile, makes it possible to assess the relative difficulty of vocables within a text. This information can be used to calculate a readability index, providing a practical means of automatically selecting a text that matches the learner's L2 language skills. We propose to examine the contribution of lexical frequency to the evaluation of text difficulty for Burmese texts. We will test the method on a corpus of English texts already classified by difficulty level, before applying it to Burmese texts so as to have a basis for comparison. This will enable us not only to develop a lexical frequency list of the Burmese language, but also a device that will provide both the lexical profile of a Burmese text and a readability index indicating its level of difficulty.

Development of linguistic methods for data mining...

Development of linguistic methods for opinion mining in Chinese (for Systran's XXX application)

Liyun YAN

Expected date of defense: 2020
Research supervisor: Mathieu Valette

Summary:
Opinion mining is of interest to both academic research and industry. Its application to Chinese appears necessary in view of the growing masses of data on the Internet and the insufficiency of current research on this language, compared with European languages for example. In the business context, the aim of opinion mining is to develop applications with which companies or customers can obtain a synthetic analysis of Internet users' comments enabling them to identify their subjective states relative to events, objects, people, etc.

Based on the state of the art, I plan to adopt methodologies that have proved their worth in existing research and to innovate in terms of linguistic methods in line with semantics work carried out in particular at ERTIM. Through a variety of experiments, the validated solution will be integrated into the Systran company application in which I will be completing my thesis. The experimental corpus consists of comments from the Booking website, which provides travel, hotel and rental services in 41 languages, including Chinese.

The first stage of my work will involve refining the research program and building and standardizing the corpora. The 2nd year will be devoted to developing a method or combination of methods for linguistic rule-based opinion mining. The 3rd year will be devoted to writing the thesis. At the same time, I will develop an industrial application based on the validated methods.

Textual analysis of ecological discourse corpora related to wu mai....

Textual analysis of ecological discourse corpora related to wu mai (pollution fog) in China using text mining methods

Qinran DANG

Expected date of defense: 2020
Research supervisor: Mathieu Valette

Summary:
As a result of environmental degradation in China linked to industrial activities and economic expansion, the word wù maí (pollution haze) has, since 2008, been omnipresent on websites, in the press, social networks, forums, and blogs, etc. China's air pollution problem has not only attracted the attention of Chinese institutes and media, but also that of the Western press. Our project is to analyze ecological discourses in a large and varied corpus, in order to identify the diversity of ideological positions and their expression. Comparisons will be made according to the type of site (institutional, media, informal), on the one hand, and according to the ideological context (Chinese or Western), on the other. Analytical methods involve statistical analysis of textual data (textometry) and are based on a theoretical background articulating textual semantics and discourse analysis.

Methodology for semi-automated textual discourse analysis...

Methodology for semi-automated textual analysis of passenger discourse for the qualification of multimodal travel

Amélie MARTIN

Research supervisor: Frédérique Segond and Mathieu Valette

Summary:
The passenger transport sector is now seeking to offer increasingly refined and personalized services, based on better knowledge of customers. Customers are increasingly expressing themselves in free speech on the web, but also via more traditional channels such as complaints and open survey questions. In particular, they describe their itineraries, whether daily or occasional, unimodal or intermodal, combining conventional modes of transport and emerging modes (such as car-sharing, bike-sharing, etc.), and sometimes specify their feelings and opinions regarding these journeys.

This thesis (carried out at SNCF) therefore aims to propose a strategy for semi-automated qualitative analysis of the representation of travelers' travel chains based on these discourses. The aim is to use approaches from information retrieval, knowledge engineering, corpus semantics and discourse analysis to first reconstruct and understand individuals' itineraries, and then to understand their motivations, preferences and travel habits on the basis of this initial analysis. This methodology could be integrated into an SNCF decision-support tool to evaluate, dynamically adapt and personalize the multimodal transport offer as well as door-to-door mobility services.

Text legibility and automatic content search....

Text readability and automatic search for pedagogical content: the case of Hindi and Armenian

Satenik MKHITARYAN

Research supervisor: Mathieu VALETTE

Summary:
This thesis aims to design a readability formula to facilitate the elaboration of pedagogical content intended for reading. Reading has a special place in language learning. Numerous studies have shown that the practice of reading in a foreign language favors its acquisition and, in particular, improves reading comprehension. But reading can fail to achieve its pedagogical objective if the texts chosen are either too easy or too difficult. It is therefore crucial that the text is adapted to the learner's level, which is not always the case. Selecting text resources according to learner level is often complex and time-consuming. For this reason, many researchers have tried to
find ways of making the task less burdensome for teachers. Readability measurement is a practical and effective way of assessing text difficulty. (François, 1993) summarizes readability as "a field that studies how to associate texts with a category of readers, according to the lexical, syntactic, coherence and cohesion aspects present in these texts".

Thus, this thesis will create an online research platform incorporating a readability formula that will have two major functionalities: evaluation of the level of difficulty of the given text; online document search and automatic classification by level.

Text mining methods for characterizing political opinions...

Text mining methods for characterizing political opinions: application to the analysis of communication strategies on social networks in Tunisia

Asma ZAMITI

Expected date of defense: 2020
Research supervisor: Mathieu VALETTE

Summary:
Study of a corpus in Tunisian from the web with the aim of identifying the implementation and evolution of the communication strategy of the Islamist party Ennahdha after the Tunisian revolution of 2011. Our project has two key objectives:

The TAL of Tunisian, a poorly endowed and uncodified language whose writing on social networks is diverse (arabizi, Arabic alphabet, borrowings, etc.). It is little studied, despite the growing amount of Tunisian data available, thanks in particular to the growth of social networks. Automatic processing of Tunisian is still in its infancy. In terms of the state of the art of NLP, publications on the subject remain minor
analysis of Tunisian political discourse with tools: the case of the Ennahdha party Tunisia's leading political force after the 2011 elections, the Islamist Ennahdha party recorded a sharp decline in the 2014 legislative elections, due in particular to sanctioned votes after three years of turbulent governance. However, the party stands out for its meticulous communications strategy, both in campaigning and in responding to polemics. It is the qualitative and quantitative study of this political discourse on the web and in particular on the Facebook social network that we wish to carry out.

Methods and tools for the automatic processing of Vietnamese - application in...

Methods and tools for the automatic processing of Vietnamese - application in digital humanities: behavioral mining on the social web

Océane Hô Dinh

Thesis defended December 22, 2017
Research supervisor: Mathieu Valette

Summary:
This thesis proposes to adapt and develop methods and tools for the automatic processing of Vietnamese, a sparsely endowed language, for data mining applications extracted from Internet discussion forums.

The aim is to use corpus linguistics to equip the study of contemporary societies in order to apprehend the most recent societal mutations, as made perceptible by information and communication technologies (ICT).

With regard to the applicative framework, we situate ourselves in the context of a country in the process of development and opening up to globalization, which is seeing its society evolve rapidly, and we seek to study how Vietnamese youth are appropriating ICTs as new means of expression and information sharing, highlighting the tensions they are experiencing between deeply rooted traditions and attractive modernity. To this end, the theme of HIV/AIDS was chosen for the many social issues it covers (health and social, generational conflicts, changing mores, etc.) and the different types of discourse that take hold of it.

Chinese-language extraction of spatio-temporalized actions performed by...

Chinese language extraction of spatio-temporalized actions performed by persons or organizations

Zhen Wang

Thesis defended on June 9, 2016
Research supervisor: Pierre Zweigenbaum

Summary:
the final objective of this thesis is the extraction from Chinese texts from the web, of actions having as agent and/or object a named entity of type person or organization(Chinese or not). Wherever possible, a precise link (geolocatable) and a precise time (date, time) are associated. To do this, we need to identify and extract the parts of the Chinese character string corresponding to proper nouns or dates. We need to type these entities as people, places, organizations, numerical quantities, dates/times. In the same text, the same entity must be identified from one occurrence to the next, even if it is written in different ways. This also requires resolving anaphora. Next, we need to identify the entity as a person, an organization or a particular place. To do this, we'll draw on external knowledge (directory of places, structured encyclopedic knowledge, etc.), which will be put into the form of ontologies. In addition, the knowledge associated with the entities in the various texts should make it possible, on the one hand, to complete certain information on facts that are recounted in different places and also to consider distinguishing activities that cannot be done by the same person (e.g. action at the same time in very distant places).

Acquisition of verbal predicative patterns in Japanese - Pierre MARCHAL

Acquisition of verbal predicative patterns in Japanese

Pierre MARCHAL

Thesis defended October 15, 2015
Research supervisor: Thierry POIBEAU

Summary:
Acquiring knowledge about verbal constructions is an important issue for automatic language processing, but also for lexicography, which aims to document new linguistic usages. This task poses many technical and theoretical challenges. In this thesis, we focus on two fundamental aspects of verb description: the notion of lexical entry and the distinction between arguments and circumstantials. Following previous studies in automatic language processing and linguistics, we assume that there is a continuum between homonyms and monosemes; similarly, we hypothesize that there is no clear distinction between arguments and circumstantials. We propose a complete processing chain for the acquisition of verbal predicative patterns in Japanese from an unlabeled corpus of journalistic texts. This processing chain integrates the notion of argumentality into the process of creating lexical entries and implements a model of these two continua. The resource produced was the subject of a qualitative comparative evaluation, which highlighted the difficulty of linguistic resources in describing new data, thereby arguing for a lexicology that falls within the epistemological framework of corpus linguistics.

Text semantics and language-culture didactics: ...

Text semantics and language-culture didactics: Application to a corpus of journalistic and political discourses in modern and contemporary Arabic

Nadia Makouar

Thesis defended in 2014
Research co-directors: Mathieu Valette and Driss El-Khattab

Summary:
Today, research in foreign language didactics agrees that authentic language materials are indispensable for accessing the reality and socio-cultural representations conveyed by the language in question.

Also being able to read and understand language from this type of content is easier thanks to Information and Communication Technologies and in particular through the intermediary of tools for targeted exploration of texts for assisted interpretation. Using a journalistic and political corpus in modern and contemporary Arabic and François Rastier's theoretical tools of text semantics, the aim of this thesis is to propose avenues of didactic exploitation for the Arabic language-culture and semantic access to digital texts via textometry software.
The aim is to evaluate these pedagogical proposals and thus give intermediate and advanced level students the opportunity to use content and tools to improve their learning and competence in written comprehension and production of the Arabic language-culture within a more global perspective of a didactics of texts.

From unstructured to structured data...

Moving from unstructured to structured data: entity relationship extraction from corpora

Mani Ezzat

Thesis defended May 06, 2014
Research supervisor: Thierry POIBEAU

Summary:
The development of data available on the Internet has considerably changed the field of language processing. Systems that until recently processed a few isolated sentences now have to cope with deluge of varied documents. Initiated by the MUC (Message Understanding Conference) conferences in the early 90s, a great deal of work has focused on a type of unit known as a named entity. These generally correspond to all proper nouns (personal names, place names, etc.). The current state of the art shows a satisfactory mastery of isolated sequence recognition, particularly for named entities and technical terms. These elements are important for indexing texts and helping analysts to understand them. However, these sequences only become fully significant when they are linked together. For example, it's interesting to know that a text contains occurrences of the words Google and Youtube; but the analysis becomes much more interesting if the system is able to detect a relationship between these two elements, even typifying it as being a purchasing relationship ( Google having bought Youtube two years ago now).

The Infom@gic project, within the Cap Digital competitiveness cluster, has explored various techniques for recognizing named entities. This is far from being completely mastered: performance is highly variable depending on the type of entity considered, the genre of the text to be analyzed and the granularity of the types considered. However, performance is sufficiently robust to allow us to go further towards relationship detection.

Evolution of buzz on the Internet - identification, analysis, modeling and...

Evolution of buzz on the internet - identification, analysis, modeling and representation in a watch context

Aurélien LAUF

Thesis defended October 14, 2014
Research supervisor: Mathieu Valette

Summary :
Largely in the context of information monitoring on the Internet, the aim of this thesis is to develop tools and methods for identifying, analyzing, modeling and representing the path of information circulating on the Internet (the buzz). These methods are derived in particular from corpus linguistics and graph theory.

The aim here is to trace the primary sources, as well as the necessary and sufficient sources of an item of information, to identify sub-themes and discourse communities, and to analyze the semantic differences that may appear between these sources throughout the information's life cycle.

Unsupervised interactive lexicon extraction in contemporary Chinese applied to...

Interactive and unsupervised lexicon extraction in contemporary Chinese applied to the constitution of linguistic resources in a specialized domain

Gaël Patin

Thesis defended January 31, 2013
Research supervisor: Pierre Zweigenbaum

Summary:
Lexicons are indispensable resources for information retrieval systems. They can significantly improve the results of automatic linguistic analysis processes - morpho-syntactic labeling, semantic interpretation or indexing - in specific domains. However, the creation of lexicons is confronted with two types of difficulty: some are pragmatic, such as the cost of their creation or their reusability, which are of great importance for industrial implementation; others are theoretical, such as the definition of the lexical unit in different languages or the characterization of thelexical particularities of a specialized corpus, which are essential for the relevance and validity of the results. this confrontation between economic and qualitative interest is a recurring problem in the business world. applied scientific research must be able to propose solutions to meet this dual requirement. this study proposes an element of response to the problem of lexical identification in a specialized corpus in contemporary Chinese via a classification system of candidate lexies (lexical units). this study focuses in particular on the case of contemporary Chinese, a language for which we have few lexical resources.

Semantic characterization of subjectivity in texts for...

Semantic characterization of subjectivity in texts for information retrieval

Egle Eensoo

Research supervisor: Mathieu Valette

Summary:
For a long time, information retrieval focused on the thematic to define the information content of documents and determine its relevance to an information request. From the 2000s onwards, this field of application has been confronted with a new need: the extraction of subjective information (opinions, feelings, etc.).

In this thesis, we focus on the textual phenomena that contribute to the interpretation of subjective information. More specifically, we attempt to identify complex cues at several levels - lexical, morphosyntactic, argumentative, structural - that enable us to characterize the subjectivity of texts of different themes and genres. In particular, we rely on the notion of genre, which enables us to characterize texts by taking into account the norms of their elaboration, which have a direct impact on their interpretation, something that is little taken into account by most methods that assume that the expression of opinions is detached from the conditions of enunciation. For example, even if the aim is to qualify an opinion as "positive" or "negative" with regard to an object, we consider that film reviews (comments by Internet users) cannot be treated in the same way as forum posts discussing a societal issue. Our aim is to elucidate the role of complex cues at each level in the interpretation of subjective information, and to model them in such a way that they can be extracted automatically.

Measuring social distance in experience narratives from discussion forums...

Measuring social distance in experience narratives from discussion forums (likely to evolve)

Jugurtha Aït-Hamlat

Thesis started in 2011
Research supervisor: Mathieu Valette

Summary:
Textual forms that emerged with Web 2.0 today represent a highly prized source of data in various fields (Text-Mining, Opinion/Sentiment Analysis, Business Intelligence...). For the most part, they embody the expression of subjectivity, giving rise to productions known as "egodocuments". Within this textual genre, we are interested in "narratives of experience" from discussion forums. The aim will be to develop an analysis method enabling the detection of affinities between narrators in a social web application framework.

Based on the postulate that similarity between two narratives can be considered in terms of distance and using corpus linguistics tools, the objective of the thesis will be to formalize semantic relationships giving rise to profiles of comparable narrators.