LaCAS-IA project winner SESAME 2024
The "Soutien aux Équipes Scientifiques pour l'Acquisition de Moyens Expérimentaux (SESAME)" (Support for Scientific Teams to Acquire Experimental Resources) scheme co-finances the scientific equipment needed by public research laboratories in the Paris region to carry out large-scale projects. Out of 34 applications received, the LaCAS-IA project is one of the 12 winners.
LaCAS-IA aims to integrate AI into the LaCAS platform (created, in part, as part of a previous SESAME 2020 project) to automate metadata harvesting and classification, train linguistic models on rare languages, and offer advanced processing and search tools.
Technical aspects of the LaCAS-IA project
This funding will enable the acquisition of graphics processors, or GPUs (from the English Graphics Processing Unit) and storage arrays, to optimize computing and data management capacities, two major technical focuses of the project.
Optimum data storage
Optimum data storage
Projects like LaCAS require the management of large volumes of data (linguistic corpora, databases of texts or audio recordings, etc.). The newly acquired storage arrays will thus enable:
- Fast, massive storage of large datasets that can be accessed quickly
- High-quality storage of audiovisual and visual data
- Development of specific APIs (programming interfaces), enabling, for example, automatic transcription and indexing of interviews in the audiovisual stream
- Backup and recovery of data, avoiding irreparable losses to research and ensuring project continuity
Process automation
Process automation
Domains such as machine translation, speech recognition, semantic analysis, or natural language generation, modern language processing models, require significant computational resources. GPUs can significantly reduce the time needed to train these models and execute predictions.
By combining GPUs for computation and storage arrays for data management, the LaCAS team will be able to traindeep learning models and analyze large databases in real time. The aim is to produce accurate results faster, accelerate the research process, and improve overall project efficiency.
Processes that can be automated:
- Metadata harvesting from open archives and national and European public repositories
- Metadata classification in LaCAS data
- Content translation by dedicated models
- Video transcription and subtitling by voice recognition
- Linear indexing of video streams by image recognition
LaCAS-IA project policy guidelines
The technical optimizations, in addition to strengthening the credibility of the LaCAS project in a highly competitive field (AI and NLP), make a decisive contribution to the project's political ambitions. Open science and the preservation of rare languages are two essential axes, which distinguish the project from other similar scientific or technological initiatives and make it a key player in the valorization of areal studies in France.
The objectives of open science
Centralized storage arrays allow data resources to be shared more easily between researchers and collaborators, improving international cooperation and the development of new research based on open corpora.
The objectives of open science
Centralized storage arrays allow data resources to be shared more easily between researchers and collaborators, improving international cooperation and the development of new research based on open corpora.
The GPUs, meanwhile, will enable the development of complex models (such asdeep learning models for automatic language processing) which can then be made available in the form of open source tools. These models could be used and improved by the scientific community, thus reinforcing the virtuous circle of open science.
In the long term, the LaCAS-IA project should make it possible to offer a vast corpus of data and tools to a wider public (civil society, companies, etc.), and open access to information and knowledge about different areas of the world.
Preserving rare languages
Large-scale language models (LLMs) are emerging as powerful catalysts in the preservation and study of rare languages. These artificial intelligence tools, capable of processing and generating human language with remarkable accuracy, offer a glimmer of hope for the world's 2,500 or so endangered languages.
Preserving rare languages
Large-scale language models (LLMs) are emerging as powerful catalysts in the preservation and study of rare languages. These artificial intelligence tools, capable of processing and generating human language with remarkable accuracy, offer a glimmer of hope for the world's 2,500 or so endangered languages.
GPUs will make it possible to train models capable of working on languages for which little data exists. Algorithms for the automatic processing of low-resource languages require high computing power to handle the linguistic peculiarities of these languages.
In the context of language preservation, it is often necessary to process not only textual data, but also audio and video recordings (interviews, conversations, oral narratives). GPUs are particularly well-suited to the analysis of this type of multimodal data, facilitating the automatic transcription, annotation and analysis of oral data for endangered languages.