A new website for the Pangloss collection

4 November 2022
  • LACITO

  • Search

As part of the CNRS Com'Lab program, the Pangloss collection, supported by Lacito (Inalco-Sorbonne Nouvelle Paris 3-CNRS), has completely redesigned its web interface to facilitate interdisciplinary collaboration between human and social sciences and automatic language processing. It also opens up free access to its text and sound corpora to a wide public.
Enfants de Papouasie Nouvelle-Guinée jouant dans une rivière
Enfants de Papouasie-Nouvelle Guinée © Sylvain Loiseau‎
Contenu central

Of the 7,000 languages spoken today, half have fewer than 10,000 speakers and a quarter fewer than 1,000. These small language communities live most often in rural environments, as here in a rainforest in Papua New Guinea.

The Pangloss Collection

The collection Pangloss is an open archive of endangered and under-documented languages carried by LACITO (Laboratoire de langues et civilisations à tradition orale), a multidisciplinary research laboratory (linguistics and anthropology) dedicated to the study of languages with an oral tradition based on field surveys in various linguistic and cultural areas.

"The fruit of over twenty years' work by specialized CNRS researchers and engineers", the collection plays a major role in saving the world's linguistic heritage. It has grown over the years, with contributions from researchers around the world documenting rare and endangered languages before they disappear. It welcomes corpora from researchers at various institutions, from France and other countries (Canada, USA, Germany, Netherlands, Vietnam, China, Singapore, Turkey...).

In 2020, the collection contains 3500 audio and video documents (around 780 hours of listening in over 170 languages) collected over the course of field surveys on every continent. Half of the recordings are transcribed and annotated, enabling all listeners to understand what they're hearing.

A collaborative, open-access tool

This bilingual French-English site provides access to corpora via an interactive map of corpora accompanied by annotations and videos or via an alphabetical list of languages, as well as to the Lexica collection of multimedia dictionaries.

For professional users, ethnologists, translators, linguists and Automatic Language Processing specialists, it reserves a dedicated space and tools for consulting and exploiting corpora and depositing new resources. The Pangloss Labs offers tools to facilitate research on parallel corpora and tools for automatic language recognition.

The originality of the Pangloss Collection is both that it is freely accessible, without any form of restriction, and that it offers both multimedia evidence (recordings, video capture) and interlinear transcriptions (morphemes by morphemes) of entire texts. Because of this ease of access to transcribed data, the Pangloss Collection is used in numerous scientific publications. (Sylvain Loiseau, Itinéraires n° 6)

A score of partners contribute to the archive, including the Institut des langues rares, Inalco, Bulac, EPHE, Sorbonne-Nouvelle, various CNRS laboratories, France Archives and Huma-Num.
Pangloss is one of 37 collections now hosted by the Cocoon platform and dedicated to scientific research and mediation. It also participates in the international networks Open language archives community, a global virtual library of languages, and Delaman, a network of digital archives of endangered languages and musical traditions.