GenAI
NLP
EMNLP Findings

A Parallel Corpus for Vietnamese Central-Northern Dialect Text Transfer

January 26, 2024

The Vietnamese language embodies dialectal variants closely attached to the nation’s three macro-regions: the Northern, Central and Southern regions. As the northern dialect forms the basis of the standard language, it’s considered the prestige dialect. While the northern dialect differs from the remaining two in certain aspects, it almost shares an identical lexicon with the southern dialect, making the textual attributes nearly interchangeable. In contrast, the central dialect possesses a number of unique vocabularies and is less mutually intelligible to the standard dialect. Through preliminary experiments, we observe that current NLP models do not possess understandings of the Vietnamese central dialect text, which most likely originates from the lack of resources. To facilitate research on this domain, we introduce a new parallel corpus for Vietnamese central-northern dialect text transfer. Via exhaustive benchmarking, we discover monolingual language models’ superiority over their multilingual counterparts on the dialect transfer task. We further demonstrate that fine-tuned transfer models can seamlessly improve the performance of existing NLP systems on the central dialect domain with dedicated results in translation and text-image retrieval tasks.

Overall

< 1 minute

Thang Le, Luu Anh Tuan

Share Article

Related publications

GenAI
NLP
LREC-COLING
June 28, 2024

Nhu Vo, Dat Quoc Nguyen, Dung D. Le, Massimo Piccardi, Wray Buntine

GenAI
NLP
Findings of ACL
June 28, 2024

Minh-Vuong Nguyen, Linhao Luo, Fatemeh Shiri, Dinh Phung, Yuan-Fang Li, Thuy-Trang Vu, Gholamreza Haffari

GenAI
NLP
Findings of ACL
June 28, 2024

Tinh Son Luong, Thanh-Thien Le, Linh Van Ngo, and Thien Huu Nguyen

GenAI
NLP
ACL Top Tier
June 28, 2024

Trinh Pham*, Khoi M. Le*, Luu Anh Tuan