SeaLLMs – Large Language Models for Southeast Asia
Xuan Phi Nguyen
Phi is a senior research engineer at DAMO Academy, Alibaba Group in Singapore, where he works in multilinguality in large language models and translation technologies with the goal to democratize AI to under-represented communities. Prior to that, he completed his PhD in Artificial Intelligence at Nanyang Technological University (NTU) in Singapore. Throughout his PhD, Phi also joined research internship programs at Salesforce AI Research and Facebook AI Research (FAIR) in 2019, 2021 and 2022. He has published several research papers in machine learning and natural language processing conferences, namely ICLR-20/22, NeurIPS-20/22, ICML-21, ACL-20/21, ICASSP-23 and EMNLP-23. Phi is also the recipient of the 2021 Singapore Data Science Consortium (SDSC) Dissertation Research Award, NeurIPS scholar award and the A*STAR Computing and Information Science (ACIS) Scholarship.
Despite the remarkable achievements of large language models (LLMs) in various tasks, there remains a linguistic bias that favors high-resource languages, such as English, often at the expense of low-resource and regional languages. To address this imbalance, we introduce SeaLLMs, an innovative series of language models that specifically focuses on Southeast Asian (SEA) languages. SeaLLMs are built upon the Llama-2 and further advanced through continued pre-training, specialized instruction and alignment tuning. Our comprehensive evaluation demonstrates that SeaLLM-13b models exhibit superior performance across a wide spectrum of linguistic tasks and assistant-style instruction-following capabilities relative to comparable open-source models. Moreover, they outperform ChatGPT-3.5 in non-Latin languages, such as Thai, Khmer, Lao, and Burmese, by large margins while remaining lightweight and cost-effective to operate.