Generative AI and LLMs

Generative AI (GenAI) and Large Language Models (LLMs) represent recent advancements in artificial intelligence that have revolutionized content creation and production. Moving beyond traditional AI’s focus on content analysis, these technologies enable the generation of original and innovative content. Powered by rapid advancements in deep learning, with technologies like Transformers and Diffusion Models at the forefront, these systems can encapsulate human knowledge and imitate human creativity.

GenAI and LLMs have the capability to produce art, music, text, and even lifelike images that are nearly indistinguishable from those created by humans. This has given rise to an expanding realm of AI-generated content, with profound implications across all fields, including entertainment, education, and healthcare. They hold the promise of pushing the boundaries of what is possible in ways we are just beginning to imagine.

Generative Modeling

At VinAI, our efforts extend across various aspects of GenAI and LLMs, ignited by the demand for solid foundation models and high-impact applications. Our dual-focused drive is propelled by algorithmic advancements and engineering excellence, aiming to enhance AI models so they can produce trustworthy content of higher quality and at lower cost. Check the below to learn more about our major research thrusts and contributions.

Foundation Models and Datasets

Providing free access to foundation models is a key part of our commitment to the AI community. This commitment stems from our understanding that not everyone has privileged access to expertise, training data, and computational resources to train their own foundation models, and we are in a unique position to lead this endeavor to democratize foundation models. Our consistent efforts in developing and releasing foundation models, such as PhoBERT and BERTweet, have benefited the wider community, as evidenced by millions of downloads so far. Recently, we have also open-sourced a state-of-the-art large generative model series, PhoGPT, for Vietnamese, and demonstrated its superior performance compared to previous open-source models.

 

The currently available public datasets for Vietnamese are inadequate for training models to produce content that matches human-level quality, and we have led the effort to advance the state of the art by creating and disseminating high-quality, large-scale datasets. For example, for English to Vietnamese translation, PhoMT is a trailblazing high-quality, large-scale parallel dataset for text translation, containing 3.02 million sentence pairs; and PhoST comprises 508 audio hours and 331,000 triplets (audio, script, target language) for speech translation.

Generative Quality Advancement

Text and Speech Translation and Generation

We have been working on various language modeling tasks and have contributed several text and speech models specialized for Vietnamese. One example is PhoBERT, a pre-trained language model for Vietnamese that advances state-of-the-art performance in multiple downstream NLP tasks such as dependency parsing and intent detection. PhoBERT models have been publicly available with over 100,000 downloads. An example of a speech model is FlowVocoder, an innovative autoregressive neural vocoder with a small memory footprint capable of generating high-fidelity audio in real-time. In text-to-speech, we developed XPhoneBERT, the first multilingual model pre-trained to learn phoneme representations for downstream TTS tasks.

 

 

Image Manipulation and Generation

We have been working on various generative computer vision tasks, including image manipulation, enhancement, and translation. Our contributions include HyperInverter for advanced style inversion, QC-StyleGAN for quality-controlled image generation, PSENet for self-enhancement of images in extreme lighting conditions, and HyperCUT for image deblurring. We have also developed novel applications such as CPM for enabling personalized appearances through virtual makeup transfer, and Neural Scene Decoration, a new method for transforming an empty scene and an object layout into a realistically furnished scene photograph.

 

Effficency Optimization

At VinAI, we understand that the efficiency of GenAI models significantly influences their capital and operational expenses. Consequently, we have committed considerable research and development resources to optimizing these models. Our initiatives include enhancing technology to refine the architecture and decrease the inference time of Transformer modules, essential to numerous foundation models. Additionally, our algorithmic innovations for faster image generation are embodied by the novel Wavelet diffusion models, which operate several times quicker than traditional diffusion models. We have also addressed the training and fine-tuning expenses of large language models, deriving techniques that allow domain-specific fine-tuning within a single day. Furthermore, we strive for engineering excellence, implementing comprehensive pipeline optimization techniques to adapt image generation and large language models for devices with constrained computing power, such as smartphones.

Reliability and Trustworthiness

At VinAI, we take the reliability and trustworthiness of GenAI models and their generated content seriously. We remain vigilant about the potential dangers of GenAI and proactively investigate risks that might compromise AI model integrity, establishing best practices for their development and use, and devising defensive strategies. To assess the reliability of generative models, we developed an extensive metric suite for Text-to-Image Synthesis Evaluation (TISE) to evaluate state-of-the-art methods. To prevent the misuse of generative models for the unauthorized manipulation of personal images, which could result in fake news or offensive content, we introduced Anti-DreamBooth, a defense mechanism that integrates imperceptible noise into images to degrade the output of generative models.