Alibaba Released Babel: An Open Multilingual Large Language Model LLM Serving Over 90% of Global Speakers Researchers from DAMO Academy at Alibaba Group introduced Babel, a multilingual LLM designed to support over 90% of global speakers by covering the top 25 most spoken languages to bridge this gap. Babel employs a unique layer extension technique to expand its model capacity without compromising performance. The research team introduced two model variants: Babel-9B, optimized for efficiency in inference and fine-tuning, and Babel-83B, which establishes a new benchmark in multilingual NLP. Unlike previous models, Babel includes widely spoken but often overlooked languages such as Bengali, Urdu, Swahili, and Javanese. The researchers focused on optimizing data quality by implementing a rigorous pipeline that curates high-quality training datasets from multiple sources. Babel’s architecture differs from conventional multilingual LLMs by employing a structured layer extension approach. Rather than relying on continuous pretraining, which requires extensive computational resources, the research team increased the model’s parameter count through controlled expansion. Additional layers were integrated strategically to maximize performance while preserving computational efficiency. For instance, Babel-9B was designed to balance speed and multilingual comprehension, making it suitable for research and localized deployment, whereas Babel-83B extends its capabilities to match commercial models. The model’s training process incorporated extensive data-cleaning techniques, using an LLM-based quality classifier to filter and refine training content. The dataset was sourced from diverse origins, including Wikipedia, news articles, textbooks, and structured multilingual corpora such as MADLAD-400 and CulturaX..... Read full article: https://lnkd.in/g2m9-UiA Paper: https://lnkd.in/gu8FG7dq Model on Hugging Face: https://lnkd.in/gwze_gm3 GitHub Page: https://lnkd.in/gyd2UNFg Project Page: https://lnkd.in/gNYhmtct Alibaba.com
Cross-lingual NLP Solutions
Explore top LinkedIn content from expert professionals.
Summary
Cross-lingual NLP solutions use advanced language technologies to enable computers to understand, generate, and interact in multiple languages, aiming to bridge communication gaps and support diverse global users. These innovations are transforming how language models handle multilingual data, making them more inclusive and accurate across many spoken languages.
- Expand language coverage: Choose NLP models and datasets that support a wide range of languages to ensure broader communication and accessibility for users worldwide.
- Improve data quality: Use rigorous data-cleaning and filtering methods to train models, ensuring that the language output is reliable and culturally appropriate for different regions.
- Diversify training tasks: Incorporate a variety of conversation styles and tasks when fine-tuning models, so they perform well in real-world scenarios and complex interactions in multiple languages.
-
-
🌟 Excited to share our latest research on enhancing multilingual capabilities in large language models! 🌟 Introducing SPHINX, a novel multilingual synthetic instruction tuning dataset created to address the performance gap in non-English languages. By translating instruction-response pairs from English into 50 languages, we achieved impressive results. In our study, fine-tuning models PHI-3-SMALL and MISTRAL-7B using SPHINX led to significant performance improvements, surpassing other multilingual datasets in benchmarks. Incorporating N-shot examples further boosted performance, showcasing the effectiveness and efficiency of SPHINX. This advancement marks a significant step forward in making large language models more inclusive and effective across diverse languages. Our research highlights the importance of sample efficiency and diversity while minimizing dataset creation costs. Excited for further discussions and collaborations in the realm of NLP, Multilingual AI, Machine Learning, and Artificial Intelligence! 🚀 Link to the paper : https://lnkd.in/g5CP9EZc Sanchit Ahuja Kumar Tanmay Hardik Chauhan Barun Patra Vishrav Chaudhary Monojit Choudhury Arindam Mitra Luciano Del Corro Tejas Indulal Dhamecha Ahmed Awadallah Sunayana Sitaram #NLP #MultilingualAI #MachineLearning #ArtificialIntelligence #Research #Innovation
-
Always excited about research in multilingual space that can help transfer LLMs' amazing capabilities to other languages! Here's a new IFT dataset that supports 70 languages and is fully synthetic 🙀 😎 M2Lingual is a fully synthetic, multilingual, multi-turn Instruction Fine-Tuning (IFT) dataset comprising 182K evenly distributed instruction-response pairs across 70 languages, resulting in competitive performance across multilingual evaluation benchmarks and MT-Bench. ⛳ IFT plays a super important role in ensuring these LLMs can effectively follow instructions across various tasks. However, the authors note that existing datasets have the following issues: 👉 Limited Multilingual Support: Current IFT datasets are predominantly focused on English 👉 Single-Turn Conversations: Many existing datasets are not multi-turn, 👉 Task Diversity: There's a shortage of datasets covering a wide range of NLP tasks in multilingual settings ⛳ Here's how the dataset addresses these issues: 👉 M2Lingual spans 70 languages and includes 182K instruction-response pairs across 17 diverse NLP tasks, ensuring LLMs are trained on a broad spectrum of language understanding and generation challenges. 👉 Guided by a popular taxonomy called Evol, M2Lingual leverages seed samples from human-generated datasets to create machine-generated instructions. This approach ensures a balanced representation across languages and tasks. 👉Unlike many existing datasets, M2Lingual incorporates a multi-turn component in its instruction-response pairs, enabling LLMs to handle complex, extended conversations across different languages effectively. The authors not that LLMs fine-tuned with M2Lingual consistently outperform existing multilingual datasets in various evaluation benchmarks, demonstrating superior performance in multilingual task scenarios! Link to the paper: https://lnkd.in/euvMPedT