17 C
Casper
Tuesday, May 28, 2024

Microsoft Unveils Novel AI for Text Embeddings with Synthetic Data

Must read

Microsoft reveals a groundbreaking AI that creates top-notch text embeddings using synthetic data & fewer training steps. This novel approach outperforms current methods requiring vast datasets & complex pipelines.

Natural Language Processing (NLP) tasks extensively make use of text embeddings. Text embeddings encode semantic information contained in text by acting as vector representations of natural language. These embeddings include information retrieval, question answering, semantic textual similarity, bitext mining, and item recommendation. Using approximate closest neighbor search methods, text embeddings in information retrieval (IR) effectively retrieve a small group of candidate documents from a large corpus at the first retrieval stage.

Retrieval Augmented Generation (RAG), the latest paradigm that allows Large Language Models to access dynamic external knowledge without changing model parameters, likewise relies heavily on embedding-based retrieval. Text embeddings also play a crucial role in attributing the source of generated text, improving the interpretability and reliability of LLMs.

Prior research has shown that weighted averages of pre-trained word embeddings provide a reliable foundation for gauging semantic similarity. These techniques, however, cannot fully capture the rich contextual information included in real language. Sentence-BERT and SimCSE are two methods that have evolved with the introduction of pre-trained language models. 

To learn text embeddings, these methods are used to fine-tune models like BERT on Natural Language Inference (NLI) datasets. More sophisticated multi-stage training paradigms are used by state-of-the-art techniques like E5 and BGE, which pre-train weakly-supervised text pairs and fine-tune on labeled datasets to improve resilience and performance.

In recent research, a team of researchers from Microsoft Corporation has presented a unique and simple method for producing high-quality text embeddings. This new approach has achieved remarkable results using only synthetic data and a remarkably small number of training steps, less than 1,000. This contrasts existing methods that rely on multi-stage pre-training using billions of weakly-supervised text pairs and subsequent fine-tuning with limited labeled datasets. The main difference lies in not relying on labor-intensive training pipelines and manually gathered datasets, which frequently have task variety and language coverage issues.

The method uses proprietary Large Language Models to generate a wide range of synthetic data for text embedding jobs across around 100 languages. This approach uses a basic contrastive loss to fine-tune open-source decoder-only LLMs on the generated synthetic data instead of utilizing complex pre-training stages.

The team has conducted some tests to verify this approach. The model has demonstrated its outstanding results on fiercely competitive text embedding benchmarks, all without using any labeled data. The model has also established itself as a state-of-the-art method in text embedding without requiring large labeled datasets when it is refined using a combination of synthetic and labeled data, setting new records on the BEIR and MTEB benchmarks.

Patented LLMs like GPT-4 have produced a diverse range of synthetic data that includes multilingual instructions. On the fiercely competitive MTEB benchmark, the method has achieved remarkable performance in nearly all work categories by using the powerful language understanding capabilities of the Mistral model. 

In conclusion, this study shows that using LLMs can significantly increase the quality of text embeddings. The training procedure of this study greatly eliminates the need for intermediate pre-training and is more streamlined and effective than current multi-stage systems.

More articles

Latest news