Embedding

What are Embeddings?¶

Embeddings are a form of data representation where elements, such as words, phrases, or entire documents, are converted into vectors of floating-point numbers. This method effectively transforms textual or categorical data into a format that can be processed by machine learning models, particularly in natural language processing tasks.

Why We Use Embeddings¶

Embeddings are used in various applications due to their ability to capture semantic relationships and patterns in data:

Search: In search applications, embeddings rank results by relevance to a query string. They help in understanding the context and semantics of the query, matching it with the most relevant content.
Clustering: Embeddings are used to group text strings by similarity. They allow algorithms to recognize and group similar texts, facilitating better organization and analysis of data.
Recommendations: In recommendation systems, embeddings suggest items with related text strings. They help in identifying items that are contextually or thematically similar to user preferences.
Anomaly Detection: Embeddings assist in identifying outliers with little relatedness. They can highlight unusual or rare items within a dataset by measuring their dissimilarity with the majority.
Diversity Measurement: They analyze similarity distributions, helping in assessing how diverse a dataset is in terms of its contents and representations.
Classification: Embeddings classify text strings by their most similar label. They enable algorithms to categorize texts based on their closeness to predefined labels or categories.

Embeddings in Practice: "text-embedding-ada-002"¶

"Text-embedding-ada-002" from OpenAI is an example of an embedding model that creates a multi-dimensional vector for each text input. The number of dimensions per vector can vary, but typically, a higher number of dimensions (e.g., hundreds or thousands) allows for a more nuanced and detailed representation of the text.

Number of Dimensions: The high dimensionality ensures that various aspects and nuances of the text are captured, allowing for a more accurate and detailed representation.
Benefits in Vector Stores: When used in vector stores, these high-dimensional embeddings facilitate efficient similarity searches and complex NLP tasks. The vector store can quickly compare these embeddings to find the most relevant matches based on semantic similarity.

In essence, embeddings like "text-embedding-ada-002" transform text into a rich, multi-dimensional vector space where semantic relationships can be quantified and leveraged for various AI-driven applications.

When an LLM model is used, its vector conversion engine (embeddings) is used to prepare a vector database.

This model can also be used to transform requests from human users or other systems. In this way, we ensure the consistency of requests with vectors using the same dimensions.

An AI application can therefore use several LLM models, depending on the tasks required. Each interaction will use a vector format (embeddings) specific to each LLM system.