Skip to content

docs: Update image and text embeddings documentation #449

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 18, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -66,10 +66,6 @@ A string that specifies the location of the tokenizer JSON file.

To run the model, you can use the `forward` method. It accepts one argument, which is a string representing the text you want to embed. The function returns a promise, which can resolve either to an error or an array of numbers representing the embedding.

:::info
The returned embedding vector is normalized, meaning that its length is equal to 1. This allows for easier comparison of vectors using cosine similarity, just calculate the dot product of two vectors to get the cosine similarity score.
:::

## Example

```typescript
Expand All @@ -82,6 +78,13 @@ import {
const dotProduct = (a: number[], b: number[]) =>
a.reduce((sum, val, i) => sum + val * b[i], 0);

const cosineSimilarity = (a: number[], b: number[]) => {
const dot = dotProduct(a, b);
const normA = Math.sqrt(dotProduct(a, a));
const normB = Math.sqrt(dotProduct(b, b));
return dot / (normA * normB);
};

function App() {
const model = useTextEmbeddings({
modelSource: ALL_MINILM_L6_V2,
Expand All @@ -94,8 +97,10 @@ function App() {
const helloWorldEmbedding = await model.forward('Hello World!');
const goodMorningEmbedding = await model.forward('Good Morning!');

// The embeddings are normalized, so we can use dot product to calculate cosine similarity
const similarity = dotProduct(helloWorldEmbedding, goodMorningEmbedding);
const similarity = cosineSimilarity(
helloWorldEmbedding,
goodMorningEmbedding
);

console.log(`Cosine similarity: ${similarity}`);
} catch (error) {
Expand All @@ -108,17 +113,22 @@ function App() {

## Supported models

| Model | Language | Max Tokens | Embedding Dimensions | Description |
| ----------------------------------------------------------------------------------------------------- | :------: | :--------: | :------------------: | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | English | 256 | 384 | All-round model tuned for many use-cases. Trained on a large and diverse dataset of over 1 billion training pairs. |
| [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) | English | 384 | 768 | All-round model tuned for many use-cases. Trained on a large and diverse dataset of over 1 billion training pairs. |
| [multi-qa-MiniLM-L6-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1) | English | 511 | 384 | This model was tuned for semantic search: Given a query/question, it can find relevant passages. It was trained on a large and diverse set of (question, answer) pairs. |
| [multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1) | English | 512 | 768 | This model was tuned for semantic search: Given a query/question, it can find relevant passages. It was trained on a large and diverse set of (question, answer) pairs. |
| Model | Language | Max Tokens | Embedding Dimensions | Description |
| ----------------------------------------------------------------------------------------------------- | :------: | :--------: | :------------------: | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | English | 254 | 384 | All-round model tuned for many use-cases. Trained on a large and diverse dataset of over 1 billion training pairs. |
| [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) | English | 382 | 768 | All-round model tuned for many use-cases. Trained on a large and diverse dataset of over 1 billion training pairs. |
| [multi-qa-MiniLM-L6-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1) | English | 509 | 384 | This model was tuned for semantic search: Given a query/question, it can find relevant passages. It was trained on a large and diverse set of (question, answer) pairs. |
| [multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1) | English | 510 | 768 | This model was tuned for semantic search: Given a query/question, it can find relevant passages. It was trained on a large and diverse set of (question, answer) pairs. |
| [clip-vit-base-patch32-text](https://huggingface.co/openai/clip-vit-base-patch32) | English | 74 | 512 | CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. CLIP allows to embed images and text into the same vector space. This allows to find similar images as well as to implement image search. This is the text encoder part of the CLIP model. To embed images checkout [clip-vit-base-patch32-image](../02-computer-vision/useImageEmbeddings.md#supported-models). |

**`Max Tokens`** - the maximum number of tokens that can be processed by the model. If the input text exceeds this limit, it will be truncated.

**`Embedding Dimensions`** - the size of the output embedding vector. This is the number of dimensions in the vector representation of the input text.

:::info
For the supported models, the returned embedding vector is normalized, meaning that its length is equal to 1. This allows for easier comparison of vectors using cosine similarity, just calculate the dot product of two vectors to get the cosine similarity score.
:::

## Benchmarks

### Model size
Expand All @@ -129,7 +139,7 @@ function App() {
| ALL_MPNET_BASE_V2 | 438 |
| MULTI_QA_MINILM_L6_COS_V1 | 91 |
| MULTI_QA_MPNET_BASE_DOT_V1 | 438 |
| CLIP_TEXT_ENCODER | 254 |
| CLIP_VIT_BASE_PATCH32_TEXT | 254 |

### Memory usage

Expand All @@ -139,7 +149,7 @@ function App() {
| ALL_MPNET_BASE_V2 | 520 | 470 |
| MULTI_QA_MINILM_L6_COS_V1 | 160 | 225 |
| MULTI_QA_MPNET_BASE_DOT_V1 | 540 | 500 |
| CLIP_TEXT_ENCODER | 275 | 250 |
| CLIP_VIT_BASE_PATCH32_TEXT | 275 | 250 |

### Inference time

Expand All @@ -153,4 +163,4 @@ Times presented in the tables are measured as consecutive runs of the model. Ini
| ALL_MPNET_BASE_V2 | 352 | 423 | 478 | 521 | 527 |
| MULTI_QA_MINILM_L6_COS_V1 | 135 | 166 | 180 | 158 | 165 |
| MULTI_QA_MPNET_BASE_DOT_V1 | 503 | 598 | 680 | 694 | 743 |
| CLIP_TEXT_ENCODER | 35 | 48 | 49 | 40 | - |
| CLIP_VIT_BASE_PATCH32_TEXT | 35 | 48 | 49 | 40 | - |
48 changes: 26 additions & 22 deletions docs/docs/02-hooks/02-computer-vision/useImageEmbeddings.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,12 +27,10 @@ It is recommended to use models provided by us, which are available at our [Hugg
```typescript
import {
useImageEmbeddings,
CLIP_VIT_BASE_PATCH_32_IMAGE_ENCODER_MODEL,
CLIP_VIT_BASE_PATCH32_IMAGE,
} from 'react-native-executorch';

const model = useImageEmbeddings({
modelSource: CLIP_VIT_BASE_PATCH_32_IMAGE_ENCODER_MODEL,
});
const model = useImageEmbeddings(CLIP_VIT_BASE_PATCH32_IMAGE);

try {
const imageEmbedding = await model.forward('https://url-to-image.jpg');
Expand Down Expand Up @@ -62,23 +60,25 @@ A string that specifies the location of the model binary. For more information,

To run the model, you can use the `forward` method. It accepts one argument which is a URI/URL to an image you want to encode. The function returns a promise, which can resolve either to an error or an array of numbers representing the embedding.

:::info
The returned embedding vector is normalized, meaning that its length is equal to 1. This allows for easier comparison of vectors using cosine similarity, just calculate the dot product of two vectors to get the cosine similarity score.
:::

## Example

```typescript
const dotProduct = (a: number[], b: number[]) =>
a.reduce((sum, val, i) => sum + val * b[i], 0);

const cosineSimilarity = (a: number[], b: number[]) => {
const dot = dotProduct(a, b);
const normA = Math.sqrt(dotProduct(a, a));
const normB = Math.sqrt(dotProduct(b, b));
return dot / (normA * normB);
};

try {
// we assume you've provided catImage and dogImage
const catImageEmbedding = await model.forward(catImage);
const dogImageEmbedding = await model.forward(dogImage);

// The embeddings are normalized, so we can use dot product to calculate cosine similarity
const similarity = dotProduct(catImageEmbedding, dogImageEmbedding);
const similarity = cosineSimilarity(catImageEmbedding, dogImageEmbedding);

console.log(`Cosine similarity: ${similarity}`);
} catch (error) {
Expand All @@ -88,34 +88,38 @@ try {

## Supported models

| Model | Language | Image size | Embedding Dimensions | Description |
| ------------------------------------------------------------------------------------------ | :------: | :--------: | :------------------: | -------------------------------------------------------------- |
| [clip-vit-base-patch32-image-encoder](https://huggingface.co/openai/clip-vit-base-patch32) | English | 224 x 224 | 512 | Trained using contrastive learning for image search use cases. |
| Model | Language | Image size | Embedding Dimensions | Description |
| ---------------------------------------------------------------------------------- | :------: | :--------: | :------------------: | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [clip-vit-base-patch32-image](https://huggingface.co/openai/clip-vit-base-patch32) | English | 224 x 224 | 512 | CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. CLIP allows to embed images and text into the same vector space. This allows to find similar images as well as to implement image search. This is the image encoder part of the CLIP model. To embed text checkout [clip-vit-base-patch32-text](../01-natural-language-processing/useTextEmbeddings.md#supported-models). |

**`Image size`** - the size of an image that the model takes as an input. Resize will happen automatically.

**`Embedding Dimensions`** - the size of the output embedding vector. This is the number of dimensions in the vector representation of the input image.

:::info
For the supported models, the returned embedding vector is normalized, meaning that its length is equal to 1. This allows for easier comparison of vectors using cosine similarity, just calculate the dot product of two vectors to get the cosine similarity score.
:::

## Benchmarks

### Model size

| Model | XNNPACK [MB] |
| ---------------------- | :----------: |
| CLIP_VIT_BASE_PATCH_32 | 352 |
| Model | XNNPACK [MB] |
| --------------------------- | :----------: |
| CLIP_VIT_BASE_PATCH32_IMAGE | 352 |

### Memory usage

| Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] |
| ---------------------- | :--------------------: | :----------------: |
| CLIP_VIT_BASE_PATCH_32 | 324 | 347 |
| Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] |
| --------------------------- | :--------------------: | :----------------: |
| CLIP_VIT_BASE_PATCH32_IMAGE | 324 | 347 |

### Inference time

:::warning warning
Times presented in the tables are measured as consecutive runs of the model. Initial run times may be up to 2x longer due to model loading and initialization. Performance also heavily depends on image size, because resize is expansive operation, especially on low-end devices.
:::

| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro Max (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] |
| ---------------------- | :--------------------------: | :------------------------------: | :------------------------: | :-------------------------------: |
| CLIP_VIT_BASE_PATCH_32 | 104 | 120 | 280 | 265 |
| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro Max (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] |
| --------------------------- | :--------------------------: | :------------------------------: | :------------------------: | :-------------------------------: |
| CLIP_VIT_BASE_PATCH32_IMAGE | 104 | 120 | 280 | 265 |
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,3 @@ To load the model, use the `load` method. It accepts the `modelSource` which is
## Running the model

It accepts one argument, which is a URI/URL to an image you want to encode. The function returns a promise, which can resolve either to an error or an array of numbers representing the embedding.

:::info
The returned embedding vector is normalized, meaning that its length is equal to 1. This allows for easier comparison of vectors using cosine similarity, just calculate the dot product of two vectors to get the cosine similarity score.
:::