diff --git a/autovec-tutorial/__frontmatter__.md b/autovec-tutorial/__frontmatter__.md new file mode 100644 index 00000000..433ea4c8 --- /dev/null +++ b/autovec-tutorial/__frontmatter__.md @@ -0,0 +1,18 @@ +--- +# frontmatter +path: "/tutorial-couchbase-autovectorization-langchain" +title: Auto-Vectorization with Couchbase Capella AI Services and LangChain +short_title: Auto-Vectorization with Couchbase and LangChain +description: + - Learn how to use Couchbase Capella's AI Services auto-vectorization feature to automatically convert your data into vector embeddings. + - This tutorial demonstrates how to set up automated embedding generation workflows and perform semantic search using LangChain. +content_type: tutorial +filter: sdk +technology: + - Artificial Intelligence +tags: + - LangChain +sdk_language: + - python +length: 20 Mins +--- diff --git a/autovec-tutorial/autovec_langchain.ipynb b/autovec-tutorial/autovec_langchain.ipynb new file mode 100644 index 00000000..fac08faa --- /dev/null +++ b/autovec-tutorial/autovec_langchain.ipynb @@ -0,0 +1,382 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "44480f12-3bd0-4fe9-9493-25bd6a2712bb", + "metadata": {}, + "source": [ + "# Auto-Vectorization Using Couchbase Capella AI Services\n", + "\n", + "This comprehensive tutorial demonstrates how to use Couchbase Capella's new AI Services auto-vectorization feature to automatically convert your data into vector embeddings and perform semantic search using LangChain.\n" + ] + }, + { + "cell_type": "markdown", + "id": "502eb13e", + "metadata": { + "jp-MarkdownHeadingCollapsed": true + }, + "source": [ + "# 1. Create and Deploy Operational Cluster on Capella\n", + " To get started with Couchbase Capella, create an account and use it to deploy a cluster. To know more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", + " ### Couchbase Capella Configuration\n", + " When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met:\n", + " * Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the travel-sample bucket (Read and Write) used in the application.\n", + " * [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running." + ] + }, + { + "cell_type": "markdown", + "id": "4369c925-adbc-4c7d-9ea6-04ff020cb1a6", + "metadata": {}, + "source": [ + "# 2. Data Upload and Preparation\n", + "\n", + "There are various techniques that exist to insert data into the cluster. To read about the techniques, please follow the [sample-data import](https://docs.couchbase.com/cloud/clusters/data-service/import-data-documents.html#import-sample-data) guide.\n", + "\n", + "After data upload is complete, follow the next steps to achieve vectorization for your required fields." + ] + }, + { + "cell_type": "markdown", + "id": "7e3afd3f-9949-4f5e-b96a-1aac1a3aea29", + "metadata": {}, + "source": [ + "# 3. Deploying the Model\n", + "Now, before we actually create embeddings for the documents, we need to deploy a model that will create the embeddings for us.\n", + "## 3.1: Selecting the Model \n", + "1. To select the model, you first need to navigate to the \"AI Services\" tab, then select \"Models\" and click on \"Deploy New Model\".\n", + " \n", + " \n", + "\n", + "2. Enter the model name, and choose the model that you want to deploy. After selecting your model, choose the model infrastructure and region where the model will be deployed.\n", + " \n", + " \n", + "\n", + "## 3.2 Access Control to the Model\n", + "\n", + "1. After deploying the model, go to the \"Models\" tab in the AI Services and click on \"Setup Access\".\n", + "\n", + " \n", + "\n", + "2. Enter your API key name, expiration time and the IP address from which you will be accessing the model.\n", + "\n", + " \n", + "\n", + "3. Download your API key\n", + "\n", + " " + ] + }, + { + "cell_type": "markdown", + "id": "daaf6525-d4e6-45fb-8839-fc7c20081675", + "metadata": {}, + "source": [ + "# 4. Deploying AutoVectorization Workflow\n", + "\n", + "Now, we are at the step that will help us create the embeddings/vectors. To proceed with the vectorization process, please follow the steps below:\n", + "\n", + "1. For deploying the autovectorization, you need to go to the `AI Services` tab, then click on `Workflows`, and then click on `Create New Workflow`.\n", + "\n", + " \n", + " \n", + "2. Start your workflow deployment by giving it a name and selecting where your data will be provided to the auto-vectorization service. There are currently 3 options: `pre-processed data (JSON format) from Capella`, `pre-processed data (JSON format) from external sources (S3 buckets)` and `unstructured data from external sources (S3 buckets)`. For this tutorial, we will choose the first option, which is pre-processed data from Capella.\n", + "\n", + " \n", + "\n", + "3. Now, select the `cluster`, `bucket`, `scope` and `collection` from which you want to select the documents and get the data vectorized.\n", + "\n", + " \n", + "\n", + "4. Field Mapping will be used to tell the AutoVectorize service which data will be converted to embeddings.\n", + "\n", + " There are two options:\n", + "\n", + " - All source fields - This feature will convert all your fields inside the document to a single vector field.\n", + " \n", + " \n", + "\n", + "\n", + " - Custom source fields - This feature will convert specific fields chosen by the user to a single vector field. In the image below, we have chosen `address`, `description` and `id` as the fields to be converted to a vector with the name `vec_addr_decr_id_mapping`.\n", + " \n", + " \n", + " \n", + "5. After choosing the type of mapping, it is required to either create an index on the new vector_embedding field or the creation of a vector index can be skipped, which is not recommended as the functionality of vector searching will be lost.\n", + "\n", + " \n", + "\n", + "6. Below screenshot highlights the whole process which were mentioned above, and click next afterwards as shown below.\n", + "\n", + " \n", + "\n", + "\n", + "7. Select the model which will be used to create the embeddings. There are two options to create the embeddings, `capella based` and `external model`.\n", + " \n", + " \n", + "\n", + " - For this tutorial, capella based embedding model is used as can be seen in the image above. API credentials can be uploaded using the file downloaded in `step 2.2` or it can be entered manually as well.\n", + " - Choices between private and insecure networking is available to choose.\n", + " - A click on `Next` will land you at the final page of the workflow.\n", + "\n", + "\n", + "\n", + "8. `Workflow Summary` will display all the necessary details of the workflow including `Data Source`, `Model Service` and `Billing Overview` as shown in image below.\n", + "\n", + " \n", + "\n", + "\n", + "\n", + "9. `Hurray! Workflow Deployed` Now in the `workflow` tab we can see the workflow deployed and can check the status of our workflow run.\n", + "\n", + " \n", + "\n", + "After this step, your vector embeddings for the selected fields should be ready, and you can check them out in the Capella UI. In the next step, we will demonstrate how we can use the generated vectors to perform vector search.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "e50204a4", + "metadata": {}, + "source": [ + "# 5. Vector Search\n", + "\n", + "The following code cells implement semantic vector search against the embeddings generated by the AutoVectorization workflow. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9d38e3de", + "metadata": { + "vscode": { + "languageId": "powershell" + } + }, + "outputs": [], + "source": [ + "!pip install couchbase langchain-couchbase langchain-openai" + ] + }, + { + "cell_type": "markdown", + "id": "a1854af3", + "metadata": {}, + "source": [ + "`couchbase - Version: 4.4.0` \\\n", + "`langchain-couchbase - Version: 0.4.0` \\\n", + "`pip install langchain-openai - Version: 0.3.34`\n", + "\n", + "Now, please proceed to execute the cells in order to run the vector similarity search.\n", + "\n", + "# Importing Required Packages" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "30955126-0053-4cec-9dec-e4c05a8de7c3", + "metadata": {}, + "outputs": [], + "source": [ + "from couchbase.cluster import Cluster\n", + "from couchbase.auth import PasswordAuthenticator\n", + "from couchbase.options import ClusterOptions\n", + "\n", + "from langchain_openai import OpenAIEmbeddings\n", + "from langchain_couchbase.vectorstores import CouchbaseSearchVectorStore" + ] + }, + { + "cell_type": "markdown", + "id": "e5be1f01", + "metadata": {}, + "source": [ + "# Cluster Connection Setup\n", + " - Defines the secure connection string, user credentials, and creates a `Cluster` object.\n", + " - Disables TLS verification by `options = ClusterOptions(auth, tls_verify='none')` ONLY for quick local testing (not recommended in production) and applies the `wan_development` profile to tune timeouts for higher-latency networks." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7e4c9e8d", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = \"CLUSTER_CONNECTION_STRING\" # Replace this with Connection String\n", + "username = \"YOUR_USERNAME\" # Replace this with your username\n", + "password = \"YOUR_PASSWORD\" # Replace this with your password\n", + "auth = PasswordAuthenticator(username, password)\n", + "\n", + "options = ClusterOptions(auth)\n", + "cluster = Cluster(endpoint, options)\n", + "\n", + "cluster.wait_until_ready(timedelta(seconds=5))" + ] + }, + { + "cell_type": "markdown", + "id": "bbeb8a4f", + "metadata": {}, + "source": [ + "# Selection of Buckets / Scope / Collection / Index / Embedder\n", + " - Sets the bucket, scope, and collection where the documents (with vector fields) live.\n", + " - Specifies the Capella Search index name created (or selected) in Step 4.5.\n", + " - `embedder` instantiates the NVIDIA embedding model that will transform the user's natural language query into a vector at search time.\n", + " - `open_api_key` is the api key token created in `step 3.2 -3`.\n", + " - `open_api_base` is the Capella model services endpoint found in the models section.\n", + "\n", + "`Note that the Capella AI Endpoint also requires an additional /v1 from the endpoint if not shown on the UI`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "799b2efc", + "metadata": {}, + "outputs": [], + "source": [ + "bucket_name = \"travel-sample\"\n", + "scope_name = \"inventory\"\n", + "collection_name = \"hotel\"\n", + "index_name = \"hybrid_autovec_workflow_vec_addr_descr_id\" # This is the name of the search index that was created in step 4.5 and can also be seen in the search tab of the cluster.\n", + " # It should be noted that hybrid_workflow_name_index_fieldname is the naming convention for the index created by AutoVectorization workflow where\n", + " # fieldname is the name of the field being indexed.\n", + "\n", + "# Using the OpenAI SDK for the embeddings with the capella model services and they are compatible with the OpenAIEmbeddings class in Langchain\n", + "embedder = OpenAIEmbeddings(\n", + " model=\"nvidia/llama-3.2-nv-embedqa-1b-v2\", # This is the model that will be used to create the embedding of the query.\n", + " openai_api_key=\"CAPELLA_MODEL_KEY\",\n", + " openai_api_base=\"CAPELLA_MODEL_ENDPOINT/v1\",\n", + " check_embedding_ctx_length=False,\n", + " tiktoken_enabled=False, \n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "fda36710", + "metadata": {}, + "source": [ + "# VectorStore Construction\n", + " - Creates a `CouchbaseSearchVectorStore` instance that:\n", + " * Knows where to read documents (`bucket/scope/collection`).\n", + " * Knows the embedding field (the vector produced by the AutoVectorization workflow).\n", + " * Uses the provided embedder to embed queries on-demand.\n", + " - If your AutoVectorization workflow produced a different vector field name, update `embedding_key` accordingly.\n", + " - If you mapped multiple fields into a single vector, you can choose any representative field for `text_key`, or modify the VectorStore wrapper to concatenate fields." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "50b85f78", + "metadata": {}, + "outputs": [], + "source": [ + "vector_store = CouchbaseSearchVectorStore(\n", + " cluster=cluster,\n", + " bucket_name=bucket_name,\n", + " scope_name=scope_name,\n", + " collection_name=collection_name,\n", + " embedding=embedder,\n", + " index_name=index_name,\n", + " text_key=\"address\", # Your document's text field\n", + " embedding_key=\"vec_addr_descr_id\" # This is the field in which your vector (embedding) is stored in the cluster.\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "be207963", + "metadata": {}, + "source": [ + "# Performing a Similarity Search\n", + " - Defines a natural language query (e.g., \"Woodhead Road\").\n", + " - Calls `similarity_search(k=3)` to retrieve the top 3 most semantically similar documents.\n", + " - Prints ranked results, extracting a `title` (if present) and the chosen `text_key` (here `address`).\n", + " - Change `query` to any descriptive phrase (e.g., \"beach resort\", \"airport hotel near NYC\").\n", + " - Adjust `k` for more or fewer results." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "177fd6d5", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1. Glossop — Address: Woodhead Road\n", + "2. Glossop — Address: 28 Woodhead Road\n", + "3. Hadrian's Wall — Address: Greenhead, Brampton, Cumbria, CA8 7HB\n" + ] + } + ], + "source": [ + "query = \"Woodhead Road\"\n", + "results = vector_store.similarity_search(query, k=3)\n", + "\n", + "# Print out the top-k results\n", + "for rank, doc in enumerate(results, start=1):\n", + " title = doc.metadata.get(\"title\", \"\")\n", + " address_text = doc.page_content\n", + " print(f\"{rank}. {title} — Address: {address_text}\")" + ] + }, + { + "cell_type": "markdown", + "id": "f9e0d863", + "metadata": {}, + "source": [ + "## 6. Results and Interpretation\n", + "\n", + "As we can see, 3 (or `k`) ranked results are printed in the output.\n", + "\n", + "### What Each Part Means\n", + "- Leading number (1, 2, 3): The result rank (1 = most similar to your query).\n", + "- Title: Pulled from `doc.metadata.get(\"title\", \"\")`. If your documents don't contain a `title` field, you will see ``.\n", + "- Address text: This is the value of the field you configured as `text_key` (in this tutorial: `address`). It represents the human-readable content we chose to display.\n", + "\n", + "### How the Ranking Works\n", + "1. Your natural language query (e.g., `\"Woodhead Road\"`) is embedded using the NVIDIA model (`nvidia/llama-3.2-nv-embedqa-1b-v2`).\n", + "2. The vector store compares the query embedding to stored document embeddings in the field you configured (`embedding_key = \"vec_addr_descr_id\"`).\n", + "3. Results are sorted by vector similarity. Higher similarity = closer semantic meaning.\n", + "\n", + "\n", + "> Your vector search pipeline is working if the returned documents feel meaningfully related to your natural language query—even when exact keywords do not match. Feel free to experiment with increasingly descriptive queries to observe the semantic power of the embeddings." + ] + }, + { + "cell_type": "markdown", + "id": "54b9ee43", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "autovec", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.7" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/autovec-tutorial/img/Access_control.png b/autovec-tutorial/img/Access_control.png new file mode 100644 index 00000000..149dcd0b Binary files /dev/null and b/autovec-tutorial/img/Access_control.png differ diff --git a/autovec-tutorial/img/Create_auto_vec.png b/autovec-tutorial/img/Create_auto_vec.png new file mode 100644 index 00000000..61baeae7 Binary files /dev/null and b/autovec-tutorial/img/Create_auto_vec.png differ diff --git a/autovec-tutorial/img/Select_embedding_model.png b/autovec-tutorial/img/Select_embedding_model.png new file mode 100644 index 00000000..a4236f2d Binary files /dev/null and b/autovec-tutorial/img/Select_embedding_model.png differ diff --git a/autovec-tutorial/img/cluster_cloud_config.png b/autovec-tutorial/img/cluster_cloud_config.png new file mode 100644 index 00000000..c478a833 Binary files /dev/null and b/autovec-tutorial/img/cluster_cloud_config.png differ diff --git a/autovec-tutorial/img/cluster_no_nodes.png b/autovec-tutorial/img/cluster_no_nodes.png new file mode 100644 index 00000000..8a09de47 Binary files /dev/null and b/autovec-tutorial/img/cluster_no_nodes.png differ diff --git a/autovec-tutorial/img/create_cluster.png b/autovec-tutorial/img/create_cluster.png new file mode 100644 index 00000000..8af4219b Binary files /dev/null and b/autovec-tutorial/img/create_cluster.png differ diff --git a/autovec-tutorial/img/deploying_model.png b/autovec-tutorial/img/deploying_model.png new file mode 100644 index 00000000..5b830341 Binary files /dev/null and b/autovec-tutorial/img/deploying_model.png differ diff --git a/autovec-tutorial/img/download_api_key_details.png b/autovec-tutorial/img/download_api_key_details.png new file mode 100644 index 00000000..8ee7dc82 Binary files /dev/null and b/autovec-tutorial/img/download_api_key_details.png differ diff --git a/autovec-tutorial/img/import_sd.png b/autovec-tutorial/img/import_sd.png new file mode 100644 index 00000000..e6d1a664 Binary files /dev/null and b/autovec-tutorial/img/import_sd.png differ diff --git a/autovec-tutorial/img/imported_data_hotel.png b/autovec-tutorial/img/imported_data_hotel.png new file mode 100644 index 00000000..1aeb7f80 Binary files /dev/null and b/autovec-tutorial/img/imported_data_hotel.png differ diff --git a/autovec-tutorial/img/importing_model.png b/autovec-tutorial/img/importing_model.png new file mode 100644 index 00000000..41e80e92 Binary files /dev/null and b/autovec-tutorial/img/importing_model.png differ diff --git a/autovec-tutorial/img/login.png b/autovec-tutorial/img/login.png new file mode 100644 index 00000000..30e8b1e2 Binary files /dev/null and b/autovec-tutorial/img/login.png differ diff --git a/autovec-tutorial/img/login_.png b/autovec-tutorial/img/login_.png new file mode 100644 index 00000000..e1711271 Binary files /dev/null and b/autovec-tutorial/img/login_.png differ diff --git a/autovec-tutorial/img/model_api_key_form.png b/autovec-tutorial/img/model_api_key_form.png new file mode 100644 index 00000000..0713a53c Binary files /dev/null and b/autovec-tutorial/img/model_api_key_form.png differ diff --git a/autovec-tutorial/img/model_setup_access.png b/autovec-tutorial/img/model_setup_access.png new file mode 100644 index 00000000..91dfae79 Binary files /dev/null and b/autovec-tutorial/img/model_setup_access.png differ diff --git a/autovec-tutorial/img/node_select_cluster_opt.png b/autovec-tutorial/img/node_select_cluster_opt.png new file mode 100644 index 00000000..a15a0f77 Binary files /dev/null and b/autovec-tutorial/img/node_select_cluster_opt.png differ diff --git a/autovec-tutorial/img/password_cluster.png b/autovec-tutorial/img/password_cluster.png new file mode 100644 index 00000000..85ad736d Binary files /dev/null and b/autovec-tutorial/img/password_cluster.png differ diff --git a/autovec-tutorial/img/select_cluster.png b/autovec-tutorial/img/select_cluster.png new file mode 100644 index 00000000..381439fe Binary files /dev/null and b/autovec-tutorial/img/select_cluster.png differ diff --git a/autovec-tutorial/img/setup_access.png b/autovec-tutorial/img/setup_access.png new file mode 100644 index 00000000..08bf9643 Binary files /dev/null and b/autovec-tutorial/img/setup_access.png differ diff --git a/autovec-tutorial/img/start_workflow.png b/autovec-tutorial/img/start_workflow.png new file mode 100644 index 00000000..23ce813a Binary files /dev/null and b/autovec-tutorial/img/start_workflow.png differ diff --git a/autovec-tutorial/img/vector_all_field_mapping.png b/autovec-tutorial/img/vector_all_field_mapping.png new file mode 100644 index 00000000..8800ac88 Binary files /dev/null and b/autovec-tutorial/img/vector_all_field_mapping.png differ diff --git a/autovec-tutorial/img/vector_custom_field_mapping.png b/autovec-tutorial/img/vector_custom_field_mapping.png new file mode 100644 index 00000000..519c4756 Binary files /dev/null and b/autovec-tutorial/img/vector_custom_field_mapping.png differ diff --git a/autovec-tutorial/img/vector_data_source.png b/autovec-tutorial/img/vector_data_source.png new file mode 100644 index 00000000..f9db7e46 Binary files /dev/null and b/autovec-tutorial/img/vector_data_source.png differ diff --git a/autovec-tutorial/img/vector_field_mapping.png b/autovec-tutorial/img/vector_field_mapping.png new file mode 100644 index 00000000..dfdeacf3 Binary files /dev/null and b/autovec-tutorial/img/vector_field_mapping.png differ diff --git a/autovec-tutorial/img/vector_index.png b/autovec-tutorial/img/vector_index.png new file mode 100644 index 00000000..b52dd9ab Binary files /dev/null and b/autovec-tutorial/img/vector_index.png differ diff --git a/autovec-tutorial/img/vector_index_page.png b/autovec-tutorial/img/vector_index_page.png new file mode 100644 index 00000000..3fa8da93 Binary files /dev/null and b/autovec-tutorial/img/vector_index_page.png differ diff --git a/autovec-tutorial/img/workflow.png b/autovec-tutorial/img/workflow.png new file mode 100644 index 00000000..fcf8a0c6 Binary files /dev/null and b/autovec-tutorial/img/workflow.png differ diff --git a/autovec-tutorial/img/workflow_deployed.png b/autovec-tutorial/img/workflow_deployed.png new file mode 100644 index 00000000..224dcfa1 Binary files /dev/null and b/autovec-tutorial/img/workflow_deployed.png differ diff --git a/autovec-tutorial/img/workflow_summary.png b/autovec-tutorial/img/workflow_summary.png new file mode 100644 index 00000000..f3810c13 Binary files /dev/null and b/autovec-tutorial/img/workflow_summary.png differ