Comparing text using embeddings
In this tutorial, you’ll learn how to create a Python application that compares two input texts using semantic embeddings and Levenshtein similarity. The application communicates with the API server to generate embeddings and calculates similarity metrics programmatically.
Embeddings are numerical representations of the meaning of a string of text. Similar strings of text will have similar numbers, and the closer numbers are together, the more closely related the topics are, even if the words are different. For example, the words “vision” and “sight” might have embeddings that are numerically close due to their similar meanings.
Prerequisites
- Before you begin, ensure that you have the conda package manager installed on your machine. You can install
conda
using either Anaconda Distribution or Miniconda. - You must have a
sentence-similarity
type model downloaded onto your local machine.
Setting up your environment
When working on a new conda project, it is recommended that you create a new environment for development. Follow these steps to set up an environment for your embedding application:
-
Open Anaconda Prompt (Terminal on macOS/Linux).
This terminal can be opened from within an IDE (JupyterLab, PyCharm, VSCode, Spyder), if preferred.
-
Create the conda environment to develop your embedding application and install the packages you’ll need by running the following command:
-
Activate your newly created conda environment by running the following command:
For more information and best practices for managing environments, see Environments.
Building the text comparator
Below, you’ll find the necessary code snippets to build your text comparator, with explanations for each step to help you understand how the application works. The text comparator combines two methods for comparing text: semantic similarity using embeddings, and structural similarity using Levenshtein distance.
Semantic similarity tells us how close the meanings of two texts are, while Levenshtein distance looks at how similar the actual characters are by counting the edits needed to turn one string into the other. Together, these methods help us understand how similar two text strings are—whether they look alike, mean the same thing, or both.
Using your preferred IDE, create a new file and name it similarian.py
.
Importing libraries
The application we are building requires libraries to handle HTTP requests, numerical operations, and string similarity calculations.
Add the following lines of code to the top of your similarian.py
file:
Setting the base_url
In order for the application to programmatically process text inputs to run server health checks, generate embeddings, and perform other actions, it is crucial that you properly structure your application to interact with the API server and its endpoints.
The URLs for these API endpoints are constructed by combining a base_url
with a specific /endpoint
for each function. The base_URL
can be constructed by combining the Server Address and Server Port specified in Anaconda AI Navigator, like this: http://<SERVER_ADDRESS>:<SERVER_PORT>
.
Set the base_url
to point to the default server address by adding the following line to your file.
localhost
and 127.0.0.1
are semantically identical.
Adding the API calls
AI Navigator utilizes llama.cpp’s specifications for interacting with the API server’s /embedding
endpoint.
The API server is also compatible with OpenAI’s /embeddings
API specifications.
To enable your application to communicate with the API server, you must implement functions that make API calls in a way that the server can understand.
GET /health
Before sending any requests to the server, it’s a good idea to verify that the server is operational. This function sends a GET request to the /health
endpoint and returns a JSON response that tells you the server’s status.
Add the following lines to your similarian.py
file:
POST /embedding
To interact with a sentence-similarity
model, you must have a function that hits the server’s /embedding
endpoint. This function processes input text and returns its vector representation (embedding).
Add the following lines to your similarian.py
file:
Constructing the functions
Now that we have added the API calls to communicate with the API server, we’ll need to construct the core functionality of our application: comparing two strings of text. This involves measuring their semantic (meaning-based) and structural (character-based) similarities.
compare_texts
This function takes the two text inputs from the main
function and calculates the semantic and structural similarity scores.
Add the following lines to your similarian.py
file:
main
The main
function ties the rest of the functions together and handles user input. It takes two inputs from the user and displays the results from the similarity calculations.
Add the following lines to your similarian.py
file:
Interacting with the API server
With your text comparator constructed, it’s time to compare some text!
-
Open Anaconda AI Navigator and load a model into the API server.
This must be a
sentence-similarity
type model! -
Leave the Server Address and Server Port at the default values and click Start.
-
Open a terminal and navigate to the directory where you stored your
similarian.py
file.Make sure you are still in your
content-compare
conda environment. -
Initiate the text comparator by running the following command:
You’ll need to run this command every time you want to run the text comparator.
-
Enter a string of text and press Enter (Windows)/Return (Mac).
-
Enter a string of text that you want to compare to the previous string and press Enter (Windows)/Return (Mac) again.
-
View the Anaconda AI Navigator API server logs. If everything is set up correctly, the server logs will populate with traffic from the application, starting with a health check.
Here is an example of interacting with the text comparator, assuming you’ve previously navigated to the directory containing your similarian.py
file:
Comparing sentences
Here are some examples that you can use to get a better understanding and feel for how text comparisons work:
Next steps
You can continue to develop and extend this text comparator to tackle more advanced use cases, such as implementing a database to store embeddings for efficient comparisons at scale, allowing you to build tools like duplicate content detectors, recommendation systems, or document clustering applications.
Or, if you’re finished with this project, you can delete the file and clean up your conda environment by running the following commands:
Was this page helpful?