Build a powerful AI-Powered Blogpost Recommendation Engine with Weaviate

13.09.2023Valentin Neher
Cloud Machine Learning Artificial Intelligence Database Hands-on Python

For this TechUp, I took a closer look at Weaviate, an open-source vector database, and used it hands-on to build a blogpost recommendation system for our TechHub.

What is a vector database?

In a vector database, vectors are stored alongside traditional forms of data (strings, integers, etc.). These vectors are representations of the information stored in the database that can be better understood by computers.

For example, if we have a collection of sentences that we want to store, we could create a multi-dimensional vector for each sentence that represents the sentence. Each dimension of this vector represents a different characteristic of the sentence, for example, the frequency of occurrence of a word in the sentence that can be assigned to a certain category such as “food”. Such a vector is called embedding, more about this in a moment.

But what can you do with such a vector anyway? The idea is that vectors representing, for example, sentences with related content are closer together in the vector space. These distances can also be determined very quickly with high-dimensional vectors thanks to efficient mathematical methods, which makes vector databases attractive for search functions and deep learning.

Weaviate uses cosine distance measurement here by default, which simply calculates the angle between two vectors.

TLDR

The main advantage of a vector database is that it allows fast and accurate similarity search and data retrieval, based on the vector distance or similarity between individual data elements. This means that instead of using traditional methods to query databases based on exact matches or predefined criteria, a vector database can be used to find the most similar / relevant data based on its semantic or contextual meaning.

What is text embedding and how is it generated?

Text embeddings are vectors created by projecting a text (semantic space) onto a vector space (number space). The idea is to condense the features of a given text and map them into a vector. Such embeddings are generated using machine learning models that have been trained on large amounts of text data. There are several techniques for generating text embeddings, but one of the most common and effective methods is to use so-called “word embedding” or “sentence embedding” models. Here I explain what is behind each:

  1. Word Embeddings
    • Word embeddings are number vectors that assign a unique representation to each word in a text corpus.
    • A frequently used method for generating word embeddings is Word2Vec. It uses a neural network to learn the vector representations for words. Word2Vec uses the context of words by analysing neighbouring words in a sentence or document. This embeds similar words in a similar numerical space.
    • Another popular model is GloVe (Global Vectors for Word Representation). It uses statistical information from global text statistics to capture semantic relationships between words.
  2. Sentence Embeddings
    • Sentence embeddings are similar to word embeddings, but here complete sentences or paragraphs are converted into number vectors.
    • A commonly used model for generating sentence embeddings is the “encoder-decoder model “, in particular the bidirectional encoder-decoder model (e.g. LSTM or GRU). It learns to encode a variable number of words into a vector containing the meaning of the sentence.
    • Transformer models, such as the well-known “BERT” (Bidirectional Encoder Representations from Transformers), are also very effective in generating sentence embeddings. These models use attention mechanisms to generate contextual embeddings.

Simple example architecture

!— Vector Database !— Figure: Source: sanity.io

Feeding data

When data is fed into the database, it first passes through the embedding model, which returns a vector representation of the data. The original data is then stored in the database along with the corresponding vector.

Query data

Data can be queried in a variety of ways. In addition to exact matches based on fixed input parameters, as is common in traditional databases, there is also the option of entering any text, from which a vector is also created via the embedding model and compared with the other entries (semantic search). From this vector, the distance to other nearby vectors can then be determined and the results returned to the application.

How can Weaviate be used?

Hands-On Blogpost Recommendation with Weaviate

To get to know Weaviate better, I had just started Getting started, but realised quite quickly the potential to use it to implement something I had had on my radar for a while: TechUp recommendations, and not just based on tags of the current TechUp, but intelligent and playful, ideally based on the whole text.

Since the embeddings model of HuggingFace used here only accepts a maximum of 256 words and has been trained on single sentences, we do not provide the whole text here for creating the vectors, but only title, tags, url, description.

Setup

For the local setup I proceeded as in the Quickstart of Weaviate. In this tutorial I use Python 3.11 in a Jupyter Notebook. First we need to install the Weaviate client.

1
pip install weaviate-client

Now we can create a free sandbox instance of Weaviate, get its API key and URL and connect to the instance. We also import a library requests for HTTP requests, which we will need later.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import weaviate
import json
import requests

client = weaviate.Client(
    url = "https://some-endpoint.weaviate.network", # Replace with your endpoint
    auth_client_secret=weaviate.AuthApiKey(api_key="YOUR-WEAVIATE-API-KEY"), # Replace w/ your Weaviate instance API key
    additional_headers = {
        "X-OpenAI-Api-Key": "YOUR-OPENAI-API-KEY" # Replace with your inference API key
    }
)

Import data

Now we create a class called Techup that will host our data.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
class_obj = {
    "class": "Techup",
    "vectorizer": "text2vec-huggingface",  # If set to "none" you must always provide vectors yourself. Could be any other "text2vec-*" also.
    "moduleConfig": {
        "text2vec-huggingface": {
            "model": "sentence-transformers/all-MiniLM-L6-v2",  # Can be any public or private Hugging Face model.
            "options": {
                "waitForModel": True
            }
        }
    }
}

client.schema.create_class(class_obj)

In this case, we use this model from HuggingFace to create the embeddings: sentence-transformers/all-MiniLM-L6-v2.

Next, let’s get some data that we want to use. In this case, the goal is to recommend other blogposts based on a given blogpost.

1
2
3
4
url = 'http://172.31.40.188:8983/solr/b-nova-techhub/select?q=lang:"de"&rows=1000'
resp = requests.get(url)
response = json.loads(resp.content)
data = response['response']["docs"]

If we look at one of these blogposts with print(data[2]), we get, for example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
{
	'article:publishedTime': '2020-09-02',
	'b-nova:categories': ['Mobile:purple'],
	'b-nova:slug': 'angular-b-nova-to-do-list-tutorial',
	'b-nova:user': 'ttrapp',
	'description': 'Together, let's learn about Angular and implement a useful b-nova To Do List',
	'keywords': ['b-nova', 'blog', 'techup', 'techhub', 'mobile', 'angular', 'typescript'],
	'lang': 'en',
	'tags': ['Angular', 'TypeScript'],
	'title': 'Angular b-nova To Do List Tutorial',
	'url': 'https://b-nova.com/home/content/angular-b-nova-to-do-list-tutorial',
	'article': 'actual blog post content',
	'_version_': 1749033919800410112
}

So we have stored our data in data. For each TechUp, we now want to import title, tags, url and description into our Weaviate instance into the Techup object and to do this we run the following code in our Jupyter Notebook:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
with client.batch(
batch_size=100
) as batch:
  # Batch import all Questions
  for i, d in enumerate(data):
    print(f "importing blogpost: {i+1}")

    properties = {
        "title": d["title"],
        "tags": d["tags"],
        "url": d["url"],
        "description": d["description"]
    }

    client.batch.add_data_object(
        properties,
        "techup",
    )

We see how the import goes through successfully at best. Now let’s run a database query on Weaviate.

The aemantic search allows us to search Weaviate for entries that have something to do with the input text, but do not necessarily contain exactly the given text 1:1.

We define with .get() the object Techup, from which we want to get only the url of each blog entry. With .with_near_text() we define our search query "Serverless", for which we want to get the most appropriate blog posts (can be a sentence or more instead of a word). As already seen, this search query is first processed into an embedding (vector) and then the vector distance between this and all other vectors is calculated. .with_limit() is the number of results we want to get back and with .with_additional() we set that we also want to get back the vector distance of each entry relative to the vector to the search query “Serverless” and the ID. The whole thing then looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
response = (
    client.query
    .get("Techup", ["url"])
    .with_near_text({
        "``concepts: ["serverless"]
    })
    .with_limit(4)
    .with_additional(["distance", "id"])
    .do()
)

And we get this response back when we run print(json.dumps(response, indent=2)).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
{
  "data": {
    "get": {
      "techup": [
        {
          "_additional": {
            "distance": 0.3552966,
            "id": "f52524dd-aa02-4830-a352-7b3b29298051"
          },
          }, "url": "https://b-nova.com/home/content/serverless-faas-payg-what-is-that-actually"
        },
        {
          "_additional": {
            "distance": 0.39143014,
            "id": "58392291-0a74-4c39-b8b3-a7e8284ba3c9"
          },
          }, "url": "https://b-nova.com/home/content/a-journey-into-the-asynchronous-cloud-world-with-serverless-patterns"
        },
        {
          "_additional": {
            "distance": 0.4448306,
            "id": "6c314dd4-696b-4dd4-a635-30e63924f27f"
          },
          }, "url": "https://b-nova.com/home/content/serverless-development-and-deployment-cdk"
        },
        {
          "_additional": {
            "distance": 0.46283734,
            "id": "b75e443f-c706-4d22-9ced-437022593574"
          },
          }, "url": "https://b-nova.com/home/content/serverless-on-kubernetes-with-knative"
        }
      ]
    }
  }
}

Blog recommendations

Now I want to get five matching blog posts based on a given blog post URL. To do this, I have defined the following two methods:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
def getRecommendations(url, n):
    id = getID(url)
    
    response = (
    client.query
    .get("techup", ["url"])
    .with_near_object({
        "id": id
    })
    .with_limit(n + 1)
    .with_additional(["distance"])
    .do())
    
    closeArticles = response["data"]['Get']['Techup']
    
    # Extract URL's and remove first entry
    urls = [entry['url'] for entry in closeArticles]
    urls = urls[1:]
    
    print(*urls, sep='\n')
    
def getID(url):
    where_filter = {
    "path": [ "url",]
    "operator": "equal",
    "valueText": url,
    }

    response = (
        client.query
        .get("Techup", ["title", "url"])
        .with_limit(10)
        .with_additional(["distance", "id"])
        .with_where(where_filter)
        .do()
    )
    id = response["data"]['Get']['Techup'][0]['_additional']['id']
    return id

To get the object ID of a blog post, I defined a filter in getID() that looks for exactly those entries in Weaviate that contain the same ID. This is a small workaround, as there would also be the possibility to assign an ID to each object when sending it to Weaviate, and we could then call the object directly, without the code in getID().

Now we can call getRecommendations() with the URL of one of our blogposts and the number of results we want to get back:

1
getRecommendations("https://b-nova.com/home/content/functional-programming-and-actor-model-with-elixir-and-the-beam", 5)
1
2
3
4
5
https://b-nova.com/home/content/alchemy-elixir-and-scalable-distributed-systems
https://b-nova.com/home/content/phoenix-framework-the-killer-app-from-elixir
https://b-nova.com/home/content/practically-on-the-go-with-kotlin
https://b-nova.com/home/content/ambassador-developer-and-devops-experience
https://b-nova.com/home/content/how-you-can-introduce-a-we-celebrate-failure-culture-with-chaos-engineering-into-your-daily-business

It works! 🤩

Visualisation of the data

The high-dimensional vectors assigned to each object in our database can also be projected onto the two-dimensional plane, among other things. This allows us to visualise the relationship between our database entries.

We get the data we need for this from our weaviate by passing in the arguments featureProjection and dimensions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
additional_clause = {
  "featureProjection": [
    "vector"
  ]
}

additional_setting = {
  "dimensions": 2
}

query_result = (
  client.query
  .get("Techup", "title")
  .with_additional(
    (additional_clause, additional_setting)
  )
  .do()
)
print(query_result)

Now we get back a 2D vector in the json response for each blogpost, for example: {'featureProjection': {'vector': [81.65559, -31.689371]}}, 'title': 'Increase your productivity with Alfred'}.

To visualise this, I had some code generated using ChatGPT 3.5, which returns us a descriptive graph:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import pandas as pd
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool

# Enable inline plotting for Jupyter Notebook
output_notebook()

data = query_result

# Extracting x and y values along with the titles
x_values = [item['_additional']['featureProjection']['vector'][0] for item in data['data']['Get']['Techup']]
y_values = [item['_additional']['featureProjection']['vector'][1] for item in data['data']['Get']['Techup']]
titles = [item['title'] for item in data['data']['Get']['Techup']]

# Create a pandas DataFrame
df = pd.DataFrame({'x': x_values, 'y': y_values, 'title': titles})

# Create a Bokeh figure
p = figure(width=800, height=500, title='Vector Scatter Plot',
tools='pan,box_zoom,reset,save,hover')

# Add scatter plot and hover tooltips
scatter = p.scatter(x='x', y='y', source=df, size=10, color='blue', legend_label='Data Points')
hover = HoverTool(tooltips=[('Title', '@title')], renderers=[scatter])
p.add_tools(hover)

# Customize the plot
p.xaxis.axis_label = 'X Values' # Customise the plot
p.yaxis.axis_label = 'Y Values
p.grid.visible = True

# Show the interactive plot
show(p)

Conclusion

All in all, I was pleasantly surprised by Weaviate. The focus on developer experience is noticeable, because even while I was working through Getting Started, it made me want to do more with it right away. Taking advantage of vector databases has become extremely easy thanks to providers like Weaviate - there are of course other players on the market that you should also take a closer look at before making a decision, but Weaviate is already very attractive due to the fact that it is open source.

Now I have fed the title of this blog post to our specially built Recommendation Engine and recommend the following TechUps for further reading (we only have one other TechUp on databases, so there is no great correlation here):

Materialize

SvelteKit

GitHub Copilot

Try it out and stay tuned! 🚀

This TechUp has been automatically translated by our Markdown Translator. 🔥

Valentin Neher

Valentin Neher - content artist, tech subscriber, keyboard enthusiast. Valentin is our bright cheerful computer science student who, as a representative of Gen-Z, knows how to present our collective expertise in the right channels. He knows how the land lies and shows no hesitation when it comes to repositioning b-nova in terms of social media.