Data set
Of course, the first thing I needed was a data set of song lyrics. Luckily, I found one on Kaggle! This dataset is licensed under a Creative Commons license (CC0: public domain).
This data set contains approximately 60K song lyrics along with the title and artist name. HE 60K It may not cover all the songs you like, but I think it's a good starting point for LyRec.
songs_df = pd.read_csv(f"{root_dir}/spotify_millsongdata.csv")
songs_df = songs_df.drop(columns=("link"))
songs_df("song_id") = songs_df.index + 1
I did not need to do any pre-processing of this data. I just removed the link column and added a ID for each song.
Models
I needed to select two LLMs: one to calculate the embeds and another to generate the song summaries. Choosing the right LLM for your task can be a bit tricky due to the sheer number of them! It's a good idea to look at the leaderboard to find the current best. For the integration model, I reviewed the MTEB leaderboard hosted on HuggingFace.
I was looking for a smaller model (obviously!) without compromising too much accuracy; therefore, I decided GTE-Qwen2-1.5B-Instruction.
from sentence_transformers import SentenceTransformer
import torchmodel = SentenceTransformer(
"Alibaba-NLP/gte-Qwen2-1.5B-instruct",
model_kwargs={"torch_dtype": torch.float16}
)
For the resumer, I only needed a small enough instruction after LLM, so I chose Gemma-2–2b-That. In my experience, it is one of the best small models so far.
import torch
from transformers import pipelinepipe = pipeline(
"text-generation",
model="google/gemma-2-2b-it",
model_kwargs={"torch_dtype": torch.bfloat16},
device="cuda",
)
Precalculate embeddings
Calculating the letter embeddings was pretty straightforward. I only used the .encode(…) method with a lot_size 32 for faster processing.
song_lyrics = songs_df("text").valueslyrics_embeddings = model.encode(
song_lyrics,
batch_size=32,
show_progress_bar=True
)
np.save(f"{root_dir}/60k_song_lyrics_embeddings.npy", lyrics_embeddings)
At this point, I stored these embeds in a .npy archive. I could have used a more structured format, but it worked for me.
As for the summary embeds, I first needed to generate the summaries. I had to make sure the summary captured the emotion and theme of the song without being too long. So, I came up with the following message for Gemma-2.
You are an expert song summarizer. \
You will be given the full lyrics to a song. \
Your task is to write a concise, cohesive summary that \
captures the central emotion, overarching theme, and \
narrative arc of the song in 150 words.{song lyrics}
Here is the code snippet for summary generation. For simplicity, a sequential processing is shown below. I have included the batch version in the GitHub repository.
def get_summary(song_lyrics):
messages = (
{"role": "user",
"content": f'''You are an expert song summarizer. \
You will be given the full lyrics to a song. \
Your task is to write a concise, cohesive summary that \
captures the central emotion, overarching theme, and \
narrative arc of the song in 150 words.\n\n{song_lyrics}'''},
)outputs = pipe(messages, max_new_tokens=256)
assistant_response = outputs(0)("generated_text")(-1)("content").strip()
return assistant_response
songs_df("summary") = songs_df("text").progress_apply(get_description)
As expected, this step took the longest. Luckily, this only has to be done once, and of course, when we want to update the database with new songs.
Then, I calculated and stored the embedding like last time.
song_summary = songs_df("summary").valuessummary_embeddings = model.encode(
song_summary,
batch_size=32,
show_progress_bar=True
)
np.save(f"{root_dir}/60k_song_summary_embeddings.npy", summary_embeddings)
Vector Search
Once the embeddings were implemented, it was time to implement semantic search based on embedding similarity. There are many amazing open source vector databases available for this job. I decided to use a simple one called FAISS (facebook ai Similarity Search). It only takes two lines to add the embeds to the database. First, we create a FAISS index. Here, we must mention the similarity metric that you want to use for the search and the dimension of the vectors. I used the scalar product (inner product) as a measure of similarity. Next, we add the embeddings to the index.
Note: Our database is small enough to perform an exhaustive search using the dot product. For larger databases, it is recommended to perform an approximate nearest neighbor (ANN) search. FAISS has support for that.
import faisslyrics_embeddings = np.load(f"{root_dir}/60k_song_lyrics_embeddings.npy")
lyrics_index = faiss.IndexFlatIP(lyrics_embeddings.shape(1))
lyrics_index.add(lyrics_embeddings.astype(np.float32))
summary_embeddings = np.load(f"{root_dir}/60k_song_summary_embeddings.npy")
summary_index = faiss.IndexFlatIP(summary_embeddings.shape(1))
summary_index.add(summary_embeddings.astype(np.float32))
To find the most similar songs given a query, we must first generate the query embedding and then call .look for(…) method in the index. Essentially, this method calculates the similarity between the query and each entry in our database and returns the top. k registrations and the corresponding scores. Here is the code that performs a semantic search for letter embeddings.
query_lyrics = 'Imagine the last song you fell in love with'
query_embedding = model.encode(f'''Instruct: Given the lyrics, \
retrieve relevant songs\nQuery: {query_lyrics}''')
query_embedding = query_embedding.reshape(1, -1).astype(np.float32)
lyrics_scores, lyrics_ids = lyrics_index.search(query_embedding, 10)
Notice that I added a simple message in the query. This is recommended for this model. The same applies to summary embeds.
query_description = 'Describe the type of song you wanna listen to'
query_embedding = model.encode(f'''Given a description, \
retrieve relevant songs\nQuery: {query_description}''')
query_embedding = query_embedding.reshape(1, -1).astype(np.float32)
summary_scores, summary_ids = summary_index.search(query_embedding, k)
Pro Tip: How do you do a sanity check?
Simply put any database entry into the query and see if the search returns the same as the highest scoring entry!
Feature Implementation
At this stage, I had the basic components of LyRec. Now it was time to put them together. Do you remember the three goals I set for myself at the beginning? This is how I implemented them.
To keep things organized, I created a class called LyRec that would have a method for each feature. The first two features are quite simple to implement.
The method.get_songs_with_similar_lyrics(…) take a song (lyrics) and an integer k as input and returns a list of k Most similar songs based on similarity of lyrics. Each item in the list is a dictionary containing the artist name, song title, and lyrics.
Similarly, .get_songs_with_similar_description(…) takes a free form text and an integer k as input and returns a list of k Most similar songs according to the description.
Here is the relevant code snippet.
class LyRec:
def __init__(self, songs_df, lyrics_index, summary_index, embedding_model):
self.songs_df = songs_df
self.lyrics_index = lyrics_index
self.summary_index = summary_index
self.embedding_model = embedding_modeldef get_records_from_id(self, song_ids):
songs = ()
for _id in song_ids:
songs.extend(self.songs_df(self.songs_df("song_id")==_id+1).to_dict(orient='records'))
return songs
def get_songs_with_similar_lyrics(self, query_lyrics, k=10):
query_embedding = self.embedding_model.encode(
f"Instruct: Given the lyrics, retrieve relevant songs\n Query: {query_lyrics}"
).reshape(1, -1).astype(np.float32)
scores, song_ids = self.lyrics_index.search(query_embedding, k)
return self.get_records_from_id(song_ids(0))
def get_songs_with_similar_description(self, query_description, k=10):
query_embedding = self.embedding_model.encode(
f"Instruct: Given a description, retrieve relevant songs\n Query: {query_description}"
).reshape(1, -1).astype(np.float32)
scores, song_ids = self.summary_index.search(query_embedding, k)
return self.get_records_from_id(song_ids(0))
The final feature was a bit complicated to implement. Remember that we must first retrieve the top songs based on the lyrics and then re-sort them based on the textual description. The first recovery was easy. For the second, you only have to consider the songs with the highest score. I decided to create a temporary FAISS index of the best songs and then look for the songs with the highest summary similarity scores. Here is my implementation.
def get_songs_with_similar_lyrics_and_description(self, query_lyrics, query_description, k=10):
query_lyrics_embedding = self.embedding_model.encode(
f"Instruct: Given the lyrics, retrieve relevant songs\n Query: {query_lyrics}"
).reshape(1, -1).astype(np.float32)scores, song_ids = self.lyrics_index.search(query_lyrics_embedding, 500)
top_k_indices = song_ids(0)
summary_candidates = ()
for idx in top_k_indices:
emb = self.summary_index.reconstruct(int(idx))
summary_candidates.append(emb)
summary_candidates = np.array(summary_candidates, dtype=np.float32)
temp_index = faiss.IndexFlatIP(summary_candidates.shape(1))
temp_index.add(summary_candidates)
query_description_embedding = self.embedding_model.encode(
f"Instruct: Given a description, retrieve relevant songs\n Query: {query_description}"
).reshape(1, -1).astype(np.float32)
scores, temp_ids = temp_index.search(query_description_embedding, k)
final_song_ids = (top_k_indices(i) for i in temp_ids(0))
return self.get_records_from_id(final_song_ids)
Viola! Finally, LyRec it's ready. You can find the full implementation in this repository. Please leave a star if you find this helpful!