- Anyone interested in DIY recommendations.
- Engineers interested in basic PyTorch classification models
- coffee nerds
- Someone who wants to copy and paste code into their production system.
- People who wanted a TensorFlow model
Imagine that you are sitting on your couch, with friends or family present. You have your favorite game console, streaming service, or music app open, and each item is a shining gem of possibilities, tailored to you. But those personalized results may be for the solo version of yourself and do not reflect the version of yourself when you are surrounded by this particular mix of others.
This project really started with coffee. I love roasting my own green coffee from Sweet Maria's (no affiliation) as it has a wide variety of delicious possibilities. Colombian? Java Beans? Kenyan Peaberry? Each description is more enticing than the last. It's very difficult to choose even for me as an individual. What happens if you buy green coffee for your family or guests?
I wanted to create a learning classification (LTR) model that could potentially solve this coffee conundrum. For this project, I started by building a simple TensorFlow Classification project to predict user-pair rankings of different cafes. I had some experience with TFR, so it seemed like a natural fit.
However, I realized that I had never built a classification model from scratch before. I set out to build a very sophisticated PyTorch classification model to see if I could put one together and learn something in the process. Obviously this is not intended for a production system and I took a lot of shortcuts along the way, but it has been an incredible pedagogical experience.
Our supreme objective is the following:
- Develop a classification model that learns user preferences in pairs.
- apply this to predict list sorting of “k” items
What signal could there be in the combinations of user and item characteristics to produce a set of recommendations for that pair of users?
To gather this data, I had to do some painful research to try amazing coffees with my wife. Then, each of us rated them on a 10-point scale. The target value is simply the sum of our two scores (20 points maximum). The objective of the model is to learn to classify the coffees that both of us will enjoy, and not just one of the members of any couple. The contextual data that we will be using is the following:
- ages of both users in the pair
- User IDs to be converted to embeds.
SweetMarias.com provides many items data:
- the origin of coffee
- Processing and cultivation notes.
- tasting descriptions
- professional qualification scores (100-point scale)
So for each training example, we will have the user data as contextual information and the feature set of each element will be concatenated.
TensorFlow Ranking models are typically trained with data in ELWC format: ExampleListWithContext. You can think of it as a dictionary with 2 keys: CONTEXT and EXAMPLES (list). Within each EXAMPLE there is a dictionary of features per element you want to classify.
For example, let's say I'm looking for a new coffee to try and I'm presented with a candidate pool of k=10 coffee varieties. An ELWC would consist of context/user information as well as a list of 10 items, each with its own set of characteristics.
Since I was no longer using TensorFlow Ranking, I created my own hacky list building/ranking aspect of this project. I took random samples of k items for which we have scores and added them to a list. I divided the first coffees I tried into a training set and the later examples became a small validation set to evaluate the model.
In this toy example, we have a pretty rich data set. As for context, we apparently know the age of the users and can know their respective preferences. Through subsequent layers within the LTR, these contextual features can be compared and contrasted. Does one user like dark, fruity flavors, while the other enjoys invigorating citrus and fruity notes in their cup?
For the article's features, we have a generous helping of rich, descriptive text from tasting notes, origin, etc. of each coffee. More on this later, but the general idea is that we can capture the meaning of these descriptions and relate them to the context data (user pair). Finally, we have some numerical features like the expert tasting score per item that (should) have some similarity to reality.
There is an amazing change happening in text embedding since when I started in the machine learning industry. Gone are the GLOVE and Word2Vec models that I used to use to try to capture some semantic meaning of a word or phrase. If you go to https://huggingface.co/blog/mtebYou can easily compare which are the best and latest embedding models for various purposes.
For the sake of simplicity and familiarity, we will use https://huggingface.co/BAAI/bge-base-en-v1.5 embeddings to help us project our text features into something understandable using an LTR model. Specifically we will use this for the product descriptions and names that Sweet Marias provides.
We will also need to convert all of our user and item id values into an embedding space. PyTorch handles this beautifully with the embed Layers.
Finally we do some scaling on our floating features with a simple Robust climber. All of this can happen within our Torch Dataset class, which is then dumped into a DataLoader for training. The trick here is to separate the different identifiers that will be passed to the forward()
Call PyTorch. This article by Offir Inbar really saved me some time by doing just that!
The only interesting thing about the Torch training was ensuring that the 2 user onboardings (one for each tester) and the k
The coffees in the list for training had the correct embeddings and dimensions to pass through our neural network. With some adjustments, I was able to get something out:
This preview puts each training example into a single concatenated list of all features.
With so few data points (only 16 coffees were rated), it can be difficult to train a robust NN model. I often build a simple sklearn
model side by side to compare the results. Are we really learning something?
Using the same data preparation techniques, I built a LogisticRegression multiclass classifier model and then removed the .predict_proba()
scores to be used as our rankings. What could our metrics say about the performance of these two models?
For metrics, I chose to track two:
- Maximum precision (`k=1`)
- NDCG
The goal, of course, is to get the correct classification for these coffees. NDCG will fit in very well here. However, I suspected that the LogReg model might have issues with the classification aspect, so I thought it might include single precision as well. Sometimes you just want a really good cup of coffee and you don't need a rating!
Without any significant investment on my part in parameter tuning, I achieved very similar results between the two models. SKLearn had a slightly worse NDCG on the (small) validation set (0.9581 vs. 0.950), but similar accuracy. I think that with some hyperparameter tuning in both the PyTorch model and the LogReg model, the results could be very similar with so little data. But at least they generally agree!
I have a new 16-pound batch of coffee to start sorting and adding to the model, and I deliberately added some lesser-known varieties to the mix. I hope to clean up the repository a bit and make it less complicated. I also need to add an invisible coffee prediction feature so I can know what to buy in the next order!
One thing to keep in mind is that if you are building a recommender for production, it is usually a good idea to use an actual library built for classification. TensorFlow Ranking, XGBoost, LambdaRank, etc. They are accepted in the industry and have many weaknesses solved.
Please check the repository here And let me know if you spot any errors! I hope you get inspired to train your own user pair model for classification.