Vector Embeddings for Movie Recommendations: The Heartbeat of This One Ltd.
- Meghana Kumar
- Feb 20
- 10 min read
A Plucky Research Engineer is added to the team (spoiler!)
This One Ltd was assembling its dream team for its venture into personalized recommendations space. It was a startup at its sapling stage with less than 6 team members but had a stellar cast with Martin Gould as captain of the ship, an Oxford PhD with fierce enthusiasm and just as kind a heart. Eric Humphrey, a rockstar in the ML community was quickly roped in for this venture and will soon be a recurring character in this story as it unfolds. It had one person each manning the front-end, back-end, product, operations and this ensemble seemed sufficient at the time for building out their MVPs and proofs-of-concepts and general fun.
I spied an old job advert with the role for a Research Scientist (Eric's role!) outlining the exciting opportunity of getting to design the heartbeat of the company from scratch in the very exciting field of recomendation systems.
The core work would be to go away to the virtual laboratory and come up with the very algorithms that would make the product work. It was everything from the whiteboard stage of conception to making it a living, breathing, tangible thing that other's could get their hands on.
I audaciously messaged the CEO even though it asked for a fair bit more experience than I had at the time (I only had 2 years, and was fresh out of my Masters). As luck would have it, Martin agreed to meet with me, and there was just such a palpable excitement, exchange of ideas, and great natural fit for working together that they decided to create a Research Engineer's role for me as Eric's trusty deputy. It was a very easy yes from me.
Day 0: Coffee Chat with the CEO
It was a sunny London afternoon when I first met Martin in their modest Fenchurch Street WeWork. The whiteboard behind him was covered in user journey diagrams and what looked like mathematical formulas for similarity metrics. "Think about how broken movie discovery is right now," he said, sketching out a typical Netflix homepage. "Users spend an average of 17.8 minutes scrolling through options across different platforms, often giving up without watching anything or picking something very mediocre. We're going to fix that."
The challenge he laid out was both terrifying and irresistible: build a recommendation engine so personally attuned to user taste that it could earn their trust in just five recommendations.
No existing user data, no pre-built infrastructure, and a six-week deadline to prove the concept.
"If we get those first five recommendations wrong," Martin emphasized, "users won't give us a second chance."
Trust would be the north star metric, the experience had to be magical, if not there's no way we'd convince anyone to not just watch whatever Netflix has on their homescreen.
Into The Laboratory: Modelling "Taste" Space
Evaluation Metrics
"If we can't measure it, we can't improve it." Eric said, as I settled into my first technical deep-dive.
Huddled around a whiteboard, we broke down what "trust" meant quantitatively. Top 5 recommendations they see on the screen is critical. It is make or break. The earlier they see a movie that intrigues, the better. We settled on these three metrics to optimize trust.
NDCG@5 (Normalized Discounted Cumulative Gain) would measure how well we ranked recommendations, with higher weights for top positions since first impressions matter most.
Precision@5 would tell us the raw percentage of recommendations that resonated.
Session Success Rate: the percentage of sessions where users connected with at least 3 out of 5 recommendations, either saving them to their watchlist or marking them as already loved.
We then put our heads together to set up a leaderboard where we could easily iterate the different experiements of our research ventures. Our first task was establishing a baseline. We grabbed the MovieLens dataset - a standard benchmark in recommendation systems - and implemented a basic content-based filtering and a collaborative-filtering approach. The numbers were humbling: NDCG@5 of 0.15, Precision@5 of 0.22, and a Session Success Rate of just 15%.
"Well," Eric grinned, looking at our freshly created leaderboard, "there's nowhere to go but up. Now let's divide and conquer. May the best model win!"
Vector Embeddings
That's when we dove into the real question: how could we do better than traditional recommendation approaches? The problem with content-based filtering was clear - movies aren't just their genres and keywords. Collaborative filtering seemed promising, but as we dug deeper into how it actually worked, we realized there might be a more innovative approach.
To appreciate why we chose word2vec, we need to understand how traditional collaborative filtering works. In its simplest form, collaborative filtering creates a massive, sparse matrix where each row is a user and each column is a movie. If Alice rates "The Dark Knight" 5 stars, we put a 5 in that cell, leaving most cells empty since no user has watched every movie. The system then tries to fill in these gaps by finding patterns - if users who like "The Dark Knight" tend to also like "Inception", it might recommend Inception to someone who just rated Dark Knight highly.
"But there's a deeper insight here," Eric pointed out. "Matrix factorization, which is at the heart of modern collaborative filtering, is essentially trying to learn lower-dimensional representations - or embeddings - of both users and items. When you multiply these embeddings together, you're trying to reconstruct the original rating matrix."
This was our 'aha' moment. If collaborative filtering was already trying to learn embeddings, why not use more sophisticated embedding techniques from NLP? word2vec learns word embeddings by predicting which words appear in similar contexts. We could adapt this to learn movie embeddings by predicting which movies appear in similar viewing contexts.

As the idea gained traction, I could suddenly see all the possibilities it could unlock. If we built out own vector embeddings space where each movie would be represented as a multi-dimensional vector trained on user's taste, we would effectively have a mathematically rich and dense "Taste Space". This space would contain similar movies closer to each other and we could search and rank within them using cosine similarity. The recommendation implementation would become very elegant.
Iterations: From Baseline to Breakthrough
Moving from concept to reality meant facing our first hard truth: our initial MovieLens dataset wasn't going to cut it. "We need richer signals," we plotted. This kicked off our first major iteration - building and monitoring a Letterboxd scraper that would eventually collect interaction data from over 100,000 users.
I took ownership of expanding and refining our data pipeline. Each training example became a careful balance - should we count only 4.5+ star ratings as positive signals? What about Letterboxd's "heart" reactions? We needed to include watchlist "saves" as also positive signals.
The technical iterations were methodical. We experimented with different word2vec hyperparameters - larger context windows to capture broader taste patterns, varying embedding dimensions to balance expressiveness with computational efficiency. Negative sampling, which we thought would help, actually degraded performance.
We also implemented a partial credit system in our evaluation metrics, giving some weight to recommending movies rated 4.0, while reserving full credit for those rated 4.5+.
The breakthrough wasn't a single dramatic moment, but rather the culmination of several carefully orchestrated improvements. The results spoke for themselves - our metrics doubled, with NDCG@5 jumping from 0.17 to 0.35. The improvement came from three key elements working in concert: the expanded dataset providing a more authentic picture of movie preferences, a refined evaluation system with partial credit that better reflected real user satisfaction, and optimized hyperparameters that helped capture more nuanced relationships in our taste space. It wasn't just better numbers - we had built something that could genuinely help people find movies they'd love.
After the quantitative validation, we gained the trust of the rest of the team to productionize this core model as the product's heartbeat. We then started testing it with user's and got the qualitative validation soon after as well. It was 6 weeks well spent, and we had successfully built the core experience of our product.
Now, it's building juicy features on top of this.
Clustering your Taste: Vibes
"I love both Inception and The Notebook," a user commented during early testing, "but never at the same time." Our taste space was revealing something fascinating - a single user's favorite movies weren't forming a single tight cluster. Instead, they created multiple distinct groupings based on the emotional context in which people watch them, their mood. And whenever they start a recommendation session, they didn't want their entire taste showing up, they wanted to be able to access a certain mood and start a recommendation session through that lens.
I was personally inspired by this problem statement and proposed implementing clustering algorithms on the vector embedding space to create different Vibes in the user's taste. I experimented with four clustering algorithms:
K-means: Gave us clear, distinct clusters but required pre-defining the number of vibes
Hierarchical Clustering: Helped us understand the natural hierarchy of taste patterns
DBSCAN: Particularly useful for identifying outliers and natural cluster boundaries
BIRCH: A breakthrough when scaling up - its ability to handle large datasets incrementally made it particularly valuable as our user base grew.
I evaluated cluster quality using standard metrics like silhouette score, Davies-Bouldin index, and inertia. We ended up supporting both K-means and BIRCH in production - K-means for its interpretable clusters and BIRCH for its scalability with growing data.

The technical validation of our clusters was just the beginning. When we rolled out Taste Vibes to our users, the impact was immediate and striking. What we'd discovered in the data resonated deeply with how people actually thought about movies. Users could now fluidly move between their edge-of-seat collection to their feel-good favorites, each backed by those carefully validated patterns we'd uncovered.
Cut and Rank: This Time It's Personal
While Vibes were a hit, we noticed an interesting challenge: recommending purely based on vibe clusters meant sometimes losing sight of individual user preferences. Now, it would start a recommendation session from the new centroid of the vibe in the vector embedding space but miss out on the subtleties of the user's actual taste.
That's when I proposed Cut and Rank - a two-step recommendation approach that would preserve both context and personalization. First, we'd "cut" the movie space to only include films within the chosen vibe (based on the closest 100 movies to that vibe's centroid). Then, we'd "rank" these movies based on the user's personal taste vector, effectively creating a personalized view of each vibe.
The results were compelling. My thriller vibe would end up different to your thriller vibe. Someone who enjoyed psychological thrillers like "Gone Girl" would see more cerebral suggestions, while a fan of action-thrillers like "Mission Impossible" would get more adrenaline-pumping options. We'd managed to make each vibe feel both focused and personally relevant. Because this time, it's personal.
Breaking Boundaries: The Multi-Entity Recommender
What if your journey didn't have to end with movies? After the success of our movie recommendation vertical, I saw an opportunity to push the boundaries further into other verticals. The insight was simple but powerful: people who love "The Lord of the Rings" movies might also love Tolkien's books, or fans of "Sherlock" might enjoy both the TV series and Arthur Conan Doyle's original stories.
I built a proof-of-concept by identifying users who had linked profiles across Letterboxd, Goodreads, and IMDB. By creating unified embeddings that captured relationships between movies, books, and TV shows, we could map everything into the same taste embedding space. This meant our entire recommendation architecture - Vibes, Cut and Rank - could now work across different types of media, each entity would be represented by its own vector embedding, and the architecture would seamlessly scale.

The implications were exciting. A user could start with "Blade Runner" and discover not just similar movies, but also cyberpunk novels that captured the same noir atmosphere. Or someone who loved "Pride and Prejudice" could seamlessly explore both film adaptations and similar literary classics. Each recommendation session could weave between different media types while maintaining that crucial personal relevance we'd worked so hard to achieve.
Find a Friend: Connecting Through Taste
While exploring user behaviors, we noticed something fascinating: users were sharing their recommendations and discussing movies in our comment sections. This sparked an idea - if our taste space could find movies you'd love, why not find people you'd click with? There's so much we could do by owning our own vector embedding space.
The implementation was elegant in its simplicity. Since we already had user taste centroids in our vector space, we could use the same distance metrics we'd perfected for movie recommendations to find users with similar taste profiles. The process worked like this:
Take a user's taste centroid
Find the k-nearest neighbors in user taste space
Filter for active users who'd opted into the feature
Engine Room: From Research to Production
With our core recommendation engine validated, the next challenge was making it production-ready. I built an internal Python library that served as the interface between our ML models and the engineering team. The library exposed clean APIs for our most crucial operations of fetching recommendations for a user's taste. These endpoints ran on Google Cloud, allowing for scalable model serving and easy updates. The filtering capability proved particularly valuable - engineers could easily exclude movies based on user history or specific criteria, making the recommendations more practical and personalized.
Each model version was deployed as a separate endpoint on Google Cloud, allowing for:
A/B testing between model versions
Gradual rollout of updates
Easy rollback if issues were detected
The engineering team could interact with our models through these endpoints while we maintained control over the ML infrastructure. We implemented extensive logging to track model performance and user interactions, which fed back into our training pipeline for continuous improvement.
Fondly Looking Back
My time at This One was transformative in ways that went beyond technical achievements. What started with a coffee chat about fixing movie recommendations turned into fourteen months of creative problem-solving and innovation. Each challenge - from doubling our recommendation metrics to building the multi-entity recommender - pushed me to grow both as an engineer and a researcher.
Working with Eric and Martin created an environment where innovative ideas were not just welcomed but expected. While the technical accomplishments were significant, what I value most was the freedom to explore and create. Every idea was met with enthusiasm and constructive collaboration. We weren't just building features; we were unfurling a vision. And we did it with a fantastic team. As we've all moved on since then, I still look back fondly as it was a truly cherished experience of my life from start to finish.
Eric's finessed commitment to razorsharp technical excellence merged with Martin's empathetic embrace of creative thinking is a combination that has shaped not just how I solve problems, but how I envision what's possible in every project since.
Comments