Gen-RecSys Paper Read

I’m using this blog to work through the Gen-RecSys paper, titled “A Review of Modern Recommender Systems Using Generative Models”.

Over the past couple of years, there’s been a lot of buzz about using GenAI in recommendation problems. Honestly, I’m skeptical. LLMs - i.e. autoregressive next-token predictors - work very differently from recommender systems, which are fundamentally about matching users to items. Going in, my expectation is that Gen-RecSys ideas either don’t really work, or are existing methods repackaged as “GenAI”.

That said, I love being proved wrong, and I’m genuinely curious to learn more about attempts to apply GenAI to recommendation systems.

1 - Introduction

The introduction frames Gen-RecSys as a paradigm shift from traditional RS (which they call “narrow experts” focused on user-item interactions) to systems that can model and sample from complex data distributions across multiple modalities (text, images, video, user interactions).

The paper identifies two primary modes of applying generative models:

Directly trained models (like VAE-CF) trained on user-item interactions: - Pretrained models (like LLMs) that leverage emergent capabilities for zero/few-shot learning, fine-tuning, RAG, feature extraction, and multimodal approaches - The authors position LLMs like ChatGPT and Gemini as offering “emergent capabilities” including reasoning, in-context learning, and access to open-world knowledge — which they suggest enables enhanced personalization, conversational interfaces, and explanation generation.

I’m very skeptical of the “emergent capabilities” claim. I think this is already potentially spurious when it comes to LLMs, but in my view is stretched even further when applied to personalisation, which by definition will rely on user information outside of the training data.

2 - Generative Models

Types of generative model:

Auto-Encoder: Either BERT like models learning masked views of inputs, or VAE learning embeddings.
Auto-Regressive: Sequential models using transformer or RNN like architectures, predicting the next item in a sequence.
GANs: Generator / Discriminator setup, appears to mostly be used for generating training examples and sampling negatives. It’s not clear to me how this works in practice. I imagine methods aren’t commonly used… GANs are basically dead now anyway.
Diffusion: Learn interactions from corrupted copies of historical interactions.

3 - LLMs

Types of recommendation strategies:

Encoder-only. Encode item information, like title, description, reviews etc., and query information, like user preferences, conversational history etc., and perform similarity search. This technique would benefit from not requiring explicit training and leveraging well established vector store search techniques, although I imagine would require a lot of manual tuning around what specific information to use. Some techniques also involve a prediction head on top of embeddings that explicitly learns user-item interactions.
Direct Prompting (described as “LLM -based Generative Recommendation”). This involves giving user information and item details directly in the prompt and requesting the LLM to choose items. This can either be zero-shot in the case where domain knowledge is known by the LLM, or fine-tuned.
RAG. Generate output based on retrieved information relevant to the query. Can be used as a ranking layer after initial retrieval from a traditional recommender system, although it appears that most methods are more review focused.
Feature Extraction. Using embeddings from LLMs as inputs to traditional user/item recommender systems.
Conversational. Make recommendations based on a conversational interaction with the user, sometimes also leveraging existing recommendations.

4 - Generative Multimodal Recommendation Systems

This area of recommendations is broader than traditional user / item systems, focused on combining multiple modalities i.e. image, text, video and interactions together. It also focuses on types of recommendation other than next item prediction, e.g. virtual try on.

This isn’t my focus so I’ve skimmed this section.

5 - Evaluating for Impact and Harm

Discusses existing methods for evaluation and their downfalls, e.g. discrepancies between online vs offline performance, difficulty of general benchmarks due to narrowness and specificity of candidate pools. Gen-RecSys techniques may be more appropriate for general benchmarks as they’re more likely to the transferable, but are still difficult to get right.

Gen-RecSys methods may have broader social and ethical harms as they combine potential harmfulness of recommender systems and that of LLMs. For example, they can have societal biases, be susceptible to manipulation, and push users to harmful content areas through hyper-personalisation.

Conclusion

This paper was much more insightful than I expected. I hadn’t appreciated some of the ways LLMs could be applied to recommendations—particularly how prompting an LLM with user and item information can form query and candidate embeddings, much like an ML model learning user/item representations. I think this approach could have interesting applications.

That said, I still think some areas discussed in this paper should be kept separate from core recommendation work. I’m not convinced that auto-encoder and diffusion approaches are particularly useful in recommendation domains. I also think broadening recommendations beyond typical item prediction, e.g. virtual try-on, is unhelpful. Recommendation systems are hugely complex as it is, and lumping in solutions to different problems muddies the water.

Future Reading

Generative Model Approaches:

BERT4REC. Commonly used encoder method.
IRGAN. Not a high priority as GANs not a common approach.
DiffRec. Diffusion networks - Interesting approach but not sure how practical.
Conversational Recommendation Systems. RAG based approach. To me doesn’t appear to be generally applicable to all recommendations, more review specific.