Factorization Machines A Theoretical Introduction

Source: http://jxieeducation.com/2016-06-26/Factorization-Machines-A-Theoretical-Introduction/

Factorization Machines

This post is going to focus on explaining what factorization machines are and why they are important. The future posts will provide practical modeling examples and a numpy clone implementation of factorization machines.

What problem do Factorization Machines solve?

TL;DR: FMs are a combination of linear regression and matrix factorization that models sparse feature interactions but in linear time.

Normally when we think of linear regression, we think of this formula.


In the formula above, the run time is O(n)

where n is the number of features. When we consider quadratic feature interactions, the complexity increases to O(n^2), in the formula below.


Now consider a very sparse set of features, the runtime blows up. In most cases, we instead model a very limited set of feature interactions to manage the complexity.

How do Factorization Machines solve the problem?

In the recommendation problem space, we have historically dealt with the sparsity problem with a well documented technique called (non-negative) matrix factorization.

matrix factorization diagram

We factorize sparse user item matrix (r) R^{UxI}

into a user matrix (u) R^{UxK} and an item matrix (i) ^R^{IxK}, where K<<U and K<<I


User (u_i)’s preference for item i_j can be approximated by u_ii_j

Factorization Machines takes inspiration from matrix factorization, and models the feature iteractions like using latent vectors of size K. As a result, every sparse feature f_i has a corresponding latent vector v_i. And two feature’s interactions are modelled as v_iv_j

Factorization Machines Math


Instead of modeling feature interactions explicitly, factorization machines uses the dot product of features’ interactions. The model learns v_i implicitly during training via techniques like gradient descent.

The intuition is that each feature will learn a dense encoding, with the property that high positive correlation between two features have a high dot product value and vise versa.

Of course, the latent vectors can only encode so much. There is an expected hit on accuracy compared to using conventional quadratic interactions. The benefit is that this model will run in linear time.

FM complexity

For the full proof, see lemma 3.1 in Rendle’s paper.


Since pic_36 can be precomputed, the complexity is reduced to O(KN).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s