Factorization Machines A Theoretical Introduction

Source: http://jxieeducation.com/2016-06-26/Factorization-Machines-A-Theoretical-Introduction/

Factorization Machines

This post is going to focus on explaining what factorization machines are and why they are important. The future posts will provide practical modeling examples and a numpy clone implementation of factorization machines.

What problem do Factorization Machines solve?

TL;DR: FMs are a combination of linear regression and matrix factorization that models sparse feature interactions but in linear time.

Normally when we think of linear regression, we think of this formula.

pic_32

In the formula above, the run time is O(n)

where n is the number of features. When we consider quadratic feature interactions, the complexity increases to O(n^2), in the formula below.

pic_33

Now consider a very sparse set of features, the runtime blows up. In most cases, we instead model a very limited set of feature interactions to manage the complexity.

How do Factorization Machines solve the problem?

In the recommendation problem space, we have historically dealt with the sparsity problem with a well documented technique called (non-negative) matrix factorization.

matrix factorization diagram

We factorize sparse user item matrix (r) R^{UxI}

into a user matrix (u) R^{UxK} and an item matrix (i) ^R^{IxK}, where K<<U and K<<I

.

User (u_i)’s preference for item i_j can be approximated by u_ii_j

Factorization Machines takes inspiration from matrix factorization, and models the feature iteractions like using latent vectors of size K. As a result, every sparse feature f_i has a corresponding latent vector v_i. And two feature’s interactions are modelled as v_iv_j

Factorization Machines Math

pic_34

Instead of modeling feature interactions explicitly, factorization machines uses the dot product of features’ interactions. The model learns v_i implicitly during training via techniques like gradient descent.

The intuition is that each feature will learn a dense encoding, with the property that high positive correlation between two features have a high dot product value and vise versa.

Of course, the latent vectors can only encode so much. There is an expected hit on accuracy compared to using conventional quadratic interactions. The benefit is that this model will run in linear time.

FM complexity

For the full proof, see lemma 3.1 in Rendle’s paper.

pic_35

Since pic_36 can be precomputed, the complexity is reduced to O(KN).

Installing Apache Spark on Ubuntu 16.04

Source: https://www.santoshsrinivas.com/installing-apache-spark-on-ubuntu-16-04/

The following installation steps worked for me on Ubuntu 16.04.

The below options worked for me:

Download Apache Spark
Download Apache Spark
  • Unzip and move Spark
cd ~/Downloads/  
tar xzvf spark-2.0.1-bin-hadoop2.7.tgz  
mv spark-2.0.1-bin-hadoop2.7/ spark  
sudo mv spark/ /usr/lib/  
  • Install SBT

As mentioned at sbt – Download

echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list  
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 2EE0EA64E40A89B84B2DF73499E82A75642AC823  
sudo apt-get update  
sudo apt-get install sbt  
  • Make sure Java is installed

If not, install java

sudo apt-add-repository ppa:webupd8team/java  
sudo apt-get update  
sudo apt-get install oracle-java8-installer  
  • Configure Spark
cd /usr/lib/spark/conf/  
cp spark-env.sh.template spark-env.sh  
vi spark-env.sh  

Add the following lines

JAVA_HOME=/usr/lib/jvm/java-8-oracle  
SPARK_WORKER_MEMORY=4g  
  • Configure IPv6

Basically, disable IPv6 using sudo vi /etc/sysctl.conf and add below lines

net.ipv6.conf.all.disable_ipv6 = 1  
net.ipv6.conf.default.disable_ipv6 = 1  
net.ipv6.conf.lo.disable_ipv6 = 1  
  • Configure .bashrc

I modified .bashrc in Sublime Text using subl ~/.bashrc and added the following lines

export JAVA_HOME=/usr/lib/jvm/java-8-oracle  
export SBT_HOME=/usr/share/sbt-launcher-packaging/bin/sbt-launch.jar  
export SPARK_HOME=/usr/lib/spark  
export PATH=$PATH:$JAVA_HOME/bin  
export PATH=$PATH:$SBT_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin  
  • Configure fish (Optional – But I love the fish shell)

Modify config.fish using subl ~/.config/fish/config.fish and add the following lines

#Credit: http://fishshell.com/docs/current/tutorial.html#tut_startup

set -x PATH $PATH /usr/lib/spark  
set -x PATH $PATH /usr/lib/spark/bin  
set -x PATH $PATH /usr/lib/spark/sbin  
  • Test Spark (Should work both in fish and bash)

Run pyspark (this is available in /usr/lib/spark/bin/) and test out.

For example ….

>>> a = 5
>>> b = 3
>>> a+b
8  
>>> print(“Welcome to Spark”)
Welcome to Spark  
## type Ctrl-d to exit

Try also, the built in run-example using run-example org.apache.spark.examples.SparkPi

That’s it! You are ready to rock on using Apache Spark!

IPython notebook and Spark setup for Windows 10

From: https://nerdsrule.co/2016/06/15/ipython-notebook-and-spark-setup-for-windows-10/

Install Java jdk

You can download and install from Oracle here. Once installed I created a new folder called Java in program files moved the JDK folder into it.  Copy the path and set JAVA_HOME within your system environment variables.  The add %JAVA_HOME%\bin to Path variable. To get the click start, then settings, and then search environment variables. This should bring up ‘Edit your system environment variables’, which should look like this:

Screenshot (2)

Installing and setting up Python and IPython

For simplicity I downloaded and installed Anaconda with the python 2.7 version from Continuum Analytics (free) using the built-in install wizard. Once installed you need to do the same thing you did with Java.  Name the variable as PYTHONPATH, where my paths were C:\User\cconnell\Anaconda2. You will also need to add %PYTHONPATH% to the Path variable

Installing Spark

I am using 1.6.0 (I like to go back at least one version) from Apache Spark here. Once unzipped do the same thing as before setting Spark_Home variable to where your spark folder is and then add %Spark_Home%\bin to the path variable.

Installing Hadoop binaries

Download and unzip the Hadoop common binaries and set a HADOOP_HOME variable to that location.

Getting everything to work together

  1. Go into your spark/conf folder and rename log4j.properties.template to log4j.properties
  2. Open log4j.properties in a text editor and change log4j.rootCategory to WARN from INFO
  3. Add two new environment variables like before: PYSPARK_DRIVER_PYTHON to jupyter (edited from ‘ipython’ in the pic) and PYSPARK_DRIVER_PYTHON_OPTS to notebook

Screenshot (1)

Now to launch your PySpark notebook just type pyspark from the console and it will automatically launch in your browser.