1 著者

Mu Li, Ziqi Liu, Alexander J. Smola, Yu-Xiang Wang
Computer Science, Carnegie Mellon University

WSDM 2016 のBest Paper Honorable Mentionに選ばれる

DMLC (xgboost, mxnet, minerva, parameter server...)

2 紹介者

バクフー株式会社柏野雄太 @yutakashino

Ovservational Cosmology
Python / Zope
Realtime Data Platform for Enterprise

3 ざっくり言うと

DiFactoというFactorization Machineのメモリ効率を改善して分散学習をする仕組みを提案する
LibFMなどの従来のFactorization Machineにくらべて高速に収束し，大きな問題に対処できる

4 背景と動機

推薦や予測（回帰・分類・ランキング）
線形モデルは大量のデータに対応できる
非線形モデルであるFactorization Machineは，特徴間の交互作用も入れることができて表現力が高い．しかし大量のデータを取り扱えない
- libFMなどの従来のFactrization Machineで扱える特徴量数は $10^9$ くらいがせいぜい．
- Creteo CTRデータセットのような現実的なデータは， $10^{11}$ くらいの特徴数になったり，100次のEmbedding Matrixを考えなければいけない．その場合，Factrization Machineはつかえない．
Factorization Modelで線型モデルのような大量のデータを扱うにはどうするか

5 Factorization Machine

Factorization Machineの特徴ベクトル：トランザクションを並べて特徴を全部一緒くたにする

Screen Shot 2016-03-15 at 21.47.45.png

線型モデルに embbeding Matrix $V$ の項を拡張する
線型モデルの重み $w$ とembbeding Matrix $V$ をSGD等により推定

U I S = {A l i c e (A), B o b (B), C h a r l i e (C), . . .} = {T i t a n i c (T I), N o t t i n g H i l l (N H), S t a r W a r s (S W), S t a r T r e k (S T), . . .} = {(A, T I, 2010 - 1, 5), (A, N H, 2010 - 2, 3), (A, S W, 2010 - 4, 1), (B, S W, 2009 - 5, 4), (B, S T, 2009 - 8, 5), (C, T I, 2009 - 9, 1), (C, S W, 2009 - 12, 5)}

$\begin{align} U &= \{Alice (A), Bob (B), Charlie (C), . . .\} \\ I &= \{Titanic (TI), Notting Hill (NH), Star Wars (SW), Star Trek (ST), . . .\} \\ S &= \{(A, TI, 2010-1, 5),(A, NH, 2010-2, 3),(A, SW, 2010-4, 1), (B, SW, 2009-5, 4),(B, ST, 2009-8, 5), (C, TI, 2009-9, 1),(C, SW, 2009-12, 5)\} \end{align}$

5.1 線型モデルとの関係

Screen Shot 2016-03-15 at 22.03.59.png
Screen Shot 2016-03-19 at 8.45.04 AM.png

https://github.com/coreylynch/pyFM/blob/master/pyfm/pylibfm.py

5.2 実装

libFM (C) 本家 https://github.com/srendle/libfm
fastFM (C + Cython) 多機能＋速い https://github.com/ibayer/fastFM
pyFM (Python/Numpy) 実装が簡単 https://github.com/coreylynch/pyFM

./libFM -task r -train ml1m-train.libfm -test ml1m-test.libfm -dim ’1,1,8’

6 DiFactoモデル

6.1 キャッチ

従来のFactrization Machineを改善

Memory Adaptive Constraints：embedding matrix Vのメモリを圧縮
Sparse Regularization: 効かないwを0にする
Frequency Adaptive Regularization: 高次の正則化
分散学習: Parameter Serverの仕組みで，重みV, wのアップデートをサーバで，勾配計算をワーカーに分散化

6.2 Memory Adaptive Constraints

Frequency threshold

Screen Shot 2016-03-13 at 18.29.44.png

6.3 Sparse Regularization

$l_1$ shrinkage: 線形モデルのl1正則化のようなものをFMにも導入する

Screen Shot 2016-03-13 at 18.29.59.png

6.4 Frequency Adaptive Regularization

Screen Shot 2016-03-19 at 12.55.35 AM.png

6.5 結局最適化は…

Screen Shot 2016-03-19 at 12.53.54 AM.png

7 分散学習

Parameter Serverの仕組みで，重みV, wのアップデートをサーバで，勾配計算をワーカーに分散化

Screen Shot 2016-03-13 at 18.33.08.png

Screen Shot 2016-03-13 at 18.33.34.png

7.1 gradient

Screen Shot 2016-03-19 at 12.50.57 AM.png

7.2 update V

Screen Shot 2016-03-19 at 12.51.22 AM.png

7.3 update w

Screen Shot 2016-03-19 at 12.51.31 AM.png

7.4 収束解析

Screen Shot 2016-03-13 at 18.33.44.png

Screen Shot 2016-03-13 at 18.33.52.png

7.5 分散学習の実装

https://github.com/dmlc/difacto

#Start: Create one scheduler node, m worker nodes and nserver nodes over multiple machines.

#Scheduler Node:

Assume the data is partitioned into s parts p1, . . . , ps,

for t = 1 to T do

Work packages P = {p1, . . . , ps}

Accomplished packages A = ∅

while P 6= ∅ do

switch detected event from worker i do

case idle

Pick p ∈ P \ A and assign p to worker i

A = A ∪ {p},

case finished p

P = P \ {p}

case dead or timeout

A = A \ {p},

end while

end for

#Worker i:

Receive command “processing p” from the scheduler

while read a minibatch from p do

Pull wi and Vi from server nodes for all features ithat appear in this minibatch

Compute the gradient based on (16) and (17)

Push gradient back to servers

end while

#Server i:

if received gradient from a worker then

update w and V by using (20) and (19)

end if

8 DiFactoの実際の使い方

build/difacto data_in=data/gisette_scale val_data=data/gisette_scale.t lr=.02 V_dim=2 V_lr=.001

Screen Shot 2016-03-19 at 7.34.15 AM.png
Screen Shot 2016-03-19 at 7.40.10 AM.png

tracker/dmlc_local.py -n 2 -s 1 bin/difacto.dmlc learn/difacto/guide/train.conf

train.conf

train_data = "data/train-part_[0-1].*"

val_data = "data/train-part_2.*"

data_format = "libsvm"

model_out = "model/criteo"

embedding {

dim = 16

threshold = 16

lambda_l2 = 0.0001

}

lambda_l1 = 4

lr_eta = .01

max_data_pass = 1

minibatch = 1000

early_stop = 1

EOF

9 実験結果

9.1 Adaptive Memory

Criteo2, CTR2の次元 $k$ を大きくしたときの，サイズ，収束時間，正確性
no mem adaption vs. freq threshold vs. freq threshold + $l_1$ shrinkage
メモリの効果大 $k=64$ で300倍効く
イテレーションあたりの経過時間: $k$ がデカイと短くなる．(CTR2で20%改善)
正確性：Criteo2はほとんど変わらない．CTR2では若干改善

Screen Shot 2016-03-13 at 17.25.55.png

9.2 Fixed-point Compression

デフォルトで浮動小数点32 bitで表現されるgradientやモデルパラメータを精度の低い整数に「圧縮」して，ネットワーク負荷を下げたときに，どうなるか．

(a) 圧縮度が高いとネットワーク負荷は下がる（当たり前）
(b) 正確度について，CTR2は変わらない．Criteo2は一番圧縮度の高いときに6%ほど正確度が下がるのは当然だけれど，圧縮度が低いからといって，正確性が高くなるわけではないことがわかった．

Screen Shot 2016-03-13 at 17.39.57.png

9.3 LibFMとの比較

Creteo1, CTR1の収束速度
LibFMはデカイデータセットCreteo2, CTR2を実行できなかった
LibFM vs. DiFacto 1 vs. DiFacto 10
LibFM vs. Difacto * 1ではLibFMのほうが良い場合もあるが，ワーカーを増やすとDiFactoの圧勝

Screen Shot 2016-03-13 at 17.39.49.png

10 結論

Factorization MachineにAdaptive MemoryとFrequency adaptive正則化を入れ，Parameter Serverの仕組みで分散化させたDiFactoは，大きな問題を高速に取り扱うことができる．

DiFacto — Distributed Factorization Machines