Zhang Hanbo @ AdaComp, NUS. | A Brief Introduction to Lua Torch

torch-logo

Abtract
Preliminaries
Installation
nn package
torch-rnn package
nngraph package

Abtract

Torch¹（非PyTorch）作为一个远古的深度学习框架，在2016年前后有着非常广泛的应用。其最大的缺点莫过于使用了LUA语言作为基础语言。由于LUA过于小众，因此学习成本比PyTorch要高很多。此外，由于其仅支持LUA语言，debugger的过程跟pdb也是相距甚远。

但是，Torch作为最初的一批基于脚本语言编程的深度学习框架，一度给研究者们带来了极大的方便。而且从名字上看，PyTorch也是在致敬Torch了（其实真实的原因是最初PyTorch和Torch共享了部分底层的C语言的源代码，只是上层支持的脚本语言不同）。由于最近需要复现DenseCap，而其只有Torch版本的源程序，因此重拾Torch。现在看来，Torch的设计也仍然不落后，属于能用的范畴。但是相比于动态计算图来说，确实不甚方便。

本篇博客的主要目的是记录自己Torch的复习过程。具体来讲，我会将Torch中搭建网络的常用技巧，变量，模型，以及支持计算图的nngraph包都进行一个简单的介绍。相比于官方教程，本篇博客更像是一个学习笔记，目的在于帮助自己快速回忆起Torch相关的开发细节。阅读时间：30分钟。

Preliminaries

符合如下条件之一：

了解过Torch，或对Torch有过开发经验。
了解其他某个深度学习框架，并掌握了LUA语言的基本语法。

Installation

安装教程： http://torch.ch/docs/getting-started.html#_

注意Torch不支持9.2之后的CUDA版本。因此，不能再RTX 20XX系列及其之后的显卡上安装。

nn package

torch中所有的网络模型都在nn包中。包括了：

Module类：一个抽象类，是所有官方及自定义的模型的父类。
Containers：主要包括了Sequential，Parallel，Concat三个类。
Activation Functions：包含了常用激活函数
Simple Layers：包含了一些简单的层，如线性层，BatchNorm等
Table layers：包括ConcatTable, ParallelTable, JoinTable, MapTable, SplitTable等
Convolution layers：包括了常用的卷积层
Criterions：包括了常用的损失函数

Module

所有module的父类。其中重要的几个方法：

output=updateOutput(input)：前向传播方法。
gradInput=updateGradInput(input, gradOutput)：用于反向传播中计算输入梯度，链式法则中的一环。
accGradParameters(input, gradOutput, scale)：用于反向传播中更新模型梯度，如有多个loss可多次调用。
zeroGradParameters()：模型梯度清零。
updateParameters(learningRate)：根据当前模型梯度和输入的学习速率更新参数。

与Pytorch相似，Module中也有forward和backward方法。但这两个方法都不建议override。其中forward方法会调用updateOutput方法进行前项传播。 backward方法会调用updateGradInput和accGradParameters计算模型的输入梯度和参数梯度，并返回输入梯度，以便使用链式法则计算前边层的梯度。

Containers

Sequential

一进一出，与pytorch中的nn.Sequential()相似，结构：

input -> nn.Sequential()[1] -> nn.Sequential()[2] -> ... -> output

Parallel(indim, outdim)

不太常用。该层会先根据indim拆分输入tensor，用包含的sub-modules分别处理切分后的tensor，最后再根据outdim进行concat。

举例如nn.Parallel(1, 2)。若其包括n个子模块，则输入tensor在维度1上必须等于n，例如对于：

input=torch.randn(n, 100)。

在前项传播时，input会被首先拆分为：

(input_1, ..., input_n)

其中每一个都是100维。然后调用其中的n个子模块分别处理这n个输入，最后再在维度2上concat。

input_1 -> nn.Parallel(1,2)[1] -> output_1
input_2 -> nn.Parallel(1,2)[2] -> output_2     concat on dim 2
...                                        ----------------------->  output
input_n -> nn.Parallel(1,2)[n] -> output_n

如果输出数据没有维度2，或每个模块的output在其他维度上尺寸不匹配，则会报错。

Concat(dim)

一进多出后根据dim参数来concat成一个tensor。结构：

                     -> output_1      concat
input -> nn.Concat() -> output_2 ---------------> output
                     ...
                     -> output_n

Table Layers

处理Table的数据层。该类数据层的输入为table。常用的包括：

ParallelTable

多进多出。输入输出都为table。中间包含多个子模块。每个子模块负责处理table中对应位置的数据。后输出一个table。与nn.Parallel虽然名字相似但是功能完全不同。该模块不会对输入和输出做任何切分或concat的操作：

input_table = { input_1,  -> nn.ParallelTable()[1] -> { output_1, 
                input_2,  -> nn.ParallelTable()[2] ->   output_2,
                input_3 } -> nn.ParallelTable()[3] ->   output_3 } = output_table

ConcatTable

一进多出。输入为tensor或table。

如果输入为tensor，则其为其中包含的所有子模块的共同输入。输出为一个table，包含了每个子模块各自的输出：

input ->   nn.ConcatTable()[1] -> { output_1,
      ->   nn.ConcatTable()[2] ->   output_2,
      ...                           ...
      ->   nn.ConcatTable()[n] ->   output_n } = output_table

如果输入为table，则输出为一个table的table。其中每个输出table中的元素（一个table）对应输入中的一个元素。对输入的table支持递归：

input_table = { input_1,  -> nn.ConcatTable() -> { output_table_1, 
                input_2,  -> nn.ConcatTable() ->   output_table_2, 
                ...          ...                   ...
                input_n } -> nn.ConcatTable() ->   output_table_n } 

JoinTable(dimension, nInputDims)

多进一出。输入为一个tensor的table。输出为根据dimension参数将输入的tensor全部concat起来。 nInputDims指定了输入tensor应该有的维度，该参数的存在是为了支持mini-batch。

input_table = { input_1,                           concat along dim d
                input_2,   ---> nn.JoinTable(d) ------------------------> output
                ...     
                input_n }

CAddTable

多进一出。输入为一个tensor的table。输出为所有输入tensor的逐元素和。

input_table = { input_1, ---> nn.CAddTable() -> output = input_1 + input_2 + ... + input_n
                input_2,
                ...     
                input_n }

其他的可参考：https://github.com/torch/nn/blob/master/doc/table.md#nn.TableLayers 总结来讲Table layers对于数据的操作更为直观，没有对于tensor数据更为复杂或者多余的操作，因此相比于container，使用更多。

Simple demos

mlp = nn.Sequential()       -- Create a network that takes a Tensor as input
mlp:add(nn.SplitTable(2))
c = nn.ParallelTable()      -- The two Tensor slices go through two different Linear
c:add(nn.Linear(10, 3))     -- Layers in Parallel
c:add(nn.Linear(10, 7))
mlp:add(c)                  -- Outputing a table with 2 elements
p = nn.ParallelTable()      -- These tables go through two more linear layers separately
p:add(nn.Linear(3, 2))
p:add(nn.Linear(7, 1))
mlp:add(p)
mlp:add(nn.JoinTable(1))    -- Finally, the tables are joined together and output.

pred = mlp:forward(torch.randn(10, 2))
print(pred)

Simple Layers

简单层的初始化与使用与Pytorch相似。这里罗列一些常用的作为示例。

--- 全连接
nn.Linear(inputDimension, outputDimension, [bias = true])
--- 取最大
nn.Max(dimension, nInputDim)
--- 取最小
nn.Min(dimension, nInputDim)
--- 取平均
nn.Mean(dimension, nInputDim)
--- 取和
nn.Sum(dimension, nInputDim, sizeAverage, squeeze)
--- 取cos
nn.Cosine(inputSize,outputSize)
--- Skip Connection
nn.Identity()
--- Reshape
nn.Reshape(dimension1, dimension2, ... [, batchMode])
nn.View(sizes)
--- 消除冗余维度
nn.Squeeze([dim, numInputDims])
--- 增加冗余维度
nn.Unsqueeze(pos [, numInputDims])
--- 维度置换
nn.Transpose({dim1, dim2} [, {dim3, dim4}, ...])
--- Dropout
nn.Dropout()
--- BatchNorm
nn.BatchNormalization(N [, eps] [, momentum] [,affine])

Convolution Layers

卷积层的初始化与使用与Pytorch相似。这里罗列一些常用的作为示例。

--- 普通卷积层
nn.SpatialConvolution(nInputPlane, nOutputPlane, kW, kH, [dW], [dH], [padW], [padH])
--- 膨胀卷积层
nn.SpatialDilatedConvolution(nInputPlane, nOutputPlane, kW, kH, [dW], [dH], [padW], [padH], [dilationW], [dilationH])
--- 最大池化
nn.SpatialMaxPooling(kW, kH [, dW, dH, padW, padH])
--- 膨胀最大池化
nn.SpatialDilatedMaxPooling(kW, kH [, dW, dH, padW, padH, dilationW, dilationH])
--- 全局自适应最大值池化
nn.SpatialAdaptiveMaxPooling(W, H)
--- 全局自适应均值池化
nn.SpatialAdaptiveAveragePooling(W, H)
--- 下采样
nn.SpatialSubSampling(nInputPlane, kW, kH, [dW], [dH])
--- 双线性插值上采样
nn.SpatialUpSamplingBilinear(scale)
nn.SpatialUpSamplingBilinear({oheight=H, owidth=W})

torch-rnn package

torch-rnn是DenseCap作者Justin Johnson写的torch框架下的递归网络的包。里边包含了一些主流递归网络结构。如RNN，LSTM等。使用方法与pytorch中也类似。这里列出两个常用的模块。

普通的RNN：

--- h[t] = tanh(Wh h[t- 1] + Wx x[t] + b)
nn.VanillaRNN(inputDim, hiddenDim)

--- 代码示例，隐变量h可以显式输入也可以使用默认值。
h = rnn:forward({h0, x})
grad_h0, grad_x = unpack(rnn:backward({h0, x}, grad_h))

h = rnn:forward(x)
grad_x = rnn:backward(x, grad_h)

LSTM：

nn.LSTM(inputDim, hiddenDim)

---代码示例，其中Cell state和Hidden state都可以使用默认值。
h = lstm:forward({c0, h0, x})
grad_c0, grad_h0, grad_x = unpack(lstm:backward({c0, h0, x}, grad_h))

h = lstm:forward({h0, x})
grad_h0, grad_x = unpack(lstm:backward({h0, x}, grad_h))

h = lstm:forward(x)
grad_x = lstm:backward(x, grad_h)

nngraph

nngraph是torch的计算图包。通过使用nngraph，能够使torch支持类似于静态计算图（如初代TensorFlow）的功能。

计算图中的输入节点的定义方式为：

input_node = nn.Module_name(Parameters)()

而中间节点的定义方式为：

next_node = nn.Module_name(Parameters)(previous_nodes)

在定义完成所有的几点后，通过nn.gModule来初始化整个计算图：

nn.gModule({input_nodes}, {output_nodes})

可视化计算图：

graph.dot(mlp.fg, 'MLP')

--- 保存至文件夹myMLP
graph.dot(mlp.fg, 'MLP', 'myMLP')

其他信息（画图时添加指定注释，debug等）可参考https://github.com/torch/nngraph

Simple demos

--- h1: 输入节点, h2: 输出节点
--- h1 -> nn.Linear(20,10) -> nn.Tanh() -> nn.Linear(10,10) -> nn.Tanh() -> nn.Linear(10,1) -> h2
h1 = nn.Linear(20, 10)()
h2 = nn.Linear(10, 1)(nn.Tanh()(nn.Linear(10, 10)(nn.Tanh()(h1))))
mlp = nn.gModule({h1}, {h2})

--- 前向传播与反向传播示例
x = torch.rand(20)
dx = torch.rand(1)
mlp:updateOutput(x)
mlp:updateGradInput(x, dx)
mlp:accGradParameters(x, dx)

上述计算图也可通过如下等价方式进行定义

h1 = - nn.Linear(20,10)
h2 = h1
     - nn.Tanh()
     - nn.Linear(10,10)
     - nn.Tanh()
     - nn.Linear(10, 1)
mlp = nn.gModule({h1}, {h2})

双输入节点示例，其中计算图的输入为输入节点实例化后对应的table：

h1 = nn.Linear(20, 20)()
h2 = nn.Linear(10, 10)()
hh1 = nn.Linear(20, 1)(nn.Tanh()(h1))
hh2 = nn.Linear(10, 1)(nn.Tanh()(h2))
madd = nn.CAddTable()({hh1, hh2})
oA = nn.Sigmoid()(madd)
oB = nn.Tanh()(madd)
gmod = nn.gModule({h1, h2}, {oA, oB})

x1 = torch.rand(20)
x2 = torch.rand(10)

gmod:updateOutput({x1, x2})
gmod:updateGradInput({x1, x2}, {torch.rand(1), torch.rand(1)})

Footnote

http://torch.ch/. ↩

Table of Contents

Abtract

Preliminaries

Installation

nn package

Module

Containers

Table Layers

Simple demos

Simple Layers

Convolution Layers

torch-rnn package

nngraph

Simple demos

Footnote