[toc] 可能会长期更新,因为经常需要从pytorch偷代码翻译成tensorflow😑因此记录一下差异的地方.

1. torchnn.Conv2dgroups参数

torchgroups控制输入和输出之间的连接,in_channelsout_channels必须都可以被组整除. - groups=1 传统的卷积方式. - groups=2 等效于并排设置两个conv层,每个conv层看到一半的输入通道,并产生一半的输出通道,并且随后将它们都连接在一起. - groups=in_channels 每个输入通道都有自己的滤波器.

等价写法:

nn.Conv2d(expand_size, expand_size, kernel_size=kernel_size, 
stride=stride, padding=kernel_size//2, groups=expand_size, bias=False)

kl.DepthwiseConv2D(kernel_size=kernel_size,
strides=stride, padding='same', use_bias=False)

NOTE:

这里pytorch生成的卷积核shape = [out_channel, 1, kh, kw] 这里tflite生成的卷积核shape = [1, kh, kw, out_channel]

2. nn.AdaptiveAvgPool2dkl.GlobalAveragePooling2D

nn.AdaptiveAvgPool2d(1)时和kl.GlobalAveragePooling2D()相同,但是注意torch的输出是保持4维的,而tensorflow不保持维度.

等价写法:

x=nn.AdaptiveAvgPool2d(1)(x)
# -----------------------------
pool=kl.GlobalAveragePooling2D()
x=k.backend.expand_dims(k.backend.expand_dims(pool(x),1),1)

当然直接修改GlobalAveragePooling2D里,添加keepdims=true参数也可以.

tf.contrib.layers.layer_norm与tf.keras.LayerNorm与nn.LayerNorm

tf.contrib.layers.layer_norm

tf以前遗留代码还是挺蛋疼的。在tf.contrib.layers.layer_norm中,对于输入为(4, 10, 10, 3)的张量,是对(h,w,c)进行归一化处理,但是他的仿射系数默认只对c有效:

x = tf.reshape(tf.range(4 * 3 * 10 * 10, dtype=tf.float32), (4, 10, 10, 3))
xout = tf_contrib.layers.layer_norm(x,
center=True, scale=True,
scope='layer_norm')
mean.shape = (4, 1, 1, 1)
gamma.shape = (3,)

tf.keras.LayerNorm

tf.keras.LayerNorm我就属实不懂了,讲道理他的归一化是对(h,w,c)进行归一化处理,仿射系数对c有效,但是输出归一化结果是400=4×10x10,这就很奇怪了,他默认的特征维度是-1,但是看起来却没有干LayerNorm应该做的事情,反而把batch维度也归一化了,但是在最终测试输出的时候发现结果是符合预期的。。属实不理解。

inputs_np = tf.convert_to_tensor(
np.arange(4 * 3 * 10 * 10).reshape((4, 10, 10, 3)), dtype=tf.float32)
inputs = k.Input((10, 10, 3), batch_size=None)
lm = k.layers.LayerNormalization()
lm.weights
lm_out = lm(inputs)
md = k.Model(inputs, lm_out)
scale.shape # (3,)
mean.shape # (400,1)

lm_out_np = md(inputs_np)
lm_out_np = lm_out_np.numpy()
np.mean(lm_out_np[0, ...]) # -3.8146972e-08
np.var(lm_out_np[0, ...]) # 0.9985023
nn.LayerNorm

nn.LayerNorm是对(c,h,w)进行归一化处理,仿射系数对c,h,w有效,但有个非常蛋疼的问题就是,他没有办法复现老版本tf的行为,即只用c作为仿射系数,如果开启仿射会导致参数非常大。。。

inputs = torch.tensor(np.arange(4 * 3 * 10 * 10).reshape((4, 3, 10, 10)), dtype=torch.float32)
lm = nn.LayerNorm([3, 10, 10], elementwise_affine=True)
ln_out = lm(inputs)
lm.weight.shape # torch.Size([3, 10, 10])

我继续检查他的源码,在aten/src/ATen/native/layer_norm.h中,将输入维度分为M*N,按照我们上面的做法即M=4,N=3*10*10。 然后进入cuda代码aten/src/ATen/native/cuda/layer_norm_kernel.cu利用RowwiseMomentsCUDAKernel计算均值与方差:

template <typename T>
void LayerNormKernelImplInternal(
const Tensor& X,
const Tensor& gamma,
const Tensor& beta,
int64_t M,
int64_t N,
T eps,
Tensor* Y,
Tensor* mean,
Tensor* rstd) {
DCHECK_EQ(X.numel(), M * N);
DCHECK(!gamma.defined() || gamma.numel() == N);
DCHECK(!beta.defined() || beta.numel() == N);
const T* X_data = X.data_ptr<T>();
const T* gamma_data = gamma.defined() ? gamma.data_ptr<T>() : nullptr;
const T* beta_data = beta.defined() ? beta.data_ptr<T>() : nullptr;
T* Y_data = Y->data_ptr<T>();
T* mean_data = mean->data_ptr<T>();
T* rstd_data = rstd->data_ptr<T>();
cudaStream_t cuda_stream = at::cuda::getCurrentCUDAStream();
RowwiseMomentsCUDAKernel<T>
<<<M, cuda_utils::kCUDABlockReduceNumThreads, 0, cuda_stream>>>(
N, eps, X_data, mean_data, rstd_data);
LayerNormForwardCUDAKernel<T><<<M, kCUDANumThreads, 0, cuda_stream>>>(
N, X_data, mean_data, rstd_data, gamma_data, beta_data, Y_data);
AT_CUDA_CHECK(cudaGetLastError());
}

接下来我们检查一下group norm,首先给定group,他将模型输入分为N,C,HxW。在aten/src/ATen/native/cuda/group_norm_kernel.cu中,当group=1的时候,D=C/G=CN×G=N,也就是group=1的是等同于layer norm,并且此时他的可变化参数为C,可以用来等效tf.contrib.layers.layer_norm

template <typename T>
void GroupNormKernelImplInternal(
const Tensor& X,
const Tensor& gamma,
const Tensor& beta,
int64_t N,
int64_t C,
int64_t HxW,
int64_t group,
T eps,
Tensor* Y,
Tensor* mean,
Tensor* rstd) {
using T_ACC = acc_type<T, true>;
TORCH_CHECK(X.numel() == N * C * HxW);
TORCH_CHECK(!gamma.defined() || gamma.numel() == C);
TORCH_CHECK(!beta.defined() || beta.numel() == C);
if (N == 0) {
return;
}
const int64_t G = group;
const int64_t D = C / G;
const T* X_data = X.data_ptr<T>();
const T* gamma_data = gamma.defined() ? gamma.data_ptr<T>() : nullptr;
const T* beta_data = beta.defined() ? beta.data_ptr<T>() : nullptr;
T* Y_data = Y->data_ptr<T>();
T* mean_data = mean->data_ptr<T>();
T* rstd_data = rstd->data_ptr<T>();
const auto kAccType = X.scalar_type() == kHalf ? kFloat : X.scalar_type();
Tensor a = at::empty({N, C}, X.options().dtype(kAccType));
Tensor b = at::empty({N, C}, X.options().dtype(kAccType));
T_ACC* a_data = a.data_ptr<T_ACC>();
T_ACC* b_data = b.data_ptr<T_ACC>();
cudaStream_t cuda_stream = at::cuda::getCurrentCUDAStream();
RowwiseMomentsCUDAKernel<T>
<<<N * G, cuda_utils::kCUDABlockReduceNumThreads, 0, cuda_stream>>>(
D * HxW, eps, X_data, mean_data, rstd_data);
int64_t B = (N * C + kCUDANumThreads - 1) / kCUDANumThreads;
ComputeFusedParamsCUDAKernel<T><<<B, kCUDANumThreads, 0, cuda_stream>>>(
N, C, G, mean_data, rstd_data, gamma_data, beta_data, a_data, b_data);
if (HxW < kCUDANumThreads) {
B = (N * C * HxW + kCUDANumThreads - 1) / kCUDANumThreads;
GroupNormForwardSimpleCUDAKernel<T><<<B, kCUDANumThreads, 0, cuda_stream>>>(
N, C, HxW, X_data, a_data, b_data, Y_data);
} else {
GroupNormForwardCUDAKernel<T><<<N * C, kCUDANumThreads, 0, cuda_stream>>>(
HxW, X_data, a_data, b_data, Y_data);
}
AT_CUDA_CHECK(cudaGetLastError());
}