2.3.1 Release Note

1. 重要更新

2.3.1 版本是在 2.3 版本的基础上修复了已知问题，并且发布了支持 CUDA 11.6 的安装包。

2. 训练框架（含分布式）

（1）功能优化

API

修改paddle.nn.initializer.KaimingUniform和paddle.nn.initializer.KaimingNormal 两种初始化方式，使其支持多种类型的激活函数。(#43721, #43827)
优化 paddle.io.DataLoader 的数据预读取功能，使其支持设置了 prefetch_factor 设定的预读取数据的缓存数量，避免在读取大块数据时出现 IO 阻塞。(#43674 )

新动态图执行机制

修改新动态图 API 逻辑中 optional 类型 Tensor 的初始化方法，防止被提前析构导致数据异常。(#42561)

全新静态图执行器

延迟初始化执行器中的线程池，避免只执行一轮的 program（如 save、load、startup_program等）创建线程池。(#43768)

混合精度训练

设置 paddle.nn.Layer 中 set_state_dict中禁用 state_dict hook。(#43407)

分布式训练

使 paddle.incubate.nn.functional.fused_attention和 paddle.incubate.nn.functional.fused_feedforward支持张量模型并行。(#43505)

其他

调整框架算子内核打印字符串的格式，便于进行自动化拆分解析。(#42931)
更新模型量化 API，支持rounding to nearest ties to even的四舍五入方式，支持量化取值范围 [-128, 127]。(#43829)
量化感知训练适配支持 AMP 混合精度训练。(#43689)
量化感知训练在启动时新增 progress bar，便于查看量化初始化进度，统计 out_threshold 时跳过 scale op，加速初始化过程。(#43454)
动态图量化训练支持 conv 和 bn 融合，静态图离线量化支持设置 skip_tensor_list 来跳过某些层不做量化。(#43301)

（2）性能优化

优化paddle.incubate.nn.functional.fused_attention和paddle.incubate.nn.functional.fused_feedforward算子，增加add_residual属性，用以控制最后一步是否进行加residual操作，CAE 模型性能提升 7.7%。(#43719)
优化 linspace 算子，将 start、stop、num三个输入 Tensor 初始化在 CPU 上，避免在算子中进行 GPU -> CPU 拷贝，SOLOv2 模型性能提升6%。(#43746)

（3）问题修复

API

修复 paddle.io.DataLoader在 return_list=True 时因多线程冲突小概率报错问题。(#43691)
修复 paddle.nn.Layer的参数存在 None类型参数时 to方法报 NoneType 不存在 device 属性的错误。(#43597)
修复 cumsum op 在某些 shape下计算结果出错的问题。 (#42500, #43777)
修复静态图下 Tensor.__getitem__在使用 bool索引时组网阶段输出结果维度为 0 的问题。 (#43246)
修复 paddle.slice 和 paddle.strided_slice 处理参数为负数时出现异常的问题。(#43432)
修复 set_value op 在处理切片 step为负数时赋值结果异常的问题。 (#43694)
修复 C++ 端 copy接口不能在多卡设备间拷贝的问题。(#43728)
修改 paddle.incubate.nn.functional.fused_attention和 paddle.incubate.nn.functional.fused_feedforward 中属性命名引发的推理时的问题。(#43505)
修复 ConditionalBlockGrad op 处理不需要 grad的 Tensor 时异常的问题。(#43034)
解决 C++ 的 einsum op 反向速度优化引起的显存增加问题，并将反向优化默认打开。(#43397)
修复单卡下 paddle.io.DataLoader多进程数据读取在固定随机种子时数据无法固定的问题。(#43702)
修复 softmax op 在 Tensor 元素超过 2G 时，触发 CUDNN_STATUS_NOT_SUPPORT 的错误。(#43719)
修复 trace op Event 字符串在不同算子无区分，导致性能分析不便利的问题。(#42789)

其他

修复动转静多次 deepcopy 并保存导致的显存溢出问题。(#43141)
修复自定义算子中使用的 PlaceType 类型升级引入的 device id 在多卡场景中出错的问题。(#43830)
优化 paddle.profiler.Profiler timeline 可视化逻辑，将在 python 脚本中自定义的事件从 C++ 折叠层显示移动至 python 折叠层显示。(#42790)

3. 部署方向（Paddle Inference）

（1）新增特性

新增功能

CPU 上 ONNX Runtime 后端新增 PaddleSlim 量化模型支持。 (#43774, #43796)

（2）底层优化

CPU性能优化

EnableMkldnn 配置中移除 gpu_cpu_reshape2_matmul_fuse_pass，修复 ResNet50 性能下降的问题。 (#43750)

GPU 性能优化

添加 bilinear_interp_v2 TensorRT convert 支持。 (#43618)
添加 matmul_scale_fuse_pass、multihead_matmul_fuse_pass_v3到 GPU pass，并添加单测。(#43765)
添加 GPU handle 延迟初始化支持。 (#43661)

（3）问题修复

框架及API修复

修复联编 Paddle-Lite XPU 时的编译报错问题。(#43178)
修复 ERNIE 3.0 pass误触发的问题。(#43948)
修复 multihead op 中 int8 量化属性读不到的问题。(#43020)

后端能力修复

修复 MKLDNN 中 elementwise_mul 和 matmul 两个 op 在运行量化推理过程中崩溃的问题。 (#43725)
修复同一模型在推理时 TensorRT 子图序列化文件反复生成的问题。(#42945, #42633)
修复 ONNX Runtime 后端与外部使用的 protobuf 冲突问题。(#43159, #43742)
修复 python 预测库 ONNX Runtime 后端在多输入情况下推理报错问题。 (#43621)

4. 环境适配

编译安装

完成对 CUDA 11.6 的验证和适配，并在官网发布 CUDA 11.6 的安装包。(#43935, #44005)
修复在 Windows 上使用 CUDA 11.6 编译时的 cub 报错问题。(#43935, #44005)
修复 elementwise、reduce op 编译时间较长的问题。(#43202, #42779, #43205)

新硬件适配

寒武纪 MLU 支持飞桨 Profiler。(#42115)
GraphCore IPU 支持显示编译进度。(#42078)

2.3.1 Release Note

1. Important Updates

V2.3.1 is built on V2.3 by fixing known issues and releasing precompiled binary that supports CUDA 11.6.

2. Training Framework (distributed included)

(1) Function Optimization

API

Modify two initialization modes of paddle.nn.initializer.KaimingUniform and paddle.nn.initializer.KaimingNormal, to support multiple types of activation functions. (#43721, #43827)
Optimize the data pre-fetching function of paddle.io.DataLoader, so that it can support the setting of the prefetch_factor to set the cache size of pre-fetched data. This can avoid IO blocking when reading large blocks of data. (#43674)

New dynamic graph execution mechanism

Modify the initialization method of optional type Tensor in the new dynamic graph API logic to prevent data exceptions caused by early destruction. (#42561)

New static graph executor

Defer initialization of the thread pools in the executor, to avoid creating thread pools for programs that execute only once (e.g.,save, load, startup_program, etc.). (#43768)

Mixed precision training

Disabling state_dict hook in set_state_dict in paddle.nn.Layer. (#43407)

Distributed training

Enabling tensor parallelism in paddle.incubate.nn.functional.fused_attention and paddle.incubate.nn.functional.fused_feedforward. (#43505)

Others

Adjust print format of the framework operator kernels to facilitate automated splitting and parsing. (#42931)
Update the model quantization API to support the round-off in rounding to nearest ties to even, and support quantization in the range [-128, 127]. (#43829)
Support AMP mixed precision training in quantization-aware training. (#43689)
Add the progress bar at the beginning of quantization-aware training, so that it is easy to check the progress of quantization initialization. Skip the scale op when counting out_threshold to speed up the initialization process. (#43454)
Support conv and bn fusion in the dynamic graph quantization training. Support the settings of skip_tensor_list in the static graph offline quantization, to skip some layers without quantization. (#43301)

(2) Performance Optimization

Optimizepaddle.incubate.nn.functional.fused_attention and paddle.incubate.nn.functional.fused_feedforwardoperators. Add add_residual property to control whether to perform add-residual operation in the last step. The performance of CAE model is improved by 7.7%. (#43719)
Optimize linspace operator. Initialize three input Tensor of start,stop and num on CPU, to avoid GPU->CPU copy in the operator. This can speed up SOLOv2 model performance by 6%. (#43746)

(3) Bug Fix

API

Fix the error reported by paddle.io.DataLoader when return_list=True due to multi-thread conflict. (#43691)
Fix the error that the to method reports NoneType does not have the device attribute when the paddle.nn.Layer parameter has the None type parameter. (#43597)
Fix the bug that the calculation result of cumsum op is wrong in some shape settings. (#42500, #43777)
Fix the bug that the output result dimension of Tensor.__getitem__ is 0 in the networking stage when using bool index in the static graph.(#43246)
Fix the bug occurred when paddle.slice and paddle.strided_slice handle negative parameters. (#43432)
Fix the bug that the assignment result of set_value op is abnormal when the processing slice step is negative. (#43694)
Fix the bug that the copy interface in C++ cannot copy between multiple cards. (#43728)
Fix the bug in inference stage caused by attribute naming in paddle.incubate.nn.functional.fused_attentionand paddle.incubate.nn.functional.fused_feedforward . (#43505)
Fix an exception in ConditionalBlockGrad op when processing Tensor that does not require grad. (#43034)
Fix the bug of device memory increase caused by einsum op in the speed optimization of backward computation. By default, this optimization is enabled. (#43397)
Fix the bug that data fails to be fixed when paddle.io.DataLoader multi-process data reads the fixing random seeds under a single card. (#43702)
Fix the bug that softmax op triggers CUDNN_STATUS_NOT_SUPPORT when the Tensor exceeds 2G. (#43719)
Fix the bug that the trace op Event string is indistinguishable among different operators that cause the inconvenient performance analysis. (#42789)

Others

Fix the bug of overflowing device memory caused by multiple deepcopy and saving in case of dynamic-to-static. (#43141)
Fix the bug that the device id introduced by the upgrade of PlaceType used in the custom operator is wrong in the multi-card scenario.(#43830)
Optimize the paddle.profiler.Profiler timeline visualization logic, move events customized in python scripts from C++ folding display to python folding display. (#42790)

3. Deployment Direction (Paddle Inference)

(1) New Features

New functions

Add the support of the PaddleSlim quantization model for ONNX Runtime backends on CPUs. (#43774, #43796)

(2) Underlying Optimization

CPU performance optimization

Remove gpu_cpu_reshape2_matmul_fuse_pass from EnableMkldnn configuration to fix the bug of ResNet50 performance degradation. (#43750)

GPU performance optimization

Add the support of bilinear_interp_v2 TensorRT convert. (#43618)
Add matmul_scale_fuse_pass and multihead_matmul_fuse_pass_v3 to GPU pass. (#43765)
Add the support of the GPU handle deferred initialization. (#43661)

(3) Bug Fixing

Framework and API fixing

Fix the compile error problem when binding Paddle-Lite XPU. (#43178)
Fix the bug of false trigger of ERNIE 3.0 pass. (#43948)
Fix the bug that int8 quantization attribute in multihead op cannot be read. (#43020)

Backend capability fixing

Fix the bug that two ops of elementwise_mul and matmul in MKLDNN are crashed during quantitative inference. (#43725)
Fix a bug where TensorRT subgraph serialization files are repeatedly generated for the same model during inference. (#42945, #42633)
Fix a conflict between the ONNX Runtime backend and the externally use of protobuf. (#43159, #43742)
Fix an error reported by python prediction library when using ONNX Runtime backend in case of multiple inputs. (#43621)

4. Environment Adaptation

Compile and install

Complete verification and adaptation of CUDA 11.6, and release CUDA 11.6 precompiled binary. (#43935, #44005)
Fix a cub error when compiling with CUDA 11.6 on Windows. (#43935, #44005)
Fix the bug of long compilation time for elementwise and reduce op. (#43202, #42779, #43205)

New hardware adaptation

Cambricon MLU supports PaddlePaddle Profiler. (#42115)
GraphCore IPU supports visualization of compilation progress. (#42078)

PaddlePaddle 2.3.1 Release Note

2.3.1 Release Note

1. 重要更新

2. 训练框架（含分布式）

（1）功能优化

API

新动态图执行机制

全新静态图执行器

混合精度训练

分布式训练

其他

（2）性能优化

（3）问题修复

API

其他

3. 部署方向（Paddle Inference）

（1）新增特性

新增功能

（2）底层优化

CPU性能优化

GPU 性能优化

（3）问题修复

框架及API修复

后端能力修复

4. 环境适配

编译安装

新硬件适配

2.3.1 Release Note

1. Important Updates

2. Training Framework (distributed included)

(1) Function Optimization

API

New dynamic graph execution mechanism

New static graph executor

Mixed precision training

Distributed training

Others

(2) Performance Optimization

(3) Bug Fix

API

Others

3. Deployment Direction (Paddle Inference)

(1) New Features

New functions

(2) Underlying Optimization

CPU performance optimization

GPU performance optimization

(3) Bug Fixing

Framework and API fixing

Backend capability fixing

4. Environment Adaptation

Compile and install

New hardware adaptation