
In addition to the two learning strategies of transfer learning and synthetic occlusion, a self-supervised training method is also used in this model, in which the 3D labels are constructed by the epipolar geometry transformation to supervise the training of the network. Finally, the features are input into a deconvolution network to acquire the volume heat map, which is later processed by a soft argmax function to obtain the coordinates of the joints. Then, a waterfall atrous spatial pooling (WASP) module is used to capture multi-scale contextual information from the extracted features to increase the receptive field. First, a convolutional block attention module (CBAM) was introduced to refine selection of significant pixels.

ResNet50 is used as the basic network for extract features.

In view of the difficulty of obtaining 3D ground truth labels for a dataset of 3D pose estimation techniques, we take 2D images as the research object in this paper, and propose a self-supervised 3D pose estimation model called Pose ResNet.

The accurate estimation of a 3D human pose is of great importance in many fields, such as human–computer interaction, motion recognition and automatic driving.
