Article Preview
TopIntroduction
Person re-identification technology, a key technology within intelligent surveillance systems, is regarded as an image retrieval problem. Person re-identification technology is a necessary technology for intelligent surveillance systems in public places for instances like locating criminals. It can also be applied to intelligent security, epidemiological investigations, and intelligent transportation. Through all-weather monitoring, the technology can prevent the occurrence of crimes like theft and robbery, locate lost persons, and assist intelligent transportation systems in completing the automatic dispatching of people, vehicles, and roads.
When monitoring large amounts of data, traditional manual processing methods are inefficient and costly. The person re-identification technology can improve such problems by quickly locating and tracking the target. This saves labor costs, improves the accuracy of detection, and has a high application value in intelligent monitoring systems.
Person re-identification aims is to search for a targeted person via surveillance videos at different locations and times. Due to factors like the limitations of technology, most of the current research on person re-identification assume that the target’s clothes are unchanged (Huang et al., 2018; Jin et al., 2022; Li et al., 2018). Thus, it uses the color, texture, and other features of the clothes as discriminant conditions. However, the problem of changing clothes is unavoidable when re-identifying a person over an extended time. There is also the problem of changing clothes in some short-term scenarios. For example, suspects usually change clothes to avoid identification and tracking. The original method will no longer be applicable in the clothes-changing scenario because people may be wrongly matched if wearing similar clothes. To address the issue, this article studies problems related to clothes-changing person re-identification.
To avoid the interference of clothes, some clothes-changing re-identification methods attach modal inputs along with the input image (Chao et al., 2019; Chen et al., 2021; Qian et al., 2020; Shu et al., 2021; Yang et al., 2019). These include three-dimensional (3D) shapes, bones, and contour (Chao et al., 2019; Chen et al., 2021; Qian et al., 2020). However, these methods often require additional models to capture multimodal information. This, in turn, increases the complexity of the model. In fact, original images contain rich clothing-independent information, which is largely underutilized.
This article aims to better mine information unrelated to clothes in the image. Thus, it adds a two-level attention module to the model, acting on the features extracted by the backbone network in space and channel, respectively. Then, it obtains a multi-scale fine-grained attention map. The module can more effectively capture the semantic information of persons in the channel and space, as well as eliminate the influence of irrelevant background as it focuses on features related to an individual. In view of the influence of the clothes feature, this article sets up a clothes classification branch. It also suppresses the sensitivity of the model to clothes features by training this branch. Experiments on popular datasets show that the proposed method is competitive (Shu et al., 2021; Yang et al., 2019).