In the field of computer vision, network architectures are critical to the performance of tasks. Vision Graph Neural Network (ViG) has shown remarkable results in handling various vision tasks with their unique characteristics. However, the lack of multi-scale information in ViG limits its expressive capability. To address this challenge, we propose a Graph Pyramid Pooling Transformer (GPPT), which aims to enhance the performance of the model by introducing multi-scale feature learning. The core advantage of GPPT is its ability to effectively capture and fuse feature information at different scales. Specifically, it first generates multi-level pooled graphs using a graph pyramid pooling structure. Next, it encodes features at each scale using a weight-shared Graph Convolutional Neural Network (GCN). Then, it enhances information exchange across scales through a cross-scale feature fusion mechanism. Finally, it captures long-range node dependencies using a transformer module. The experimental results demonstrate that GPPT achieves exceptional performance across various visual scenes, including image classification, and object detection, highlighting its generality and validity.
[Neucom] GPPT: Graph Pyramid Pooling Transformer for Visual Scene
2025-06-06 19:56:22
科研
25