Images in the real world often have a very unbalanced density of content. Some areas are very uniform, for example large patches of blue sky, while other areas are scattered with many small objects. However, the successive grid downsampling strategy commonly used in convolutional deep networks treats all areas equally. Therefore, small objects are rendered in too few spatial locations, leading to poorer performance on tasks such as segmentation. Intuitively, retaining more pixels that represent small objects during downsampling helps preserve important information. To achieve this, we propose AutoFocusFormer (AFF), a local attention transformer image recognition backbone, which performs adaptive downsampling by learning to retain the most task-important pixels. Since adaptive downsampling generates an irregularly distributed set of pixels in the image plane, we abandon the classical grid structure. Instead, we developed a novel point-based local focus block, facilitated by a balanced clustering module and a learnable neighborhood merger module, which produces renderings for our point-based versions of next-generation segmentation heads. . Experiments show that our AutoFocusFormer (AFF) improves significantly over reference models of similar sizes.