Can Transformers predict new syllogisms by composing established ones? More generally, what kind of targets can these models learn from scratch? Recent work shows that Transformers can be Turing-complete in terms of expressiveness, but this does not address the goal of learnability. This paper proposes the notion of distribution locality to capture when weak learning can be efficiently achieved by regular Transformers, where locality measures the smallest number of tokens needed in addition to the token histogram to non-trivially correlate to the target. As shown experimentally and theoretically under additional assumptions, distributions with high locality cannot be efficiently learned. In particular, syllogisms cannot be composed into long chains. Furthermore, we show that (i) an agnostic draft cannot help break the locality barrier, (ii) a polite draft can help if it breaks locality at every step, (iii) a notion of 'inductive draft' can break locality and improve out-of-distribution generalization, e.g. generalizing to almost twice the input size for some arithmetic tasks.