Sleep staging is a clinically important task for diagnosing various sleep disorders, but its implementation at scale remains challenging because it requires clinical expertise, among other reasons. Deep learning models can accomplish the task, but at the expense of large labeled data sets, which are infeasible to acquire at scale. While self-supervised learning (SSL) can mitigate this need, recent studies on SSL for sleep staging have shown that performance gains saturate after training with labeled data from only tens of subjects, so they cannot match the maximum performance achieved with larger data sets. Our hypothesis is that rapid saturation arises from the application of a pre-training scheme that only pre-trains a part of the architecture, i.e. the feature encoder but not the temporal encoder; Therefore, we propose to adopt an architecture that perfectly combines feature and temporal encoding and a suitable pre-training scheme that pre-trains the entire model. On a sample sleep staging dataset, we find that the proposed scheme offers performance improvements that do not saturate with the size of the labeled training dataset (e.g., a 3-5% improvement in balanced accuracy in low to high labeled data environments), resulting in significant reductions in the amount of labeled training data needed for high performance (e.g., in 800 subjects). Based on our findings, we recommend adopting this SSL paradigm for further work on SSL for sleep preparation.