Inspired by advances in basic models for modeling language and vision, we explore the utilization of transformers and large-scale pretraining of biosignals. In this study, we aim to design a general-purpose architecture for biosignals that can be easily trained on multiple modalities and can be adapted to new modalities or tasks with ease. The proposed model is designed with three key features: (i) A frequency-aware architecture that can efficiently identify local and global information from biosignals by leveraging global filters in the frequency space. (ii) A channel-independent design that shares encoder weights between different channels using general-purpose or modality-specific filters. (iii) A modality combining transformer capable of effectively combining an arbitrary number of modalities. We demonstrate the robustness of the proposed architecture on multiple biosignal data sets, where we show that the proposed architecture not only performs better than single-modality models but also outperforms on transfer learning tasks.