The process of human speech production involves coordinated respiratory action to obtain acoustic speech signals. Typically, speech is produced by forcing air out of the lungs and modulated by the vocal tract, where such actions are interspersed with moments of inhalation to refill the lungs. Respiratory rate (fRH) is a vital metric used to assess the overall health, fitness, and general well-being of an individual. Existing approaches to measure fRH (number of breaths one takes in a minute) are performed using specialized equipment or training. Studies have shown that machine learning algorithms can be used to estimate fRH using biosensor signals as input. Speech-based fRH estimation can offer an effective approach to measure the vital metric without requiring any specialized equipment or sensors. This work investigates a machine learning-based approach to estimate 𝑅𝑅 from speech segments obtained from subjects speaking into a nearby conversational microphone device. Data was collected from N=26 individuals, where the ground truth 𝑅𝑅 was obtained via commercial-grade chest straps and then any errors were manually corrected. A Convolutional Long Short-Term Memory (Conv-LSTM) network is proposed to estimate respiration time series data from the speech signal. We show that using pre-trained representations obtained from a base model, such as WAV2VEC2, can be used to estimate respiration time series with low mean square error and high correlation coefficient, compared to the baseline. The model-based time series can then be used to estimate 𝑅𝑅 with low mean absolute error (𝑀𝐴𝐸) ≈ 1.6𝑏𝑟𝑒𝑎𝑡h𝑠/𝑚𝑖𝑛.