When working on data science projects, a fundamental channel to configure is the one related to data collection. Real-world machine learning mainly differs from Kaggle-like problems because the data is not static. We need to scrape websites, collect API data, etc. This way of collecting data may seem chaotic, and it is! That's why we need to structure our code following best practices to put some sort of order in all this mess.
Once you have identified the sources you want to collect your data from, you need to collect it in a structured way to store it in your database. For example, you could decide that to train your LLM what you need are data sources that contain 3 fields: author, content, and link.
What you could do is download the data and then write SQL queries to store and retrieve data from your database. More commonly, you may want to implement all queries to be performed CRUD operations. CRUD means create, read, update and delete. These are the four basic functions of persistent storage.