Monitoring and extracting trends from the web content has become essential for market research, content creation or staying at the forefront in its field. In this tutorial, we provide a practical guide to build your trend search tool with Python. Without external API or complex configurations, you will learn how to scrape public access websites, apply powerful NLP (natural language processing) such as the analysis of feelings and the modeling of topics, and visualize emerging tendencies using clouds of dynamic words.
import requests
from bs4 import BeautifulSoup
# List of URLs to scrape
urls = ("https://en.wikipedia.org/wiki/Natural_language_processing",
"https://en.wikipedia.org/wiki/Machine_learning")
collected_texts = () # to store text from each page
for url in urls:
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
# Extract all paragraph text
paragraphs = (p.get_text() for p in soup.find_all('p'))
page_text = " ".join(paragraphs)
collected_texts.append(page_text.strip())
else:
print(f"Failed to retrieve {url}")
First with the previous code fragment, we demonstrate a direct way of scraping textual data on public access websites using Python and Beautifuluup applications. It obtains specified URL content, extracts paragraphs from the HTML and prepares them for an additional NLP analysis combining text data in structured chains.
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
cleaned_texts = ()
for text in collected_texts:
# Remove non-alphabetical characters and lower the text
text = re.sub(r'(^A-Za-z\s)', ' ', text).lower()
# Remove stopwords
words = (w for w in text.split() if w not in stop_words)
cleaned_texts.append(" ".join(words))
Then, we clean the scraping text by making it lowercase, eliminating the score and special characters, and filtering commonly used English words using NLTK. This preprocessing guarantees that the text data is clean, focused and ready for significant NLP analysis.
from collections import Counter
# Combine all texts into one if analyzing overall trends:
all_text = " ".join(cleaned_texts)
word_counts = Counter(all_text.split())
common_words = word_counts.most_common(10) # top 10 frequent words
print("Top 10 keywords:", common_words)
Now, we calculate the frequencies of words from the clean textual data, identifying the 10 most frequent keywords. This helps highlight the dominant trends and recurring issues in the documents collected, providing immediate information on popular or significant issues within the scraping content.
!pip install textblob
from textblob import TextBlob
for i, text in enumerate(cleaned_texts, 1):
polarity = TextBlob(text).sentiment.polarity
if polarity > 0.1:
sentiment = "Positive "
elif polarity < -0.1:
sentiment = "Negative "
else:
sentiment = "Neutral "
print(f"Document {i} Sentiment: {sentiment} (polarity={polarity:.2f})")
We perform an analysis of feelings in each clean text document using Textblob, a Python library built on NLTK. Evaluate the general emotional tone of each document, positive, negative or neutral, and print the feeling along with a numerical polarity score, providing a rapid indication of the mood or general attitude within the text data.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
# Adjust these parameters
vectorizer = CountVectorizer(max_df=1.0, min_df=1, stop_words="english")
doc_term_matrix = vectorizer.fit_transform(cleaned_texts)
# Fit LDA to find topics (for instance, 3 topics)
lda = LatentDirichletAllocation(n_components=3, random_state=42)
lda.fit(doc_term_matrix)
feature_names = vectorizer.get_feature_names_out()
for idx, topic in enumerate(lda.components_):
print(f"Topic {idx + 1}: ", (vectorizer.get_feature_names_out()(i) for i in topic.argsort()(:-11:-1)))
Then, we apply the allocation of Dirichlet latent (LDA), an algorithm of popular themes modeling, to discover underlying themes in the text corpus. First transforms clean texts into a numerical matrix in the term of documents using the SCIKIT-learning condeaverizer, then adapts to an LDA model to identify the primary themes. The output lists the main keywords for each discovered topic, summarizing concise key concepts in the data collected.
# Assuming you have your text data stored in combined_text
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
import re
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
# Preprocess and clean the text:
cleaned_texts = ()
for text in collected_texts:
text = re.sub(r'(^A-Za-z\s)', ' ', text).lower()
words = (w for w in text.split() if w not in stop_words)
cleaned_texts.append(" ".join(words))
# Generate combined text
combined_text = " ".join(cleaned_texts)
# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color="white", colormap='viridis').generate(combined_text)
# Display the word cloud
plt.figure(figsize=(10, 6)) # <-- corrected numeric dimensions
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title("Word Cloud of Scraped Text", fontsize=16)
plt.show()
Finally, we generate a Word Cloud visualization that shows prominent keywords from combined and clean text data. By visually emphasizing the most frequent and relevant terms, this approach allows the intuitive exploration of the main trends and issues in the web content collected.
Word cloud output of the scraping site
In conclusion, we have successfully built a robust and interactive tool to search for trends. This exercise was equipped with practical experience in web scraping, NLP analysis, intuitive topics and visualizations modeling using clouds of words. With this powerful but direct approach, you can continuously track industry trends, obtain valuable information from social and blog content, and make informed decisions based on real -time data.
Here is the Colab notebook. Besides, don't forget to follow us <a target="_blank" href="https://x.com/intent/follow?screen_name=marktechpost” target=”_blank” rel=”noreferrer noopener”>twitter and join our Telegram channel and LINKEDIN GRsplash. Do not forget to join our 80k+ ml subject.
Know Parlant: A frame of the conversational LLM of LLM designed to provide developers with the control and precision they need about their ai customer service agents, using behavior guidelines and supervision of execution time. A It works using an easy -to -use cli and SDK of native customers in Python and TypeScript .

Asif Razzaq is the CEO of Marktechpost Media Inc .. as a visionary entrepreneur and engineer, Asif undertakes to take advantage of the potential of artificial intelligence for the social good. Its most recent effort is the launch of an artificial intelligence media platform, Marktechpost, which stands out for its deep coverage of automatic learning and deep learning news that is technically solid and easily understandable by a broad audience. The platform has more than 2 million monthly views, illustrating its popularity among the public.