The rapid growth of the web content presents a challenge to extract and efficiently summarize the relevant information. In this tutorial, we demonstrate how to take advantage Pilot For the web scraping and the process of the data extracted using ai models such as Google Gemini. When integrating these tools in Google Colab, we create a end -to -end workflow that scrapes the web pages, recover significant content and generates concise summaries that use latest generation language models. Whether you want to automate research, extract information from articles or build applications with ai, this tutorial provides a robust and adaptable solution.
!pip install google-generativeai firecrawl-py
First, we install Google-Generativai Firecrawl-Py, which installs two essential libraries necessary for this tutorial. Google-Generativai provides access to the Google Gemini API for the generation of text with ai, while Firecrawl-Py allows the web scraping obtaining content of the web pages in a structured format.
import os
from getpass import getpass
# Input your API keys (they will be hidden as you type)
os.environ("FIRECRAWL_API_KEY") = getpass("Enter your Firecrawl API key: ")
Then we safely establish the API Firecrawl key as an environment variable on Google Colab. Use getpass () to request the user the API key without showing it, ensuring confidentiality. Storing the key in OS.environ allows perfect authentication for Firecrawl web scraping functions throughout the session.
from firecrawl import FirecrawlApp
firecrawl_app = FirecrawlApp(api_key=os.environ("FIRECRAWL_API_KEY"))
target_url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
result = firecrawl_app.scrape_url(target_url)
page_content = result.get("markdown", "")
print("Scraped content length:", len(page_content))
We initialize Firecrawl creating an instance of Firecrawlapp using the stored API key. Then it scrapes the content of a specified web page (in this case, the Wikipedia Python programming language page) and extracts the data in Markdown format. Finally, it prints the length of the scraping content, which allows us to verify the successful recovery before the subsequent processing.
import google.generativeai as genai
from getpass import getpass
# Securely input your Gemini API Key
GEMINI_API_KEY = getpass("Enter your Google Gemini API Key: ")
genai.configure(api_key=GEMINI_API_KEY)
We initialize the Google Gemini API safely capturing the API key using getpass (), preventing it from being shown in flat text. The Genai.configure command (API_Key = Gemini_api_Key) establishes the API client, allowing a perfect interaction with the Google Gemini for text generation and summary tasks. This guarantees safe authentication before making applications to the ai model.
for model in genai.list_models():
print(model.name)
We are going through the models available on Google Gemini API using Genai.list_models () and printed their names. This helps users to verify which models can be accessed with their API key and select the appropriate for tasks such as the generation or summary of text. If a model is not found, this step helps to purify and choose an alternative.
model = genai.GenerativeModel("gemini-1.5-pro")
response = model.generate_content(f"Summarize this:\n\n{page_content(:4000)}")
print("Summary:\n", response.text)
Finally, we initialize the Gemini 1.5 Pro model using Genai.Generative of the (“Gemini-1.5-Pro”) sends a request to generate a summary of the scraping content. Limit the entry text to 4,000 characters to remain within the limitations of API. The model processes the application and returns a concise summary, which is then printed, providing a general description structured and generated by the content of the extracted website.
In conclusion, by combining Firecrawl and Google Gemini, we have created an automated pipe that scrapes the web content and generates significant summaries with a minimum effort. This tutorial showed multiple solutions with ai, allowing flexibility based on the availability of APIs and quota limitations. Whether you are working on NLP, research automation or content aggregation, this approach allows the extraction and summary of efficient data on scale.
Here is the Colab notebook. Besides, don't forget to follow us <a target="_blank" href="https://x.com/intent/follow?screen_name=marktechpost” target=”_blank” rel=”noreferrer noopener”>twitter and join our Telegram channel and LINKEDIN GRsplash. Do not forget to join our 80k+ ml subject.
Know Parlant: A frame of the conversational LLM of LLM designed to provide developers with the control and precision they need about their ai customer service agents, using behavior guidelines and supervision of execution time. A It works using an easy -to -use cli and SDK of native customers in Python and TypeScript .
Asif Razzaq is the CEO of Marktechpost Media Inc .. as a visionary entrepreneur and engineer, Asif undertakes to take advantage of the potential of artificial intelligence for the social good. Its most recent effort is the launch of an artificial intelligence media platform, Marktechpost, which stands out for its deep coverage of automatic learning and deep learning news that is technically solid and easily understandable by a broad audience. The platform has more than 2 million monthly views, illustrating its popularity among the public.