The speedy progress of net content material presents a problem for effectively extracting and summarizing related data. On this tutorial, we show how you can leverage Firecrawl for net scraping and course of the extracted knowledge utilizing AI fashions like Google Gemini. By integrating these instruments in Google Colab, we create an end-to-end workflow that scrapes net pages, retrieves significant content material, and generates concise summaries utilizing state-of-the-art language fashions. Whether or not you wish to automate analysis, extract insights from articles, or construct AI-powered purposes, this tutorial gives a strong and adaptable resolution.
!pip set up google-generativeai firecrawl-py
First, we set up google-generativeai firecrawl-py, which installs two important libraries required for this tutorial. google-generativeai gives entry to Google’s Gemini API for AI-powered textual content era, whereas firecrawl-py permits net scraping by fetching content material from net pages in a structured format.
import os
from getpass import getpass
# Enter your API keys (they are going to be hidden as you sort)
os.environ["FIRECRAWL_API_KEY"] = getpass("Enter your Firecrawl API key: ")
Then we securely set the Firecrawl API key as an setting variable in Google Colab. It makes use of getpass() to immediate the consumer for the API key with out displaying it, making certain confidentiality. Storing the important thing in os.environ permits seamless authentication for Firecrawl’s net scraping capabilities all through the session.
from firecrawl import FirecrawlApp
firecrawl_app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
target_url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
outcome = firecrawl_app.scrape_url(target_url)
page_content = outcome.get("markdown", "")
print("Scraped content material size:", len(page_content))
We initialize Firecrawl by making a FirecrawlApp occasion utilizing the saved API key. It then scrapes the content material of a specified webpage (on this case, Wikipedia’s Python programming language web page) and extracts the information in Markdown format. Lastly, it prints the size of the scraped content material, permitting us to confirm profitable retrieval earlier than additional processing.
import google.generativeai as genai
from getpass import getpass
# Securely enter your Gemini API Key
GEMINI_API_KEY = getpass("Enter your Google Gemini API Key: ")
genai.configure(api_key=GEMINI_API_KEY)
We initialize Google Gemini API by securely capturing the API key utilizing getpass(), stopping it from being displayed in plain textual content. The genai.configure(api_key=GEMINI_API_KEY) command units up the API consumer, permitting seamless interplay with Google’s Gemini AI for textual content era and summarization duties. This ensures safe authentication earlier than making requests to the AI mannequin.
for mannequin in genai.list_models():
print(mannequin.identify)
We iterate via the obtainable fashions in Google Gemini API utilizing genai.list_models() and print their names. This helps customers confirm which fashions are accessible with their API key and choose the suitable one for duties like textual content era or summarization. If a mannequin is just not discovered, this step aids debugging and selecting an alternate.
mannequin = genai.GenerativeModel("gemini-1.5-pro")
response = mannequin.generate_content(f"Summarize this:nn{page_content[:4000]}")
print("Abstract:n", response.textual content)
Lastly, we initialize the Gemini 1.5 Professional mannequin utilizing genai.GenerativeModel(“gemini-1.5-pro”) sends a request to generate a abstract of the scraped content material. It limits the enter textual content to 4,000 characters to remain inside API constraints. The mannequin processes the request and returns a concise abstract, which is then printed, offering a structured and AI-generated overview of the extracted webpage content material.
In conclusion, by combining Firecrawl and Google Gemini, we now have created an automatic pipeline that scrapes net content material and generates significant summaries with minimal effort. This tutorial showcased a number of AI-powered options, permitting flexibility based mostly on API availability and quota constraints. Whether or not you’re engaged on NLP purposes, analysis automation, or content material aggregation, this strategy permits environment friendly knowledge extraction and summarization at scale.
Right here is the Colab Pocket book. Additionally, don’t neglect to comply with us on Twitter and be a part of our Telegram Channel and LinkedIn Group. Don’t Overlook to affix our 80k+ ML SubReddit.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its reputation amongst audiences.