In this tutorial, we will learn how to build a multimodal interactive application that makes an image change application using the Google Colab platform, the powerful Blip model of Salesforce and optimize for an intuitive web interface. Multimodal models, which combine image processing capabilities and text, have become increasingly important in ai applications, allowing tasks such as image subtitles, response of visual questions and more. This step -by -step guide guarantees a soft configuration, clearly addresses common difficulties and demonstrates how to integrate and display advanced ai solutions, even without extensive experience.
!pip install transformers torch torchvision streamlit Pillow pyngrok
First we install transformers, torch, antorchvision vision, rationalization, pillow, pyngrok, all the necessary units to build an application of multimodal image subtitles. Includes Transformers (for Blip Model), Torch & Torchvision (for deep learning and image processing), Streamlit (to create the user interface), pillow (to handle image files) and Pyngrok (to expose the online application through NGROK).
%%writefile app.py
import torch
from transformers import BlipProcessor, BlipForConditionalGeneration
import streamlit as st
from PIL import Image
device = "cuda" if torch.cuda.is_available() else "cpu"
@st.cache_resource
def load_model():
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base").to(device)
return processor, model
processor, model = load_model()
st.title(" Image Captioning with BLIP")
uploaded_file = st.file_uploader("Upload your image:", type=("jpg", "jpeg", "png"))
if uploaded_file is not None:
image = Image.open(uploaded_file).convert('RGB')
st.image(image, caption="Uploaded Image", use_column_width=True)
if st.button("Generate Caption"):
inputs = processor(image, return_tensors="pt").to(device)
outputs = model.generate(**inputs)
caption = processor.decode(outputs(0), skip_special_tokens=True)
st.markdown(f"### **Caption:** {caption}")
Then we create a multimodal image subtitulation application based on transmission through the BLIP model. First load the Blipprocessor and the generation of blipfordptional of the hugged face, allowing the model to process images and generate subtitles. The Strewlit user interface allows users to load an image, show it and generate a title by clicking on a button. The use of @st.cache_resource guarantees an efficient load of the model, and the CUDA support is used if available for faster processing.
from pyngrok import ngrok
NGROK_TOKEN = "use your own NGROK token here"
ngrok.set_auth_token(NGROK_TOKEN)
public_url = ngrok.connect(8501)
print(" Your Streamlit app is available at:", public_url)
# run streamlit app
!streamlit run app.py &>/dev/null &
Finally, we set up a public access transmission application that runs on Google Colab using NGROK. Does the following:
- Authentic Ngrok using your personal token (`ngrok_token`) to create a safe tunnel.
- It exposes the application of optimization that is executed in the port `8501` to an external url through` ngrok.connect (8501)`.
- Print the public URL, which can be used to access the application in any browser.
- Start the streamlit application (`app.py`) in the background.
This method allows you to interact remotely with its application of image subtitles, although Google Colab does not provide direct web accommodation.
In conclusion, we have successfully created and implemented a multimodal image subtitling application driven by Salesforce's Blip and Strewlit, hosted safely through NGROK from a Google Colab environment. This practical exercise demonstrated how easily sophisticated automatic learning models can be integrated into easy to use and provide a basis for exploring and customizing more multimodal applications.
Here is the Colab notebook. Besides, don't forget to follow us <a target="_blank" href="https://x.com/intent/follow?screen_name=marktechpost” target=”_blank” rel=”noreferrer noopener”>twitter and join our Telegram channel and LINKEDIN GRsplash. Do not forget to join our 80k+ ml subject.
Asif Razzaq is the CEO of Marktechpost Media Inc .. as a visionary entrepreneur and engineer, Asif undertakes to take advantage of the potential of artificial intelligence for the social good. Its most recent effort is the launch of an artificial intelligence media platform, Marktechpost, which stands out for its deep coverage of automatic learning and deep learning news that is technically solid and easily understandable by a broad audience. The platform has more than 2 million monthly views, illustrating its popularity among the public.