Multilingual Audio-Enhanced Visual Detection

Overview

This project captures images in real-time, identifies objects using machine learning, and provides a description of the objects. The description is translated into a selected language and converted into audio for playback.

It uses cutting-edge technologies like OpenAI's CLIP model for object detection, Google Gemini API for content generation, and gTTS for audio output.

How Does It Work?

The system combines multiple technologies to achieve real-time multilingual detection and description:

Capture: The user captures an image using a webcam.
Detection: The CLIP model identifies objects in the image.
Description: The Gemini API generates a detailed text description.
Translation: The text is translated into the selected language using Google Translator.
Audio: The translated text is converted into speech using gTTS and played back.

Supported Languages

English
Kannada
Tamil
Malayalam
Hindi
Punjabi

Bengali
Gujarati
Marathi
Odia
Assamese
Urdu

Key Features

Real-time object detection and description.
Translation into multiple languages.
Audio playback of the translated description.
Accessible and user-friendly interface.

Limitations

Response time depends on system performance and API latency.
Translation quality may vary for complex sentences.
Requires an internet connection for API calls and translation.

Back to Projects

Multilingual Audio-Enhanced Visual Detection System

Overview

How Does It Work?

Supported Languages

Key Features

Limitations