Multilingual Audio-Enhanced Visual Detection System

Multilingual Detection System

Overview

This project captures images in real-time, identifies objects using machine learning, and provides a description of the objects. The description is translated into a selected language and converted into audio for playback.

It uses cutting-edge technologies like OpenAI's CLIP model for object detection, Google Gemini API for content generation, and gTTS for audio output.

How Does It Work?

The system combines multiple technologies to achieve real-time multilingual detection and description:

  • Capture: The user captures an image using a webcam.
  • Detection: The CLIP model identifies objects in the image.
  • Description: The Gemini API generates a detailed text description.
  • Translation: The text is translated into the selected language using Google Translator.
  • Audio: The translated text is converted into speech using gTTS and played back.
How It Works
Supported Languages

Supported Languages

  • English
  • Kannada
  • Tamil
  • Malayalam
  • Hindi
  • Punjabi
  • Bengali
  • Gujarati
  • Marathi
  • Odia
  • Assamese
  • Urdu

Key Features

Limitations

Back to Projects