Overview
This project captures images in real-time, identifies objects using machine learning, and provides a description of the objects. The description is translated into a selected language and converted into audio for playback.
It uses cutting-edge technologies like OpenAI's CLIP model for object detection, Google Gemini API for content generation, and gTTS for audio output.
How Does It Work?
The system combines multiple technologies to achieve real-time multilingual detection and description:
- Capture: The user captures an image using a webcam.
- Detection: The CLIP model identifies objects in the image.
- Description: The Gemini API generates a detailed text description.
- Translation: The text is translated into the selected language using Google Translator.
- Audio: The translated text is converted into speech using gTTS and played back.
Supported Languages
- English
- Kannada
- Tamil
- Malayalam
- Hindi
- Punjabi
- Bengali
- Gujarati
- Marathi
- Odia
- Assamese
- Urdu
Key Features
- Real-time object detection and description.
- Translation into multiple languages.
- Audio playback of the translated description.
- Accessible and user-friendly interface.
Limitations
- Response time depends on system performance and API latency.
- Translation quality may vary for complex sentences.
- Requires an internet connection for API calls and translation.