Live Captioning for Android devices

Live Captioning for Android devices

Captioning is the informing way of the images and videos. Audio content with captions along with a deaf and hard of hearing people benefit to everyone. Many times on the train, in meetings, in bed or while kids are sleeping we prefer to watch video without audio. Studies reveal that there can be an increase in the duration of time user prefers to spend in watching video by 40%.

We have come up with Live Caption, a new Android features that caption while the media is playing on your phone. It is real time captioning technique, completely-on –device, in which network resources are not used, which preserve the privacy and lowers latency. Currently this feature is available in Pixel 4 and Pixel 4 XL, in future years will move to Pixel 3 models and will be available widely on other Android version.

Live Captioning for Accuracy and Efficiency

Live Caption perform on a combination of three on-device deep learning models: a recurrent neural network (RNN) sequence transduction model for speech recognition (RNN-T), a text-based recurrent neural network model for unspoken punctuation, and a convolutional neural network (CNN) model for sound events classification. To create a single caption track Live Caption accommodates the signal from the three models, where sound event tags, like [APPLAUSE] and [MUSIC], appear without interrupting the flow of speech recognition results. While the text is updated in parallel, punctuation symbols are predicted.

For sound recognition, we make use of previous work that was done for sound events detection, using The Sound Recognition model that was built on top of the AudioSet dataset. The Sound Recognition model along with generation of popular sound effect labels is also used to detect speech periods. The full automatic speech recognition (ASR) RNN-T engine runs only during speech periods to minimize memory and battery usage. For example, when music is identified and speech is not present in the audio stream, the [MUSIC] label will pop up on screen, and the ASR model will be unloaded.The ASR model is only loaded back into memory when speech is present in the audio stream again. To make Live Caption more useful, for a long period of time it should be able to run continuously. To do this, Live Caption’s ASR model is enhanced for edge-devices using several techniques, such as neural connection pruning. Compared to the full sized speech model this can reduce the power consumption by 50%. Yet while the model is significantly more energy efficient, it still performs well in a variety of use cases, including captioning videos, recognizing short queries and narrowband telephony speech, while also being robust to background noise. The text-based punctuation model was improved for running continuously on-device using a smaller architecture than the cloud equivalent, and then quantized and serialized using the TensorFlow Lite runtime. As the caption is formed, speech recognition takes place and results are rapidly updated a few times per second. The punctuation prediction is performed on the tail of the text from the most recently recognized sentence to save on computational resources and provide a smooth user experience and if the next updated ASR results do not change that text, the previously punctuated results are retained and reused.

Expectations and Planning

Now Live Caption is only available in English on Pixel 4 but we are looking forward to make this feature available in other languages and for other android devices too. Also looking for improving the formatting to improve the perceived accuracy and coherency of the captions, particularly for multi-speaker content.


Instagram has returned invalid data.