Project Proposal

CS 7641 Group 3
Jenna Gottschalk, Hechen Li, Ahmed Rabbani, Ahindrila Saha, Sai Yang

Background

According to the World Health Organization, 5% of the world population suffers from speech-hearing impairment. Sign language recognition is an indispensable foundation for machine translation to bridge the gaps in communication between the hearing and the hearing-speech impaired communities.

Problem definition

Sign language translation takes an image containing a sign language gesture and outputs the corresponding meaning as text. It can be divided into two major steps: feature extraction and classification. Earlier approaches use PCA and Kurtosis position [1] for feature extraction and Hidden Markov Model [1] or SVM [2] for classification. With the advance of deep learning, recent models using CNN [3] generally outperform previous methods. However, CNN-based models learn from the RGB pixel data directly. While some methods use more advanced skeleton features [4], this requires users to wear tracking devices. Therefore, we want to explore a new approach that only requires image input: extract hand skeletons from images using the state-of-the-art OpenPose framework [5], and then use deep learning to perform the classification.

Methodology

Our project includes two stages: Preprocessing image data and building machine learning models.

In the data preprocessing stage, we will utilize the ASL Alphabet dataset of 87,000 labeled images and the ASL Alphabet Test dataset of 870 labeled images with noisy backgrounds. We will extract the features from the image data using OpenPose [5]. Given an input frame, OpenPose detects the hand in the image and outputs the position of 21 key points on the hand. We will use the coordinates of these points as features. Although we can learn from the pixel data directly, we prefer this state-of-the-art pose detection framework. This framework enables more structural and meaningful features, which could improve model performance. We will use Principal Component Analysis to decompose the extracted features in a lower-dimensional space.

In the modeling stage, we will perform both supervised and unsupervised machine learning. For the supervised classification task, we will recognize American Sign Language hand gestures. Using the extracted features, we will develop a deep learning model that classifies an unknown gesture input with high accuracy. We will train and test different models to find the best configuration, and we will use the ASL Alphabet Test dataset to validate the models.

For the unsupervised clustering task, we will use the extracted features to compare gestures across different sign languages. We will cluster the sign language images using K-means and DBSCAN to explore similarities and differences of the gestures from different sign languages.

Potential Results

As a result of our project, we hope to build a model that classifies sign language gestures with high accuracy and a clustering model to identify similar characteristics among hand gestures across different sign languages. We might be able to find that different sign languages share a set of common gestures, although each gesture may have a different meaning. If time permits, we want to develop an application that translates the gestures from a webcam video stream in real-time.

Project Timeline

References

[1] M. M. Zaki and S. I. Shaheen, “Sign language recognition using a combination of new vision based features”, Pattern Recognition Letters, vol. 32, pp. 572-577, 2011

[2] Kakoty, N. M., & Sharma, M. D. (2018). Recognition of sign language alphabets and numbers based on hand kinematics using a data glove. Procedia Computer Science, 133, 55–62.

[3] Sakshi Sharma, Sukhwinder Singh,Vision-based hand gesture recognition using deep learning for the interpretation of sign language,Expert Systems with Applications,Volume 182,2021

[4] Xiao, Q., Qin, M., & Yin, Y. (2020). Skeleton-based Chinese sign language recognition and generation for bidirectional communication between deaf and hearing people. Neural Networks, 125, 41–55.

[5] Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, Yaser Sheikh: “OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields”, 2018; arXiv:1812.08008.

Presentation Video

Watch on YouTube.

Presentation Slides

Download the slides.