Teaching Machines to Understand Sign Language
Abstract
Can AI understand human language? In the future, AI could aid in emergency interpretive service in the hospital when translators aren't available. But can current AI algorithms understand non-verbal languages like sign language? In this science project, you will test whether AI can learn sign language gestures or phrases to see if it can be used for interpretation.
Summary
None.
Readily available.
No issues.
Objective
In this science project, you will test whether AI can learn sign language with action recognition.
Introduction
Have you ever wondered how we learn a language and if Artificial Intelligence (AI) can understand it? Language is an essential way for humans to connect and communicate. All languages have similar but different structures. We even have different communication methods, such as speech or speaking, writing or illustrating, and non-verbal communication like gestures. For example, sign language is one way to communicate solely with gestures.
This image presents a comprehensive chart of the American Sign Language (ASL) manual alphabet. It is organized in rows, showing the specific hand configuration for each letter of the English alphabet, from A through Z. Each letter is clearly labeled below its corresponding handshape illustration. This resource is useful for learning or referencing the finger spelling gestures in ASL, which is a unique language distinct from spoken English and different from other sign languages used globally.
Figure 1. This image displays the complete American Sign Language (ASL) manual alphabet, illustrating the distinct handshape for each letter from A to Z. It is important to remember that ASL is just one of many sign languages used worldwide, with different countries and regions often having their own unique sign languages.
Our brains can learn new languages, and an entire field of research is dedicated to how our brains process and form these languages, called neurolinguistics. Many languages are dynamic, such as new terms changing meaning with new generations. An important question with the broader adoption of AI tools will be assessing whether AI can understand languages and aid our ability to communicate across cultures and languages. If it is, this could make the world a more accessible place for those who are nonverbal or deaf.
Many patients and families cannot access a translator in a reasonable timeframe in healthcare settings. Often, friends or family members are required or asked to step in as an interpreter. Still, they may not always be available or able to access a facility, such as during the COVID-19 pandemic lockdowns at healthcare institutions for public safety. Proper communication in these environments is essential for informed consent before treatments and care. This is a significant problem since a lack of access to medical interpreters in a healthcare setting can lead to decreased compliance and worse health outcomes. Lack of access to interpretive services has worsened disparities in access to quality healthcare. So, how could we fill this gap in healthcare equity? Some computer interpreting systems, also known as computer-assisted interpreting tools, have been used in healthcare settings and everyday life. However, these devices do not constantly update with language changes and lack the depth of understanding that human interpreters have. What if we could teach machines to rapidly learn and evolve with languages using AI and provide these machines to patients to improve healthcare outcomes?
In this science project, you will teach a machine to learn sign language using AI. One real-world application of AI in communication is automatic captioning, which provides real-time text interpretation of spoken language. However, these systems often struggle with gesture-based communication, such as sign language, which is widely used in the deaf and hard-of-hearing community. Can AI help bridge this gap and make sign language translation more accessible?
In this project, you will use MediaPipe, a framework that can detect keypoints on the human body, including the hands, face, and pose. By tracking the position and motion of these keypoints over time, MediaPipe provides the raw input needed for recognizing dynamic gestures. To interpret these gestures as specific words or phrases, you will build an action recognition model using a Long Short-Term Memory (LSTM) neural network. LSTMs are a type of recurrent neural network (RNN) well-suited for analyzing sequences, like the series of movements involved in signing.
You can watch this video to learn more about LSTM:
This dynamic approach allows the model to learn how different gesture sequences correspond to specific terms. Ultimately, the goal is to explore whether AI can accurately interpret sign language and potentially support real-time translation in settings such as healthcare.
Terms and Concepts
- Artificial Intelligence (AI)
- Language
- Sign language
- Neurolinguistics
- Translator
- Healthcare
- Interpreter
- Informed consent
- Computer interpreting systems
- MediaPipe
- Keypoints
- Action recognition
- Long Short-Term Memory (LSTM)
- Neural network
- Recurrent Neural Network (RNN)
- Label map
- Multilabel confusion matrix
- Accuracy
Questions
- What is language, and why is it important?
- What is the study of how our brain interprets language?
- What are the issues with current computer-interpretive systems for translation?
- What is informed consent? What role do interpreters or computer-interpretive systems play in this process?
- How can AI be used to improve communication in healthcare?
- What algorithm can be used to teach AI sign language (a non-verbal language)?
Bibliography
The code is based on this project:
- nicknochnack. (2021, June). ActionDetectionforSignLanguage. Github. Retrieved May 21, 2025.
Resources on ASL (American Sign Language):
- National Association of the Deaf. (n.d.). Learning American Sign Language. Retrieved May 21, 2025.
- Sign Language 101. (n.d.). ASL Dictionary. Retrieved May 21, 2025.
More about MediaPipe:
- GoogleAI. (n.d.). MediaPipe Solutions guide. GoogleAI for Developers. Retrieved May 21, 2025.
Learn more about why we split datasets into train and test:
- Turp, Misra (2023, February). Why do we split data into train test and validation sets? YouTube. Retrieved May 21, 2025.
Learn more about LSTMs:
- IBM Technology. (2021, November). What is LSTM (Long Short Term Memory)?. YouTube. Retrieved May 21, 2025.
- StatQuest by Josh Starmer. (2022, November). Long Short-Term Memory (LSTM), Clearly Explained. YouTube. Retrieved May 21, 2025.
Learn more about how to read confusion matrices:
- Kundu, Rohot. (2022, September). Confusion Matrix: How To Use It & Interpret Results [Examples]. V7. Retrieved May 21, 2025.
Materials and Equipment
- Computer with Internet access
- Phone or camera with the ability to record videos
Experimental Procedure
Setting Up the Google Colab Environment
- You will need a Google account. If you do not have one, make one when prompted.
- Download the sign_language_detection.ipynb file from Science Buddies. This is the code you will need to process your data.
- Within your Google Drive, click on ‘MyDrive,’ then create a new folder and rename it
sign_language_detection
. Inside the folder, upload thesign_language_detection.ipynb
file. - Double-click on the
sign_language_detection.ipynb
file. This should automatically open in Google Colab. You will need to do the following in the notebook:- Read the Troubleshooting Tips and How to Use This Notebook sections. Follow the instructions you find in that section.
Install Dependencies
- Run all the code blocks under this section. This will download various dependencies such as Tensorflow, OpenCV, and Mediapipe.
- Tensorflow – A tool that helps computers learn from data, often used to build and train machine learning models.
- OpenCV – A library that helps with working on images and videos, like reading video files or drawing on frames.
- Mediapipe – A tool made by Google that can find things like hands, faces, and body positions in videos or images.
Importing Libraries
- Run all the code blocks in this section to ensure we can access all the functions needed for this project and the files stored on your Google Drive.
Detecting Keypoints Using MP Holistic
- (Code Block 2A) This code block sets up tools to find and show body parts like the face, hands, and body in images or videos using MediaPipe. It includes functions to detect these parts and draw them on the image. Run this code block.
- (Code Block 2B) This step tests whether the code can successfully read and process videos. In your
sign_language_detection
folder on Google Drive, upload a video of yourself waving your hands. You only need to show your upper body, and your hands should be visible at some point so that the MediaPipe model can detect them. The code will display individual video frames with keypoints drawn on them. Make sure to:- Replace the filename under the
#TODO
comment with the exact name of your video (including the file extension). - Re-run the Importing Libraries section if the uploaded video is not being found.
- Replace the filename under the
Extracting Keypoint Values
- (Code Block 3A) This code block takes the points detected by MediaPipe–like your body, face, and hands–into one long list of numbers. If parts are not found in a frame, it fills in zeros so the list always stays the same size. This makes it easier to use the data for machine learning. Run this code block.
Setting Up Folders for Collection
- (Code Block 4A) This code block sets up folders to store keypoint data for each sign language gesture. Under the
#TODO
comment, enter the names of the gestures for which you want to collect data. By default, the code includes'hello'
,'thank_you'
, and'see_you_later'
, but you can add as many gestures as you’d like.- Note: After running this code block, two new folders called
MP_Data
andvideos
will be created inside yoursign_language_detection
folder on Google Drive. Inside bothMP_Data
andvideos
, you will see subfolders named after each gesture you listed (e.g.,'hello'
,'thank_you'
,'see_you_later'
).
- Note: After running this code block, two new folders called
- Now it is time to collect 30 video samples for each gesture. You can film with a phone or a webcam. If you are unfamiliar with sign language, you can look up sign language tutorials on YouTube or find other online resources. Ensure the footage focuses solely on the gesture–avoid including actions like reaching towards the camera to stop the recording. If you are filming alone, trim out these parts. Having a second person press the record button can make this process easier and result in cleaner videos. Try to make each video slightly different to help the model learn better. For example, you can:
- Move your hands at different speeds while signing.
- Vary your position in the frame (closer to or farther from the camera).
- Use different hands if the gesture allows for it (e.g., left vs. right).
- Add small body movements like shifting your weight or turning slightly.
- Change facial expressions (since face landmarks are also captured).
- Upload your videos to the correct subfolder inside the
videos
folder. Then, re-run the “Importing Libraries” section to ensure your notebook can access the newly uploaded videos.
Collecting Keypoint Values for Training and Testing
- (Code Block 5A) This code block reads each video, grabs 30 frames, extracts keypoints from each, and saves them for training your sign language model. Run this code block.
Preprocessing Data and Creating Labels and Features
- (Code Block 6A) This code block creates a label map – a dictionary that assigns a unique number to each gesture name in your dataset. We do this because machines understand numbers better than words like
'hello'
and'thank_you'
. Run this code block. - (Code Block 6B) This code block reads all your saved
.npy
files (which contain keypoints for each frame), groups them into 30-frame sequences (representing one video), and stores them in a list calledsequences
. It also stores the corresponding numeric label for each sequence in a list calledlabels
, so your model knows which gesture each sequence represents. We do this to organize our data before giving it to our model for training. Run this code block. - (Code Block 6C) Splitting data into training and testing sets is important in machine learning. It helps to see how well your model works on new data. Watch this video to learn more about why we split datasets. Pay attention to the sizes of
X_train
,X_test
,y_train
, andy_test
. If you see theX_train
size as (85, 30, 1662), that means there are 85 training samples (videos), each sample has 30 frames (we extracted 30 frames per video), and each frame contains 1662 features, which come from flattening all of the detected keypoints (pose, face, left hand, right hand) into one long array.
Building and Training an LSTM Neural Network
- (Code Block 7A) This code block builds a neural network to classify sign language gestures based on 30-frame video sequences of body keypoints. It uses LSTM (Long-Short Term Memory) layers to learn patterns over time, followed by dense layers to refine the prediction. Run this code block.
- LSTM is a type of neural network layer that is great at learning from sequences, like video frames or time series data. It can remember important patterns across time, making it useful for tasks where order and context matter.
- (Code Block 7B) Compiling a model sets up how the model will learn during training. This step is necessary before training and ensures the model can improve its predictions and evaluate its accuracy. Run this code block.
- (Code Block 7C) Using the training data, this code block trains the model to recognize sign language gestures. It runs 200 times over the dataset to help the model learn. Run this code block.
Making Predictions
- (Code Block 8A) This code block uses the trained model to predict the test data. It will output the number of predictions it has made. Run this code block.
- (Code Block 8B) This code block displays the predicted gesture for a specific test sample–by default, it is set to the 4th sample (index 3). You can change the index to any number between 0 and the total number of predictions shown in the previous code block -1 to view the predicted gesture for different test samples.
- (Code Block 8C) This code block retrieves the true label (actual gesture) for the specific test sample–by default, it is set to the 4th sample (index 3). If you changed this value in the previous code block, change it to the same index here. Compare whether the predicted value was the same as the actual value.
Saving Weights
- (Code Block 9A) This code block saves the trained machine learning model (the LSTM model) to a file so you can reuse it later without needing to retrain it. Run this code block.
- (Code Block 9B) This code block loads a previously saved model from the file specified by
model_path
. Run this code block to use the model again.
Evaluating Using a Confusion Matrix and Accuracy
- (Code Block 10A) This code block uses the trained model to make predictions and then compares them to the correct answers. It turns the predictions and actual labels into numbers so they can be easily compared and checked for accuracy. Run this code block.
- (Code Block 10B) This code block calculates a multilabel confusion matrix, which helps evaluate how well the model performs for each individual class (in this case, each sign or gesture). Instead of creating a single confusion matrix for the entire model, it generates a separate confusion matrix for each class. This allows you to see how often the model currently identifies a specific gesture versus how often it confuses it with others. To learn more about interpreting confusion matrices, click this link here.
- (Code Block 10C) This code block calculates the accuracy of the model.
- Accuracy is a performance metric that measures how often a classification model correctly predicts the labels.
Accuracy = (Number of Correct Predictions) / (Total Number of Predictions) - In other words, accuracy tells you what percentage of the total predictions made by the model were correct.
- If a model makes 5 predictions and 4 of them are correct:
Accuracy = 4/5 = 0.8 or 80%.
- Accuracy is a performance metric that measures how often a classification model correctly predicts the labels.
Testing with Videos
- Now, it is time to test how well your model recognizes multiple signs in a row. Record a single video of yourself performing the gestures your model was trained on, but do them randomly and use different hands if applicable to add variety.
- Once you are done recording, upload the video to the
test_videos
folder in your Google Drive. - (Code Block 11A) Next, re-run the “Importing Libraries” section of your notebook so the video becomes accessible in your code. Under the
#TODO
comment, replace the text defining thetest_video
variable with the exact name of your video file. Run the code block, and when it finishes, it will generate a new video calledoutput_video.mp4
in the sametest_video
folder. In this output, you will see:- A blue bar at the top showing the model’s detected gesture.
- The probabilities of each possible gesture displayed below the blue bar.
Watch the video and see if your model could recognize your signs accurately.
(Optional) Improving Your Model
If the results are not what you desire, here are some suggestions on how you can improve your model:
- Collect More Diverse Training Data
- Include multiple people performing the signs to improve generalization.
- Perform signs with both left and right hands, and change the angle of your elbow if applicable.
- Increase Training Data Quality
- Trim gesture clips accurately to avoid noisy data.
- Increase the amount of data you have.
- Note: If you decide to add more data, you would need to delete the
MP_Data
folder and increase the variableno_sequences
to the number of videos you want to take.
Ask an Expert
Variations
- Get the model to run in real-time using input from a webcam instead of pre-recorded videos.
- Update your algorithm to understand medical terminiology and test it with a small group of people pretending to be patients with medical issues you use in the training dataset. Can the algorithm understand sign language enough to communicate effectively in this setting?
- Can you train a model that can recognize the difference between American Sign Language (ASL) and Spanish Sign Language (LSE), or even other sign languages worldwide?
- Tune the LSTM model:
- (Code Block 7A) Add more
LSTM
layers or increase the number of units in each layer. - (Code Block 7A) Try adding
Dropout
layers to prevent overfitting. - (Code Block 4A) Experiment with
sequence_length
–how many frames the LSTM looks at per prediction.
- (Code Block 7A) Add more
- Try a more advanced model:
- Replace or combine the LSTM with models like GRU, Transformer, etc. for better temporal learning.
- Improve feature engineering:
- Instead of raw keypoints, compute velocities (movement over time) or angles between joints to capture gesture dynamics.
- Hyperparameter tuning:
- Use grid search or random search to find the best values for learning rate, batch size, number of epochs, etc.
Careers
If you like this project, you might enjoy exploring these related careers: