Jump to main content

Teaching Machines to Understand Sign Language

Abstract

Can AI understand human language? In the future, AI could aid in emergency interpretive service in the hospital when translators aren't available. But can current AI algorithms understand non-verbal languages like sign language? In this science project, you will test whether AI can learn sign language gestures or phrases to see if it can be used for interpretation.

Summary

Areas of Science
Difficulty
 
Method
Time Required
Short (2-5 days)
Prerequisites

None. 

Material Availability

Readily available.

Cost
Very Low (under $20)
Safety

No issues. 

Credits
Science Buddies is committed to creating content authored by scientists and educators. Learn more about our process and how we use AI.

Objective

In this science project, you will test whether AI can learn sign language with action recognition.

Introduction

Have you ever wondered how we learn a language and if Artificial Intelligence (AI) can understand it? Language is an essential way for humans to connect and communicate. All languages have similar but different structures. We even have different communication methods, such as speech or speaking, writing or illustrating, and non-verbal communication like gestures. For example, sign language is one way to communicate solely with gestures.

A grid displaying the complete American Sign Language (ASL) manual alphabet, with each letter from A to Z represented by a distinct handshape.Image Credit: Jazz Davis / Creative Commons Attribution-Share Alike 4.0 International

This image presents a comprehensive chart of the American Sign Language (ASL) manual alphabet. It is organized in rows, showing the specific hand configuration for each letter of the English alphabet, from A through Z. Each letter is clearly labeled below its corresponding handshape illustration. This resource is useful for learning or referencing the finger spelling gestures in ASL, which is a unique language distinct from spoken English and different from other sign languages used globally.

Figure 1. This image displays the complete American Sign Language (ASL) manual alphabet, illustrating the distinct handshape for each letter from A to Z. It is important to remember that ASL is just one of many sign languages used worldwide, with different countries and regions often having their own unique sign languages. 

Our brains can learn new languages, and an entire field of research is dedicated to how our brains process and form these languages, called neurolinguistics. Many languages are dynamic, such as new terms changing meaning with new generations. An important question with the broader adoption of AI tools will be assessing whether AI can understand languages and aid our ability to communicate across cultures and languages. If it is, this could make the world a more accessible place for those who are nonverbal or deaf.

Many patients and families cannot access a translator in a reasonable timeframe in healthcare settings. Often, friends or family members are required or asked to step in as an interpreter. Still, they may not always be available or able to access a facility, such as during the COVID-19 pandemic lockdowns at healthcare institutions for public safety. Proper communication in these environments is essential for informed consent before treatments and care. This is a significant problem since a lack of access to medical interpreters in a healthcare setting can lead to decreased compliance and worse health outcomes. Lack of access to interpretive services has worsened disparities in access to quality healthcare. So, how could we fill this gap in healthcare equity? Some computer interpreting systems, also known as computer-assisted interpreting tools, have been used in healthcare settings and everyday life. However, these devices do not constantly update with language changes and lack the depth of understanding that human interpreters have. What if we could teach machines to rapidly learn and evolve with languages using AI and provide these machines to patients to improve healthcare outcomes? 

In this science project, you will teach a machine to learn sign language using AI. One real-world application of AI in communication is automatic captioning, which provides real-time text interpretation of spoken language. However, these systems often struggle with gesture-based communication, such as sign language, which is widely used in the deaf and hard-of-hearing community. Can AI help bridge this gap and make sign language translation more accessible?

In this project, you will use MediaPipe, a framework that can detect keypoints on the human body, including the hands, face, and pose. By tracking the position and motion of these keypoints over time, MediaPipe provides the raw input needed for recognizing dynamic gestures. To interpret these gestures as specific words or phrases, you will build an action recognition model using a Long Short-Term Memory (LSTM) neural network. LSTMs are a type of recurrent neural network (RNN) well-suited for analyzing sequences, like the series of movements involved in signing.

You can watch this video to learn more about LSTM:

This dynamic approach allows the model to learn how different gesture sequences correspond to specific terms. Ultimately, the goal is to explore whether AI can accurately interpret sign language and potentially support real-time translation in settings such as healthcare. 

Terms and Concepts

Questions

Bibliography

The code is based on this project:

Resources on ASL (American Sign Language):

More about MediaPipe:

Learn more about why we split datasets into train and test:

Learn more about LSTMs:

Learn more about how to read confusion matrices:

Materials and Equipment

Experimental Procedure

This project follows the Engineering Design Process. Confirm with your teacher if this is acceptable for your project, and review the steps before you begin.

Setting Up the Google Colab Environment

  1. You will need a Google account. If you do not have one, make one when prompted.
  2. Download the sign_language_detection.ipynb file from Science Buddies. This is the code you will need to process your data.
  3. Within your Google Drive, click on ‘MyDrive,’ then create a new folder and rename it sign_language_detection. Inside the folder, upload the sign_language_detection.ipynb file.
  4. Double-click on the sign_language_detection.ipynb file. This should automatically open in Google Colab. You will need to do the following in the notebook:
    1. Read the Troubleshooting Tips and How to Use This Notebook sections. Follow the instructions you find in that section.

Install Dependencies

  1. Run all the code blocks under this section. This will download various dependencies such as Tensorflow, OpenCV, and Mediapipe.
    1. Tensorflow – A tool that helps computers learn from data, often used to build and train machine learning models.
    2. OpenCV – A library that helps with working on images and videos, like reading video files or drawing on frames.
    3. Mediapipe – A tool made by Google that can find things like hands, faces, and body positions in videos or images.

Importing Libraries

  1. Run all the code blocks in this section to ensure we can access all the functions needed for this project and the files stored on your Google Drive.

Detecting Keypoints Using MP Holistic

  1. (Code Block 2A) This code block sets up tools to find and show body parts like the face, hands, and body in images or videos using MediaPipe. It includes functions to detect these parts and draw them on the image. Run this code block.
  2. (Code Block 2B) This step tests whether the code can successfully read and process videos. In your sign_language_detection folder on Google Drive, upload a video of yourself waving your hands. You only need to show your upper body, and your hands should be visible at some point so that the MediaPipe model can detect them. The code will display individual video frames with keypoints drawn on them.  Make sure to:
    1. Replace the filename under the #TODO comment with the exact name of your video (including the file extension).
    2. Re-run the Importing Libraries section if the uploaded video is not being found.

Extracting Keypoint Values

  1. (Code Block 3A) This code block takes the points detected by MediaPipe–like your body, face, and hands–into one long list of numbers. If parts are not found in a frame, it fills in zeros so the list always stays the same size. This makes it easier to use the data for machine learning. Run this code block.

Setting Up Folders for Collection

  1. (Code Block 4A) This code block sets up folders to store keypoint data for each sign language gesture. Under the #TODO comment, enter the names of the gestures for which you want to collect data. By default, the code includes 'hello', 'thank_you', and 'see_you_later', but you can add as many gestures as you’d like. 
    1. Note: After running this code block, two new folders called MP_Data and videos will be created inside your sign_language_detection folder on Google Drive. Inside both MP_Data and videos, you will see subfolders named after each gesture you listed (e.g., 'hello', 'thank_you', 'see_you_later'). 
  2. Now it is time to collect 30 video samples for each gesture. You can film with a phone or a webcam. If you are unfamiliar with sign language, you can look up sign language tutorials on YouTube or find other online resources. Ensure the footage focuses solely on the gesture–avoid including actions like reaching towards the camera to stop the recording. If you are filming alone, trim out these parts. Having a second person press the record button can make this process easier and result in cleaner videos. Try to make each video slightly different to help the model learn better. For example, you can:
    1. Move your hands at different speeds while signing.
    2. Vary your position in the frame (closer to or farther from the camera).
    3. Use different hands if the gesture allows for it (e.g., left vs. right).
    4. Add small body movements like shifting your weight or turning slightly.
    5. Change facial expressions (since face landmarks are also captured).
  3. Upload your videos to the correct subfolder inside the videos folder. Then, re-run the “Importing Libraries” section to ensure your notebook can access the newly uploaded videos.

Collecting Keypoint Values for Training and Testing

  1. (Code Block 5A) This code block reads each video, grabs 30 frames, extracts keypoints from each, and saves them for training your sign language model. Run this code block.

Preprocessing Data and Creating Labels and Features

  1. (Code Block 6A) This code block creates a label map – a dictionary that assigns a unique number to each gesture name in your dataset. We do this because machines understand numbers better than words like 'hello' and 'thank_you'. Run this code block. 
  2. (Code Block 6B) This code block reads all your saved .npy files (which contain keypoints for each frame), groups them into 30-frame sequences (representing one video), and stores them in a list called sequences. It also stores the corresponding numeric label for each sequence in a list called labels, so your model knows which gesture each sequence represents. We do this to organize our data before giving it to our model for training. Run this code block.
  3. (Code Block 6C) Splitting data into training and testing sets is important in machine learning. It helps to see how well your model works on new data. Watch this video to learn more about why we split datasets. Pay attention to the sizes of X_train, X_test, y_train, and y_test. If you see the X_train size as (85, 30, 1662), that means there are 85 training samples (videos), each sample has 30 frames (we extracted 30 frames per video), and each frame contains 1662 features, which come from flattening all of the detected keypoints (pose, face, left hand, right hand) into one long array.

Building and Training an LSTM Neural Network

  1. (Code Block 7A) This code block builds a neural network to classify sign language gestures based on 30-frame video sequences of body keypoints. It uses LSTM (Long-Short Term Memory) layers to learn patterns over time, followed by dense layers to refine the prediction. Run this code block.
    1. LSTM is a type of neural network layer that is great at learning from sequences, like video frames or time series data. It can remember important patterns across time, making it useful for tasks where order and context matter.
  2. (Code Block 7B) Compiling a model sets up how the model will learn during training. This step is necessary before training and ensures the model can improve its predictions and evaluate its accuracy. Run this code block.
  3. (Code Block 7C) Using the training data, this code block trains the model to recognize sign language gestures. It runs 200 times over the dataset to help the model learn. Run this code block.

Making Predictions

  1. (Code Block 8A) This code block uses the trained model to predict the test data. It will output the number of predictions it has made. Run this code block.
  2. (Code Block 8B) This code block displays the predicted gesture for a specific test sample–by default, it is set to the 4th sample (index 3). You can change the index to any number between 0 and the total number of predictions shown in the previous code block -1 to view the predicted gesture for different test samples.
  3. (Code Block 8C) This code block retrieves the true label (actual gesture) for the specific test sample–by default, it is set to the 4th sample (index 3). If you changed this value in the previous code block, change it to the same index here. Compare whether the predicted value was the same as the actual value.

Saving Weights

  1. (Code Block 9A) This code block saves the trained machine learning model (the LSTM model) to a file so you can reuse it later without needing to retrain it. Run this code block.
  2. (Code Block 9B) This code block loads a previously saved model from the file specified by model_path. Run this code block to use the model again.

Evaluating Using a Confusion Matrix and Accuracy

  1. (Code Block 10A) This code block uses the trained model to make predictions and then compares them to the correct answers. It turns the predictions and actual labels into numbers so they can be easily compared and checked for accuracy. Run this code block.
  2. (Code Block 10B) This code block calculates a multilabel confusion matrix, which helps evaluate how well the model performs for each individual class (in this case, each sign or gesture). Instead of creating a single confusion matrix for the entire model, it generates a separate confusion matrix for each class. This allows you to see how often the model currently identifies a specific gesture versus how often it confuses it with others. To learn more about interpreting confusion matrices, click this link here.
  3. (Code Block 10C) This code block calculates the accuracy of the model.
    1. Accuracy is a performance metric that measures how often a classification model correctly predicts the labels.
      Accuracy = (Number of Correct Predictions) / (Total Number of Predictions) 
    2. In other words, accuracy tells you what percentage of the total predictions made by the model were correct.
    3. If a model makes 5 predictions and 4 of them are correct:
      Accuracy = 4/5 = 0.8 or 80%.

Testing with Videos

  1. Now, it is time to test how well your model recognizes multiple signs in a row. Record a single video of yourself performing the gestures your model was trained on, but do them randomly and use different hands if applicable to add variety.
  2. Once you are done recording, upload the video to the test_videos folder in your Google Drive.
  3. (Code Block 11A) Next, re-run the “Importing Libraries” section of your notebook so the video becomes accessible in your code. Under the #TODO comment, replace the text defining the test_video variable with the exact name of your video file. Run the code block, and when it finishes, it will generate a new video called output_video.mp4 in the same test_video folder. In this output, you will see:
    1. A blue bar at the top showing the model’s detected gesture.
    2. The probabilities of each possible gesture displayed below the blue bar.
      Watch the video and see if your model could recognize your signs accurately. 

(Optional) Improving Your Model

If the results are not what you desire, here are some suggestions on how you can improve your model:

  1. Collect More Diverse Training Data
    1. Include multiple people performing the signs to improve generalization.
    2. Perform signs with both left and right hands, and change the angle of your elbow if applicable.
  2. Increase Training Data Quality
    1. Trim gesture clips accurately to avoid noisy data.
    2. Increase the amount of data you have.
    3. Note: If you decide to add more data, you would need to delete the MP_Data folder and increase the variable no_sequences to the number of videos you want to take. 
icon scientific method

Ask an Expert

Do you have specific questions about your science project? Our team of volunteer scientists can help. Our Experts won't do the work for you, but they will make suggestions, offer guidance, and help you troubleshoot.

Variations

  • Get the model to run in real-time using input from a webcam instead of pre-recorded videos.
  • Update your algorithm to understand medical terminiology and test it with a small group of people pretending to be patients with medical issues you use in the training dataset. Can the algorithm understand sign language enough to communicate effectively in this setting?
  • Can you train a model that can recognize the difference between American Sign Language (ASL) and Spanish Sign Language (LSE), or even other sign languages worldwide?
  • Tune the LSTM model:
    • (Code Block 7A) Add more LSTM layers or increase the number of units in each layer.
    • (Code Block 7A) Try adding Dropout layers to prevent overfitting.
    • (Code Block 4A) Experiment with sequence_length–how many frames the LSTM looks at per prediction.
  • Try a more advanced model:
    • Replace or combine the LSTM with models like GRU, Transformer, etc. for better temporal learning.
  • Improve feature engineering:
    • Instead of raw keypoints, compute velocities (movement over time) or angles between joints to capture gesture dynamics.
  • Hyperparameter tuning:
    • Use grid search or random search to find the best values for learning rate, batch size, number of epochs, etc.

Careers

If you like this project, you might enjoy exploring these related careers:

Career Profile
Many aspects of peoples' daily lives can be summarized using data, from what is the most popular new video game to where people like to go for a summer vacation. Data scientists (sometimes called data analysts) are experts at organizing and analyzing large sets of data (often called "big data"). By doing this, data scientists make conclusions that help other people or companies. For example, data scientists could help a video game company make a more profitable video game based on players'… Read more
Career Profile
What if you couldn't tell someone what you needed or wanted? Or you couldn't understand what other people around you were saying? Can you imagine how frustrating that would be? Communication is vital to our lives as human beings. Language allows us to express our daily experiences, needs, wants, ideas, and dreams—even our jokes! Without it, we are isolated. Speech-language pathologists are the therapists who assess, diagnose, and treat communicative disorders related to speech, language,… Read more

News Feed on This Topic

 
, ,

Cite This Page

General citation information is provided here. Be sure to check the formatting, including capitalization, for the method you are using and update your citation, as needed.

MLA Style

Ngo, Tracey, and Laura Ohl. "Teaching Machines to Understand Sign Language." Science Buddies, 10 June 2025, https://www.sciencebuddies.org/science-fair-projects/project-ideas/ArtificialIntelligence_p028/artificial-intelligence/sign_language_detection. Accessed 16 June 2025.

APA Style

Ngo, T., & Ohl, L. (2025, June 10). Teaching Machines to Understand Sign Language. Retrieved from https://www.sciencebuddies.org/science-fair-projects/project-ideas/ArtificialIntelligence_p028/artificial-intelligence/sign_language_detection


Last edit date: 2025-06-10
Top
We use cookies and those of third party providers to deliver the best possible web experience and to compile statistics.
By continuing and using the site, including the landing page, you agree to our Privacy Policy and Terms of Use.
OK, got it
Free science fair projects.