Using Machine Learning to Code Occupational Surveillance Data: A Cooperative Effort between NIOSH and the Harvard Computer Society – Tech for Social Good Program

Posted on by Gavin Lifrieri and Suzanne Marsh, MPA


The National Institute for Occupational Safety and Health (NIOSH) depends on surveillance data collected through the occupational supplement to the National Electronic Injury Surveillance System (NEISS-Work) to study and understand nonfatal occupational injuries. Collected through an interagency agreement with the Consumer Product Safety Commission, NEISS-Work captures hospital emergency department-treated occupational injuries to paid, self-employed, and volunteer workers. NEISS-Work includes demographic characteristics, nature of injury, incident characteristics, and employment information.

To obtain useful results from these data and target prevention efforts, standardized industry codes must be assigned to identify workers in high-risk industries. The employment information that is obtained through NEISS-Work is in the form of unstructured text fields and industry codes are assigned manually by a trained coder. NIOSH anticipated that the coding process could be improved using a machine learning algorithm based on experience. For example, a prior study found that a computer program took less than 3 hours to finish what would have taken 4.5 years to manually code.

Through an arrangement between NIOSH and the Harvard Computer Society Tech for Social Good (T4SG) program, NIOSH asked T4SG to create an “auto-encoder” that would use machine learning based on previously coded datasets to assign industry codes to new datasets.

The Output

After determining the feasibility of the project, students from T4SG constructed and trained the algorithm. The final version of the auto-encoder was a model that attempted to classify data-points into one of 257 industry codes depending on information including the employer’s name and a text field describing business type. Moreover, the model calculates the estimated confidence in its prediction in the form of a probability. The students from T4SG created reproducible modules of code so that, as new training data become available, it can be easily used to re-train the model. This ability to re-train on new data will be necessary as new business types and employer names are reported to NEISS-Work.

The Impact

The goal of the project was to reduce the number of cases that had to be manually coded by 20-30%. Based on initial tests, the machine learning algorithm was able to code up to 60% of the records with a high degree of reliability, surpassing the original target. The accuracy will be tested on additional data to evaluate the long-term success of the model. New or additional training could further improve the model’s performance.

Through the collaboration with Harvard Computer Society T4SG, the Harvard computer science students applied their knowledge and skills in artificial intelligence (AI) and machine learning and helped NIOSH code a natural language database. Their contribution improves the time it takes for NIOSH to make the information available for analysis, which will ultimately benefit workers’ safety and health. These AI and machine learning techniques hold much promise for the future of work to aid in complex data analysis.

Please comment below on ways that you have used AI or machine learning in your work to advance occupational safety and health.


Gavin Lifrieri is a student at Harvard and a member of the Harvard Computer Society Tech for Social Good, a student group that partners with nonprofits, government agencies, and social enterprises to amplify their impact through technology and empower student leaders to use technology to tackle the world’s big problems. More info is on their website at:

Suzanne Marsh, MPA, is a Team Lead in the Surveillance and Field Investigations Branch in the NIOSH Division of Safety Research.


Posted on by Gavin Lifrieri and Suzanne Marsh, MPA

2 comments on “Using Machine Learning to Code Occupational Surveillance Data: A Cooperative Effort between NIOSH and the Harvard Computer Society – Tech for Social Good Program”

Comments listed below are posted by individuals not associated with CDC, unless otherwise stated. These comments do not represent the official views of CDC, and CDC does not guarantee that any information posted by individuals on this site is correct, and disclaims any liability for any loss or damage resulting from reliance on any such information. Read more about our comment policy ».

    Thank you for sharing this informative article on the potential of machine learning in occupational safety and health. As a web data extraction company, we know how important it is to gather accurate and timely data to inform decision-making and improve workplace safety. Our custom data extraction and web scraping services can help businesses collect and analyze data on hazards and incidents, as well as monitor compliance with safety regulations. We believe that combining the power of machine learning with our data extraction services can help businesses create safer and healthier work environments for their employees. Thank you again for sharing these valuable insights

Post a Comment

Your email address will not be published. Required fields are marked *

All comments posted become a part of the public domain, and users are responsible for their comments. This is a moderated site and your comments will be reviewed before they are posted. Read more about our comment policy »

Page last reviewed: August 19, 2021
Page last updated: August 19, 2021