Using Machine Learning to Code Occupational Surveillance Data: A Cooperative Effort between NIOSH and the Harvard Computer Society – Tech for Social Good ProgramPosted on by
The National Institute for Occupational Safety and Health (NIOSH) depends on surveillance data collected through the occupational supplement to the National Electronic Injury Surveillance System (NEISS-Work) to study and understand nonfatal occupational injuries. Collected through an interagency agreement with the Consumer Product Safety Commission, NEISS-Work captures hospital emergency department-treated occupational injuries to paid, self-employed, and volunteer workers. NEISS-Work includes demographic characteristics, nature of injury, incident characteristics, and employment information.
To obtain useful results from these data and target prevention efforts, standardized industry codes must be assigned to identify workers in high-risk industries. The employment information that is obtained through NEISS-Work is in the form of unstructured text fields and industry codes are assigned manually by a trained coder. NIOSH anticipated that the coding process could be improved using a machine learning algorithm based on experience. For example, a prior study found that a computer program took less than 3 hours to finish what would have taken 4.5 years to manually code.
Through an arrangement between NIOSH and the Harvard Computer Society Tech for Social Good (T4SG) program, NIOSH asked T4SG to create an “auto-encoder” that would use machine learning based on previously coded datasets to assign industry codes to new datasets.
After determining the feasibility of the project, students from T4SG constructed and trained the algorithm. The final version of the auto-encoder was a model that attempted to classify data-points into one of 257 industry codes depending on information including the employer’s name and a text field describing business type. Moreover, the model calculates the estimated confidence in its prediction in the form of a probability. The students from T4SG created reproducible modules of code so that, as new training data become available, it can be easily used to re-train the model. This ability to re-train on new data will be necessary as new business types and employer names are reported to NEISS-Work.
The goal of the project was to reduce the number of cases that had to be manually coded by 20-30%. Based on initial tests, the machine learning algorithm was able to code up to 60% of the records with a high degree of reliability, surpassing the original target. The accuracy will be tested on additional data to evaluate the long-term success of the model. New or additional training could further improve the model’s performance.
Through the collaboration with Harvard Computer Society T4SG, the Harvard computer science students applied their knowledge and skills in artificial intelligence (AI) and machine learning and helped NIOSH code a natural language database. Their contribution improves the time it takes for NIOSH to make the information available for analysis, which will ultimately benefit workers’ safety and health. These AI and machine learning techniques hold much promise for the future of work to aid in complex data analysis.
Please comment below on ways that you have used AI or machine learning in your work to advance occupational safety and health.
Gavin Lifrieri is a student at Harvard and a member of the Harvard Computer Society Tech for Social Good, a student group that partners with nonprofits, government agencies, and social enterprises to amplify their impact through technology and empower student leaders to use technology to tackle the world’s big problems. More info is on their website at: https://socialgood.hcs.harvard.edu/.
Suzanne Marsh, MPA, is a Team Lead in the Surveillance and Field Investigations Branch in the NIOSH Division of Safety Research.