Music Concert Program Optical Character Recognition

Scrap websites of professional symphony orchestras to collect historic documents of concert programs. On these images of text data, perform optical character recognition (OCR) to convert to plain text and save results in a database. After collecting data, apply natural language and machine learning algorithms to perform sentiment analysis on the dataset. 


  1. Identify and adapt existing English-language OCR tools for use in this program.
  2. Scrap select website for Program Notes covering specific historic concerts.
  3. Convert images to parsable text. 
  4. Preprocess text using ntlk and other NLP tools.
  5. Create and maintain a database of images and converted text.
  6. (Optional) Perform machine learning sentiment analysis on the collected dataset. 


Researchers in computational musicology desire to apply modern natural language processing techniques to historic documents describing musical pieces and specific performances. Although some organizations, such as the NY Philharmonic, provide publicly available archives of documents, these documents are not in a formats that can be readily leveraged by machine learning. The Soundbender lab desires a tool that leverages existing OCR tools to convert images of concert program notes into parsable text for natural language analysis. 


Minimum Qualifications:
  • CS 290 (web dev)
  • CS 340 (database)
  • interest in Classical music

Preferred Qualifications:
  • experience with web scraping (e.g., beautiful soup)
  • experience with optical character recognition or NLP
  • CS 493 (cloud)


Project Partner:

Patrick Donnelly


No Agreement Required

Number Groups:


Project Status:

Accepting Applicants

Card Image Capstone