Developing a rule-based NLP application (2017)

July 17-19, 2017
(Instructor: Wendy Chapman, Kelly Peterson, Jianlin Shi)

A hypothetical healthcare delivery system has asked us to determine whether use of opioids is associated with increased risk of community-acquired pneumonia in adults. We have two-and-a-half days to identify patients with pneumonia so that the health services team can assess the risk based on pharmacy data. There is not enough time to manually read through the thousands of charts, so we will apply natural language processing for the task. Student teams in this course will collaboratively build an NLP application for identifying patients with pneumonia from a publicly available dataset. You will apply existing tools to annotate a reference standard and to create a computable knowledge base (OWL ontology) of conditions in the CDC pneumonia case definition. You will then use the pyConText library to build rule-based NLP tools to identify those conditions in clinical reports and determine which patients have pneumonia. We will evaluate the predictive performance of the tools and discuss the costs and benefits of rule-based NLP.

Students completing this course will be able to

  • Explain the cost and benefit of using NLP for cohort identification
  • Develop annotation guidelines
  • Annotate a reference standard set using existing annotation tools
  • Develop a knowledge base that will drive information extraction from clinical text
  • Build a rule-based NLP tool using the Python library pyConText
  • Evaluate performance of an NLP tool


You will be working in teams, so the following prerequisites apply to the skills of an entire team:

  • Some experience programming with Python
  • Clinical domain expertise
  • Organizational abilities




Topic Product
Monday 9-11 am


Notes (just brainstorming here):

Split into teams – physical findings team, symptom team, cxr finding team? Use only one note type (discharge summary?) patient = document

  1. CDC case definition – create guidelines for annotating each part (document and mention) (use spreadsheet to list concepts and generate draft guidelines)
  2. Annotate some cases at document level using eHOST – development reference set
  3. generate a domain ontology from the spreadsheet → generate pyConText files from ontology
  4. pyConText – learn how to build an app using the library
  5. Create rule-based classifier from pyConText’s output (using spreadsheet)
  6. run pyConText over the development reference set
  7. Evaluate performance at document level
  8. Error evaluation and tweaking of lexicons
  9. Run over test set (pre-annotated by us)
  10. Discuss results