Information Extraction from Narrative Pathology Reports on Melanoma (2008)
Description
This pilot study sought to demonstrate that key elements of one type of pathology report (melanoma) could be defined in a structured format. It also sought to demonstrate that hard copy pathology reports could be accurately annotated by trained linguists to indicate the correct structured report elements to support the automatic extraction of those elements.
Grant Recipient
University of Sydney
Aims and Objectives
No specific aims or objectives were included in the report although the executive summary noted:
An analysis has been conducted on pathology reports of melanoma with the purpose of determining if human annotators can reliably identify the contents appropriate for devising methods for the automatic extraction of pathology concepts needed to populate structured reports.
The report’s authors stated all project objectives were achieved, although these objectives were not included in the report.
Findings
Linguists can reliably produce more consistency than pathologists in annotating a large corpus* of pathology reports.
The reliability of linguists between each other is consistently higher than between the pathologists, or between a linguist and a pathologist.
Linguists miss about 6% of the tags annotated by the pathologist, and the pathologists miss about three to five as many assigned by the linguists.
The findings suggest that once linguists are trained to understand the linguistic features and extent of pathology concepts, they can reliably annotate information to aid the automatic extraction of these concepts in structured pathology reports.
The study showed only three tags were interpretative and the other 19 tags were directly defined by content. This indicates they may be reliably extracted by automatic computation although it will take further research to establish this is the case.
The study also highlighted problematic areas of annotating such a large corpus.
Further refinement of the training process should significantly improve the performance of the linguists to have far fewer misses, and reduce the error rate below 3%.
*corpus/corpora: a collection of naturally occurring language text chosen to characterise a state or variety of language
Recommendation
Further research is required to establish the reliability to which these concepts can be defined and therefore annotated correctly.
Key Project Learnings
More structured meetings between linguists and pathologists would have been beneficial to revise the concept set before annotation of all reports.
A more structured schedule of concept tag-set revisions would have reduced confusion and improved tag development.
The study identified issues about the original pathology reports that could prove useful in the education of professional pathologists for their report writing.
Follow on Initiatives and Projects
A second stage for this project was undertaken to develop software to automatically compile the synoptic reports (Automatic Compilation of Synoptic Reports from Narrative Pathology Reports (Stage 2) (2010 Submitted with the title: The Pathology Reporter)
Automatic Compilation of Synoptic Reports from Narrative Pathology Reports (Stage 2) (2010 Submitted with the title: The Pathology Reporter)
Description
This study sought to develop a software system to automatically extract information from pathology reports for three types of cancers to display information in a synoptic reporting format. It was designed around three report categories: melanoma, skin cancer and lymphoma.
Grant Recipient
University of Sydney
Aims and Objectives
These were not specifically outlined in the report however the following points were extracted from the introduction:
to complete the construction of an information extraction engine for melanoma
to annotate skin cancer and lymphoma reports
to build an extraction engine for skin cancer and lymphoma
to create a website where anyone could use the extraction engine for testing their own reports.
These aims and objectives were achieved by this project.
Outcome
The website where anyone could use the extraction engine for testing their own reports has been operating since February 2010.
Findings
Lymphoma was more difficult to annotate and was more complex in its description of particular features resulting in slightly lower accuracy for the automatic processing.
The results highlighted problematic areas of annotating pathology corpora*.
The effort of recognising the SNOMED CT codes for diagnoses, and the population of the structured report for melanoma, show this work has the potential for improving the automatic processing of prose reports.
Recommendation
It will take further research to establish the reliability to which these concepts can be defined and therefore annotated correctly.
Key Project Learning
The gains from this approach are likely to come from disease reports that have a high frequency and are most amenable to producing structured reports.
Follow on Initiatives and Projects
The third stage of this project is currently being undertaken (Automatic Population of Synoptic Reports from Narrative Pathology Reports (Stage 3) ).
Areas for Future Consideration
A larger corpora* needs to be collected for disease categories that are more complex in their pathology testing and descriptions.
*corpus/corpora: a collection of naturally occurring language text chosen to characterise a state or variety of language