The ORCHID dataset
The ORal Cancer Histology Dataset: A comprehensive histopathology dataset for ML in Oral Cancer grade and stage detection.
Nisha Chaudhary [1], Arpita Rai [2], Aakash Madhav Rao [3], Md Imam Faizan [1], Jeyaseelan Augustine [4], Akhilanand Chaurasia [5], Deepika Mishra [6], Akhilesh Chandra [7], Varnit Chauhan [1], Rintu Kutum [3,8], Tanveer Ahmad [1*]
Affiliations:
- Multidisciplinary Centre for Advanced Research and Studies, Jamia Millia Islamia, New Delhi, India
- Rajendra Institute of Medical Sciences., Ranchi, Jharkhand, India
- Department of Computer Science, Ashoka University, Haryana, India
- Maulana Azad Institute of Dental Sciences, New Delhi, India
- King George Medical University, Lucknow, Uttar Pradesh, India
- All India Institute of Medical Sciences, New Delhi, India
- Banaras Hindu University, Uttar Pradesh, India
- Trivedi School of Biosciences, Ashoka University, Haryana, India
*. Corresponding Author
Abstract:
Oral cancer is a global health challenge with a difficult histopathological diagnosis. The accurate histopathological interpretation of oral cancer tissue samples remains difficult. However, early diagnosis is very challenging due to a lack of experienced pathologists and inter-observer variability in diagnosis. The application of artificial intelligence (deep learning algorithms) for oral cancer histology images is very promising for rapid diagnosis. However, it requires a quality annotated dataset to build AI models. We present ORCHID (ORal Cancer Histology Image Database), a specialized database generated to advance research in AI-based histology image analytics of oral cancer and precancer. The ORCHID database is an extensive multicenter collection of 300,000 image patches, encapsulating various oral cancer and precancer categories, such as oral submucous fibrosis (OSMF) and oral squamous cell carcinoma (OSCC). Additionally, it also contains grade-level sub-classifications for OSCC, such as well-differentiated (WD), moderately-differentiated (MD), and poorly-differentiated (PD). Furthermore, the database seeks to bolster the creation and validation of innovative artificial intelligence-based rapid diagnostics for OSMF and OSCC, along with subtypes.
The code for the project can be found here.
My role:
I carried out the organization and harmonization of the data into a format that is ready for modelling. I carried out the technical validation process and benchmarked various models to evaluate the dataset performance.
License
This work has been licensed under the Creative Commons CC BY-NC 4.0 license.
Below is the work currently submitted at Scientific Data, Nature, and available on medRxiv: