The ORCHID dataset | Aakash M. Rao

Nisha Chaudhary [1], Arpita Rai [2], Aakash Madhav Rao [3], Md Imam Faizan [1], Jeyaseelan Augustine [4], Akhilanand Chaurasia [5], Deepika Mishra [6], Akhilesh Chandra [7], Varnit Chauhan [1], Rintu Kutum [3,8], Tanveer Ahmad [1*]

Affiliations:

Multidisciplinary Centre for Advanced Research and Studies, Jamia Millia Islamia, New Delhi, India
Rajendra Institute of Medical Sciences., Ranchi, Jharkhand, India
Department of Computer Science, Ashoka University, Haryana, India
Maulana Azad Institute of Dental Sciences, New Delhi, India
King George Medical University, Lucknow, Uttar Pradesh, India
All India Institute of Medical Sciences, New Delhi, India
Banaras Hindu University, Uttar Pradesh, India
Trivedi School of Biosciences, Ashoka University, Haryana, India
*. Corresponding Author

Abstract:

Oral cancer is a global health challenge with a difficult histopathological diagnosis. The accurate histopathological interpretation of oral cancer tissue samples remains difficult. However, early diagnosis is very challenging due to a lack of experienced pathologists and inter-observer variability in diagnosis. The application of artificial intelligence (deep learning algorithms) for oral cancer histology images is very promising for rapid diagnosis. However, it requires a quality annotated dataset to build AI models. We present ORCHID (ORal Cancer Histology Image Database), a specialized database generated to advance research in AI-based histology image analytics of oral cancer and precancer. The ORCHID database is an extensive multicenter collection of 300,000 image patches, encapsulating various oral cancer and precancer categories, such as oral submucous fibrosis (OSMF) and oral squamous cell carcinoma (OSCC). Additionally, it also contains grade-level sub-classifications for OSCC, such as well-differentiated (WD), moderately-differentiated (MD), and poorly-differentiated (PD). Furthermore, the database seeks to bolster the creation and validation of innovative artificial intelligence-based rapid diagnostics for OSMF and OSCC, along with subtypes.

The code for the project can be found here.

My role:

I carried out the organization and harmonization of the data into a format that is ready for modelling. I carried out the technical validation process and benchmarked various models to evaluate the dataset performance.

License

This work has been licensed under the Creative Commons CC BY-NC 4.0 license.

Below is the work currently submitted at Scientific Data, Nature, and available on medRxiv: