Skip to main content

The value of artificial intelligence for detection and grading of prostate cancer in human prostatectomy specimens: a validation study



The Gleason grading system is an important clinical practice for diagnosing prostate cancer in pathology images. However, this analysis results in significant variability among pathologists, hence creating possible negative clinical impacts. Artificial intelligence methods can be an important support for the pathologist, improving Gleason grade classifications. Consequently, our purpose is to construct and evaluate the potential of a Convolutional Neural Network (CNN) to classify Gleason patterns.


The methodology included 6982 image patches with cancer, extracted from radical prostatectomy specimens previously analyzed by an expert uropathologist. A CNN was constructed to accurately classify the corresponding Gleason. The evaluation was carried out by computing the corresponding 3 classes confusion matrix; thus, calculating the percentage of precision, sensitivity, and specificity, as well as the overall accuracy. Additionally, k-fold three-way cross-validation was performed to enhance evaluation, allowing better interpretation and avoiding possible bias.


The overall accuracy reached 98% for the training and validation stage, and 94% for the test phase. Considering the test samples, the true positive ratio between pathologist and computer method was 85%, 93%, and 96% for specific Gleason patterns. Finally, precision, sensitivity, and specificity reached values up to 97%.


The CNN model presented and evaluated has shown high accuracy for specifically pattern neighbors and critical Gleason patterns. The outcomes are in line and complement others in the literature. The promising results surpassed current inter-pathologist congruence in classical reports, evidencing the potential of this novel technology in daily clinical aspects.


Prostate cancer is the fifth deadliest cancer in the world and the second most frequent among men. Globocan 2020 data count 1,414,259 new diagnoses and 375,304 deaths in a single year. It is known that asymptomatic early-stage, well treated disease is associated with up to 98% long-term survival according to guidelines [1,2,3].

Pathological diagnosis still raises divergence between specialists, even despite using the same classical Gleason grading (GG) studies [4,5,6,7]. Tumor grading is a cornerstone to guide cancer therapy, raising concerns among practitioners worldwide due to heterogeneity [8,9,10] and the need for novel tools. GG, described elsewhere [4], is purely visual analytic, throwing the spotlight on the standardization capability of artificial intelligence, with growing evidence in the deep learning field [11, 12].

Since 1998, Convolutional Neural Network (CNN) has been established and become a popular technique for image classification [13] Hence, different topologies have been constructed and evaluated for many applications. Some groups presented data on high-performance CNN Gleason grading standards classification [13,14,15,16,17,18]. However, in every context, specific possible limitations may be found in data organization, computational cost, and scope or focus of evaluation. These possible limitations may compromise more accurate interpretations for specific contexts. Consequently, it is still necessary for medical literature publications in this field to complement, strengthen, and support the CNN potential and its continuous evolution for this application. Therefore, the purpose of this study is to construct and evaluate a deep learning model for graduate the relevant Gleason patterns.



This study’s hypothesis is that a CNN system could be efficient to graduate the relevant GG patterns. Accordingly, we managed to merge computing engineering science with high-standard pathological reports to construct, train and evaluate a specific CNN system to classify G3, G4, and G5.


The study design is divided into two main flow of work, Clinical Actions, and Computational Actions, with corresponding procedures.

Clinical actions

The laboratory of medical investigations from the Medical School of the University of Sao Paulo (FMUSP) collected 32 previously reported radical prostatectomy specimens. They were colored by hematoxylin and eosin (H&E) method, and scanned by an Aperio® microscope, the slide’s images were analyzed, and Gleason patterns 3, 4, and 5 were delineated by the corresponding specialists. Additionally, images from “the Prostate Cancer Grade Assessment (PANDA) Challenge” were also added to the dataset. Hence, providing a richer dataset to support to the model to improve performance, alongside robustness and capacity of generalizing. PANDA includes two open-access datasets: Karolinska Institute (images divided into background, benign and cancerous tissue) and Radboud University Medical Center (13 images divided into background, stroma, benign tissue, and Gleason patterns 3, 4, and 5). To ensure methodological similarity, only Radboud images were used to improve the initial training sample. All the samples underwent the same screening process previously presented [19].

Computational actions

The computational procedures are composed by Patch Extraction Step, Deep Learning Step (Fig. 1).

Fig. 1
figure 1

The steps of the Design with their corresponding illustrations: 1st Clinical Actions, resulting in the marked images. 2nd the computational actions showing the two main steps: Patch Extraction Step, and Deep Learning Step

Patch extraction step

This step consists in building a dataset of extracted patches, small sample images sized 256 × 256 pixels, with a corresponding 20 × zoom of previously marked regions. The zoom and patch size values were chosen considering they are adequate for individual and combined clinical element identification. Considering this parameter, we obtained a total of 6982 patches (5036 from FMUSP prostatectomy samples and 1946 from the PANDAS Challenge dataset). As a result, patches of Gleason´s 3, 4, and 5 were obtained and can be identified by their corresponding Slide (Fig. 2).

Fig. 2
figure 2

Illustration of patch extraction, connected to its corresponding slides and Gleason grade. Specifically, patches from slides SA, SB, and SC were separated to be applied to the cross-validation process, whereby they were alternately used as training, validation, and test

Deep learning step

This step included topology construction considering a combination of multiple blocks. Previous architectures, characteristics, and important elements were considered to establish the structure proposed. Several experiments were performed using features of complex neural nets, combined blocks, and learning methods, resulting in the obtention of a high-performing architecture for this purpose, as shown in Fig. 3. The neural net input starts with two convolutional layers containing 32 filters with 5 × 5 kernel, and 64 filters with 5 × 5 kernel, respectively. The number of filters is related to the feature extraction diversity in the input – the more filters, the more complementary features are extracted and considered to support decision. The batch normalization layer standardizes output values of the corresponding layer, decreasing the chances of value range saturation. Max Pooling decreased the feature matrix dimension, allowing only the best parameters to proceed; the first sequence ends in a dropout layer. The other sequences (Second and Third) work similarly, except for the number of the filter (64 and 128 in the second, 128 and 256 in the third). Lastly, information goes through the Fully Connected layer containing 512 neurons within the hidden layer, batch normalization; the dropout of 0.5. SoftMax was used as an activation function. RMSprop was chosen as an optimizer for training; thus, the neural net can output the images classified as Gleason patterns 3, 4, and 5.

Fig. 3
figure 3

Proposed CNN architecture configuration

The patch images extracted (Fig. 2) were applied to the architecture described (Fig. 3) for training and evaluation. The separation of training, validation, and test groups was performed using the 80%, 10%, and 10% ratios. In addition, to obtain the most from our image set, we carried out a 3-time k-fold cross-validation, as shown in Fig. 4.

Fig. 4
figure 4

Illustration of how the image slides were separated for the cross-validation process, alternating slides SA, SB, and SC to be used as the source of patches for validation, test, and completing the rest of the training data, hence generating the final classification according to Gleason grades (G3, G4, and G5)

Specifically, this cross-validation took patches from slides SA, SB, and SC (Figs. 2 and 4) to be individually used as validation and test, and the patches from the spare slides complete the training data. This prevented patches of the same patient and slides from being present in the training, validation, and test groups; accordingly, providing a wider context variation, and composed outcome, thus leading to a more reliable and unbiased outcome, supporting better interpretation for corroboration. Slides SA, SB, and SC were chosen because each of them had the most balanced distribution of patches for Gleason 3, 4, and 5. The corresponding distribution and number of patches used for each k of the k-fold can be seen in Table 1. Finally, to improve model accuracy together with robustness, minimizing potential overfitting, data augmentation was performed before being applied to the neural net; specifically, this process includes random rotations, brightness, and zoon.

Table 1 Dataset separation considering, approximately, 80%, 10%, 10% ratio, for training, validation, and test of each corresponding k, respectively

The evaluation process occurred in two steps for each k during the proposed k-fold cross-validation (Fig. 4). The first was the training and validation step, and the second was the test, computing typical and additional parameters of performance for better interpretation. The training and validation step evaluate the potential of learning the current application; thus, the parameters accuracy and loss were computed to validate this step. Once high rate of accuracy and loss were achieved, the corresponding trained CNN topology were saved and submitted to the test step. The test step corroborates the previously obtained accuracy; as well as, measuring robustness and potential of generalize classification. During the test step, the corresponding image samples were applied to the trained CNN topology to be classified. As a result, generating the regarding confusion matrix; hence, allowing computing precision, sensitivity, and specificity, for interpretation of possible consequences.


Training and validation step

As can be seen in Fig. 5, among the different groups of the three k-folds, the training and validation curves were convergent in terms of accuracy and loss. For training and validation, the accuracy reached about 98%, and loss variates around 1% to 2.5%. In addition, the differences shown in Fig. 5 demonstrate there is no considerable underfitting or overfitting. Considering the proposed context, using slides from different patients for training and validation, in addition to the two different image sources to train the topology, the learning potential of the network is demonstrated. The total training processing time was estimated at 1200 s using one of our laboratory computers (Intel Core i7 3.50 GHz configurations, NVIDIA GeForce GTX 1060 Graphics, 16 GB RAM, 2 TB Hard Disk).

Fig. 5
figure 5

Training and validation curves for each of the three kfold-cross-validation, the blue curve represents training patches and the orange curve represents validations patches. a Accuracy for the first cross-validation. b Loss for the first cross-validation. c Accuracy for the second cross-validation. d Loss for the second cross-validation. e Accuracy for the third cross-validation. f Loss for the third cross-validation

Test step

After the training and validation of each k, the corresponding trained model was subjected to the test step (Fig. 4). The test step was carried out using the corresponding test data of each cross-validation (Table 1). For each k, the corresponding test data had never been seen by the model; hence, the network was blinded for every set of the test image to prevent bias. The confusion matrix with the results of each k-fold analysis group is presented in Figs. 6a, b, and c; additionally, the composition of all k-fold results is presented in Fig. 6d. From the resulting confusion matrixes, we obtained the pertinent metrics of efficacy, in addition to the general accuracy; precision, sensitivity, and specificity were also computed considering each class as a target object, therefore measuring the potential performance of the model to separate each class. Table 2 summarizes the findings with accuracy, precision, sensitivity, and specificity. Accuracy of around 95% was achieved in the final evaluation data from tests results (Table 2). Additionally, values above 80% and close to 98% were achieved for precision, sensitivity, and specificity for the different classes of Gleasons (Table 2). Considering the blind test applied, the classification potential of the network was evidenced.

Fig. 6
figure 6

Resulting confusion matrix of the corresponding set of test patches. a, b, and c Confusion matrixes of k = 1, 2, and 3, respectively. d Confusion matrix with the composed result of the 3 k-folds

Table 2 Precision, Sensitivity, Specificity, and Accuracy for each of the corresponding k

Specifically, the lowest performance values for Gleason 3, compared to patterns 4 and 5 (approximately 81% and 87%, for precision and sensitivity), can be explained by the lower number of samples for this class. Considering the way the patches were obtained, Gleason 3 is lower grade, more similar to noncancer tissue, and has lower volume in prostate specimens resulting in less material for analytical purposes. The consequences of this difference in the number of samples can also be seen with the lower specificity of the other two classes compared to Gleason 3, see Table 2. Considering the clinical relevance of patterns 4 and 5 and eventual non-relevant pattern 3 findings, this disadvantage may represent minimal clinical relevance.


Inter-pathologist grading discordance is known to be a relevant issue in prostate cancer treatment with numerous clinical consequences. Classical studies show inter-pathologists concordance varies between 51–78%, with a greater effect on patterns 3 and 4 differentiation [8,9,10]. Artificial intelligence (AI) is growing in importance among novel technologies in diagnostic procedures, mainly involving pattern recognition, being widely used in pathology and radiology [20,21,22].

Multiple recent literature reports on prostate cancer pathology and AI usage show the relevance of this theme [13, 14, 23, 24]. In a recent paper from the Karolinska Institute, Ström et. al. [23], the authors obtained 96–99% accuracy in terms of benign-malignant differentiation and Gleason Grade concordance kappa of 0.62. Bulten et al. [25], used a dataset composed of 5759 prostate biopsies and reached a kappa agreement of 91.8% with pathologist reports. Patches were used with sufficient zoom to find fundamental structures and to classify samples between cancer and non-cancer as well as to provide GG differentiation. Tolkach et al. [17] have separated cancer and non-cancer (stroma, glandular, and non-glandular benign tissue) patches obtaining more than 1.67 million patches.

The studies above have greatly contributed to the current knowledge of AI application to this field. Nonetheless, investigations and alternative approaches with new topologies must be continuously carried out and constructed to complement the current knowledge. We have thus developed a topology with tuned parameters regarding the number of filters, kernel sizes, layer sequence, number of hidden layers, activation function, and optimizer. Our proposed investigation and implementation complement and support the studies carried out in different contexts. The promising results, showing the performance of grading G3, G4, and G5, are in line with the literature, hence reinforcing the high potential of AI methods for this classification. Additionally, alternatives are offered to be used and evolved, contributing to the growing knowledge and evidence in this field.

As a limitation, the limited number of samples is noted for Gleason 3 pattern. However, Fig. 6 demonstrates that misclassification between classes in terms of numbers and percentages still statistically motivating. Furthermore, most misinterpretations are between neighboring Grades 3 and 4 (only 1 patch of Gleason 3 was interpreted as Gleason 5) with a minimal potential of clinical repercussion. Accordingly, considering that patches represent small portions of a large area, these few misinterpretations have minimal significance for the classification of the whole pathological area of the slide, minimizing possible interpretation effects.

Future work will focus on gathering additional collaborators and performing investigations, parallelly evaluating different promising topologies with the same dataset. With a dataset with wider variances, we will obtain the differences among topologies performance. Finally, we will include the construction of a mosaic from classified patches, creating heat map images, and provide a classification of the whole digital slides.


Pathology is a cornerstone to support intervention discussion between practitioner and patient in actual customized prostate cancer care involving novel therapies (active surveillance, focal) [26, 27] and classic radical ones (radiation therapy and radical prostatectomy) [28]. Artificial intelligence has demonstrated its great potential in helping pathology pattern recognition with high accuracy. Our proposed CNN model added evidence to supports this potential and provides a new alternative to be used and evolved, following the trend towards clinical usage in medical daily practices, consequently increasing the standards on pattern recognition to optimize clinical decisions, enabling best therapeutical results.



Artificial Intelligence


Convolutional Neural Network


Gleason grading scale


Hematoxylin and eosin


Initials of the uropathology specialist’s name


The Prostate Cancer Grade Assessment


Coordination for the Improvement of Higher Education Personnel


University of São Paulo Scholl of Medicine


Federal University of São Paulo


  1. Sung H, Ferlay J, Siegel RL, et al. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J Clin. 2021;71(3):209–49.

    Article  PubMed  Google Scholar 

  2. Sanda MG, Cadeddu JA, Kirkby E, et al. Clinically Localized Prostate Cancer: AUA/ASTRO/SUO Guideline. Part I: Risk Stratification, Shared Decision Making, and Care Options. J Urol. 2018;199(3):683–90.

    Article  PubMed  Google Scholar 

  3. Sanda MG, Cadeddu JA, Kirkby E, et al. Clinically Localized Prostate Cancer: AUA/ASTRO/SUO Guideline. Part II: Recommended Approaches and Details of Specific Care Options. J Urol. 2018;199(4):990–7.

    Article  PubMed  Google Scholar 

  4. Epstein JI, Egevad L, Amin MB, et al. The 2014 International Society of Urological Pathology (ISUP) Consensus Conference on Gleason Grading of Prostatic Carcinoma: Definition of Grading Patterns and Proposal for a New Grading System. Am J Surg Pathol. 2016;40(2):244–52.

    Article  PubMed  Google Scholar 

  5. Gleason DF, Mellinger GT. Prediction of prognosis for prostatic adenocarcinoma by combined histological grading and clinical staging. J Urol. 1974;111(1):58–64.

    Article  CAS  PubMed  Google Scholar 

  6. Gleason DF, Mellinger GT, Group VACUR. Prediction of prognosis for prostatic adenocarcinoma by combined histological grading and clinical staging. 1974. J Urol. 2002;167(2 Pt 2):953–8 (discussion 959).

    Article  PubMed  Google Scholar 

  7. Gleason DF, Mellinger GT, Group VACUR. Prediction of Prognosis for Prostatic Adenocarcinoma by Combined Histological Grading and Clinical Staging. J Urol. 2017;197(2S):S134–9.

    Article  PubMed  Google Scholar 

  8. Ozkan TA, Eruyar AT, Cebeci OO, Memik O, Ozcan L, Kuskonmaz I. Interobserver variability in Gleason histological grading of prostate cancer. Scand J Urol. 2016;50(6):420–4.

    Article  CAS  PubMed  Google Scholar 

  9. Meliti A, Sadimin E, Diolombi M, Khani F, Epstein JI. Accuracy of Grading Gleason Score 7 Prostatic Adenocarcinoma on Needle Biopsy: Influence of Percent Pattern 4 and Other Histological Factors. Prostate. 2017;77(6):681–5.

    Article  PubMed  Google Scholar 

  10. Sadimin ET, Khani F, Diolombi M, Meliti A, Epstein JI. Interobserver Reproducibility of Percent Gleason Pattern 4 in Prostatic Adenocarcinoma on Prostate Biopsies. Am J Surg Pathol. 2016;40(12):1686–92.

    Article  PubMed  Google Scholar 

  11. Monica M, Vadladi VK, Karuna G, Sowmya P. Comprehensive study of pathology image analysis using a deep learning algorithm. Mater Today Proc. 2020.

    Article  Google Scholar 

  12. Regnier-Coudert O, McCall J, Lothian R, Lam T, McClinton S, N’Dow J. Machine learning for the improved pathological staging of prostate cancer: A performance comparison on a range of classifiers. Artif Intell Med. 2012;55(1):25–35.

    Article  PubMed  Google Scholar 

  13. Li Y, Huang M, Zhang Y, et al. Automated Gleason Grading and Gleason Pattern Region Segmentation Based on Deep Learning for Pathological Images of Prostate Cancer. IEEE Access. 2020;8:117714–25.

    Article  Google Scholar 

  14. Nagpal K, Foote D, Liu Y, et al. Development and validation of a deep learning algorithm for improving Gleason scoring of prostate cancer. NPJ Digit Med. 2019;2(1):48.

    Article  PubMed  PubMed Central  Google Scholar 

  15. Pantanowitz L, Quiroga-Garza GM, Bien L, et al. An artificial intelligence algorithm for prostate cancer diagnosis in whole slide images of core needle biopsies: a blinded clinical validation and deployment study. Lancet Digit Health. 2020;2(8):e407–16.

    Article  PubMed  Google Scholar 

  16. Mun Y, Paik I, Shin S-J, Kwak T-Y, Chang H. Yet Another Automated Gleason Grading System (YAAGGS) by weakly supervised deep learning. NPJ Digit Med. 2021;4(1):99.

    Article  PubMed  PubMed Central  Google Scholar 

  17. Tolkach Y, Dohmgörgen T, Toma M, Kristiansen G. High-accuracy prostate cancer pathology using deep learning. Nat Mach Intell. 2020;2(7):411–8.

    Article  Google Scholar 

  18. Hayashi Y. New unified insights on deep learning in radiological and pathological images: Beyond quantitative performances to qualitative interpretation. Inf in Med Unlocked. 2020;19:100329.

    Article  Google Scholar 

  19. Kudo MS, de Souza VMG, de Souza AG, et al. The potential of convolutional neural network diagnosing prostate cancer. Res Biomed Eng. 2021;37(1):25–31.

    Article  Google Scholar 

  20. Krizhevsky A, Sutskever I, Hinton GE. ImageNet Classification with Deep Convolutional Neural Networks. Adv Neural Inf Process Syst. 2012;25:1090–8.

    Google Scholar 

  21. Poplin R, Varadarajan AV, Blumer K, et al. Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nat Biomed Eng. 2018;2:158–64.

    Article  PubMed  Google Scholar 

  22. Cireşan DC, Giusti A, Gambardella LM, Schmidhuber J. Mitosis detection in breast cancer histology images with deep neural networks. Med Image Comput Comput Assist Interv. 2013;16(Pt 2):411–8.

    PubMed  Google Scholar 

  23. Strom P, Kartasalo K, Olsson H. Artificial intelligence for diagnosis and grading of prostate cancer in biopsies: a population-based, diagnostic study. Lancet Oncol. 2020;21(2):E70–E70.

    Article  Google Scholar 

  24. Arvaniti E, Fricker KS, Moret M, et al. Automated Gleason grading of prostate cancer tissue microarrays via deep learning. Sci Rep. 2018;8(1):12054.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Bulten W, Pinckaers H, van Boven H, et al. Automated deep-learning system for Gleason grading of prostate cancer using biopsies: a diagnostic study. Lancet Oncol. 2020;21(2):233–41.

    Article  PubMed  Google Scholar 

  26. Carlsson S, Benfante N, Alvim R, et al. Long-Term Outcomes of Active Surveillance for Prostate Cancer: The Memorial Sloan Kettering Cancer Center Experience. J Urol. 2020;203(6):1122–7.

    Article  PubMed  Google Scholar 

  27. Sivaraman A, Barret E. Focal Therapy for Prostate Cancer: An" À la Carte" Approach. Eur Urol. 2016;69(6):973–5.

    Article  PubMed  Google Scholar 

  28. Wilt TJ, Ullman KE, Linskens EJ, et al. Therapies for clinically localized prostate cancer: a comparative effectiveness review. J Urol. 2021;205(4):967–76.

    Article  PubMed  Google Scholar 

Download references


Institute of Science and Technology of the Federal University of São Paulo (ICT-UNIFESP), LAPIS – Laboratory of Image and Signal Processing of ICT-UNIFESP. University of Sao Paulo Medical School, Medical Investigation Laboratory number 55 (LIM-55), Urology Discipline of the Surgery Department, the University Ethics Committee that approved this study (approval number: 3.004.858). CAPES—Coordination for the Improvement of Higher Education Personnel, and CNPq—National Council of Scientific and Technological Development. 


CNPq: National Council of Scientific and Technological Development, Brazil.

CAPES: Coordination for the Improvement of Higher Education Personnel.

Author information

Authors and Affiliations



Study concept and design: Souza, Leite, Moraes. Acquisition of data: Kudo, Souza, Estivallet, Amorim, Leite. Analysis and interpretation: Kudo, Souza, Kim, Moraes. Drafting of the manuscript: Kudo, Estivallet, Amorim. Critical revision of the manuscript for important intellectual content: Souza, Moraes. Statistical analysis: Kudo, Estivallet, Amorim, Moraes. Administrative, technical, and material support: Kudo, Estivallet, Amorim, Moraes. Supervision: Kim, Moraes. Other (pathology assessment): Leite. The author(s) read and approved the final manuscript.

Corresponding author

Correspondence to Vinicius Meneguette Gomes de Souza.

Ethics declarations

Ethics approval and consent to participate

FMUSP: 99896918.3.0000.0065

UNIFESP: 8657261119

Competing interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kudo, M.S., Gomes de Souza, V.M., Estivallet, C.L.N. et al. The value of artificial intelligence for detection and grading of prostate cancer in human prostatectomy specimens: a validation study. Patient Saf Surg 16, 36 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: