Research

Automated Acne Severity Scoring at Nolla

At Nolla Health, we've developed an automated acne severity scoring system using Vision Transformers to provide accurate, quantitative monitoring during treatment in Nolla Acne. This system captures facial features and predicts acne severity on a standardized numerical scale, functioning like a continuous tracking tool for skin health.

Clinical Application

Objective acne severity measurement has significant clinical value. Manual scoring is time-intensive and subject to inter-rater variability, making it difficult to track subtle changes over time. Automated scoring enables consistent progress monitoring, helps clinicians quickly identify treatment response (or lack thereof), and provides patients with tangible feedback during their care journey.

This is particularly valuable in telehealth settings, where providers can review quantitative trends and images asynchronously without requiring subjective in-person assessment.

By standardizing measurement, we can better evaluate treatment effectiveness and make data-driven adjustments to care plans.

Methodology

We consolidated a comprehensive dataset of approximately 90,000 images (and growing), labeled by trained experts using established scoring methods: the Investigator Global Assessment (IGA) and the Global Acne Grading System (GAGS).

The IGA is one of the most widely used clinical scales for acne assessment. The FDA-recommended 5-point scale (0-4) provides a simplified global assessment:

0 = Clear: No inflammatory or non-inflammatory lesions
1 = Almost Clear: Rare lesions, mostly non-inflammatory with no more than one papule
2 = Mild: Some non-inflammatory lesions, few inflammatory lesions
3 = Moderate: Many non-inflammatory lesions, some inflammatory lesions
4 = Severe: Many inflammatory and non-inflammatory lesions, potential nodules

Studies over the past decade have demonstrated robust inter-rater reliability for IGA, with weighted kappa values ranging from 0.52 to 0.75 (Tan et al., 2013; FDA Guidance Document, 2018). These weighted kappa values measure agreement between raters on an ordinal scale. The IGA's simplicity makes it well-suited for interpretable automated assessments.

The Global Acne Grading System (GAGS) takes a different approach, dividing the face into six regions (forehead, right cheek, left cheek, nose, chin, chest/back) and assigning scores based on lesion type and location (Doshi et al., 1997):

Area: Each region is weighted by surface area
Oiliness: Each region is weighted by sebum production
Lesion scores: No lesion=0, Comedone=1, Papule=2, Pustule=3, Nodule=4
Regional factors: Forehead=2, Right cheek=2, Left cheek=2, Nose=1, Chin=1, Chest/back=3

The total GAGS score ranges from 0 to 44 and can be normalized to a 0-4 scale for harmonization with IGA.

Building on Prior Research

To provide optimal acne care, reliable quantitative monitoring tools are essential for tracking baseline severity and treatment progress. We thoroughly evaluated all previously published semi-automated and automated acne severity scoring methods—approximately 47 distinct models described in the literature—to understand performance gaps and develop next-generation systems. These include various deep learning approaches:

Convolutional Neural Networks for detection, segmentation, regression, and classification
Vision Transformers for detection, segmentation, regression, and classification
Hybrid methods for detection, segmentation, regression, and classification

Meta-analyses and our own investigation identified critical gaps: (1) lack of large-scale, diverse, standardized high-resolution datasets, (2) low model robustness to variations in illumination, positioning, or proximity typical in real-world use, and (3) lack of verifiable and interpretable outputs. We've designed our system to address these limitations.

Our high-resolution, open-device dataset of approximately 90,000 images represents nearly two orders of magnitude (100x) increase over previous medical image datasets of comparable quality and scope. We also created a supplementary dataset for facial acne and relevant features, including semantic segmentation. Using these datasets, we trained Vision Transformer models to output severity scores (single regression endpoint) and semantic segmentation masks.

Current Performance

As of October 6, 2025, our acne scoring regression model achieves an R² of 93.5% on a held-out test set compared to professional raters. Our semantic segmentation models also provide exceptional detail. Both models are best-in-class and continue to improve as we expand our dataset and incorporate new architectures.

References

J. K. Tan, Current measures for the evaluation of acne severity. Expert Review of Dermatology 3, 595–603 (2008).
Acne Vulgaris: Establishing Effectiveness of Drugs Intended for Treatment. Guidance for Industry (2018).
A. Doshi, A. Zaheer, M. J. Stiller, A comparison of current acne grading systems and proposal of a novel system. International journal of dermatology 36, 416–418 (1997).
D. O. Traini, G. Palmisano, C. Guerriero, K. Peris, Artificial Intelligence in the Assessment and Grading of Acne Vulgaris: A Systematic Review. Journal of Personalized Medicine 15, 238 (2025).