Expert-Level Detection of Epilepsy Markers in EEG on Short and Long Timescales

  • Li, Jun B. Eng
  • Goldenholz, Daniel M. M.D., Ph.D.
  • Alkofer, Moritz M.Sc.
  • Sun, Chenxi Ph.D.
  • Nascimento, Fabio A. M.D.
  • Halford, Jonathan J. M.D.
  • Dean, Brian C. Ph.D.
  • Galanti, Mattia M.Sc.
  • Struck, Aaron F. M.D.
  • Greenblatt, Adam S. M.D.
  • Lam, Alice D. M.D., Ph.D.
  • Herlopian, Aline M.D.
  • Nwankwo, Chinasa M.D.
  • Weber, Dan D.O.
  • Maus, Douglas M.D., Ph.D.
  • Haider, Hiba A. M.D.
  • Karakis, Ioannis M.D., Ph.D.
  • Yoo, Ji Yeoun M.D.
  • Ng, Marcus C. M.D.
  • Selioutski, Olga D.O.
  • Taraschenko, Olga M.D., Ph.D.
  • Osman, Gamaleldin M.D.
  • Katyal, Roohi M.B.B.S.
  • Schmitt, Sarah E. M.D.
  • Benbadis, Selim M.D.
  • Cash, Sydney S. M.D., Ph.D.
  • Tatum, William O. D.O.
  • Sheikh, Zubeda M.D.
  • Kong, Wan Yee M.D., M.Sc.
  • Bayas, Grace B.Sc.
  • Turley, Niels B.Sc.
  • Hong, Shenda Ph.D.
  • Westover, M. Brandon M.D., Ph.D.
  • Jing, Jin Ph.D.
NEJM AI 2(7), July 2025. | DOI: 10.1056/AIoa2401221

Abstract

Background

Epileptiform discharges, or spikes, within electroencephalogram (EEG) recordings are essential for diagnosing epilepsy and localizing seizure origins. Artificial intelligence (AI) offers a promising approach to automating detection, but current models are often hindered by artifact-related false positives and often target either event- or EEG-level classification, thus limiting clinical utility.

Methods

We developed SpikeNet2, a deep-learning model based on a residual network architecture, and enhanced it with hard-negative mining to reduce false positives. Our study analyzed 17,812 EEG recordings from 13,523 patients across multiple institutions, including Massachusetts General Brigham (MGB) hospitals. Data from the Human Epilepsy Project (HEP) and SCORE-AI (SAI) were also included. A total of 32,433 event-level samples, labeled by experts, were used for training and evaluation. Performance was assessed using the area under the receiver operating characteristic curve (AUROC), the area under the precision–recall curve (AUPRC), calibration error, and a modified area under the curve (mAUC) metric. The model’s generalizability was evaluated using external datasets.

Results

SpikeNet2 demonstrated strong performance in event-level spike detection, achieving an AUROC of 0.973 and an AUPRC of 0.995, with 44% of experts surpassing the model on the MGB dataset. In external validation, the model achieved an AUROC of 0.942 and an AUPRC of 0.948 on the HEP dataset. For EEG-level classification, SpikeNet2 recorded an AUROC of 0.958 and an AUPRC of 0.959 on the MGB dataset, an AUROC of 0.888 and an AUPRC of 0.823 on the HEP dataset, and an AUROC of 0.995 and an AUPRC of 0.991 on the SAI dataset, with 32% of experts outperforming the model. The false-positive rate was reduced to an average of nine spikes per hour.

Conclusions

SpikeNet2 offers expert-level accuracy in both event-level spike detection and EEG-level classification, while significantly reducing false positives. Its dual functionality and robust performance across diverse datasets make it a promising tool for clinical and telemedicine applications, particularly in resource-limited settings. (Funded by the National Institutes of Health and others.)

Copyright © 2025 Massachusetts Medical Society. All rights reserved.
View full text|Download PDF