FAS Research Computing - Slurm down – Incident details

Status page for the Harvard FAS Research Computing cluster and other resources.

Cluster Utilization (VPN and FASRC login required): Cannon | FASSE


Please scroll down to see details on any Incidents or maintenance notices.
Monthly maintenance occurs on the first Monday of the month (except holidays).

GETTING HELP
Documentation: https://docs.rc.fas.harvard.edu | Account Portal https://portal.rc.fas.harvard.edu
Email: rchelp@rc.fas.harvard.edu | Support Hours


The colors shown in the bars below were chosen to increase visibility for color-blind visitors.
For higher contrast, switch to light mode at the bottom of this page if the background is dark and colors are muted.

Slurm down

Resolved
Major outage
Started 2 days agoLasted about 2 hours

Affected

Cannon Cluster

Major outage from 12:11 AM to 2:20 AM

SLURM Scheduler - Cannon

Major outage from 12:11 AM to 2:20 AM

Cannon Compute Cluster (Holyoke)

Major outage from 12:11 AM to 2:20 AM

Boston Compute Nodes

Major outage from 12:11 AM to 2:20 AM

GPU nodes (Holyoke)

Major outage from 12:11 AM to 2:20 AM

seas_compute

Major outage from 12:11 AM to 2:20 AM

Updates
  • Update
    Update
    This incident has been resolved.
  • Resolved
    Resolved
    The rogue job has been found and removed. The scheduler is running normally again and all partitions are open.
  • Investigating
    Investigating

    The Slurm scheduler is currently down and no new jobs are able to be scheduled.

    We are currently investigating this incident and will provide updates.