FAS Research Computing - Notice history

Status page for the Harvard FAS Research Computing cluster and other resources.

Cluster Utilization (VPN and FASRC login required): Cannon | FASSE


Please scroll down to see details on any Incidents or maintenance notices.
Monthly maintenance occurs on the first Monday of the month (except holidays).

GETTING HELP
Documentation: https://docs.rc.fas.harvard.edu | Account Portal https://portal.rc.fas.harvard.edu
Email: rchelp@rc.fas.harvard.edu | Support Hours


The colors shown in the bars below were chosen to increase visibility for color-blind visitors.
For higher contrast, switch to light mode at the bottom of this page if the background is dark and colors are muted.

Operational

SLURM Scheduler - Cannon - Operational

Cannon Compute Cluster (Holyoke) - Operational

Boston Compute Nodes - Operational

GPU nodes (Holyoke) - Operational

seas_compute - Operational

Operational

SLURM Scheduler - FASSE - Operational

FASSE Compute Cluster (Holyoke) - Operational

Operational

Kempner Cluster CPU - Operational

Kempner Cluster GPU - Operational

Operational

FASSE login nodes - Operational

Operational

Cannon Open OnDemand/VDI - Operational

FASSE Open OnDemand/VDI - Operational

Operational

Netscratch (Global Scratch) - Operational

Home Directory Storage - Boston - Operational

Tape - (Tier 3) - Operational

Holylabs - Operational

Isilon Storage Holyoke (Tier 1) - Operational

Holystore01 (Tier 0) - Operational

HolyLFS04 (Tier 0) - Operational

HolyLFS05 (Tier 0) - Operational

HolyLFS06 (Tier 0) - Operational

Holyoke Tier 2 NFS (new) - Operational

Holyoke Specialty Storage - Operational

holECS - Operational

Isilon Storage Boston (Tier 1) - Operational

BosLFS02 (Tier 0) - Operational

Boston Tier 2 NFS (new) - Operational

CEPH Storage Boston (Tier 2) - Operational

Boston Specialty Storage - Operational

bosECS - Operational

Samba Cluster - Operational

Globus Data Transfer - Operational

Notice history

May 2026

Monthly maintenance May 4th 2026 9am-1pm
Scheduled for May 04, 2026 at 1:00 PM – 1:00 PM
  • Planned
    May 04, 2026 at 1:00 PM
    Planned
    May 04, 2026 at 1:00 PM

    FASRC monthly maintenance will take place on May 4th 2026. Our maintenance tasks should be completed between 9am-1pm.

    NOTICES:

    • Annual data center power downtime: The annual downtime at MGHPCC will take place June 15 - June 18. This year's downtime will be one day longer. More details will be sent to all users next month.

    • Training: Upcoming training from FASRC and other sources can be found on our Training Calendar. at https://www.rc.fas.harvard.edu/upcoming-training/

    • Status Page: You can subscribe to our status to receive notifications of maintenance, incidents, and their resolution at https://status.rc.fas.harvard.edu/ (click Get Updates for options).

    MAINTENANCE TASKS

    Cannon cluster will be paused during this maintenance?: YES
    FASSE cluster will be paused during this maintenance?: YES

    • Slurm 25.11.5 Upgrade

      • Audience: All cluster users

      • Impact: Jobs will be paused during the upgrade

    • Reboot remaining stuck nodes from power outage

      • Audience: N/A

      • Impact: No visible impact to user

    • Two-Factor/OpenAuth (two-factor.rc.fas.harvard.edu) replacement

      • Audience: All account holders

      • Impact: The server will be unavailable during maintenance. You will be unable to obtain a new or replacement OpenAuth token during this period.

    • Domain controller replacement

      • Audience: Internal

      • Impact: End users should not see any impact

    • OOD/Open OnDemand reboots

      • Audience: All OOD users, reboot of the head nodes

      • Impact: Running sessions will not be affected

    • Login node reboots

      • Audience; All login node users

      • Impact: Login nodes will reboot during the maintenance window

    • Netscratch 90-day retention cleanup

      • Audience; All netscratch users

      • Impact: Files older than 90 days will be removed per our scratch policy. Please note that this cleanup can happen at any time, not just during maintenance.

    Thank you,
    FAS Research Computing
    https://docs.rc.fas.harvard.edu/
    https://www.rc.fas.harvard.edu/

Apr 2026

Login and OOD node access restricted due to serious security issue - No ETA
  • Resolved
    Resolved

    The cluster has been rebooted and all nodes, including login and OOD, have been patched.

    The scheduler is re-opened and jobs which were preempted/requeued have priority for re-scheduling.

    Some non-standard, lab-owned nodes may still require patching. The owners of these machines may be contacted about this.

    Thank you for your patience. This is a global issue and is being addressed at centers everywhere.

  • Update
    Update

    To mitigate this exploit we will need to restart -all nodes- on the cluster.

    This will begin at 1PM and run until all nodes have restarted (no ETA).

    This will mean any un-finished jobs will be terminated. There is no way to avoid this.

    We will then be validating the fix before re-opening the login. OOD nodes, and scheduler.

    Next steps and updates will be posted here.

  • Update
    Update

    We are developing a plan of attack to mitigate this exploit. Please know that this is a very serious issue and so we are treating it as such. Thank you for your understanding.

    We are currently awaiting further information from the Redhat/Fedora/Rocky community but building a plan in the meantime with the information we have. More details to follow as we can share them.

    If you need to access storage (except scratch and home directories), Globus is still online and available. But again, login nodes and OOD are not available.

  • Identified
    Identified

    Due to a serious in-the-wild exploit which can compromise Fedora-based Linux distributions including Rocky, which is used on the cluster, we need to restrict access. All login and OOD nodes are shut down until a fix can be put in place. Jobs running on the cluster will continue running.

    No ETA, There is not fix at this time. We will update our status page in the morning once we have more information or a fix to roll out.

    This is a serious exploit and we do not take this measure lightly. Please follow this status page for updates and eventual resolution.

Website security maintenance (www.rc and docs.rc) 4-28-26 1pm
  • Completed
    April 28, 2026 at 5:16 PM
    Completed
    April 28, 2026 at 5:16 PM

    Website maintenance has completed successfully.

  • In progress
    April 28, 2026 at 5:00 PM
    In progress
    April 28, 2026 at 5:00 PM
    Maintenance is now in progress
  • Planned
    April 28, 2026 at 5:00 PM
    Planned
    April 28, 2026 at 5:00 PM

    Security updates are required for www.rc.fas.harvard.edu and docs.rc.fas.harvard.edu
    This work will take place today between 1pm and 2pm
    Both sites will be down for very short periods during the updates.

Mar 2026

Scheduler is degraded
  • Resolved
    Resolved

    This incident has been resolved. The scheduler is running normally.

  • Investigating
    Investigating

    The scheduler is in a degraded state due to thrashing
    We are actively working to resolve this problem.

Network issues - Cluster degraded
  • Resolved
    Resolved

    This incident has been resolved by draining and rebooting any nodes with stuck mounts.

  • Monitoring
    Monitoring

    Mounts to Holyoke Isilon (specifically /n/sw) are broken on numerous nodes across the cluster. We have a check rolling out to find these nodes so we can remediate them individually. Until remediated the cluster will be in a degraded state. Running jobs may randomly die or fail as they hit nodes that have stale mounts.

    It will be risky to run jobs for the next hour and then, after that point, the cluster will have a large number of nodes closed waiting for them to drain so we can reboot them and fix the mounts.

    At this time we are unaware of any holy-isilon problems other than the effect this had on cluster nodes/running jobs. We will update should we identify any data storage concerns.

  • Identified
    Identified

    Mounts to Holyoke Isilon (specifically /n/sw) are broken on numerous nodes across the cluster. We have a check rolling out to find these nodes so we can remediate them individually. Until remediated the cluster will be in a degraded state. Running jobs may randomly die or fail as they hit nodes that have stale mounts.

    It will be risky to run jobs for the next hour and then, after that point, the cluster will have a large number of nodes closed waiting for them to drain so we can reboot them and fix the mounts.

  • Investigating
    Investigating

    A network issue affecting storage critical to the cluster is It's causing instability. The cluster is currently in a degraded state as a result. We are looking into the problem. Updates to follow..

Mar 2026 to May 2026

Next