FAS Research Computing - Notice history

Status page for the Harvard FAS Research Computing cluster and other resources.

Cluster Utilization (VPN and FASRC login required): Cannon | FASSE


Please scroll down to see details on any Incidents or maintenance notices.
Monthly maintenance occurs on the first Monday of the month (except holidays).

GETTING HELP
https://docs.rc.fas.harvard.edu | https://portal.rc.fas.harvard.edu | Email: rchelp@rc.fas.harvard.edu


The colors shown in the bars below were chosen to increase visibility for color-blind visitors.
For higher contrast, switch to light mode at the bottom of this page if the background is dark and colors are muted.

Operational

SLURM Scheduler - Cannon - Operational

Cannon Compute Cluster (Holyoke) - Operational

Boston Compute Nodes - Operational

GPU nodes (Holyoke) - Operational

seas_compute - Operational

Operational

SLURM Scheduler - FASSE - Operational

FASSE Compute Cluster (Holyoke) - Operational

Operational

Kempner Cluster CPU - Operational

Kempner Cluster GPU - Operational

Operational

Login Nodes - Boston - Operational

Login Nodes - Holyoke - Operational

FASSE login nodes - Operational

Operational

Cannon Open OnDemand/VDI - Operational

FASSE Open OnDemand/VDI - Operational

Operational

Netscratch (Global Scratch) - Operational

Home Directory Storage - Boston - Operational

Tape - (Tier 3) - Operational

Holylabs - Operational

Isilon Storage Holyoke (Tier 1) - Operational

Holystore01 (Tier 0) - Operational

HolyLFS04 (Tier 0) - Operational

HolyLFS05 (Tier 0) - Operational

HolyLFS06 (Tier 0) - Operational

Holyoke Tier 2 NFS (new) - Operational

Holyoke Specialty Storage - Operational

holECS - Operational

Isilon Storage Boston (Tier 1) - Operational

BosLFS02 (Tier 0) - Operational

Boston Tier 2 NFS (new) - Operational

CEPH Storage Boston (Tier 2) - Operational

Boston Specialty Storage - Operational

bosECS - Operational

Samba Cluster - Operational

Globus Data Transfer - Operational

Notice history

May 2025

MGHPCC power work 5/21 - 5/23 - Some partitions will be at half capacity
Scheduled for May 21, 2025 at 11:00 AM – May 23, 2025 at 7:00 PM 2 days
  • Planned
    May 21, 2025 at 11:00 AM
    Planned
    May 21, 2025 at 11:00 AM

    The MGHPCC Holyoke data center will be performing power work on May 21st -23rd. This work will take out one half (or one 'side') of the power capacity for certain rows/racks including our compute rows. Because of our power draw, one side is not enough power to keep each full rack running.

    As such, we will be adding a reservation to idle half the nodes in the partitions listed below. A reservation will cause nodes to drain as jobs complete and stop scheduling new jobs on those nodes if they cannot be completed before the outage. This will allow us to idle and power down those nodes prior to the work and avoid potential blackout/brownout on those racks.

    This will mean that these partitions will be up and available, but that half the nodes from each will be down (assuming an even number of nodes).

    This work is part of an on-going power capacity upgrade at MGHPCC. We expect this will be the last power work needed and the facility will then provide enough additional power for future expansion as well adding overhead for the current load.

    The affected partitions are:

    • arguelles_delgado

    • bigmem_intermediate

    • blackhole_gpu

    • eddy gershman

    • hejazi

    • hernquist

    • hoekstra

    • huce_ice

    • iaifi_gpu

    • iaifi_gpu_requeue

    • iaifi_priority

    • jshapiro

    • jshapiro_priority

    • kempner

    • kempner_requeue

    • kempner_h100

    • kempner_h100_priority

    • kempner_h100_priority2

    • kovac kozinsky

    • kozinsky_gpu

    • kozinsky_requeue

    • ortegahernandez_ice

    • rivas

    • seas_compute

    • seas_gpu

    • siag_combo

    • siag_gpu

    • sur

    • zhuang

FASRC monthly maintenance Monday May 5th, 2025 from 9am-1pm
Scheduled for May 05, 2025 at 1:00 PM – 5:00 PM about 4 hours
  • Planned
    May 05, 2025 at 1:00 PM
    Planned
    May 05, 2025 at 1:00 PM

    FASRC monthly maintenance will take place Monday May 5th, 2025 from 9am-1pm

    NOTICES

    • MGHPCC power work May 21-23 - Some partitions will be at half capacity. Details on our Status Page

    • Annual Holyoke/MGHPCC power downtime will take place June 2-4. Details on our website and status page.

    MAINTENANCE TASKS
    Cannon cluster will be paused during this maintenance?: YES
    FASSE cluster will be paused during this maintenance?: YES

    • Slurm Upgrade to 24.11.4

      • Audience: All cluster users

      • Impact: Jobs and the scheduler will be paused during this upgrade

    • Infiniband subnet expansion

      • Audience: Most FASRC compute and storage resources

      • Impact: Brief network interrupts, but scheduler will already be paused

    • OOD node reboots

      • Audience: OOD (Open OnDemand/VDI) users

      • Impact: OOD nodes will be rebooted during this maintenance window

    • Login node reboots

      • Audience: Anyone logged into a FASRC Cannon or FASSE login node

      • Impact: All login nodes will rebooted at the end of this maintenance window.

    • Netscratch cleanup ( https://docs.rc.fas.harvard.edu/kb/policy-scratch/ )

      • Audience: Cluster users

      • Impact: Files older than 90 days will be removed. Please note that retention cleanup can and does run at any time, not just during the maintenance window.

    Thank you,
    FAS Research Computing
    https://docs.rc.fas.harvard.edu/
    https://www.rc.fas.harvard.edu/

Apr 2025

Login nodes temporarily down
  • Resolved
    Resolved

    Cannon boslogin and FASSE login nodes are back up and operational.

    All holylogin nodes are still down for repair, please see our posted incident for more updates: https://status.rc.fas.harvard.edu/cm97gyay90013dturk7fxg5pb

    We apologize for the unexpected disruption.

  • Investigating
    Investigating

    Due to a configuration error, all cluster login nodes are rebooting and are temporarily unavailable. Please save any work immediately.

holylogin[05-08] down
  • Resolved
    Resolved
    Hardware has been repaired and holyoke login nodes are back online. Thanks for your patience.
  • Monitoring
    Monitoring

    Holylogin chassis repair during maintenance was unsuccessful and replacement parts have been ordered.

    • holyoke login nodes (holylogin05-08) are down for hardware repair

    • Only Boston login nodes available (ie, boslogin[05-08])

    If you have holylogin hard-coded in your scripts, please update to login.rc.fas.harvard.edu or boslogin.rc.fas.harvard.edu for the time being, which will redirect you to an available login node.

    As always, the best method for obtaining a login node is usinglogin.rc.fas.harvard.edu which will pick a node for you.

    If you require a login node in a specific data center, use boslogin.rc.fas.harvard.edu (Boston) or (once they are back in service) holylogin.rc.fas.harvard.edu (Holyoke).

    See also: Command line access with Terminal (login nodes) – FASRC DOCS

  • Resolved
    Resolved

    This incident was posted by mistake.

    holylogin01-04 were replaced by holylogin05-08 some time back.

    As always, the best method for obtaining a login node is usinglogin.rc.fas.harvard.edu which will pick a node for you.

    Or if you require a login node in a specific data center, use boslogin.rc.fas.harvard.edu (Boston) or holylogin.rc.fas.harvard.edu (Holyoke).

    See also: Command line access with Terminal (login nodes) – FASRC DOCS

  • Investigating
    Investigating

    Holylogin chassis repair during maintenance was unsuccessful and replacement parts have been ordered.

    Audience:

    • All cluster users

    Impact:

    • All holylogin** servers will be down till further notice

    • Only Boston login nodes available (ie, boslogin[05-08])

    If you have holylogin hard-coded in your scripts, please update to login.rc.fas.harvard.edu or boslogin.rc.fas.harvard.edu for the time being, which will redirect you to an available login node.

    Updates to follow as we have them.

Mar 2025

FASRC monthly maintenance - Monday March 3rd, 2025 from 9am-1pm
  • Update
    March 03, 2025 at 6:00 PM
    Completed
    March 03, 2025 at 6:00 PM
    Maintenance has completed successfully
  • Completed
    March 03, 2025 at 6:00 PM
    Completed
    March 03, 2025 at 6:00 PM
    Maintenance has completed successfully
  • In progress
    March 03, 2025 at 2:00 PM
    In progress
    March 03, 2025 at 2:00 PM
    Maintenance is now in progress
  • Planned
    March 03, 2025 at 2:00 PM
    Planned
    March 03, 2025 at 2:00 PM

    PLEASE NOTE - New time window going forward - 9am-1pm

    FASRC monthly maintenance will take place Monday March 3rd, 2025 from 9am-1pm

    NOTICES

    • Training: Upcoming training from FASRC and other sources can be found on our Training Calendar. at https://www.rc.fas.harvard.edu/upcoming-training/

    • Status Page: You can subscribe to our status to receive notifications of maintenance, incidents, and their resolution at https://status.rc.fas.harvard.edu/ (click Get Updates for options).

    • Upcoming holidays: Memorial Day - Monday, May 26

    • You can subscribe to our status page using the Get Updates button in the upper right

    MAINTENANCE TASKS
    Cannon cluster will be paused during this maintenance?: YES
    FASSE cluster will be paused during this maintenance?: YES

    • Slurm Upgrade to 24.11.2 - Crucial Update

      • Audience: All cluster users

      • Impact: Jobs and the scheduler will be paused during this upgrade

    • Open Ondemand (OOD) reboots

      • Audience: All OOD users

      • Impact: All Open OnDemand (aka OOD/VDI/RCOOD) nodes will be rebooted

    • Login node reboots

      • Audience: Anyone logged into a FASRC Cannon or FASSE login node

      • Impact: Login nodes will rebooted during this maintenance window

    • bos-Isilon firmware updates

      • Audience: bos-isilon users

      • Impact: No noticeable impact for storage users

    • Netscratch retention/cleanup ( https://docs.rc.fas.harvard.edu/kb/policy-scratch/ )

      • Audience: Cluster users

      • Impact: Files older than 90 days will be removed. Please note that retention cleanup can and does run at any time, not just during the maintenance window.

    Thank you,
    FAS Research Computing
    https://www.rc.fas.harvard.edu/
    https://docs.rc.fas.harvard.edu/

Mar 2025 to May 2025

Next