Notice history

Operational

Jun 2025

Resolved
June 20, 2025 at 6:27 PM
Resolved
June 20, 2025 at 6:27 PM
Coldfront is back up. This incident has been resolved.
Investigating
June 20, 2025 at 4:58 PM
Investigating
June 20, 2025 at 4:58 PM
Coldfront may be inaccessible. We are currently investigating this incident.

Resolved
June 16, 2025 at 8:26 PM
Resolved
June 16, 2025 at 8:26 PM
Allowed one week for the message to propagate. Closing this incident.
Identified
June 09, 2025 at 2:34 PM
Identified
June 09, 2025 at 2:34 PM
While attempting to correct the over-quota/extra data issue on holylabs, an error in the sync command caused the deletion of newly created files since the re-open of the cluster (6/5/25 9AM) for 54 lab directories. We see no evidence that any other lab directories were affected.

Due to the large nature of the original cleanup and the error being discovered after the fact, regretfully these deleted files cannot be recovered.
A list follows of affected /n/holylabs lab directories. If your lab is not on that list, then it is not identified as being affected but this error:
acc_lab
alvarez_lab
avillar_lab
barnett_lab
bertoldi_lab
bol_lab
brenner_lab
cgolden_lab
charbonneau_lab
charrison_lab
chetty_lab
cnelya_lab
dam_lab
doshi-velez_lab
eisenstein_lab
enos_lab
eps_preceptors
glassman_lab
hanson_lab
hekstra_lab
holbrook_lab
iaifi_lab
idreos_lab
iebecker_lab
imai_lab
jacobsen_lab
jialiu_lab
junweil_lab
kaxiras_lab
kdbrantley_lab
kempner_dev
king_lab
kiyoul_lab
konkle_lab
koumoutsakos_lab
kozinsky_lab
kramer_lab
maustern_lab
nliu_lab
pallais_lab
park_lab
pierce_lab
protopapas_lab
pslade_lab
shro_lab
sitanc_lab
smousavih_lab
sneel_lab
snyder_lab
sompolinsky_lab
tamano_lab
ylei_lab
zickler_lab
Investigating
June 09, 2025 at 2:07 PM
Investigating
June 09, 2025 at 2:07 PM
We are currently investigating an issue on holylabs where some labs have noticed newly created files are missing.

We will update this incident with more info as soon as possible.

Resolved
June 06, 2025 at 3:50 PM
Resolved
June 06, 2025 at 3:50 PM
During the downtime holylabs was migrated to our new Vast filesystem. During this migration data was synced from from the old system to the new and then a final sync was done at powerup while the systems were not available for use by end users. We have since found an issue for some labs that will present itself in one of two ways (or both, for some): 1 - You are able to use your holylabs lab directory but you notice files that were previously deleted 2- You cannot use your holylabs lab directory because of over quota errors Our storage team is working to resolve this issue by temporarily suspending quotas on holylabs to prevent over-quota errors and also to remove the data that should not have remained after the final sync(s) so as to get lab directories back below quota. This work has already begun but will may take several days to complete. Updates will be posted here.

June 5-6 MGHPCC pod 7c cooling updates - See partition list below

Completed
June 05, 2025 at 7:03 PM
Completed
June 05, 2025 at 7:03 PM
The work on row 7c is complete. Returning idled nodes to normal service.
In progress
June 05, 2025 at 11:00 AM
In progress
June 05, 2025 at 11:00 AM
Maintenance is now in progress
Planned
June 05, 2025 at 11:00 AM
Planned
June 05, 2025 at 11:00 AM
There will be additional scheduled maintenance at MGHPCC between June 5th and 6th.
As part of the work during the MGHPCC Outage, one of the Cooling Distribution Unit (CDU) in Pod 7c will be replaced. This will allow for future expansion into this space.
This work will run from Thursday Jun 5th until the evening of Friday June 6th. This means nodes whose names begin with holy7c02, 04, 06, 08, 10, 12 will not come back online after the outage and will remain down until this CDU update is complete.
This impacts the following partitions. If you are using one of those partitions please use the public sapphire partition while your equipment is being serviced. These nodes will be returned to service once the CDU work is complete:
- blackhole
- blackhole_priority
- davies
- desai
- eddy
- huce_cascade
- huce_cascade_priority
- huttenhower
- jacobsen2
- janson
- janson_cascade
- ke
- lukin
- nguyen
- seas_compute
- shared
- tambe
- vishwanath
- whipple
- xlin

2025 MGHPCC power downtime June 2-4, 2025

Completed
June 05, 2025 at 1:00 PM
Completed
June 05, 2025 at 1:00 PM
Maintenance has completed successfully
In progress
June 02, 2025 at 1:00 PM
In progress
June 02, 2025 at 1:00 PM
Maintenance is now in progress
Planned
June 02, 2025 at 1:00 PM
Planned
June 02, 2025 at 1:00 PM
The yearly power downtime at our Holyoke data center, MGHPCC, has been scheduled.
This year's power downtime will take place on Tuesday June 3, 2025.
This will require FASRC to begin shutdown of our systems beginning at 9AM on Monday, June 2nd.
We have worked to reduce the total outage time this year.
We will begin power-up on Wednesday June 4th with an expected return to full service by 9AM Thursday June 5th.
- Monday June 2nd - Power-down begins at 9AM
- Tuesday June 3rd - Power out at MGHPCC
- Wednesday June 4th - Maintenance tasks and then power-up begins
- Thursday June 5th - Expected return to full service by 9AM
Maintenance:
During this downtime, Holylabs (/n/holylabs) will move to new hardware.
Starfish, Coldfront, and the Portal will be unavailable during the downtime.
For more details including a graphical timeline, please see: https://www.rc.fas.harvard.edu/events/2025-mghpcc-power-downtime/
Updates will be posted here on our status page: https://status.rc.fas.harvard.edu/
Note that you can subscribe to receive updates as they happen. On the status page, click Get Updates.
Notices and reminders will also be sent to all users via our mailing lists.

May 2025

Resolved
May 30, 2025 at 7:59 PM
Resolved
May 30, 2025 at 7:59 PM
This incident has been resolved.
Investigating
May 30, 2025 at 4:33 PM
Investigating
May 30, 2025 at 4:33 PM
Coldfront is undergoing maintenance and may experience service disruption in the course of this process
ETA for resolution is end of business on Friday.

Starfish upgrade Thursday, May 29th from 5PM-6PM

Completed
May 30, 2025 at 2:27 PM
Completed
May 30, 2025 at 2:27 PM
Maintenance has completed successfully.
Planned
May 29, 2025 at 9:00 PM
Planned
May 29, 2025 at 9:00 PM
Starfish upgrade Thursday, May 29th from 5PM-6PM. Starfish will be unavailable during that time

Resolved
May 27, 2025 at 11:51 PM
Resolved
May 27, 2025 at 11:51 PM
maintenance complete
Identified
May 27, 2025 at 6:28 PM
Identified
May 27, 2025 at 6:28 PM
Coldfront is in need of database maintenance. Expect return to service by 5PM.

MGHPCC power work 5/21 - 5/23 - Some partitions will be at half capacity

Completed
May 23, 2025 at 7:00 PM
Completed
May 23, 2025 at 7:00 PM
Maintenance has completed successfully
In progress
May 21, 2025 at 11:00 AM
In progress
May 21, 2025 at 11:00 AM
Maintenance is now in progress
Planned
May 21, 2025 at 11:00 AM
Planned
May 21, 2025 at 11:00 AM
The MGHPCC Holyoke data center will be performing power work on May 21st -23rd. This work will take out one half (or one 'side') of the power capacity for certain rows/racks including our compute rows. Because of our power draw, one side is not enough power to keep each full rack running.
As such, we will be adding a reservation to idle half the nodes in the partitions listed below. A reservation will cause nodes to drain as jobs complete and stop scheduling new jobs on those nodes if they cannot be completed before the outage. This will allow us to idle and power down those nodes prior to the work and avoid potential blackout/brownout on those racks.
This will mean that these partitions will be up and available, but that half the nodes from each will be down (assuming an even number of nodes).
This work is part of an on-going power capacity upgrade at MGHPCC. We expect this will be the last power work needed and the facility will then provide enough additional power for future expansion as well adding overhead for the current load.
The affected partitions are:
- arguelles_delgado
- bigmem_intermediate
- blackhole_gpu
- eddy gershman
- hejazi
- hernquist
- hoekstra
- huce_ice
- iaifi_gpu
- iaifi_gpu_requeue
- iaifi_priority
- jshapiro
- jshapiro_priority
- kempner
- kempner_requeue
- kempner_h100
- kempner_h100_priority
- kempner_h100_priority2
- kovac kozinsky
- kozinsky_gpu
- kozinsky_requeue
- ortegahernandez_ice
- rivas
- seas_compute
- seas_gpu
- siag_combo
- siag_gpu
- sur
- zhuang

Resolved
May 06, 2025 at 2:30 PM
Resolved
May 06, 2025 at 2:30 PM
We have reverted the changes to the Jupyter OOD app while we work on further development.
This incident is resolved.
Identified
May 06, 2025 at 1:56 PM
Identified
May 06, 2025 at 1:56 PM
The Jupyter app for OOD is failing to start for certain workflows that load older versions of python or modules. We are in the process of reverting Jupyter to fix the bug.

Apr 2025

Resolved
April 28, 2025 at 8:36 PM
Resolved
April 28, 2025 at 8:36 PM
Starfish is once again accepting logins.
Investigating
April 28, 2025 at 2:25 PM
Investigating
April 28, 2025 at 2:25 PM
Starfish is not allowing login due to a database issue. We have contacted the vendor and expect to resolve this soon.

Resolved
April 27, 2025 at 12:41 PM
Resolved
April 27, 2025 at 12:41 PM
Most nodes have been returned to service, both Cannon and FASSE are operable. Remaining down nodes will be remediated by FASRC staff over the next several days.
Investigating
April 27, 2025 at 11:02 AM
Investigating
April 27, 2025 at 11:02 AM
At 1:05am EDT MGHPCC lost power to compute due to cooling failure. The chiller has been restored and power is back as of 5:43am. As a result of this all compute is powered off and all running jobs were requeued. FASRC staff is triaging the down hardware and will bring it back online when it is ready. Jobs will start once FASRC staff have confirmed that everything is in good state.

Resolved
April 16, 2025 at 2:40 PM
Resolved
April 16, 2025 at 2:40 PM
holylabs is back up
Identified
April 16, 2025 at 2:27 PM
Identified
April 16, 2025 at 2:27 PM
holylabs must be rebooted to resolved a stability/funciton issue.

Resolved
April 14, 2025 at 5:18 PM
Resolved
April 14, 2025 at 5:18 PM
Cannon boslogin and FASSE login nodes are back up and operational.
All holylogin nodes are still down for repair, please see our posted incident for more updates: https://status.rc.fas.harvard.edu/cm97gyay90013dturk7fxg5pb
We apologize for the unexpected disruption.
Investigating
April 14, 2025 at 4:58 PM
Investigating
April 14, 2025 at 4:58 PM
Due to a configuration error, all cluster login nodes are rebooting and are temporarily unavailable. Please save any work immediately.

Resolved
April 14, 2025 at 6:48 PM
Resolved
April 14, 2025 at 6:48 PM
Hardware has been repaired and holyoke login nodes are back online. Thanks for your patience.
Monitoring
April 09, 2025 at 3:15 PM
Monitoring
April 09, 2025 at 3:15 PM
Holylogin chassis repair during maintenance was unsuccessful and replacement parts have been ordered.
- holyoke login nodes (holylogin05-08) are down for hardware repair
- Only Boston login nodes available (ie, boslogin[05-08])
If you have holylogin hard-coded in your scripts, please update to login.rc.fas.harvard.edu or boslogin.rc.fas.harvard.edu for the time being, which will redirect you to an available login node.
As always, the best method for obtaining a login node is usinglogin.rc.fas.harvard.edu which will pick a node for you.
If you require a login node in a specific data center, use boslogin.rc.fas.harvard.edu (Boston) or (once they are back in service) holylogin.rc.fas.harvard.edu (Holyoke).
See also: Command line access with Terminal (login nodes) – FASRC DOCS
Resolved
April 09, 2025 at 3:01 PM
Resolved
April 09, 2025 at 3:01 PM
This incident was posted by mistake.

holylogin01-04 were replaced by holylogin05-08 some time back.

As always, the best method for obtaining a login node is usinglogin.rc.fas.harvard.edu which will pick a node for you.
Or if you require a login node in a specific data center, use boslogin.rc.fas.harvard.edu (Boston) or holylogin.rc.fas.harvard.edu (Holyoke).
See also: Command line access with Terminal (login nodes) – FASRC DOCS
Investigating
April 07, 2025 at 7:33 PM
Investigating
April 07, 2025 at 7:33 PM
Holylogin chassis repair during maintenance was unsuccessful and replacement parts have been ordered.
Audience:
- All cluster users
Impact:
- All holylogin** servers will be down till further notice
- Only Boston login nodes available (ie, boslogin[05-08])
If you have holylogin hard-coded in your scripts, please update to login.rc.fas.harvard.edu or boslogin.rc.fas.harvard.edu for the time being, which will redirect you to an available login node.
Updates to follow as we have them.

Apr 2025 to Jun 2025

FAS Research Computing - Notice history

Notice history

Jun 2025

May 2025

Apr 2025