FAS Research Computing Status - Incident history

Boston Data Center Move – Jan. 12-14 (Phase 1)

Sun, 12 Jan 2025 21:00:00 +0000

Jan 12, 21:00:00 GMT+0
Identified - FAS Research Computing will be moving floors in our Boston data center in early 2025\. This will require at least two downtimes to accomplish. The most impactful will be the first move of our Boston Isilon storage which houses all user home directories. This first move will begin at 4PM on Sunday, January 12th 2025 with an expected return to service at 12PM on Tuesday January 14th 2025\. As home directories will be offline, this will necessitate a full stop of the cluster and all jobs. As login will not be possible without home directories, all login and Open OnDemand nodes will also be offline. [Click Here for Timeline Graphic and Details](https://www.rc.fas.harvard.edu/blog/boston-move-2025-phase-1/) TIMELINE * January 12th * 4pm: Begin. Close clusters, terminate jobs, shut down login and Open OnDemand nodes * January 13th: * All Day: Vendors and FASRC moves * January 14th * ETA 12pm: Reopen cluster Progress and any changes will be noted on our [Status Page](https://fasrc.instatus.com/) during this downtime..

Cluster Partially Degraded

Thu, 12 Dec 2024 17:44:46 +0000

Dec 12, 17:44:46 GMT+0
Investigating - Low priority jobs are not getting scheduled despite being at the top of the queue. We are currently investigating this incident and have reached out to SchedMD regarding this. See .

Dec 13, 16:02:58 GMT+0
Resolved - Jobs have cleared overnight and a fix for the high load appears to be working. We will monitor for any recurrence, but all appears well at this time..

NESE tape maintenance Dec. 11, 2024

Wed, 11 Dec 2024 13:00:00 +0000

Dec 11, 13:00:00 GMT+0
Identified - From Harvard University RC: NESE is scheduling a hardware repair for 12/11 to one of disk cache management servers for NESE Tape. The scheduled repair will not interrupt the Tape service, but you will likely experience performance degradation in interacting with the archive on 12/11 during the maintenance. This maintenance will not affect the NESE Disk service..

Dec 11, 13:00:01 GMT+0
Identified - Maintenance is now in progress.

Dec 11, 22:00:00 GMT+0
Completed - Maintenance has completed successfully.

Slurm Scheduler high load

Tue, 10 Dec 2024 15:17:10 +0000

Dec 10, 15:17:10 GMT+0
Identified - The scheduler is experiencing a high load at the moment which may manifest as slowness, timeout or general instability for job scheduling. We are actively working to mitigate this..

Dec 10, 17:16:40 GMT+0
Resolved - The scheduler load has returned to normal. A root cause was identified and remediated..

FASRC monthly maintenance - Monday December 2nd, 2024 7am-11am

Mon, 2 Dec 2024 12:00:00 +0000

Dec 2, 12:52:42 GMT+0
Identified - Due to an urgent network issue which requires a restart of some network hardware, all jobs will need to be paused. Interactive jobs and the ability to write to some storage may be interrupted..

Dec 2, 16:00:00 GMT+0
Completed - Maintenance has completed successfully.

Dec 2, 12:00:00 GMT+0
Identified - FASRC monthly maintenance will occur Monday December 2nd, 2024 from 7am-11am **IMPORTANT NOTICES** * `holyscratch01` will be set to **read-only** during this maintenance and will be decommissioned February 1, 2025\. Please [move any needed scratch data to netscratch](https://www.rc.fas.harvard.edu/blog/announcing-netscratch/) and begin using it instead if you have not done so already. The global `$SCRATCH` variable will be changed to `/n/netscratch` * FASRC will be switching to the Harvard ServiceNow ticket system on Dec. 2nd. Our email addresses remain the same and no action is required on your part. Please **_do not re-open old/closed tickets_** after Dec. 2nd and instead create a [new ticket](mailto:rchelp@rc.fas.harvard.edu). * Cannon cluster: `serial_requeue` and `gpu_requeue` will be set to allow MPI/multinode jobs. Such jobs need to be able to handle preemption/being requeued. Training: Upcoming training from FASRC and other sources can be found on our Training Calendar. at Status Page: You can subscribe to our status to receive notifications of maintenance, incidents, and their resolution at (click Get Updates for options). Upcoming holidays: Thanksgiving Nov. 28th and 29th. Winter break Dec. 23rd through January 1st **MAINTENANCE TASKS** Cannon cluster will be paused during this maintenance?: **NO** FASSE cluster will be paused during this maintenance?: **NO** * Set /n/holyscratch01 scratch filesystem to read-only * Audience: All cluster users * Impact: Please adoptthe new scratch filesystem [/n/netscratch](https://www.rc.fas.harvard.edu/blog/announcing-netscratch/) prior to Dec. 2nd. The $SCRATCH variable will move to /n/netscratch during this maintenance Data on holyscratch01 will still be readable, but **not** writable, and will be fully decommissioned on Feb. 1, 2025. * Switch ticketing system to ServiceNow. Our email addresses remain the same. * Audience: All FASRC users * Impact: All new tickets will go to Harvard'sServiceNow,our email remains the same. Existing tickets will get moved any time someone replies. * NOTE: From Dec. 2nd on, please **do not re-open any old tickets**. Create a new one instead by emailing [rchelp@rc.fas.harvard.edu](mailto:rchelp@rc.fas.harvard.edu) * Login node reboots * Audience: Anyone logged into a FASRC Cannon or FASSE login node * Impact: Login nodes will rebooted during this maintenance window * Scratch cleanup ( ) * Audience: Cluster users * Impact: Files older than 90 days will be removed. Please note that retention cleanup can and does run at any time, not just during the maintenance window. Thank you, FAS Research Computing [https://docs.rc.fas.harvard.edu/ https://www.rc.fas.harvard.edu/upcoming-training/](https://docs.rc.fas.harvard.edu/https://www.rc.fas.harvard.edu/upcoming-training/).

Dec 2, 12:00:01 GMT+0
Identified - Maintenance is now in progress.

holyscratch01 degraded

Wed, 27 Nov 2024 14:53:22 +0000

Nov 27, 14:53:22 GMT+0
Investigating - OST2b has wedged again. We are in the process of troubleshooting this incident..

Nov 27, 16:05:14 GMT+0
Resolved - holyscratch01 is operational. These failing object stores will continue to cause issues. Please move your scratch data to /n/netscratch if you have not already. Holyscratch01 will be set READ-ONLY on Dec. 2nd. https://www.rc.fas.harvard.edu/blog/announcing-netscratch/.

holylfs06 stability issues

Mon, 25 Nov 2024 16:31:32 +0000

Nov 25, 16:31:32 GMT+0
Identified - We are working on an issue with holylfs06 Expect some instability or slowness..

Nov 25, 17:37:54 GMT+0
Resolved - holylfs06 issues resolved..

Citrix (TMS/RCapps) down

Wed, 20 Nov 2024 15:09:24 +0000

Nov 20, 15:09:24 GMT+0
Identified - rcapps.rc.fas.harvard.edu Citrix is down. We are investigating..

Nov 20, 17:08:22 GMT+0
Resolved - Citrix/RCapps is up and working again..

Globus - Boston collection unavailable

Tue, 19 Nov 2024 14:29:07 +0000

Nov 19, 14:29:07 GMT+0
Investigating - The Boston collection appears unavailable in Globus. We are investigating..

Nov 19, 20:30:56 GMT+0
Identified - This issue has not yet been resolved. We have a ticket in with Globus and will update once we know more..

Dec 5, 21:38:16 GMT+0
Resolved - The transfer nodes for Boston globus are working again. We will follow up with the vendor for any additional information on root cause. Thanks you for your patience..

Netscratch - some directories stalled/unavailable

Mon, 18 Nov 2024 15:05:50 +0000

Nov 18, 15:05:50 GMT+0
Investigating - We have reports that some directories in netscratch are unavailable or intermittent. We are working with Vast to determine the cause. This is a new filesystem which is under heavy load for cluster use as well as migration from holyscratch01\. We appreciate your understanding of these teething pains and will update you as soon as we have more details or resolution..

Nov 18, 15:12:49 GMT+0
Resolved - The instability has been identified and resolved. Netscratch has returned to normal, full operaiton..

Nov 19, 14:52:16 GMT+0
Identified - We are seeing a recurrence of this issue. We are working to determine the cause and resolution..

Nov 19, 16:24:19 GMT+0
Resolved - Netscratch is working normally. The vendor has collected logs and is determining next steps for prevention..

Loss of nodes in row 7c

Fri, 15 Nov 2024 20:52:03 +0000

Nov 15, 20:52:03 GMT+0
Investigating - Cannon nodes in 7c02-12 are down. We are investigating..

Nov 15, 22:01:10 GMT+0
Identified - Due to a cooling issue for these nodes, we will leave them down until we have an all-clear from the facility or vendor. All other nodes are up and running and thew cluster is otherwise operating normally. We will update as the situation changes..

Nov 16, 17:06:50 GMT+0
Resolved - Cooling has been restored and the impacted nodes are back online. This incident has been resolved..

boslogin05 unstable, requires reboot

Fri, 15 Nov 2024 15:09:06 +0000

Nov 15, 15:09:06 GMT+0
Identified - boslogin05 is unstable and will be rebooted at 11am.

Nov 15, 16:52:55 GMT+0
Resolved - Reboot comp-leted at 11:08am.

holylfs06 read/write performance degraded

Tue, 12 Nov 2024 15:14:24 +0000

Nov 12, 15:14:24 GMT+0
Investigating - holylfs06 read/write performance degraded. Users of holylfs06 may experience hangs/delays. especially writing files. We are investigating..

Nov 12, 16:31:04 GMT+0
Resolved - holylfs06 issues resolved..

Nov 12, 17:26:03 GMT+0
Identified - Some users still seeing issues on holylfs06\. Re-opening this incident..

Nov 12, 21:44:54 GMT+0
Identified - FASRC continues to work on the issues with holylfs06\. No ETA at this time. We will post updates here. A reminder that you can subscribe to our status page for updates. Click Get Updates and enter your email address. You will also receive notices about maintenance events..

Nov 13, 15:11:45 GMT+0
Monitoring - holylfs06 is currently stable, but the root causes are still being investigated. We will leave the status as degraded for now out of caution..

Nov 14, 17:04:46 GMT+0
Resolved - We have identified where the heavy IO originated and have been watching holylfs06\. It has been performing normally since the last update. We are closing this incident..

Open OnDemand update November 12th 2024, 8AM-12PM.

Tue, 12 Nov 2024 13:00:00 +0000

Nov 12, 13:00:00 GMT+0
Identified - Open OnDemand (aka OOD) configuration will be getting an update on Tuesday November 12th between 8AM and 12PM. There will be minimal impact to users. Some users may encounter an error on the dashboard. Simply refresh the page a few times and the error should go away. In the application, the only noticeable change in this deployment is the addition of the **cancel job** functionality. This feature was released with Version 3.1, however it has not been enabled until now. .

Nov 12, 13:00:01 GMT+0
Identified - Maintenance is now in progress.

Nov 12, 14:29:22 GMT+0
Completed - OOD Maintenance has been completed. .

holyscratch01 instability

Fri, 8 Nov 2024 16:03:27 +0000

Nov 8, 16:03:27 GMT+0
Identified - We are again seeing instability and hangs on holyscratch01\. Work in progress. Please move to netscratch as soon as possible. holy scratch01 will go read-only on Dec. 2nd and will continue to be unstable due to failing hardware. https://www.rc.fas.harvard.edu/blog/announcing-netscratch/.

Nov 8, 16:43:09 GMT+0
Resolved - holyscratch01 is back in full service.

some nodes in drain state

Wed, 6 Nov 2024 15:25:31 +0000

Nov 6, 15:25:31 GMT+0
Identified - Many/most nodes in the test partition are in drain state. We are investigating and re-opening nodes as we can. More details to follow..

Nov 6, 16:50:48 GMT+0
Identified - Some test nodes re-opened. Working on remainder. No ETA..

Nov 6, 17:00:10 GMT+0
Identified - Please note that this also affects other nodes and is not limited to the test partition. We are working to return all nodes to service but do need to allow jobs/draining to complete on many nodes..

Nov 6, 18:33:59 GMT+0
Monitoring - More nodes open. Please note that this was less than 10% of our total nodes, so if you did not experience any issues this is why. Root cause appears to be holyscratch01 issues. Failover on a hung OST helped. Please do move off of holyscratch01 as soon as possible and begin using the new netscratch instead. holyscratch01 will go read-only on Dec. 2nd, 2024. .

Nov 6, 21:15:36 GMT+0
Resolved - The remaining number of affected nodes is very low and well-below our incident threshold (at any time, there will be some number of nodes out of commission on the cluster). We are closing this incident. Any remaining nodes will be brought back into service as they are assessed and cleared up. Some require reboots but must be fully drained of jobs first. Thanks, FAS Research Computing.

holyscratch01 degraded - storage brick failover

Tue, 5 Nov 2024 14:34:14 +0000

Nov 5, 14:36:27 GMT+0
Resolved - Failover has completed..

Nov 5, 14:34:14 GMT+0
Identified - We are continuing to work on a fix for The storage brick OST2b needs failing over again. WIP. Please do move to the new Netscratch as soon as possible. this incident..

Monthly maintenance - Monday November 4th, 2024 from 7am-11am

Mon, 4 Nov 2024 12:00:00 +0000

Nov 4, 12:00:00 GMT+0
Identified - **FASRC monthly maintenance will occur Monday November 4th, 2024 from 7am-11am** **NOTICES** * New scratch system /n/netscratch will be deployed today. See email on Nov 1st or this link for details: [Announcing Netscratch: Our New High-Performance Scratch Filesystem](https://www.rc.fas.harvard.edu/blog/announcing-netscratch/) * Older login nodes are being decommissioned. See below. * Training: Upcoming training from FASRC and other sources can be found on our Training Calendar. at * Status Page: You can subscribe to our status to receive notifications of maintenance, incidents, and their resolution at (click Get Updates for options). * Upcoming holidays: Veterans Day - November 11th **MAINTENANCE TASKS** Cannon cluster will be paused during this maintenance?: **YES** FASSE cluster will be paused during this maintenance?: **YES** * Decommission of boslogin01-04 and holylogin01-04 * Audience: All cluster users * Impact: New login nodes have been in place for some time now and it's time to decommission the older ones and reclaim the rack space. If you have crons or hardcoded login (vscode, etc) please update before Nov. 4th. * Slurm 24.05.4 Upgrade * Audience: All cluster users * Impact: The scheduler will be updated to version 24.05.4 - Jobs will be paused * Login and OOD node reboots * Audience: Anyone logged into a FASRC Cannon or FASSE login node or OOD/VDI node * Impact: These nodes will rebooted during this maintenance window * Scratch cleanup ( ) * Audience: Cluster users * Impact: Files older than 90 days will be removed. Please note that retention cleanup can and does run at any time, not just during the maintenance window. Thanks, FAS Research Computing Dept. Website: Documentation: Status Page: cheduled maintenance during that time..

Nov 4, 13:21:28 GMT+0
Identified - Maintenance began 7am. .

Nov 4, 14:45:27 GMT+0
Completed - Maintenance is complete..

holyscratch01 OST2b issues

Mon, 4 Nov 2024 11:43:46 +0000

Nov 4, 11:43:46 GMT+0
Resolved - holyscratch01 is back to normal. Please move to netscratch for global scratch storage as holyscratch01 is end of life..

holyscratch01 degraded - OST2b failover

Sun, 3 Nov 2024 19:53:50 +0000

Nov 3, 19:53:50 GMT+0
Identified - We are currently investigating tOne of the object storage components is stuck and will be failed over. Users may see degraded performance in the meantime.his incident..

Nov 3, 20:00:26 GMT+0
Resolved - The failover is complete. We encourage users to begin using the new /n/netscratch as soon as possible. Please seem email announcement sent to all cluster users Friday Nov 1st for more details or see [Announcing Netscratch: Our New High-Performance Scratch Filesystem | FAS Research Computing (](https://www.rc.fas.harvard.edu/blog/announcing-netscratch/)[harvard.edu](http://harvard.edu)[)](https://www.rc.fas.harvard.edu/blog/announcing-netscratch/) . .

boslogin05 unstable - reboot at 1pm

Fri, 1 Nov 2024 16:08:32 +0000

Nov 1, 16:08:32 GMT+0
Identified - boslogin05 has become unstable and requires a reboot. This will take place at 1pm..

Nov 1, 17:17:12 GMT+0
Resolved - boslogin05 is back up .

holyscratch01 OST2b issues

Mon, 28 Oct 2024 12:36:00 +0000

Oct 28, 12:36:00 GMT+0
Investigating - OST2b (one of the bricks that make up holyscratch01) has wedged again. We are in the process of troubleshooting..

Oct 28, 12:40:25 GMT+0
Resolved - This incident has been resolved..

holyscratch01 OST2b issues

Sun, 27 Oct 2024 18:15:17 +0000

Oct 27, 18:15:17 GMT+0
Investigating - OST2b on holyscratch01 (one of the bricks that makes up holyscratch01) is hung up. We are failing it over the rectify scratch performance issues..

Oct 27, 18:20:25 GMT+0
Resolved - This incident has been resolved..

holyscratch01 Instability

Fri, 25 Oct 2024 18:10:00 +0000

Oct 25, 18:10:00 GMT+0
Investigating - We are currently investigating this incident..

Oct 25, 18:35:05 GMT+0
Resolved - This incident has been resolved..

Oct 25, 19:40:00 GMT+0
Identified - We are continuing to work on a fix for this incident..

Oct 25, 20:18:37 GMT+0
Resolved - The instability has been resolved..

holyscratch01 reboot

Thu, 10 Oct 2024 14:11:16 +0000

Oct 10, 14:11:16 GMT+0
Investigating - We are rebooting holyscratch01 to clear a stuck state..

Oct 10, 14:21:00 GMT+0
Resolved - This incident has been resolved..

holyscratch01 degraded

Wed, 9 Oct 2024 19:30:25 +0000

Oct 9, 19:30:25 GMT+0
Monitoring - holyscratch01 - OST2b failing over.

Oct 9, 19:31:43 GMT+0
Resolved - OST2b failed over.

FASRC Monthly Maintenance - Monday October 7th, 2024 7am-11am

Mon, 7 Oct 2024 11:00:00 +0000

Oct 7, 13:09:55 GMT+0
Completed - Maintenance complete. All tasks completed successfully..

Oct 7, 11:00:00 GMT+0
Identified - **FASRC monthly maintenance will occur Monday October 7th, 2024 from 7am-11am** **NOTICES** [X11](https://docs.rc.fas.harvard.edu/kb/glossary/#X11) bug: There is currently a bug with X11 forwarding with Slurm that is preventing it from working. Not many people still use X11, it is slow and unreliable, but we want to get the word out to those who might, and also suggest they use [Open OnDemand](https://docs.rc.fas.harvard.edu/kb/ood-remote-desktop-how-to-open-software/) (aka OOD or VDI) instead. There is no ETA on a resolution to this bug. See also: Training: Upcoming training from FASRC and other sources can be found on our Training Calendar. at Status Page: You can subscribe to our status to receive notifications of maintenance, incidents, and their resolution at (click Get Updates for options). Upcoming holidays: Columbus Day (Federal) / Indigenous Peoples’ Day (City of Cambridge) - Monday, Oct. 14 **MAINTENANCE TASKS** Cannon cluster will be paused during this maintenance?: **YES** FASSE cluster will be paused during this maintenance?: **YES** Power work at MGHPCC for Row 8A \-- Audience: All cluster users \-- Impact: All jobs will be paused while this work is done. Jobs will resume once we have the all-clear from MGHPCC staff. Login node reboots \-- Audience: Anyone logged into a FASRC Cannon or FASSE login node \-- Impact: Login nodes will rebooted during this maintenance window Scratch cleanup ( ) \-- Audience: Cluster users \-- Impact: Files older than 90 days will be removed. Please note that retention cleanup can and does run at any time, not just during the maintenance window. Thanks, FAS Research Computing Dept. Website: Documentation: Status Page: .

Oct 7, 11:00:01 GMT+0
Identified - Maintenance is now in progress.

Virtual Machine hypervisor down - Affects FASSE login/OOD

Tue, 24 Sep 2024 19:59:16 +0000

Sep 24, 19:59:16 GMT+0
Identified - One of the hypervisors managing virtual machines is down. We are working to bring it back up. This does affect FASSE login and FASSE OOD nodes as well as may degrade OpenAuth (two-factor). Affected hosts are: HOST -- STATUS dataverse-backup UNKNOWN demo2-l3-fs UNKNOWN enos-vote-l3-fs UNKNOWN fasselogin01 UNKNOWN fasselogin02 UNKNOWN frontier-squid02 UNKNOWN frontier-squid03 UNKNOWN frontier-squid04 UNKNOWN goel-adm24-l3-fs UNKNOWN goel-blind-l3-fs UNKNOWN goel-l3-fs UNKNOWN h-dev-fasseooda-01 UNKNOWN h-dev-fasseooda-lb01 UNKNOWN h-dev-fasseoodb-lb11 UNKNOWN h-fasseooda-01 UNKNOWN h-fasseooda-lb02 UNKNOWN h-fasseoodb-lb11 UNKNOWN h-fasseoodb-lb12 UNKNOWN h-fasseoodc-lb21 UNKNOWN h-fasseoodc-lb22 UNKNOWN h-qa-fasseooda-01 UNKNOWN h-qa-fasseooda-lb02 UNKNOWN holy-es-master01 UNKNOWN holy-es-master02 UNKNOWN holy-es-master03 UNKNOWN holynagios UNKNOWN kreindlerl3-fs UNKNOWN martin-su-l3-fs UNKNOWN mcconnell-l3-fs UNKNOWN openauth02 jtriley UNKNOWN shleifer-dsl3-fs UNKNOWN stock-solar-l3-fs UNKNOWN stopsack-l3-fs UNKNOWN xcat UNKNOWN.

Sep 24, 21:09:12 GMT+0
Monitoring - FASSE OOD is back up FASSE login nodes are still down .

Sep 25, 01:41:33 GMT+0
Monitoring - FASSE Open OnDemand and FASSE login services should be operational now..

Sep 25, 14:18:23 GMT+0
Resolved - Resolving. The hypervisor and all but one VM, which has separate issue, are operational..

holyscratch01

Mon, 23 Sep 2024 19:54:00 +0000

Sep 23, 19:54:00 GMT+0
Identified - The object storage target OST2b on holyscratch01 is again causing degraded performance. We are failing it over to the backup. We're aware that this issue is a concern, but please know that an entire new scratch filesystem is forthcoming. Thanks for your understanding..

Sep 23, 20:00:35 GMT+0
Resolved - The failover is complete..

holyscratch01 issues overnight

Sun, 22 Sep 2024 01:00:00 +0000

Sep 22, 01:00:00 GMT+0
Resolved - holyscratch01 was at times degraded over the weekend. The OST causing the issues was restarted and the filesystem should be back to normal..

holyscratch01 degraded performance

Thu, 19 Sep 2024 13:25:00 +0000

Sep 19, 13:25:00 GMT+0
Resolved - holyscratch01 was found to be in a degraded state around 9:15AM and returned to operation at 9:25AM.

FASRC Monthly Maintenance - Monday September 9th, 2024 from 7am-11am

Mon, 9 Sep 2024 11:00:00 +0000

Sep 9, 11:00:00 GMT+0
Identified - **NOTICES** * Holiday: Monday September 2nd, 2024 is a university holiday (Labor Day) * Training: Upcoming training from FASRC and other sources can be found on our Training Calendar. at * Status Page: You can subscribe to our status to receive notifications of any issues and their resolution at (click Get Updates for options). **MAINTENANCE TASKS** Cannon cluster will be paused during this maintenance?: YES FASSE cluster will be paused during this maintenance?: YES * Upgrade Slurm to version 24.05.3 * Audience: OOD/VDI users * Impact: Jobs will be paused during the upgrade and will resume once the upgrade is complete. * Upgrade Open OnDemand (OOD/VDI) to version 3.1.7 * Audience: All cluster users * Impact: Any new features/changes will be visible at next new OOD session startup. However, see above re: jobs paused during Slurm upgrade. * Login node reboots * Audience: Anyone logged into a FASRC login node * Impact: Login nodes will rebooted during this maintenance window * Scratch cleanup ( ) * Audience: Cluster users * Impact: Files older than 90 days will be removed. Please note that retention cleanup can run at any time, not just during the maintenance window. Thanks, FAS Research Computing Dept. Website: Documentation: Status Page: .

Sep 9, 11:00:01 GMT+0
Identified - Maintenance is now in progress.

Sep 9, 15:00:00 GMT+0
Completed - Maintenance has completed successfully.

Starfish upgrade

Sat, 24 Aug 2024 00:00:00 +0000

Aug 24, 00:00:00 GMT+0
Identified - The Starfish Zones Dashboard will be undergoing a few upgrades and maintenance this weekend from Friday, August 23rd at 8AM until Monday, August 26th at 8AM. The dashboard will not be accessible during this time. Further details will be provided, if needed. Please email [rchelp@rc.fas.harvard.edu](mailto:rchelp@rc.fas.harvard.edu) if you have any questions or concerns..

Aug 24, 00:00:01 GMT+0
Identified - Maintenance is now in progress.

Aug 26, 14:35:42 GMT+0
Identified - Starfish maintenance is still ongoing, no ETA at this time. .

Aug 27, 12:00:00 GMT+0
Completed - Starfish is back up .

FASRC monthly maintenance - Monday August 5th, 2024 - 7am-11am

Mon, 5 Aug 2024 11:00:00 +0000

Aug 5, 11:00:00 GMT+0
Identified - FASRC monthly maintenance will take place Monday August 5th, 2024 from 7am-11am ## **NOTICES** ### ### Training New training sessions available. Based on feedback we will soon begin a monthly training newsletter to keep everyone better informed about upcoming training. ### Research Data Management New data management resources and information from Sarah Marchese our new Research Data Manager: * * Training Session: Managing Research Data at FAS August 20, 2024 1-2pm * ### Status Page You can subscribe to our status to receive notifications of any issues and their resolution at (click Get Updates for options). ## **MAINTENANCE TASKS** Cannon cluster will be paused during this maintenance?: **NO** FASSE cluster will be paused during this maintenance?: **NO** * **Cutover to new Open OnDemand (OOD/VDI) nodes** * Audience: All OOD users * Impact: This change will take place behind the scenes and should have no immediate visible impact on OOD jobs/users * **Login node reboots** * Audience: Anyone logged into a FASRC login node * Impact: Login nodes will rebooted during this maintenance window * **Scratch cleanup** ( ) * Audience: Cluster users * Impact: Files older than 90 days will be removed. Please note that retention cleanup can run at any time, not just during the maintenance window. Thanks, FAS Research Computing Dept. Website: Documentation: Status Page: .

Aug 5, 11:00:01 GMT+0
Identified - Maintenance is now in progress.

Aug 5, 15:00:00 GMT+0
Completed - Maintenance has completed successfully.

MGHPCC row 8A critical work, even side nodes

Thu, 1 Aug 2024 19:15:00 +0000

Aug 1, 19:15:00 GMT+0
Identified - Lenovo needs to do some critical maintenance in row 8A at MGHPCC. This is needed to fix networking issues to improve IB performance Jobs on those nodes will be suspended and, after our work is completed, jobs will be resumed. No jobs will be killed. * This will affect seas and kempner users and users of sapphire and requeue whose jobs land on those nodes. See attached file for complete list of nodes. * Start time: 3:15 pm today * ETA: 45 minutes .

Aug 1, 19:15:01 GMT+0
Identified - Maintenance is now in progress.

Aug 1, 20:00:00 GMT+0
Completed - Maintenance has completed successfully.

Coldfront maintenance 7/30/24 noon-12:30

Tue, 30 Jul 2024 16:00:00 +0000

Jul 30, 16:00:00 GMT+0
Identified - Coldfront will be unavailable Tuesday July 30th from 12:00-12:30 for crucial maintenance..

Jul 30, 16:58:18 GMT+0
Completed - Maintenance completed..

MGHPCC power work on Pod 8a - Monday July 15th

Mon, 15 Jul 2024 10:00:00 +0000

Jul 15, 10:00:00 GMT+0
Identified - MGHPCC will be doing power work on Pod 8a on Monday July 15th. _This event will be for the full day_. If work is completed early we will note that here. This work will require that half the nodes in 8a be put into an idle state. No jobs will need to be canceled but users will notice that half the nodes will be unavailable for use during this period leading to delays in scheduling related to reduced capacity. **The impacted partitions are:** ``` arguelles_delgado_gpu bigmem bigmem_intermediate blackhole_gpu dvorkin eddy enos gershman gpu gpu_requeue hejazi hernquist_ice hoekstra hsph huce_ice iaifi_gpu iaifi_gpu_priority iaifi_gpu_requeue intermediate itc_gpu itc_gpu_requeue joonholee jshapiro jshapiro_priority jshapiro_sapphire kempner kempner_dev kempner_h100 kempner_requeue kovac kozinsky kozinsky_gpu kozinsky_priority kozinsky_requeue murphy_ice ortegahernandez_ice rivas sapphire seas_compute seas_gpu serial_requeue serial_requeue siag siag_combo siag_gpu sur test yao yao_priority zhuang ```.

Jul 15, 10:00:01 GMT+0
Identified - Maintenance is now in progress.

Jul 15, 16:32:20 GMT+0
Completed - The power work has been completed. All affected nodes are being returned to regular service..

Starfish Maintenance

Wed, 3 Jul 2024 20:00:00 +0000

Jul 3, 20:00:01 GMT+0
Identified - Maintenance is now in progress.

Jul 3, 20:00:00 GMT+0
Identified - Starfish will be offline from Wednesday July 3rd at 4PM until Sunday July 7th at 8PM for scheduled maintenance. The Starfish dashboard will be inaccessible during the downtime and storage migrations will be on hold.re planning for a scheduled maintenance during that time..

Jul 8, 00:00:00 GMT+0
Completed - Maintenance has completed successfully.

Monthly Maintenance July 1st, 2024 from 7am-11am

Mon, 1 Jul 2024 11:00:00 +0000

Jul 1, 11:00:00 GMT+0
Identified - **NOTICES** * TRAINING: 2024 training sessions are available. Topics include new user training as well as advanced topics. To see current and future training sessions, view our training calendar at: For these and other events such as Office Hours, view our entire events calendar at: * SURVEY: The FASRC Research Data Management and User Training Survey 2024 closes July 1st: * STATUS PAGE: You can subscribe to our status to receive notifications of any issues and their resolution at (click Get Updates for options). **MAINTENANCE TASKS** Cannon cluster will be paused during this maintenance?: YES FASSE cluster will be paused during this maintenance?: YES **Slurm scheduler upgrade** \-- Audience: All cluster users \-- Impact: The schedulers will be upgraded to Slurm version 23.11.8\. All jobs will be paused to accommodate this upgrade and will resume once complete. **~~PMIx upgrade~~ \- DEFERRED** \-- This upgrade has been deferred due to issues during testing \-- ~~Audience: All MPI/OpenMP jobs~~ \-- ~~Impact: PMIx API will be upgraded to version 5.0.2 while jobs are paused for the Slurm upgrade~~ **Login node and Open OnDemand (OOD/VDI) reboots** \-- Audience: Anyone logged into a login node or VDI/OOD node \-- Impact: Login and VDI/OOD nodes will rebooted during this maintenance window **Scratch cleanup** ( ) \-- Audience: Cluster users \-- Impact: Files older than 90 days will be removed. Please note that retention cleanup can run at any time, not just during the maintenance window. Thanks, FAS Research Computing Dept. Website: Documentation: Status Page: .

Jul 1, 11:00:01 GMT+0
Identified - Maintenance is now in progress.

Jul 1, 15:00:00 GMT+0
Completed - Maintenance has completed successfully.

FASRC websites -Unplanned maintenance (www.rc.fas.harvard.edu and docs.rc.fas.harvard.edu)

Wed, 26 Jun 2024 04:17:55 +0000

Jun 26, 04:29:12 GMT+0
Completed - This update was completed successfully..

Jun 26, 04:17:55 GMT+0
Identified - An unplanned maintenance on [www.rc.fas.harvard.edu](http://www.rc.fas.harvard.edu) and [docs.rc.fas.harvard.edu](http://docs.rc.fas.harvard.edu) is required. ETA is approximately 1 hour . We apologize for any inconvenience..

MGHPCC Pod 8A Power Upgrade June 24 will idle some Cannon nodes

Mon, 24 Jun 2024 04:01:00 +0000

Jun 24, 04:01:00 GMT+0
Identified - MGHPCC will be performing power upgrades on Pod 8A in order to increase density and allow more nodes to be added in that Pod's rows. Similar to the May 13th work, this means that we will be idling half the nodes in 8A on two dates: June 17 and June 24th. These are all day events, meaning that the nodes in question will not be available for the 24 hours of that day. This is being accomplished via reservations. So no jobs will be canceled but nodes will be drained and users may notice that their jobs may pend longer than normal as the scheduler idles these nodes. Where possible, please use or include other partitions in your job scripts and plan accordingly for any new or long-running jobs during that period: This affects the Cannon cluster. FASSE is not affected. **Impacted partitions are:** arguelles\_delgado\_gpu bigmem\_intermediate bigmem blackhole\_gpu eddy enos gershman gpu hejazi hernquist\_ice hoekstra hsph huce\_ice iaifi\_gpu iaifi\_gpu\_priority iaifi\_gpu\_requeue intermediate itc\_gpu itc\_gpu\_requeue joonholee jshapiro jshapiro\_priority jshapiro\_sapphire kempner kempner\_dev kempner\_h100 kempner\_requeue kempner\_reservation kovac kozinsky kozinsky\_gpu kozinsky\_priority kozinsky\_requeue murphy\_ice ortegahernandez\_ice sapphire seas\_compute seas\_gpu siag siag\_combo siag\_gpu sur test yao yao\_priority zhuang.

Jun 24, 16:01:01 GMT+0
Identified - Maintenance is now in progress.

Jun 25, 04:00:00 GMT+0
Completed - Maintenance has completed successfully.

MGHPCC Pod 8A Power Upgrade June 17 will idle some Cannon nodes

Mon, 17 Jun 2024 04:01:00 +0000

Jun 17, 04:01:00 GMT+0
Identified - MGHPCC will be performing power upgrades on Pod 8A in order to increase density and allow more nodes to be added in that Pod's rows. Similar to the May 13th work, this means that we will be idling half the nodes in 8A on two dates: June 17 and June 24th. These are all day events, meaning that the nodes in question will not be available for the 24 hours of that day. This is being accomplished via reservations. So no jobs will be canceled but nodes will be drained and users may notice that their jobs may pend longer than normal as the scheduler idles these nodes. Where possible, please use or include other partitions in your job scripts and plan accordingly for any new or long-running jobs during that period: This affects the Cannon cluster. FASSE is not affected. **Impacted partitions are:** arguelles\_delgado\_gpu bigmem\_intermediate bigmem blackhole\_gpu eddy enos gershman gpu hejazi hernquist\_ice hoekstra hsph huce\_ice iaifi\_gpu iaifi\_gpu\_priority iaifi\_gpu\_requeue intermediate itc\_gpu itc\_gpu\_requeue joonholee jshapiro jshapiro\_priority jshapiro\_sapphire kempner kempner\_dev kempner\_h100 kempner\_requeue kempner\_reservation kovac kozinsky kozinsky\_gpu kozinsky\_priority kozinsky\_requeue murphy\_ice ortegahernandez\_ice sapphire seas\_compute seas\_gpu siag siag\_combo siag\_gpu sur test yao yao\_priority zhuang.

Jun 17, 04:01:01 GMT+0
Identified - Maintenance is now in progress.

Jun 17, 17:28:00 GMT+0
Completed - Maintenance has completed successfully.

Annual MGHPCC/Holyoke data center power downtime - May 21-24 2024

Tue, 21 May 2024 13:00:00 +0000

May 21, 13:00:01 GMT+0
Identified - Maintenance is now in progress.

May 21, 13:00:00 GMT+0
Identified - **The 2024 MGHPCC data center annual power downtime will take place May 21-24, 2024.** **We will begin our shutdown on Tuesday May 21st and expect a return to service by 5PM Friday May 24th.** \- Jobs: Please plan ahead as _all_ still running jobs on the morning of May 21st will be stopped and canceled and will need to be resubmitted after the downtime. Pending jobs will remain in the queue until the cluster returns to regular service on May 24th. \- Access: The cluster, scheduler, login, and OoD nodes will be unavailable for the duration of the downtime. New lab and account requests should wait until after the downtime. \- Storage: All Holyoke storage will be powered down and unavailable for the duration of the downtime. Boston storage will remain online, but your ability to access it may be impacted and network changes may briefly affect its availability. Further details, an explanation for this year's change in scheduling, a visual timeline, and a list of maintenance tasks overview can be found at: Progress of the downtime will be posted here on our status page during the event. Note that you can subscribe to receive updates as they happen. Click Get Updates in the upper right. ### MAJOR TASK OVERVIEW * OS upgrade to Rocky 8.9 - Point upgrade, no code rebuilds will be required. Switch from system OFED to Mellanox OFED on nodes for improved performance * Infiniband (network) upgrades * BIOS updates (various) * Storage firmware updates * Network Maintenance * Decommission old nodes (targets contacted) * Additional minor one-off updates and maintenance (cable swap, reboots, etc.) Thanks, FAS Research Computing .

May 24, 13:47:01 GMT+0
Identified - Power work completed by facility. Currently on schedule for powerup and return to service. ETA 5pm..

May 24, 21:08:58 GMT+0
Identified - We are currently delayed opening the cluster due to some lingering issues. We will re-open as soon as possible or update again at 6pm..

May 24, 21:50:32 GMT+0
Completed - # 2024 MGHPCC downtime complete DOWNTIME COMPLETE The annual multi-day power downtime at MGHPCC () is complete (with any exceptions noted below). Normal service resumes today (Friday May 24th) at 5pm. The cluster has been updated to Rocky Linux 8.9\. Several network, InfiniBand, computer, and storage firmware updates were installed. Available security updates were also installed. CANNON NODES More than 90% of nodes are up and all partitions are enabled. If your specialty partition has a downed node, we will attend to this on Tuesday. FASSE OOD Some updates are still propagating. If your FASSE Open OnDemand/VDI session does not work initially, please wait or retry your job/session. POST-DOWNTIME SUPPORT If you have any further concerns or unanswered questions please submit a help ticket () and we will do our best to respond quickly. Please bear in mind it is a long weekend, so lingering issues may not be dealt with until Tuesday. Also, have a good long Memorial Day weekend! Thanks, FAS Research Computing [rchelp@rc.fas.harvard.edu](mailto:rchelp@rc.fas.harvard.edu) .

Critical power supply work at MGHPCC May 13th - A subset of Cannon nodes will be idled

Mon, 13 May 2024 12:00:00 +0000

May 13, 19:15:13 GMT+0
Completed - This work has been completed and all 8A nodes are back in service..

May 13, 12:00:00 GMT+0
Identified - WHAT: Some nodes in row 8A will be idled at MGHPCC WHEN: May 13th 8am-4pm To avoid a future over-capacity situation, MGHPCC will be performing power supply work on May 13th from 8am-4pm. This includes sections of row 8A where some of our Cannon compute is located. We will be idling half the nodes in Pod 8a to allow necessary power work. Unfortunately this cannot be done during our upcoming outage. This work is dictated by availability of electricians and other resources outside the facilities control, otherwise it would have been included in the May 21-24 downtime. IMPACT Half the nodes in racks 8a22, 8a28, 8a30, 8a32 will be down (only \~ 114 nodes, around 7% of [total ](https://www.rc.fas.harvard.edu/about/cluster-architecture/)capacity). The work being done will also enable us to add more capacity for future purchases so it's important that we allow this interruption. Impacted partitions are outlined at the bottom of this notice. Impact includes but is not limited to gpu, intermediate, sapphire, and hsph. A reservation to idle the nodes is already in place. Pending jobs in those partitions will take longer to start due to fewer available nodes during this time. Where possible, please use or include other partitions in your job scripts and plan accordingly for any new or long-running jobs during that period: Thanks for your understanding. FAS Research Computing **Partitions with nodes in 8A whose capacity will be reduced during this maintenance:** arguelles\_delgado\_gpu bigmem\_intermediate bigmem blackhole\_gpu eddy gershman gpu hejazi hernquist\_ice hoekstra hsph huce\_ice iaifi\_gpu iaifi\_gpu\_priority intermediate itc\_gpu joonholee jshapiro jshapiro\_priority jshapiro\_sapphire kempner kempner\_dev kempner\_h100 kovac kozinsky\_gpu kozinsky kozinsky\_priority murphy\_ice ortegahernandez\_ice sapphire rivas seas\_compute seas\_gpu siag\_gpu siag\_combo siag sur test yao yao\_priority zhuang.

May 13, 12:00:01 GMT+0
Identified - Maintenance is now in progress.

Monthly Maintenance Monday April 1st, 2024 from 7am-11am

Mon, 1 Apr 2024 11:00:00 +0000

Apr 1, 11:00:00 GMT+0
Identified - **NOTICES** * TRAINING: 2024 training sessions are available. Topics include new user training as well as advanced topics. To see current and future training sessions, view our training alendar at: For these and other events such as Office Hours, view our entire events calendar at: * SURVEY: If you have not yet, we invite you to fill out our 2024 user survey \*approx. 15 minutes) and give us your feedback. The survey is anonymous and asks questions about all of our services including cluster, storage, and support. The survey will be available until April 12th. * STATUS PAGE: You can subscribe to our status to receive notifications of any issues and their resolution at (click Get Updates for options). **MAINTENANCE TASKS** Cannon cluster will be paused during this maintenance?: NO FASSE cluster will be paused during this maintenance?: NO **Ticket System host move** \-- Audience: All users \-- Impact: The FASRC ticket system will be unavailable during maintenance while we move it to a new host. Emails sent during this time should pend but still reach the ticket system once it is back online. **Login node and Open OnDemand (OOD/VDI) reboots** \-- Audience: Anyone logged into a login node or VDI/OOD node \-- Impact: Login and VDI/OOD nodes will rebooted during this maintenance window **Scratch cleanup** ( ) \-- Audience: Cluster users \-- Impact: Files older than 90 days will be removed. Please note that retention cleanup can run at any time, not just during the maintenance window. Thanks, FAS Research Computing Dept. Website: Documentation: Status Page: .

Apr 1, 11:00:01 GMT+0
Identified - Maintenance is now in progress.

Apr 1, 15:00:00 GMT+0
Completed - Maintenance has completed successfully.

FASRC Monthly Maintenance - March 4th, 2024 7am-11am

Mon, 4 Mar 2024 12:00:00 +0000

Mar 4, 12:00:00 GMT+0
Identified - FASRC monthly maintenance will take place Monday March 4th, 2024 from 7am-11am **NOTICES** 2024 training sessions are available. Topics so far include New User Training, Getting Started on FASRC with CLI, Getting Started on FASRC with OpenOnDemand, Installing and using software, GPU Computing, Advanced cluster usage, Parallel job workflows, and other advanced topics. To see current and future training sessions, view our calendar at: For these and other events such as Office Hours, view our entire events calendar at: TIP: Subscribe to receive status page notices at (click Get Updates for options). **MAINTENANCE TASKS** Cannon cluster will be paused during this maintenance?: YES FASSE cluster will be paused during this maintenance?: YES **Slurm upgrade to v 23.11.4** \-- Audience: All cluster users \-- Impact: The scheduler will be unavailable during maintenance. This is likely to require the entire maintenance window. There is a small possibility that this may go over time. Please watch the Status Page for updates. \-- Jobs will be paused to accommodate this upgrade **Infiniband Nvidia MQM9700 firmware update to the latest version 3.11.3002** \-- Audience: All cluster users \-- Impact: No direct impact (this is networking firmware, not GPU firmware) \-- Jobs will be paused to accommodate this update **Internal services updates** \-- Audience: FASRC internal services including: foreman, atlantis, puppet, terraform \-- Impact: No user-facing impact **Login node and OOD/VDI reboots** \-- Audience: Anyone logged into a login node or VDI/OOD node \-- Impact: Login and VDI/OOD nodes will rebooted during this maintenance window **Scratch cleanup (** **)** \-- Audience: Cluster users \-- Impact: Files older than 90 days will be removed. Please note that retention cleanup can run at any time, not just during the maintenance window. Thanks, FAS Research Computing Dept. Website: Documentation: Status Page: .

Mar 4, 12:00:01 GMT+0
Identified - Maintenance is now in progress.

Mar 4, 15:56:28 GMT+0
Identified - Due to unforeseen issues with updates, maintenance will run long. ETA 11:15.

Mar 4, 16:15:00 GMT+0
Completed - Maintenance has completed successfully.

Mar 4, 16:19:19 GMT+0
Completed - The Slurm upgrade to 23.11.4 was completed successfully during maintenance. However a complication with the automation of Slurm's cryptographic keys occurred during the upgrade which caused nodes to lose the ability to talk to the Slurm master. The Slurm master therefore viewed those nodes as down and requeued their jobs. All jobs on Cannon and FASSE were requeued. This is deeply regrettable but the chain of events which caused this could not be foreseen. To check the status of your jobs, see the common Slurm commands at: FAS Research Computing [rchelp@rc.fas.harvard.edu](mailto:rchelp@rc.fas.harvard.edu).

FASRC monthly maintenance Monday February 5th, 2024 from 7am-12pm

Mon, 5 Feb 2024 12:00:00 +0000

Feb 5, 12:00:00 GMT+0
Identified - FASRC monthly maintenance will take place Monday February 5th, 2024 from 7am-11am NOTICES - 2024 training sessions are available. Topics will include New User Training, Getting Started on FASRC with CLI, Getting Started on FASRC with OpenOnDemand, Installing and using software, GPU Computing, Advanced cluster usage, Parallel job workflows, and other advanced topics. To see current and future training sessions, view our calendar at: https://www.rc.fas.harvard.edu/upcoming-training/ MAINTENANCE TASKS Cannon cluster will be paused during this maintenance?: YES FASSE cluster will be paused during this maintenance?: YES Holylabs firmware upgrade -- Audience: All cluster users -- Impact: Holylabs (lab directories) storage will be unavailable during the upgrade -- Jobs will be paused to accommodate this maintenance Holy-Isilon firmware upgrade -- Audience: Tier1 holy-isilon shares -- Impact: holy-isilon storage may be interrupted during the upgrade Disable NTLMv1 and LM protocols -- Audience: FASRC authentication -- Impact: No major impact expected, these protocols should not be in use by any services Login node and OOD/VDI reboots -- Audience: Anyone logged into a login node or VDI/OOD node -- Impact: Login and VDI/OOD nodes will rebooted during this maintenance window Scratch cleanup ( https://docs.rc.fas.harvard.edu/kb/policy-scratch/ ) -- Audience: Cluster users -- Impact: Files older than 90 days will be removed. Please note that retention cleanup can run at any time, not just during the maintenance window. Thanks, FAS Research Computing Department and Service Catalog: https://www.rc.fas.harvard.edu/ Documentation: https://docs.rc.fas.harvard.edu/ Status Page: https://status.rc.fas.harvard.edu/.

Feb 5, 12:00:01 GMT+0
Identified - Maintenance is now in progress.

Feb 5, 16:00:00 GMT+0
Completed - Maintenance has completed successfully.

FASRC monthly maintenance Monday January 8th, 2024 from 7am-12pm

Mon, 8 Jan 2024 12:00:00 +0000

Jan 8, 12:00:01 GMT+0
Identified - Maintenance is now in progress.

Jan 8, 12:00:00 GMT+0
Identified - FASRC monthly maintenance will take place Monday January 8th, 2024 from 7am-11am NOTICES - 2024 training sessions will be available soon. Topics will include New User Training, Getting Started on FASRC with CLI, Getting Started on FASRC with OpenOnDemand, Installing and using software, GPU Computing, Advanced cluster usage, Parallel job workflows, and other advanced topics. To see current and future training sessions, view our calendar at: https://www.rc.fas.harvard.edu/upcoming-training/ MAINTENANCE TASKS Cannon cluster will be paused during this maintenance?: YES FASSE cluster will be paused during this maintenance?: YES scratch controller replacement -- Audience: All cluster users -- Impact: Scratch (holyscratch01) will be unavailable in the maintenance window -- Jobs will be paused to accommodate this maintenance FASSE Open OnDemand (OOD/VDI) upgrade to V3 -- Audience: All FASSE users of FASRC Open OnDemand/VDI -- Impact: FASSE OOD new connections will be unavailable in the maintenance window -- FASSE OOD disconnects of existing session could occur within the maintenance window but are not expected VPN firmware updates -- Audience: All users of FASRC VPN -- Impact: No major impact expected, but brief disconnects will occur Disable NTLMv1 and LM protocols -- Audience: FASRC authentication -- Impact: No major impact expected, these protocols should not be in use by any services Login node and OOD/VDI reboots -- Audience: Anyone logged into a login node or VDI/OOD node -- Impact: Login and VDI/OOD nodes will rebooted during this maintenance window Scratch cleanup ( https://docs.rc.fas.harvard.edu/kb/policy-scratch/ ) -- Audience: Cluster users -- Impact: Files older than 90 days will be removed. Please note that retention cleanup can run at any time, not just during the maintenance window. Thanks, FAS Research Computing Department and Service Catalog: https://www.rc.fas.harvard.edu/ Documentation: https://docs.rc.fas.harvard.edu/ Status Page: https://status.rc.fas.harvard.edu/.

Jan 8, 15:41:35 GMT+0
Identified - Maintenance is now in progress but progressing more slowly than expected. We are extending by an hour. New ETA to end maintenance is noon..

Jan 8, 17:00:00 GMT+0
Completed - Maintenance has completed successfully.

Slurm scheduler emergency maintenance

Wed, 3 Jan 2024 16:20:00 +0000

Jan 3, 16:20:00 GMT+0
Identified - A configuration update is urgently needed for the Slurm scheduler. The scheduler will be unavailable for queries or new jobs during this time. Running jobs will not be affected..

Jan 3, 16:18:28 GMT+0
Identified - This maintenance is now in progress..

Jan 3, 16:41:38 GMT+0
Completed - Slurm maintenance has completed successfully..

Emergency scheduler update - Security Patch

Thu, 14 Dec 2023 16:00:00 +0000

Dec 14, 16:00:01 GMT+0
Identified - Maintenance is now in progress.

Dec 14, 16:00:00 GMT+0
Identified - SchedMD has issued a patch for a critical security vulnerability. It is imperative that we apply this patch ASAP. This will require a pause of the cluster for roughly 30-60 minutes at around 11am, after which jobs and the scheduler will continue normal operation. A side effect of this is that we will move up to version 23.02.7 with the patch which will fix the IntelMPI and PMI2 issues some users have experienced. Thanks for your understanding and patience. FAS Research Computing .

Dec 14, 17:00:00 GMT+0
Completed - Maintenance has completed successfully.