Notice history

Under maintenance

Jun 2022

Globus.org maintenance Saturday June 18, 9am-10am

Completed
June 18, 2022 at 2:00 PM
Completed
June 18, 2022 at 2:00 PM
Maintenance has completed successfully
Update
June 18, 2022 at 1:00 PM
Update
June 18, 2022 at 1:00 PM
Maintenance is now in progress
In progress
June 18, 2022 at 1:00 PM
In progress
June 18, 2022 at 1:00 PM
Maintenance is now in progress
Planned
June 18, 2022 at 1:00 PM
Planned
June 18, 2022 at 1:00 PM
Globus has informed us that they will be performing maintenance Saturday June 18th 9am-10am

``` During the downtime, the following will be impacted:
- All services in the Globus ecosystem that use Auth, including Flows, Search, and Timer, will be unavailable.
- Users will be unable to initiate new transfers. Transfers that are inflight when the downtime begins will resume at the last checkpoint when the services are restored at the end of the maintenance period.
- Users will not be able to log into applications that rely on Globus Auth, including Globus webapp and command line interface.
- All third party applications and services that rely on Globus Auth will be impacted, and users will not be able to login/authenticate.
- Users will be unable to initiate new flow runs. Flow runs that attempt to start a new step during the downtime will fail. We recommend that users check the state of their flow runs after services have resumed. ```

Resolved
June 30, 2022 at 2:45 PM
Resolved
June 30, 2022 at 2:45 PM
This issue is currently not fixable on the login nodes. CVMFS can still be used on compute, but not login nodes. We are in contact with OSG and if there is a workaround later we will implement it. As a result, this incident is now closed.
Investigating
June 17, 2022 at 3:01 PM
Investigating
June 17, 2022 at 3:01 PM
CVMFS is available from compute nodes, but not from login nodes. This will only impact users who need this filesystem (OpenScienceGrid).

We will investigate this incident further next week.

Resolved
June 08, 2022 at 1:31 AM
Resolved
June 08, 2022 at 1:31 AM
Websites & Tools - SPINAL (FAS Informatics) is now operational! This update was created by an automated monitoring service.
Investigating
June 08, 2022 at 1:19 AM
Investigating
June 08, 2022 at 1:19 AM
Websites & Tools - SPINAL (FAS Informatics) cannot be accessed at the moment. This incident was created by an automated monitoring service.

Resolved
June 07, 2022 at 6:10 PM
Resolved
June 07, 2022 at 6:10 PM
A problem with the outgoing mail system the Portal uses to send approval and account emails has been found and fixed.

You may be receiving late notices now. If you receive approval emails and find nothing to approve when following the link, this is why. Apologies for the inconvenience. Additional monitoring is being put into place.

Resolved
June 01, 2022 at 2:51 PM
Resolved
June 01, 2022 at 2:51 PM
This issue with the ECS has been resolved.
Identified
June 01, 2022 at 2:45 PM
Identified
June 01, 2022 at 2:45 PM
We are looking at a certificate issue with bosecs and holyecs. Replacement certs should be available to us shortly.

May 2022

Resolved
May 29, 2022 at 2:01 AM
Resolved
May 29, 2022 at 2:01 AM
Websites & Tools - FASRC Documentation (docs.rc) is now operational! This update was created by an automated monitoring service.
Investigating
May 29, 2022 at 1:22 AM
Investigating
May 29, 2022 at 1:22 AM
Websites & Tools - FASRC Documentation (docs.rc) cannot be accessed at the moment. This incident was created by an automated monitoring service.

MGHPCC/Holyoke May 2022 power downtime

Completed
May 26, 2022 at 3:49 PM
Completed
May 26, 2022 at 3:49 PM
Data center maintenance has completed successfully. The cluster and all resources are back in service.
In progress
May 23, 2022 at 10:00 PM
In progress
May 23, 2022 at 10:00 PM
Maintenance is now in progress
Planned
May 23, 2022 at 10:00 PM
Planned
May 23, 2022 at 10:00 PM
The annual MGHPCC data center power shutdown will occur from 6pm May 23 - noon May26
- Power-down of the cluster will begin at 6PM on May 23rd
- Power will be out that night and through the following day, May 24th
- Maintenance to FASRC infrastructure will occur on May 25th
- Power-up ETA and return to service is noon on May 26th
While this outage impacts all services and resources in the MGHPCC/Holyoke data center, please be aware that this can have a knock-on effect for some Boston services as well.

LOGIN & VDI Boston and Holyoke login, and VDI will be affected for the duration of the downtime.

OFFICE HOURS Office hours will not be held on Wednesday May 25th.

BOSTON DATA CENTER Boston storage may be affected at various times on May 25th. Any additional Boston outages will be noted here.

Resolved
May 23, 2022 at 7:14 PM
Resolved
May 23, 2022 at 7:14 PM
Websites & Tools - MiniLIMs (FAS Informatics) is now operational! This update was created by an automated monitoring service.
Investigating
May 23, 2022 at 7:01 PM
Investigating
May 23, 2022 at 7:01 PM
Websites & Tools - MiniLIMs (FAS Informatics) cannot be accessed at the moment. This incident was created by an automated monitoring service.

Resolved
May 16, 2022 at 9:34 PM
Resolved
May 16, 2022 at 9:34 PM
Websites & Tools - SPINAL (FAS Informatics) is now operational! This update was created by an automated monitoring service.
Investigating
May 16, 2022 at 9:22 PM
Investigating
May 16, 2022 at 9:22 PM
Websites & Tools - SPINAL (FAS Informatics) cannot be accessed at the moment. This incident was created by an automated monitoring service.

Resolved
May 06, 2022 at 4:17 PM
Resolved
May 06, 2022 at 4:17 PM
The scheduler and node states appear to be stable. Thank you for your patience and understanding.

Please note that the intermittent deadlock issue is still not resolved, but we are actively monitoring that and intervening as necessary until we receive a solution.
Monitoring
May 06, 2022 at 3:20 PM
Monitoring
May 06, 2022 at 3:20 PM
The patch has been deployed and the scheduler restarted. Paused jobs are resuming.

Any jobs which did not start or were stuck may have been flushed. So please check any pending jobs you might have had.

Related doc: https://docs.rc.fas.harvard.edu/kb/running-jobs/.
Identified
May 06, 2022 at 2:16 PM
Identified
May 06, 2022 at 2:16 PM
We have tested the patch on our test cluster before releasing. We are now proceeding to deploy to the cluster. Thanks for your patience.

UPDATE: jobs suspended, scheduler down for patching, nodes are updating.
Investigating
May 06, 2022 at 1:47 PM
Investigating
May 06, 2022 at 1:47 PM
The Slurm emergency security patch introduced a bug which is causing many of our nodes to be set to 'not responding'. The vendor has already identified the issue and issued another patch.

We are deploying this patch after testing. Jobs will be paused and the scheduler and cluster will be unavailable while deploying the patch. Watch here for updates.

Apr 2022

Resolved
April 19, 2022 at 3:56 PM
Resolved
April 19, 2022 at 3:56 PM
boslogin04 is back up
Identified
April 19, 2022 at 3:45 PM
Identified
April 19, 2022 at 3:45 PM
boslogin04 needs to be rebooted in order to address an underlying issue.

Resolved
April 25, 2022 at 2:15 PM
Resolved
April 25, 2022 at 2:15 PM
Please contact us if you see any lingering node issues.
Identified
April 14, 2022 at 12:00 PM
Identified
April 14, 2022 at 12:00 PM
From the MGHPCC datacenter outage on 4/12/22, there are some lingering filesystem mount issues for some labs on some nodes. We are actively working on draining and rebooting these nodes to bring them back into service.

Resolved
April 12, 2022 at 10:33 PM
Resolved
April 12, 2022 at 10:33 PM
All cooling, including the water cooling for water-cooled compute nodes, is back online. All partitions are open for jobs. Some compute nodes in various partitions may still require individual attention, so not every compute node is back online, but we will work to bring them all online in the coming hours.
Monitoring
April 12, 2022 at 9:35 PM
Monitoring
April 12, 2022 at 9:35 PM
Most storage in Holyoke is back up.

The Slurm scheduler is back up and accepting jobs. However, most public partitions are down as the water cooling systems for those compute racks require in-person attention. RC staff are already en route to the datacenter to address this.

The Academic Cluster is back up.
Investigating
April 12, 2022 at 7:13 PM
Investigating
April 12, 2022 at 7:13 PM
A cooling failure caused temperatures in the MGHPCC datacenter to exceed the safe range of operation for many systems, causing them to power down to prevent permanent damage.

The cooling issue has been resolved and we are beginning to power systems back on. Expect outage on various systems until the issue is resolved.
Monitoring
April 12, 2022 at 7:13 PM
Monitoring
April 12, 2022 at 7:13 PM
A cooling failure caused temperatures in the MGHPCC datacenter to exceed the safe range of operation for many systems, causing them to power down to prevent permanent damage.

The cooling issue has been resolved and we are beginning to power systems back on. Expect outage on various systems until the issue is resolved.

Resolved
April 09, 2022 at 10:00 AM
Resolved
April 09, 2022 at 10:00 AM
Websites & Tools - MiniLIMs (FAS Informatics) is now operational! This update was created by an automated monitoring service.
Investigating
April 09, 2022 at 9:17 AM
Investigating
April 09, 2022 at 9:17 AM
Websites & Tools - MiniLIMs (FAS Informatics) cannot be accessed at the moment. This incident was created by an automated monitoring service.

Resolved
April 09, 2022 at 10:04 AM
Resolved
April 09, 2022 at 10:04 AM
Websites & Tools - SPINAL (FAS Informatics) is now operational! This update was created by an automated monitoring service.
Investigating
April 09, 2022 at 8:37 AM
Investigating
April 09, 2022 at 8:37 AM
Websites & Tools - SPINAL (FAS Informatics) cannot be accessed at the moment. This incident was created by an automated monitoring service.

Apr 2022 to Jun 2022

FAS Research Computing - Notice history

Notice history

Jun 2022

May 2022

Apr 2022