Incident Report
Incident Name: Chainlink Arbitrum Mainnet Outage
Report Date: June 7, 2023
Incident Dates: 07/06/23T1800 - 07/06/23T1930
Incident Duration: ~90 minutes
Incident Description:
Arbitrum Chainlink node stopped serving requests, because the RPC connection timed out. Upon further inspection of RPC connections, RPC nodes were not producing blocks correctly, and syncing with the chain.
The network issue seemed to be sourced from the relayer not working properly. There is a RPC node upstream of the relayer which kept producing blocks, but all self hosted or non-foundation RPC nodes were down stream of the relayer, leading to the out of sync issue.
Root Cause:
Arbitrum network relayer offline because Arbitrum Foundation didn't pay their gas (ETH) bill and couldn't process tx's through the relayer with 0 eth in gas.
Read Arbitrum Foundation, full incident report:
https://arbitrumfoundation.notion.site/arbitrumfoundation/June-7-2023-Batch-Poster-Outage-d49c50df42864c7b83521fd7aa5897f2
Resolution:
Team had to wait for Arbitrum Network to resolve the root cause issue on their end.
Preventative Measures:
None
Follow-Up:
A review of the incident with the team members to share lessons learned and identify opportunities for improvement.
Updating documentation on incident response procedures when the source is directly from the underlying operating network.
Conducting a post-incident review to evaluate the effectiveness of the preventative measures put in place.
Impact:
The incident resulted in a downtime of ~90minute (1) Arbitrum Mainnet Chainlink node services. Tx's were processed intermintently when blockFarms RPC endpoints could sync properly over the 90minute window but overall degraded services for the 90minute window was experienced.
Incident Owner: Matthew Gladson (Team Lead)