Approach for Troubleshooting Oracle Database Unavailability in a Production Environment

Quickstart

When an Oracle Database becomes unavailable in a production (PRD) environment, troubleshooting should follow a structured and non-destructive approach to preserve evidence and avoid data loss.

1. Capture the Current State Before Any Restart

Before restarting Oracle services or rebooting the server, record the current state of the system and database components. This information may be lost after a restart.

  • Identify running Oracle and cluster-related processes ps -ef | grep pmon
    ps -ef | grep d.bin # Clusterware environments
    systemctl status dbora.service
  • In high-availability setups, verify the Data Guard configuration dgmgrl sys/
    SHOW CONFIGURATION;
    SHOW DATABASE ;
  • In Clusterware environments, check the status of Oracle resources crsctl status resource -t
  • For ASM-based storage, confirm disk group availability asmcmd lsdg
    asmcmd lsdsk

2. Attempt Database Connectivity

Try connecting to the database using SQL*Plus to capture any error messages:

sqlplus / as sysdba

Record all error outputs for further analysis.

3. Review Oracle Alert Logs

Perform an initial analysis of the Oracle alert log to understand the nature of the failure.

  • Check the most recent entries in: alert_.log
  • Search backward from the end of the file for termination or Oracle errors: Instance terminated
    ORA-

4. Check Operating System Logs

If the database instance terminated unexpectedly, inspect OS-level logs around the same timestamp to identify external causes such as resource exhaustion or storage failures.

  • System logs: /var/log/messages
  • Kernel messages: dmesg -T

5. Verify Backup Availability

Before attempting recovery or restart actions, confirm that valid database backups are available.

  • Check RMAN backup status: RMAN> LIST BACKUP;

6. Restore or Start Services in a Controlled Manner

  • Attempt to start the primary database and verify availability.
  • If the primary database starts successfully, validate:
  • Listener status
  • Application connectivity
  • In Data Guard environments, ensure redo log transport and apply services are functioning correctly on the standby database.

7. Perform Root Cause Analysis

Analyze collected logs, system outputs, and configuration states to determine the root cause of the database outage. This analysis should consider:

  • Oracle internal errors
  • OS-level resource constraints
  • Storage or ASM issues
  • Clusterware or Data Guard events

Analyzing Oracle Errors and Log Files

Effective troubleshooting of Oracle Database issues requires a systematic review of Oracle and operating system logs. Oracle often reports multiple errors for a single failure; therefore, identifying the first error in the sequence is critical, as it usually indicates the root cause.


1. Oracle Log Files to Review

General Oracle Logs

  • Database Alert Log $ORACLE_BASE/diag/rdbms///trace/alert_.log The alert log provides high-level database events and typically references additional trace files that should also be reviewed.
  • Listener Log $ORACLE_BASE/diag/tnslsnr//listener/trace/listener.log This log is useful for diagnosing connection-related issues and listener failures.

Note: Oracle may log multiple related errors. Always focus on the earliest ORA error reported to determine the underlying cause.


2. High Availability (Data Guard) Logs

  • Data Guard Broker Log $ORACLE_BASE/diag/rdbms///trace/drc.log Review this log when issues involve role transitions, redo transport, or Data Guard configuration problems.

3. Clusterware Logs

  • Oracle High Availability Services Daemon (OHASD) Log $GRID_HOME/log//ohasd
    /var/log/oracleohasd
  • Clusterware Alert Log $ORACLE_BASE/diag/crs//crs/trace/alert.log

These logs are essential for identifying Clusterware startup failures, resource fencing, or node-level issues.


4. ASM Logs

  • ASM Alert Log $ORACLE_BASE/diag/asm/+asm//trace/alert_.log
  • Oracle ASM System Logs /var/log/oracleasm

Review these logs when database issues may be related to storage availability, disk groups, or ASM instance failures.


5. Correlate with Operating System Logs

If Oracle processes terminated unexpectedly (for example, due to memory exhaustion or I/O failures), correlate Oracle logs with OS-level diagnostics:

  • System logs: /var/log/messages
  • Kernel messages: dmesg -T
  • System monitoring and resource checks: free -m
    top
    df -h
    mount
  • Historical monitoring data (e.g., Munin graphs), if available

6. Oracle Error Code Reference

Oracle maintains official documentation for error codes, including possible causes and recommended actions:

Oracle Error Messages Documentation:
https://docs.oracle.com/en/database/oracle/oracle-database/19/errmg/index.html

Examples:

log entryError Message Documentation
alert log:
2022-02-01T16:42:19.462373+00:00
PMON (ospid: 12382): terminating the instance due to ORA error 471
ORA-00471: DBWR process terminated with error
Cause: The database writer process died
Action: Warm start instance
alert log:
2022-01-25T15:57:29.050729+00:00
ORA-00020: maximum number of processes (300) exceeded
ORA-20 errors will not be written to the alert log for …
ORA-00020: maximum number of processes (string) exceeded
Cause: All process state objects are in use.
Action: Increase the value of the PROCESSES initialization parameter.
listener log:
25-JAN-2022 16:38:04 * (CONNECT_DATA=(SERVICE_NAME=ZONOS.NPT.COM)…
TNS-12528: TNS:listener: all appropriate instances are blocking new connections
TNS-12528: TNS:listener: all appropriate instances are blocking new connections
Cause: All instances supporting the service requested by the client reported that they were blocking the new connections. This condition may be temporary, such as at instance startup.
Action: Attempt the connection again. If error persists, then contact the administrator to check the status of the instances.

Oracle Knowledge Base

Requires Oracle Support Login

https://support.oracle.com/, go to Knowledge tab, select Oracle product and enter patterns to search for:

Interactive Quick Reference

Oracle provides a quick Ref for 12c (not available for 19c but should be more or less the same) which might be helpful when trying to find

DB/Cluster Status (Is Oracle up and accepts connections?)

Check systemd status:

systemctl status dbora.service (standalone)
systemctl status oracle-ohasd.service (clusterware)

Check if pmon (Oracle Process Monitor, checks all other Oracle background processes) is running:

$ ps -ef | grep pmon
oracle    5502     1  0 Jan25 ?        00:01:47 ora_pmon_ACPDB
Clusterware (grid layer):
$ ps -ef | grep pmon
grid      82263      1  0 Jan05 ?        00:03:48 asm_pmon_+ASM
oracle    82362      1  0 Jan05 ?        00:06:37 ora_pmon_DB2

Try to connect & check DB state and open-mode:

$ sqlplus /nolog => if that fails $ORACLE_HOME may be unavailable, e.g. due to disk lost/readonly mounted, ...
SQL> conn sys as sysdba
Enter password: 
Connected.
or:
$ sqlplus / as sysdba

SQL> select user from dual;
USER
--------------------------------------------------------------------------------
SYS

SQL> SELECT INSTANCE_NAME, STATUS, DATABASE_STATUS FROM V$INSTANCE;
INSTANCE_NAME     STATUS       DATABASE_STATUS
---------------- ------------ -----------------
ACPDB             OPEN          ACTIVE
SQL> select name, open_mode from v$database;
NAME      OPEN_MODE
--------- --------------------
ACPDB      READ WRITE


Clusterware only: check cluster status. “Target” state should match “State” in output!

crsctl status resource -t
Example:
[grid@myhost_db1 ~]$ crsctl stat res -t
--------------------------------------------------------------------------------
Name           Target  State        Server                   State details
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DGDATA.dg
               ONLINE  ONLINE       myhost_db1           STABLE
ora.DGFRA.dg
               ONLINE  ONLINE       myhost_db1           STABLE
ora.DGREDO.dg
               ONLINE  ONLINE       myhost_db1           STABLE
ora.LISTENER.lsnr
               ONLINE  ONLINE       myhost_db1           STABLE
ora.asm
               ONLINE  ONLINE       myhost_db1           Started,STABLE
ora.ons
               OFFLINE OFFLINE      myhost_db1           STABLE
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.cssd
      1        ONLINE  ONLINE       myhost_db1           STABLE
ora.diskmon
      1        OFFLINE OFFLINE                               STABLE
ora.mydbr.db
      1        ONLINE  ONLINE       myhost_db1           Open,Readonly,HOME=/
                                                             u01/app/oracle/produ
                                                             ct/19.0.0/mydb,STAB
                                                             LE
ora.evmd
      1        ONLINE  ONLINE       myhost_db1           STABLE
--------------------------------------------------------------------------------

Check Oracle alert log for records which indicate that Oracle was terminated and not restarted (e.g. search from last line for ‘Instance terminated’ and check records before and after)

Check listener status:

$ lsnrctl status

Clean shutdown & restart all Oracle/grid infrastructure processes

Clusterware

Stop/start grid + Oracle High Availability Service (as user grid):

$ crsctl stop res -all
$ crsctl stop has
$ crsctl start has
$ crsctl start res -all

Stop/start database/cluster services using srvctl

When starting database using sqplus shutdown command, clusterware might not get current status of database. Use srvctl commands to stop/start (as user oracle!):

Examples:

Status (stop/start) ora.itsdbtst2.db database service:

[oracle@it1dba2 ~]$ srvctl status database -d itsdbtst2
Database is running.

Status (stop/start) ora.itsdbtst2.itsdbtst.cucunet.svc service (for failover zonos applications, expected to be ONLINE on primary and OFFLINE on standby:

[oracle@it1dba2 ~]$ srvctl status service -d itsdbtst2 -s itsdbtst.cucunet
Service itsdbtst.cucunet is not running.

Issue: Services connecting to read-only standby DB

Examples:

Jan 23 17:37:09 it1coa1 zonos-process-engine_10121[1206]: SQL Error: 16000, SQLState: 72000
Jan 23 17:37:09 it1coa1 zonos-process-engine_10121[1206]: ORA-16000: database or pluggable database open for read-only access

Jan 23 17:41:25 it1apa1 zonos-control-panel_10121[92800]:                LEFT JOIN DEVICE_TEMPLATE ON DEVICE_TEMPLATE.ID = FILTERED_DEVICE.DEVICE_TEMPLATE_ID]; SQL state [72000]; error code [16000]; ORA-16000: database or pluggable database open for read-only access
Jan 23 17:41:25 it1apa1 zonos-control-panel_10121[92800]: ; nested exception is java.sql.SQLException: ORA-16000: database or pluggable database open for read-only access

Check current status of custom ora.service cluster service (ora.*.svc) as user grid.

The expected current state is OFFLINE on standby. During failover this service is relocated (started on new primary, stopped on old primary=new standby).

$ crsctl stat res -t
--------------------------------------------------------------------------------
Name           Target  State        Server                   State details       
--------------------------------------------------------------------------------
...
Cluster Resources
--------------------------------------------------------------------------------
ora.cssd
      1        ONLINE  ONLINE       it1dba1                  STABLE
ora.diskmon
      1        OFFLINE OFFLINE                               STABLE
ora.evmd
      1        ONLINE  ONLINE       it1dba1                  STABLE
ora.itsdbtst.db
      1        ONLINE  ONLINE       it1dba1                  Open,Readonly,HOME=/
                                                             u01/app/oracle/produ
                                                             ct/19.0.0/dbhome_1,S
                                                             TABLE
ora.itsdbtst.itsdbtst.cucunet.svc
      1        ONLINE  OFFLINE                               STABLE
--------------------------------------------------------------------------------
$ crsctl stat res -v
...
NAME=ora.itsdbtst2.itsdbtst.cucunet.svc
TYPE=ora.service.type
LAST_SERVER=it1dba2
STATE=OFFLINE
TARGET=...

Stop service manually:

$ srvctl status  service -d itsdbtst -s  itsdbtst.cucunet
Service itsdbtst.cucunet is running
$ srvctl stop service -d itsdbtst -s  itsdbtst.cucunet

Related logs: $ORACLE_BASE/diag/crs/…/crs/trace/…

RAC

OHAS RAC: crsctl enable crs => Cluster manager (cluster ready service) needs to be enabled in RAC

As user grid:

$ crsctl stop crs (stop the entire CRS (GI) stack on each node)
$ crsctl stop cluster ("-all" to stop crs on all nodes)
$ crsctl start crs
$ crsctl start cluster ("-all" to start crs on all nodes)

Start terminated Oracle primary instance step by step

to track at which point Oracle fails and returns error

SQL> shutdown immediate;  (rollback TX, close+shutdown. Last resort: "shutdown abort" => uncommitted TX are killed not rolled back, re-run during Crash Recovery on restart)
SQL> startup nomount; (allocate SGA, read init file, start background processes)
SQL> ALTER DATABASE MOUNT; (open controlfile, find redo logs + datafiles)
SQL> ALTER DATABASE OPEN; (open datafiles, redo logs)

check DB state and open-mode:

SQL> SELECT instance_name, status FROM v$instance;
INSTANCE_NAME     STATUS
---------------- ------------
ACPDB             OPEN

Start standby database(s) and check/start redo apply

Startup:

SQL> STARTUP MOUNT;
SQL> ALTER DATABASE OPEN READ ONLY;
Database altered.

Start Redo apply (MRP process):

SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT FROM SESSION;

Check recovery mode, MRP process and applied archives on Standby:

SQL> select RECOVERY_MODE from V$ARCHIVE_DEST_STATUS ;

RECOVERY_MODE
----------------------------------
MANAGED REAL TIME APPLY
...


SQL> select process,status,sequence#, group#, thread# from v$managed_standby;


"PROCESS","STATUS","SEQUENCE#","GROUP#","THREAD#"
...
"MRP0","APPLYING_LOG",8744,"N/A",1

SQL> select sequence#, first_time, next_time, STATUS from v$archived_log order by first_time desc,sequence# FETCH FIRST 5 ROWS ONLY;

"SEQUENCE#","FIRST_TIME","NEXT_TIME","STATUS"
8743,"2022-01-26 12:40:00","2022-01-26 14:15:06","A"

Check if Primary is in sync with Standby:

see  Verify log shipping from Primary to Standby

ASM

Commands to check availability of ASM disks:

oracleasm listdisks:

[grid@myhost_db1 ~]$ oracleasm listdisks
SDB1
SDB2
SDD1

asmcmd lsdsk/lsdg (list disks/diskgroups):

[grid@myhost_db1 ~]$ asmcmd lsdsk
Path
/dev/oracleasm/disks/SDB1
/dev/oracleasm/disks/SDB2
/dev/oracleasm/disks/SDD1
 
[grid@myhost_db1 ~]$ asmcmd lsdg
State    Type    Rebal  Sector  Logical_Sector  Block       AU  Total_MB  Free_MB  Req_mir_free_MB  Usable_file_MB  Offline_disks  Voting_files  Name
MOUNTED  EXTERN  N         512             512   4096  1048576    962560   954575                0          954575              0             N  DGDATA/
MOUNTED  EXTERN  N         512             512   4096  1048576    295729   283793                0          283793              0             N  DGFRA/
MOUNTED  EXTERN  N         512             512   4096  1048576     30718    30666                0           30666              0             N  DGREDO/

oracleasm querydisk command (Oracle ASMLib driver) lists major and minor numbers that can be used to match the physical device:

[grid@myhost_db1 ~]$ oracleasm querydisk -d SDD1
Disk "SDD1" is a valid ASM disk on device [8,65]
[grid@myhost_db1 ~]$ oracleasm querydisk -d SDB2
Disk "SDB2" is a valid ASM disk on device [8,18]
[grid@myhost_db1 ~]$ oracleasm querydisk -d SDB1
Disk "SDB1" is a valid ASM disk on device [8,17]
[grid@myhost_db1 ~]$ ls -ltr /dev/sd*
brw-rw---- 1 root disk 8, 64 Dec 23 21:01 /dev/sde
brw-rw---- 1 root disk 8, 65 Dec 23 21:01 /dev/sde1
brw-rw---- 1 root disk 8, 16 Jan  3 21:28 /dev/sdb
brw-rw---- 1 root disk 8, 17 Jan  3 21:28 /dev/sdb1
brw-rw---- 1 root disk 8, 18 Jan  3 21:28 /dev/sdb2
...
SDB1 => /dev/sdb1
SDB2 => /dev/sdb2
SDD1 => /dev/sde1

alt. to identify ASM devices on major:minor:

# ls -ltr /dev/oracleasm/disks/
# cat /proc/partitions
# lsblk

Check which devices are identified by ASMLib as ASM dev:

# blkid |grep asm

Query ASM tables in database:

SQL> SELECT * FROM V$ASM_DISKGROUP;
 
"GROUP_NUMBER","NAME","SECTOR_SIZE","LOGICAL_SECTOR_SIZE","BLOCK_SIZE","ALLOCATION_UNIT_SIZE","STATE","TYPE","TOTAL_MB","FREE_MB","HOT_USED_MB","COLD_USED_MB","REQUIRED_MIRROR_FREE_MB","USABLE_FILE_MB","OFFLINE_DISKS","COMPATIBILITY","DATABASE_COMPATIBILITY","VOTING_FILES","CON_ID"
0,"DGREDO",0,0,0,0,"DISMOUNTED",,0,0,0,0,0,0,0,"0.0.0.0.0","0.0.0.0.0","N",0
1,"DGDATA",512,512,4096,1048576,"CONNECTED","EXTERN",962560,954585,0,7975,0,954585,0,"12.1.0.0.0","10.1.0.0.0","N",0
2,"DGFRA",512,512,4096,1048576,"CONNECTED","EXTERN",295729,283736,0,11993,0,283736,0,"11.2.0.2.0","10.1.0.0.0","N",0
SQL> select GROUP_NUMBER, DISK_NUMBER, NAME, LABEL, PATH, CREATE_DATE, MOUNT_DATE, MOUNT_STATUS, STATE, HEADER_STATUS, MODE_STATUS, OS_MB,TOTAL_MB from V$ASM_DISK;
 
"GROUP_NUMBER","DISK_NUMBER","NAME","LABEL","PATH","CREATE_DATE","MOUNT_DATE","MOUNT_STATUS","STATE","HEADER_STATUS","MODE_STATUS","OS_MB","TOTAL_MB"
3,0,"DGREDO_0000","SDD1","/dev/oracleasm/disks/SDD1","24-APR-19","03-JAN-22","CACHED","NORMAL","MEMBER","ONLINE",30718,30718
2,0,"DGFRA_0000","SDB2","/dev/oracleasm/disks/SDB2","24-APR-19","23-DEC-21","CACHED","NORMAL","MEMBER","ONLINE",295729,295729
1,0,"DGDATA_0000","SDB1","/dev/oracleasm/disks/SDB1","24-APR-19","23-DEC-21","CACHED","NORMAL","MEMBER","ONLINE",962560,962560

show ASM config:

$ oracleasm configure

DataGuard

Summary + Status

DGMGRL> SHOW CONFIGURATION [LAG] [VERBOSE];
DGMGRL> SHOW DATABASE [VERBOSE] <dbname>
DGMGRL> VALIDATE DATABASE [VERBOSE] <dbname> (e.g. before performing switchover)

Transport and Apply Lag Information

DGMGRL> show configuration lag
Configuration - acpdb_dg_config
  Protection Mode: MaxPerformance
  Members:
  mydba1 - Primary database
    mydba2 - (*) Physical standby database 
              Transport Lag:      0 seconds (computed 1 second ago)
              Apply Lag:          0 seconds (computed 1 second ago)
Fast-Start Failover: Enabled in Potential Data Loss Mode
Configuration Status:
SUCCESS   (status updated 51 seconds ago)

Information about all registered observers in DG broker

DGMGRL> show observer
Configuration - acpdb_dg_config
  Primary:            mydba1
  Active Target:      mydba2
Observer "mykaa3"(19.5.0.0.0) - Master
  Host Name:                    mykaa3
  Last Ping to Primary:         1 second ago
  Last Ping to Target:          4 seconds ago

Stop/start observer

On observer host as user oracle:

[oracle@mykaa3 ~]$ cd /home/oracle/scripts/
[oracle@mykaa3 scripts]$ ./observer.sh start|stop|restart

Observer issue: Observer fails to start due to additionally created observers

(by “start observer” command in DataGuard Manager on DB host)

Example:

## 2 additional observer ("myhost_db1" + "myhost_db1") running. "myhost_adm1" fails to start.
DGMGRL> show observer
Configuration - dst_dg_config
  Primary:            mydb
  Active Target:      mydb2
Observer "myhost_db1"(19.5.0.0.0) - Master
  Host Name:                    myhost_db1
  Last Ping to Primary:         5346 seconds ago
  Last Ping to Target:          4467 seconds ago
Observer "myhost_adm1"(19.3.0.0.0) - Backup
  Host Name:                    myhost_adm1
  Last Ping to Primary:         (unknown)
  Last Ping to Target:          (unknown)
Observer "myhost_db1" - Backup
  Host Name:                    myhost_db1
  Last Ping to Primary:         (unknown)
  Last Ping to Target:          (unknown)
## trying to get rid of manually created servers
DGMGRL> stop observer "myhost_db1"
Error: ORA-16878: master observer cannot be stopped
Failed.
DGMGRL> stop observer "myhost_db1"
Observer stopped.
DGMGRL> SET MASTEROBSERVER TO "myhost_adm1"
Succeeded.
DGMGRL> show observer
Configuration - dst_dg_config
  Primary:            mydb
  Active Target:      mydb2
Observer "myhost_adm1"(19.3.0.0.0) - Master
  Host Name:                    myhost_adm1
  Last Ping to Primary:         (unknown)
  Last Ping to Target:          (unknown)
Observer "myhost_db1"(19.5.0.0.0) - Backup
  Host Name:                    myhost_db1
  Last Ping to Primary:         5581 seconds ago
  Last Ping to Target:          3 seconds ago
DGMGRL> stop observer "myhost_db1"
Observer stopped.
DGMGRL> show observer
Configuration - dst_dg_config
  Primary:            mydb
  Active Target:      mydb2
Observer "myhost_adm1"(19.3.0.0.0) - Master
  Host Name:                    myhost_adm1
  Last Ping to Primary:         (unknown)
  Last Ping to Target:          (unknown)
## Start original observer on adm host 
# su - oracle
$ cd scripts/
$ ./observer.sh start
$ tail  ../log/observer.log
...
Observer 'myhost_adm1' started
[W000 2022-06-21T16:57:33.131+08:00] Observer trace level is set to USER
[W000 2022-06-21T16:57:33.131+08:00] Try to connect to the primary.
[W000 2022-06-21T16:57:33.131+08:00] Try to connect to the primary mydb_dgb.
[W000 2022-06-21T16:57:33.147+08:00] The standby mydb2 is ready to be a FSFO target
[W000 2022-06-21T16:57:34.147+08:00] Connection to the primary restored!
[W000 2022-06-21T16:57:38.157+08:00] Disconnecting from database mydb_dgb.
## Check observer state on DB hosts
DGMGRL> show observer
Configuration - dst_dg_config
  Primary:            mydb
  Active Target:      mydb2
Observer "myhost_adm1"(19.3.0.0.0) - Master
  Host Name:                    myhost_adm1
  Last Ping to Primary:         2 seconds ago
  Last Ping to Target:          1 second ago

FSFO config

DGMGRL> SHOW FAST_START FAILOVER
Fast-Start Failover: Enabled in Potential Data Loss Mode
  Protection Mode:    MaxPerformance
  Lag Limit:          30 seconds
  Threshold:          30 seconds
  Active Target:      mydba2
  Potential Targets:  "mydba2"
    mydba2    valid
  Observer:           mykaa3
  Shutdown Primary:   TRUE
  Auto-reinstate:     TRUE
  Observer Reconnect: (none)
  Observer Override:  FALSE
Configurable Failover Conditions
  Health Conditions:
    Corrupted Controlfile          YES
    Corrupted Dictionary           YES
    Inaccessible Logfile            NO
    Stuck Archiver                  NO
    Datafile Write Errors          YES
  Oracle Error Conditions:
    (none)
DGMGRL> DISABLE FAST_START FAILOVER       ## e.g. to failover manually to DR system

DGMGRL> ENABLE FAST_START FAILOVER        ## e.g. to switch back to PRD

Is Oracle supposed to switch to a standby DB if primary DB fails? Why did that fail?

Check cluster cfg in DataGuard (as user oracle: ‘dgmgrl sys/<passwd>’) for

  • ‘show configuration’ output shows a asterisk before the standby db entry, and FastStart Failover settings
  • ‘show database verbose primary_dbname’ reports a failover target database in FastStartFailoverTarget property

Example 1: Failover enabled

Oracle is supposed to check state and switch to the target db defined in FastStartFailoverTarget in case of primary failure

DGMGRL> show configuration
...
  Members:
  mydba1 - Primary database
    mydba2 - (*) Physical standby database 

Fast-Start Failover: Enabled in Potential Data Loss Mode

DGMGRL> show database verbose mydba1
...
  Properties:
    ...
    FastStartFailoverTarget         = 'mydba2'

Example 2: Failover disabled

StandbyDB is just a DR db. If primary fails a manual failover needs to be done.

DGMGRL> show configuration;
...
  Members:
  cl1db1 - Primary database
    cl2db1 - Physical standby database

Fast-Start Failover:  Disabled

DGMGRL> show database verbose cl1db1;
...
  Properties:
    ...
    FastStartFailoverTarget         = ''

Switchover & Failover

Switchover => role reversal between the primary database and one of its standby databases – primary db transitions to standby role, standby db transitions to primary

  • Guarantees no data loss
  • Switchover requires the primary db to be available!

Switchover for scheduled (quick) maintenance (OS/HW updates, …) :

see Switch Oracle DB for maintenance

When planning to switch the db for longer period:

Disable Fast Start Failover before switching the db (to prevent the observer from initiating a failover back to original primary DB):

DGMGRL> show FAST_START FAILOVER;
DGMCRL> disable FAST_START FAILOVER;

Config change is propagated to primary and all standby database present in DataGuard cfg. If network connection is broken try ‘disable fast_start failover force’.

Failover  => role transition in which one of the standby databases is transitioned to the primary role after the primary db crashed (standby database is opened read/write).

Failover is usually initiated by Observer/DataGuard in case of emergency (see FastStartFailoverTarget property) but can be also initiated manually.

  • Failover to standby cannot be reversed, so the decision to failover manually should be made during a real disaster or severe outage
  • May or may not result in data loss depending on the protection mode in effect at the time of the failover
  • Failover does not care about state and availability of primary db, it simply transform a pervious standby to primary db
  • As long as the original primary database can be mounted and its files are intact, it can be quickly reinstated and resynchronized as a standby database using Flashback

Failover when the primary database is unavailable/corrupted/… :

Network connectivity check

DGMGRL> VALIDATE NETWORK CONFIGURATION FOR all;
Connecting to instance "ACPDB" on database "mydba1" ...
Connected to "mydba1"
Checking connectivity from instance "ACPDB" on database "mydba1 to instance "ACPDB" on database "mydba2"...
Succeeded.
Connecting to instance "ACPDB" on database "mydba2" ...
Connected to "mydba2"
Checking connectivity from instance "ACPDB" on database "mydba2 to instance "ACPDB" on database "mydba1"...
Succeeded.

Misc

query v$resource_limit to check if database reached some configuration limit:

SQL> col RESOURCE_NAME format  A30;
SQL> col INITIAL_ALLOCATION format  A20;
SQL> col LIMIT_VALUE format  A20;
SQL> set linesize 200;
SQL> select * from v$resource_limit; (select * from gv$resource_limit => in RAC: for all instances)

RESOURCE_NAME                  CURRENT_UTILIZATION MAX_UTILIZATION INITIAL_ALLOCATION    LIMIT_VALUE         CON_ID
------------------------------ ------------------- --------------- -------------------- -------------------- ----------
processes                                       76              80                  400               400              0
sessions                                        92              97                  624               624              0
enqueue_locks                                   39             113                 7420              7420              0
enqueue_resources                               30              30                 2956         UNLIMITED              0

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *