Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Abstract
Parallel Concurrent Processing Failover uses two mechanisms to detect a failure, dead connection detection, and detecting a
failure of the process monitor for the Concurrent Managers, otherwise known as PMON (note that this is not the PMON from
the database); introduced with Patch 6495206. Load balancing of the Concurrent Managers is critical if you expect parallel
concurrent processing to function after the failover to the remaining node(s).
This paper reviews Concurrent Manager basics before we discuss the topics of failover and load balancing. One of the key
components used by Concurrent Processing is Generic Service Management. The use of GSM with multiple nodes and
seeded GSM services is discussed. Administering Concurrent Managers, managing control across nodes, starting and
stopping the Concurrent Managers, and managing concurrent log files are skills needed to understand the configuration of
Parallel Concurrent Processing failover and load balancing.
There are a number of ways that an E-Business Suite environment might be configured for failover:
• Database
• Fast Connection Failover (FCF)
• Transparent Application Failover (TAF)
• Parallel Concurrent Processing Failover
• Concurrent Manager Failover
This paper will discuss Parallel Concurrent Processing Failover, ICM Failover, CRM Failover, and Concurrent Manager
Failover. We’ll leave the discussion of Database Failover, Fast Connection Failover and Transparent Application Failover for
another time.
The paper concludes with a discussion of load balancing and the issues that must be considered to properly configure an E-
Business Suite environment to take advantage of Oracle’s load balancing features.
Concurrent Processing
Most user interactions with Oracle Applications data are conducted via the HTML interface or the Forms interface. However,
reporting and interface programs may need to run periodically or on an ad hoc basis. As these programs may require a large
number of computations, they are run in the background at a time, and with a priority, such that the work of interactive users
is not impeded. Such programs are run on the Concurrent Processing server and run under Concurrent Managers.
When a request is submitted to run a Concurrent Program through an Oracle Applications form or through Oracle
Application Manager (OAM), the request inserts a row into the FND_CONCURRENT_REQUESTS table that specifies the
program to be run. Concurrent Managers read the requests from the table and start the appropriate Concurrent Programs.
• Concurrent Managers may have limits on the Concurrent Programs that can be run, and the times that they can be
started. Concurrent Requests have priorities, statuses, and log and out files in $APPLCSF.
Definitions
The following are some acronyms that we will use throughout this paper:
• CP => Concurrent Processing
• DCD => Dead Connection Detection
• ICM => Internal Concurrent Manager
• IM => Internal Monitor
• CRM => Conflict Resolution Manager
• PCP => Parallel Concurrent Processing
• PMON => Process Monitor for ICM
Concurrent Requests
Figure 1 shows an example of the Concurrent Manager Requests screen.
Figure 1
Concurrent Managers
Figure 3 shows the Concurrent Manager Administer screen. Oracle seeds a number of Concurrent Managers and assigns
Concurrent Programs to those managers. Your Applications System Administrator can also define custom managers and
assign Concurrent Programs to those managers.
Figure 3
Figure 4 shows the different types of Concurrent Managers, their Service Instance, and their Program Name. Your
Applications System Administrator can adjust the Concurrent Managers and Transaction Managers, but the other types of
managers must be left alone.
Forms Server
JAVA
JInitiator Interface
Reports Server
• The Internal Concurrent Manager (ICM) starts, sets the number of active processes, monitors, and terminates all
other concurrent processes through requests made to the Service Manager, including restarting any failed processes.
• The ICM also starts, stops, and restarts the Service Manager for each node.
• The ICM will perform process migration during an instance or node failure.
• The ICM will be active on a single node. This is also true in a Parallel Concurrent Processing environment, where
the ICM will be active on at least one node at all times.
• The ICM really does not have any scheduling responsibilities. It has NOTHING to do with scheduling requests, or
deciding which manager will run a particular request. The function of the ICM is to run 'queue control' requests;
requests to startup or shutdown other managers.
• The ICM is responsible for startup and shutdown of the whole concurrent processing facility, and it monitors the
other managers periodically, and restarts them if they should go down. It can also take over the Conflict Resolution
Manager's job, and resolve incompatibilities.
• If the ICM itself should go down, requests will continue to run normally, except for 'queue control' requests. Your
Applications System Administrator can restart the ICM by running the 'startmgr' command; there is no need to kill
the other managers first.
Figure 5
In Release 11i, if there is more than one possible Secondary Node and the Primary Node fails, PCP will failover to any node
that is available. By specifying a Secondary Node, it limits failover only to that node. An available node is any node, except
AUTHENTICATION, in the FND_NODES table whose status is set to ‘Y’.
Figure 6
In Figure 6, the TCP connection to RH9 has been disconnected and it shows a status of ‘N’.
Service Manager
(FNDSM process) - Communicates with the Internal Concurrent Manager, Concurrent Manager, and non-Manager Service
processes.
• The Service Manager (SM) spawns and terminates manager and service processes (these could be Forms, Apache
Listeners, Metrics or Reports Server, and any other process controlled through Generic Service Management).
• When the ICM terminates, the SM that resides on the same node with the ICM will also terminate.
• The SM is “chained” to the ICM. The SM will only reinitialize after termination when there is a function it needs to
perform (start, or stop a process), so there may be periods of time when the SM is not active, and this would be
normal.
• All processes initialized by the SM inherit the same environment as the SM.
• The SM’s environment is set by the APPSORA.env file, and the gsmstart.sh script.
• The TWO_TASK setting used by the SM to connect to a RAC instance must match the instance_name from
GV$INSTANCE.
• The apps_<sid> listener must be active on each Concurrent Processing node to support the Service
Manager connection to the local instance.
• There should be a Service Manager active on each node where a Concurrent or non-Manager service process will
reside.
Could not contact Service Manager FNDSM_RH8_VIS. The TNS alias could not be located, the listener
process on RH3 could not be contacted, or the listener failed to spawn the Service Manager process.
Found dead process: spid=(962754), cpid=(2259578), Service Instance=(1045)
CONC-SM TNS FAIL
Call to PingProcess failed for WFMAILER
CONC-SM TNS FAIL
Call to StopProcess failed for WFMAILER
CONC-SM TNS FAIL
Call to PingProcess failed for FNDCPGSC
CONC-SM TNS FAIL
Call to StopProcess failed for FNDOPP
CONC-SM TNS FAIL
Call to PingProcess failed for OAMGCS
CONC-SM TNS FAIL
Call to StopProcess failed for OAMGCS
Found dead process: spid=(716870), cpid=(2259580), Service Instance=(2009)
Found dead process: spid=(1442020), cpid=(2259579), Service Instance=(2010)
Internal Monitor
(FNDIMON process) - Communicates with the Internal Concurrent Manager.
• The Standard Manager is a worker process that initiates, and executes client requests on behalf of Applications batch
and OLTP clients.
Figure 7
You can also see the Concurrent Managers from the OAM web page:
Figure 8
In Figure 9, the Standard Manager is active on RH9, even though no Primary Node is defined:
Figure 9
Since no Secondary Node is defined, the Standard Manager will not failover.
Notice in Figure 8 that in the Work Shifts definition, there are now Failover Processes, in order to specify the number of
processes that will run when the Standard Manager fails over to the Secondary Node.
Transaction Manager
Transaction Managers communicate with the Service Manager, and any user process initiated on behalf of Forms, or a
Standard Manager request.
A Transaction Manager:
Figure 10
Note that between Release 11i and Release 12, the way that Transaction Managers work has changed:
Transaction Managers allow a client to make a request for a program to be run on the server immediately. The client then waits
for the program to complete and can receive program results from the server. As the client and server are two separate database
sessions, the communication between them for Release 11i has been handled using the DBMS_PIPE package.
Unfortunately the DBMS_PIPE package does not extend to communications between sessions on different RAC instances. On
an Applications instance using RAC, the client and server are very likely to be on different instances, causing transactions to
time out for long periods or fail completely. The current workaround is to manually set up Transaction Managers to connect to
all RAC instances, which not only takes up additional resources, but may also require additional middle-tier hardware or a
complicated configuration that is difficult to maintain.
In Release 12, the Transaction Managers use the AQ mechanism; the Transaction Managers, work on RAC connected to either
instance. This greatly simplifies the configuration and reduces the complexity for RAC administrators. A Profile Option has
been introduced to allow users to switch between the two transports DBMS_PIPE or AQ.
Timeout
Receive
Request No
Get
Yes No Concurrent
Processor Receive
Shut Return
Process Down? Message
Request Yes
Timeout
Yes
Yes
Place Results on
Return Queue Place message on AQ
Retrieve
Exit Transaction
Results
Here we see the Client and Server Process flows for the AQ Transaction Managers.
2. Shut down all the database instances cleanly in the RAC environment, using the command:
SQL>shutdown immediate;
_lm_global_posts=TRUE
6. Navigate to Profile > System and change the Profile Option ‘Concurrent: TM Transport Type' to ‘QUEUE', and
verify that the Transaction Manager works across the RAC instance. ATG RUP3 (4334965) or higher provides an
option to use AQs in place of Pipes. Note: 240818.1
8. Pipes are more efficient but require a Transaction Manager to be running on each database Instance.
9. Navigate to the Concurrent > Manager > Define screen, and set up the Primary and Secondary Node names for the
Transaction Managers.
If a program is identified as Run Alone, then the Conflict Resolution Manager prevents the Concurrent Managers from
starting other programs in the same conflict domain.
When a program lists other programs as being incompatible with it, the Conflict Resolution Manager prevents the program
from starting until any incompatible programs in the same domain have completed running.
If a Concurrent Program cannot run on any Concurrent Manager, perhaps because it has been assigned to a Concurrent
Manager that is disabled, then the Concurrent Request will stack up in the Conflict Resolution Manager.
When a Concurrent Program is started, Concurrent Managers read the request information from the FND Concurrent Request
tables. The Conflict Resolution Manager checks Concurrent Program definitions for incompatibility rules.
If a program is identified as Run Alone, then the Conflict Resolution Manager prevents the Concurrent Managers from
starting other programs in the same conflict domain.
When a program lists other programs as being incompatible with it, the
Conflict Resolution Manager prevents the program from starting until any incompatible programs in the same domain have
completed running.
TO ENABLE/DISABLE THE CONFLICT RESOLUTION MANAGER
This manager is intended to implement Advanced Schedules. Its job is to determine when a scheduled request is ready to
run. Advanced Schedules were not fully implemented in Release 11.0. They are implemented in Release 11i, but are not
widely used by the various Applications modules.
General Ledger uses FNDSCH for financial schedules based on different calendars and period types. It is then possible to
schedule AutoAllocation sets,
Recurring Journals, MassAllocations, Budget Formulas, and MassBudgets
to run according to the General Ledger schedules that have been defined.
If financial schedules in GL are not being used then it is not a problem to deactivate this manager.
Internal Concurrent Manager Failover Definition
Release 11i
Define Primary and Secondary Nodes in Release 11i
Figure 11
By not specifying a secondary node the ICM can failover to any node that is available. Consider a system that has three or
more concurrent processing nodes and two nodes go down, including primary node RH3. If the secondary node was
specified, there would be a chance the secondary node would not be available. This capability, to failover to an un-named
secondary node, is available for all managers in 11i. In Release 12 this works differently.
Release 12
In Release 12, for failover to function properly, both primary and secondary nodes must be specified. Most managers won’t
start if a primary node is not assigned. However, a few managers, for example, the Internal Concurrent Manager, and the
Conflict Resolution Manager will start on any available node. If a secondary node is not defined, the manager will not
failover.
Figure 12
Figure 13
GENERIC SERVICES
Generic Services include the Internal Concurrent Manager and Conflict Resolution Manager.
Figure 14
REQUEST PROCESSING MANAGERS
Request Processing Managers include the Standard Manager and other Concurrent Managers.
Figure 15
GENERIC SERVICE MANAGEMENT
An E-Business Suite system depends on a variety of services, such as Forms Listeners, HTTP Servers, Concurrent Managers,
and Workflow Mailers. These services are composed of one or more processes. In the past, many of these processes had to
be individually started and monitored by the Applications System Administrator. Management of these processes is
complicated, since these services can be distributed across multiple host machines.
The introduction of Generic Service Management in Release 11i helped simplify the management of these processes by
providing a fault tolerant service framework and a central management console built into Oracle Applications Manager
(OAM).
Service Management is an extension of Concurrent Processing, and provides a framework for managing processes on
multiple host machines. With Service Management, virtually any application tier service can be integrated into this
framework.
Figure 16
Figure 16 shows that beginning with Release 11i, services such as the Oracle Forms Listener, Oracle Reports Server, Apache
Web listener, and Oracle Workflow Mailer can be run under Service Management.
With Service Management, the Internal Concurrent Manager (ICM) manages the various service processes across multiple
hosts. On each host, a Service Manager acts on behalf of the ICM, allowing the ICM to monitor and control service processes
on that host. Applications System Administrators can then configure, monitor, and control services though a management
console that communicates with the ICM. Figure 17 shows the Oracle Application Manager (OAM) screen that an
Applications System Administrator can use to manage the Concurrent Managers.
Figure 17
Service Management provides a fault tolerant system. If a service process exits unexpectedly, the ICM will automatically
attempt to restart the process. If a host fails, the ICM may start the affected service processes on a secondary host. The ICM
itself is monitored and kept alive by Internal Monitor processes located on various hosts.
TEST – KILL SERVICES TO SEE IF GSM RESTARTS THEM
In this example, we will kill the FNDSM process and the FNDCRM process to see if the Generic Services Manager correctly
restarts the process:
Kill FNDSM
Kill FNDCRM
In each case, both of these services were started before I could enter the grep command to find the corresponding process.
Figure 18 shows that the entire set of system services may be started or stopped with a single action.
Figure 18
• Forms Listener
• Metrics Server
• Metrics Client
• Reports Server
• Apache Listener
• LINUX users should not Activate the Reports Server under GSM
These services, once seeded, may be managed under GSM and controlled via the Oracle Applications Manager.
FNDSVCRG is an executable introduced as a part of the Seeded GSM Services. It provides improved coordination between
the GSM monitoring of these services and their command-line control scripts.
The $FND_TOP/bin/FNDSVCRG executable is triggered from the concurrent processing control script before and after the
script starts or stops the service. FNDSVCRG connects to the database and validates the configuration of the Seeded GSM
Service.
If a service is not enabled to be managed under GSM, the FNDSVCRG executable does nothing and exits. The script then
continues to perform its normal start/stop actions.
If a service is enabled for GSM management, the FNDSVCRG executable will update the service information in the database
including the environment context, the current service log file location, and the current state of the service.
VERIFY GSM
• To verify that GSM is working, start the Concurrent Managers.
• Once GSM is enabled, the ICM uses Service Managers to start all Concurrent Managers and activated services.
• If the ICM successfully starts the managers, then GSM has been configured properly.
• If managers and/or services fail to start, errors should appear in the ICM log file.
Each Service Manager maintains its own log file named FNDSMxxxx.mgr, located in the same directory as the Concurrent
Manager log files. It is useful to examine these log files when there are problems starting services. If you cannot locate the
Service Manager log file, it is likely that the Service Managers are not starting properly and there is a configuration issue that
needs troubleshooting.
Parallel Concurrent Processing
APPLDCP Profile Option
Starting with Release 11.5.10, FND.H, the APPLDCP environment variable is ignored. Release 12 GSM requires the value of
APPLDCP to be set to “ON”. The value is hard-coded in afpcsq.lpc version 115.35, thereby ignoring the value of APPLDCP.
According to Oracle’s ATG Development in Note 753678.1:
“As of file "afpcsq.lpc" version 115.35 or higher, APPLDCP is internally hard-coded to "ON" when the Generic Service
Management (GSM) is enabled--"keeping in mind, use of the GSM is required".
In short, at "afpcsq.lpc" version 115.35 or higher with the GSM enabled, the setting of the APPLDCP environment variable is
ignored--this is the "default behavior on all Release 12 releases."
NOTE: As per ARU, "Patch 11i.FND.H" (3262159) and "Oracle Applications Release 11.5.10" (3140000) contains
"afpcsq.lpc" version 115.37.”
Parallel Concurrent Processing
• In a Release 11i or Release 12 environment with Parallel Concurrent Processing enabled, the Primary Node
assignment is optional for the Internal Concurrent Manager.
• The Internal Concurrent Manager can be started from any of the nodes (host machines) identified as concurrent
processing server enabled.
• In the absence of a Primary Node assignment for the Internal Concurrent Manager, the Internal Concurrent Manager
will stay on the node (host machine) where it was started.
• If a Primary Node is assigned, the Internal Concurrent Manager will migrate to that node if it was started on a
different node.
• If the node on which the Internal Concurrent Manager is currently running becomes unavailable, the Internal
Concurrent Manager will be restarted on an alternate concurrent processing node.
• If a Primary Node is not assigned, the Internal Concurrent Manager will continue to operate on the node where it
was restarted.
• If a Primary Node has been assigned to the Internal Concurrent Manager, then it will be migrated back to that node
whenever the node becomes available.
Release 11i Parallel Concurrent Processing
• In releases before Release 11i, there must be an assigned Primary and Secondary Node for each Concurrent
Manager.
• Primary and Secondary Nodes need not be explicitly assigned. However, you can assign Primary and Secondary
Nodes for directed load and failover capabilities.
• In Release 11i, with three or more nodes in the concurrent processing tier, it is recommended to not specify the
Secondary Node for failover. This is because the specified Secondary Node may not be available when the Primary
Node goes down.
• By not specifying the Secondary Node, GSM can find an available node with Concurrent Processing services that
can be used during failover.
Release 12 Parallel Concurrent Processing
• With Release 12, if a Secondary Node is not specified, the processes will not failover as they do in Release 11i. This
is a critical difference between Release 11i and Release 12.
Parallel Concurrent Processing
Parallel concurrent processing allows distribution of Concurrent Managers across multiple nodes. Benefits are improved
performance, availability and scalability (load balancing).
Parallel Concurrent Processing (PCP) is activated along with Generic Service Management (GSM); it can not be activated
independently of GSM. With parallel concurrent processing implemented with GSM, the Internal Concurrent Manager (ICM)
tries to assign valid nodes for Concurrent Managers and other service instances.
There should be only one ICM and CRM, at any given time. However, the ICM and CRM could be configured to run on
several of the nodes.
Concurrent Managers migrate to the surviving node when one of the concurrent nodes goes down.
Only one Internal Monitor Process can be active on a single node. You decide which nodes have an Internal Monitor Process
when you configure your system. You can also assign each Internal Monitor Process a Primary and a Secondary Node to
ensure failover protection.
Internal Monitor Processes, like Concurrent Managers, have assigned work shifts, and are activated and deactivated by the
Internal Concurrent Manager. However, automatic activation of PCP does not additionally require that Primary Nodes be
assigned for all Concurrent Managers and other GSM-managed services. If no Primary Node is assigned for a service
instance, the Internal Concurrent Manager (ICM) assigns a valid Concurrent Processing Server Node as the Target Node. In
general, this node will be the same node where the Internal Concurrent Manager is running.
In the case where the ICM is not on a Concurrent Processing Server Node, the ICM chooses an active Concurrent Processing
Server Node in the system. If a Concurrent Processing Server Node is not available, a Target Node will not be assigned.
If a Concurrent Manager does have an assigned Primary Node, it will only try to start up on that node; if the Primary Node is
down, it will look for its assigned Secondary Node, if one exists.
If both the Primary and Secondary Nodes are unavailable, the Concurrent Manager will not start (the ICM will not look for
another node on which to start the Concurrent Manager). This strategy prevents overloading any node in the case of failover.
The Concurrent Managers are aware of many aspects of the system state when they start up. When an ICM successfully starts
up, it checks the TNS listeners and database instances on all remote nodes. If an instance is down, the affected managers and
services switch to their Secondary Nodes.
Processes managed under GSM will only start on nodes that are in Online mode. If a node is changed from Online to Offline,
the processes on that node will be shut down and switched to a Secondary Node if possible.
Concurrent processing provides database instance-sensitive failover capabilities. When an instance is down, all managers
connecting to it switch to a secondary middle-tier node.
However, if you prefer to handle instance failover separately from such middle-tier failover (for example, using the TNS
connection-time failover mechanism instead), use the Profile Option Concurrent:PCP Instance Check. When this Profile
Option is set to OFF, Parallel Concurrent Processing will not provide database instance failover support; however, it will
continue to provide middle-tier node failover support when a node goes down.
For the Internal Concurrent Manager, you assign the Primary Node only.
To Set Up PCP with RAC
The following assumes a 2 node RAC cluster, where node1 is known as vip1 and node2 is known as vip2:
1. Check the configuration files tnsnames.ora and listener.ora located under the 8.0.6 ORACLE_HOME at
$ORACLE_HOME /network/admin/<context>. Ensure that you have information of all the other concurrent
nodes for FNDSM and FNDFS entries.
3. Log in to Oracle E-Business Suite Release 11i as SYSADMIN and choose the System Administrator Responsibility.
Navigate to the Install > Nodes screen, and ensure that each node in the cluster is registered.
4. Verify that the Internal Monitor for each node is defined properly, with the correct Primary and Secondary Node
specifications and work shift details.
5. Confirm that the Internal Monitor manager is activated from Concurrent > Manager > Administrator, activating the
manager as required. For example, Internal Monitor: Host2 might have the Primary Node as vip2 and Secondary
Node as vip1.
6. On all Concurrent Processing nodes, set the $APPLCSF environment variable to point to a log directory on a shared
file system.
7. On all Concurrent Processing nodes, set the $APPLPTMP environment variable to the value of the
UTL_FILE_DIR entry in the init.ora file on the database nodes. This value should be a directory on a shared
file system.
8. Do not use a load balanced TNS entry for the value of s_cp_twotask. The request may hang if the sessions are
load balanced. Worker 1 connected to DB Instance 1 places a message in the pipe, and expects Worker 2 (which is
connected to DB Instance 2) to consume the message. However, Worker 2 never gets the message since pipes are
instance private. Optimizing the E-Business Suite with Real Application Clusters (RAC) - Ahmed Alomari
o to 'ON' means that Concurrent Managers will fail over to a secondary application tier node if the database
instance to which it is connected goes down.
DCD is initiated on the server when a connection is established. At this time SQL*Net reads the SQL*Net parameter files
and sets a timer to generate an alarm. The timer interval is set by providing a non-zero value in minutes for the
SQLNET.EXPIRE_TIME parameter in the sqlnet.ora file.
When the timer expires, SQL*Net on the server sends a "probe" packet to the client. The probe is an empty SQL*Net packet
and does not represent any form of SQL*Net level data, but it creates data traffic on the underlying protocol.
If the client end of the connection is still active, the probe is discarded, and the timer mechanism is reset. If the client has
terminated abnormally, the server will receive an error from the send call issued for the probe, and SQL*Net on the server
will signal the operating system to release the connection's resources.
TCP/IP, for example, is a connection-oriented protocol, and as such, the protocol will implement some level of packet
timeout and retransmission in an effort to guarantee the safe and sequenced order of data packets. If a timely
acknowledgement is not received in response to the probe packet, the TCP/IP stack will retransmit the packet some number
of times before timing out. After TCP/IP gives up, then SQL*Net receives notification that the probe failed.
This is a server feature only. The client may be running any supported
SQL*Net V2 release.
DCD is much more resource-intensive than similar mechanisms at the protocol level.
With DCD enabled, if the connection is idle for the duration of the time interval specified in minutes by the
SQLNET.EXPIRE_TIME parameter, the Server-side process sends a small 10-byte packet to the client. This packet is sent
using TCP/IP.
y Both the Internal Concurrent Manager and the Internal Monitor can use the DCD functionality of the Network (TCP
sqlnet).
y The ICM is a client process connected to a DCD-enabled DB dedicated server process.
y The ICM holds the named PL/SQL Lock, the “ICM lock”.
y The IM is continuously trying to check whether it can get the same named PL/SQL Lock.
y As soon as the “ICM lock” is released by the DB / DCD, FNDIMON pings the ICM node, and the IM deduces that
the ICM has crashed.
o If the ping succeeds, we conclude that the ICM is fine. Obviously, the ICM can be down, even if TCP is
working, so this is bad logic that can lead to false positives.
o If the ping fails, we further check if it has been over four PMON cycles since the ICM updated the
work_start column in the FND_CONCURRENT_QUEUES table.
o If it has been more than four PMON cycles we conclude that the ICM is dead.
y The DCD comes into the picture here after the ICM has crashed and the database needs to identify that the ICM is
gone.
y The database needs to clean up the dedicated server process resource corresponding to the ICM client process.
With DCD enabled, if the connection is idle for the duration of the time interval specified in minutes by the
SQLNET.EXPIRE_TIME parameter, the Server-side process sends a small 10-byte packet to the client. This packet is sent
using TCP/IP.
TCP/IP is a connection-oriented protocol. This protocol implements a level of packet timeout and retransmission to help
guarantee the safe and sequenced order of data packets.
If a timely acknowledgement is not received in response to the probe packet, the TCP/IP stack will retransmit the packet
some number of times before timing out. After TCP/IP gives up, then SQL*Net receives notification that the probe failed.
If the client side connection is still connected and responsive, the client sends a response packet back to the database server,
resetting the timer, and another packet will be sent when next interval expires (assuming no other activity on the connection
After 200 seconds of no response, TCP sends the first of 2 probes, 20 seconds apart. Then, TCP notifies SQL*Net of the
failure, and SQL*Net removes the offending connection.
tcp_retries1 (default: 3) The number of times TCP will attempt to retransmit a packet on an established
connection normally, without the extra effort of getting the network layers involved.
tcp_retries2 (default: 15) The maximum number of times a TCP packet is retransmitted in established state
before giving up
tcp_syn_retries (default: 5) The maximum number of times initial SYNs for an active TCP connection attempt
will be retransmitted. The default value is 5, corresponds to approximately 180
seconds.
Now let’s consider an example where the following TCP parameters are changed from their default values:
tcp_retries1 = 2
tcp_retries2 = 2
tcp_syn_retries = 2
In this example, the time to initialize the PCP failover was an average of 8 seconds after changing these TCP parameters.
We found the following Linux parameters listed in the Metalink note: 249213.1
net.ipv4.tcp_keepalive_time 3000
net.ipv4.tcp_retries2 5
net.ipv4.tcp_syn_retries 1
By changing some of these parameters, the timeout period was reduced to about 20 seconds, with the following breakdown
for the timeout:
Therefore it keeps trying every 3200ms until a magic interval occurs and it stops. On Sun this interval is
tcp_ip_abort_cinterval and defaults to 3 minutes (180000ms).” Note: 249213.1
Six seconds is very close to the time measured during tests with tcp_syn_retries and tcp_retries2 set to 2. The measured
average was 8 seconds.
Multiple measurements at 5 seconds recorded no change in connection status. However, one failover was initiated at a
measured time of 6 seconds.
When configured correctly, Keepalive enables dead connections to be discovered and closed more quickly, freeing resources
used on the server more quickly.
At the time of this document, client side SQL*Net connections do not enable keepalive for TCP connections by default.
However, it is possible to enable this by adding the ENABLE=BROKEN parameter to the SQL*Net connect string, by adding
this parameter to the sqlnet.ora file.
**WARNING** Keepalive intervals can typically be set to 2 hours or more (i.e,,it can take more than 2 hours to notice a
dead server even if keepalive is enabled). To make keepalive useful for PCP and TAF the keepalive interval needs to be
reduced to a smaller value (such as 2 minutes).
If there are a lot of IDLE connections on your network, then reducing keepalive can increase network traffic significantly.
Sample TNS alias to enable keepalive (notice the ENABLE=BROKEN clause)
VIS_BALANCE =
(DESCRIPTION =
(ENABLE=BROKEN)
(ADDRESS_LIST =
(LOAD_BALANCE = ON)
(FAILOVER = ON)
(ADDRESS = (PROTOCOL = TCP)(HOST = rh8)(PORT = 1521))
(ADDRESS = (PROTOCOL = TCP)(HOST = rh6)(PORT = 1521)))
3. ICM Process Monitor (PMON) – once TCP fails, this method, introduced with Patch 6495206, takes 2 minutes
• If the “ICM lock” is not available, FNDIMON will now ping the node of the ICM.
• If the ping succeeds, we conclude that the ICM is fine.
• If the ping fails, we further check if it has been over four PMON cycles since the ICM updated the WORK_START
column of the FND_CONCURRENT_QUEUES table.
• If it has been more than four PMON cycles we conclude that the ICM is dead.
Release 11i only uses PMON if patch 6495206 has been applied. The PMON method is included in Release 12.
DEFAULT PMON SETTINGS
Figure 19 shows the Oracle Application Manager screen with the PMON settings for this instance:
Figure 19
• When the database connection is possible, the Reviver will restart Concurrent Processing
• Concurrent Processing can be started / stopped when the network or database is down
• This should reduce processing down time because Concurrent Processing restarts as soon as possible
• This should reduce the Applications System Administrator’s workload, since he will no longer need to take the extra
step of restarting the Concurrent Managers
Of the first three methods, in Release 11i, the method that recognizes the failure first depends on the timeout settings of each
method.
• sqlnet.inbound_connect_timeout (server)
• sqlnet.send_timeout (client and/or server)
• sqlnet.recv_timeout (client and/or server)
This method should provide automated recovery for Concurrent Managers after network or database failures. When a
network failure occurs on a concurrent processing node, resulting in a loss of database connectivity, all Concurrent Managers
running on that node will eventually be forced to shut down.
In cases where multiple Concurrent Processing nodes are being used, and these other nodes retain their database connection,
the managers will migrate to the working nodes. In the case where only a single Concurrent Processing node is being used, or
when all Concurrent Processing nodes lose their database connection (for example if the database node suffers a network
failure), all running Concurrent Managers on the entire instance will be forced to shut down.
Without this feature, when the network comes back up, the managers must be restarted manually, as there is no automatic
restart facility. This can lead to lost productivity between the time the network is restored and when the managers are
restarted.
With this new feature, the Concurrent Managers will restart automatically as soon as connectivity is restored. To achieve this,
when a connection failure situation arises, a new monitor process, the Reviver, is started. This process will remain alive until
it is able to obtain a database connection and restart Concurrent Processing.
In addition this allows the Applications System Administrator to maintain control over Concurrent Processing even when
network or database failure has brought down Concurrent Processing. When the connection is down, an administrator can
still start CP using the adcmctl.sh script and by doing so it will start a Reviver process.
When Concurrent Processing is down and a Reviver process is actively waiting to restart Concurrent Processing, the
adcmctl.sh script can be used to stop Concurrent Processing, as it will detect the Reviver and shut it down.
There is no additional setup required to use Connection Failure Recovery. If you wish to disable Connection Failure
Recovery you can do so by setting the Concurrent Processing Reviver Process context file variable to “Disabled”.
REVIVER
ICM
Start No Receive
Starts to Shutdown Shutdown?
Attempt to
Get DB
Lost DB Connection No Sleep
Connection?
Yes
Yes
No Kill Previous DB
Spawn Reviver Session No
Yes ICM
Started?
Exit Exit
As part of its shutdown process, the ICM will detect that it is being forced to shut down due to losing its database connection.
This is done by looking for specific error messages ORA-3113, ORA-3114 or ORA-1041; If one of these errors is detected:
• The ICM will assume that it has lost its database connection and will spawn the reviver process.
The ICM will pass the Apps username/password to the script using a secure protocol, along with the Oracle session id of the
current ICM process. When the script starts, it will attempt to make a database connection using sqlplus. If unsuccessful, it
will sleep for a 30 seconds before trying again. It will continue this until it either successfully makes a connection or it
receives a signal to shut itself down.
When it successfully makes a connection, it will first kill the old ICM database session to make sure any locks are released,
then start a new ICM using the normal startmgr script. It then checks to make sure an ICM is successfully running; it will
not exit until a new ICM is running.
Once the ICM is restarted, it will start up any other managers that had been shut down and normal processing will resume.
PCP Failover
Failover is the process of migrating the Concurrent Managers from the Primary Node to the Secondary Node because of a
concurrent processing tier failure or listener failure.
Failback is when the Primary Node becomes available again and the Concurrent Managers need to migrate back to their
original Primary Node.
Primary Node = HOST1 – The Managers assigned to the Primary Node are ICM (FNDLIBR-cpmgr), and FNDCRM
Secondary node = HOST2 – The Manager assigned to the Secondary Node is Standard Manager (FNDLIBR)
When HOST1 becomes unavailable (this means TCP is no longer working), both the ICM and FNDCRM are migrated to
HOST2. This can be seen from the Administer Concurrent Manager screen in the System Administrator Responsibility.
The $APPLCSF/log/.mgr logfile shows that HOST1 is being added to the unavailable list.
On HOST2, after the PMON cycle, FNDICM, FNDCRM, and FNDLIBR are now migrated and running.
FNDSM is not a persistent process, and FNDIMON is a persistent process local to each node
Be aware that if a TCP failure is not detected, failover will not occur. The following excerpt from a Concurrent Manager log
shows the case where a failure is detected:
The PingProcess at the end of this log continues until the concurrent manager processes resume, or a TCP failure is detected,
and failover is begun.
ICM Failover in Release 11i
• ICM and IM use the DCD functionality of the Network (TCP sqlnet).
• ICM is a client process connected to a DCD enabled DB dedicated server process.
• ICM holds the named PL/SQL Lock, the “ICM lock”.
• IM is continuously trying to check whether it can get the same named PL/SQL Lock.
• As soon as the “ICM lock” is released by the DB / DCD from the ICM crash, FNDIMON pings the ICM node, and
the IM deduces that the ICM has crashed.
• The DCD works after the ICM has crashed and DB needs to identify that the ICM is gone.
• Then, the DB needs to clean up the dedicated server process resource corresponding to the ICM client process
• If the “ICM lock” is not available, FNDIMON will now ping the node of the ICM.
• If the ping succeeds, we conclude that the ICM is fine.
o Obviously, the ICM can be down, even if TCP is working, this is bad logic.
• If the ping fails, we further check if it has been over four PMON cycles since the ICM updated the WORK_START
column in the FND_CONCURRENT_QUEUES table.
• If it has been more than four PMON cycles we conclude that the ICM is dead.
• Fail over is triggered when node running the ICM goes down
• This ICM going down would lead to connected database server process clearing its resources (including named
PL/SQL lock)
• In turn, the database server process cleanup is dependent on DCD mechanism of network (sqlnet)
• That is, sqlnet determines that connected client has closed down through DCD mechanism and triggers database
server process cleanup
11i PCP Failure
The following steps occur in the order indicated:
• TCP Failure
• ICM Lock is released, FNDIMON pings ICM node, if ping fails, check PMON
• PMON detects a “dead process”, crashed ICM
• reviver.sh
• DCD
R12 PCP Failure
• TCP Failure
• PMON detects a “dead process”
• ICM Shutdown
o Look for error messages ORA-3113, ORA-3114 or ORA-1041
• reviver.sh
• DCD
Test PCP Failover Components
Test to explore effect of DCD, PMON and TCP failover methods.
Variables: sqlnet.expire_time, PMON sleep and number of cycles, and the following TCP Keepalive parameters:
• tcp_keepalive_time,
• tcp_keepalive_intvl,
• tcp_keepalive_probes
• tcp_retries1 (default: 3, new value 2)
• tcp_retries2 (default: 15, new value 2)
• tcp_syn_retries (default: 5, new value 2)
Failover Expire_time PMON PMON tcp_KA tcp KA tcp KA tcp tcp tcp syn
time / In Minutes Sleep Cycles time intvl probes retries retries2 retries
Failback
time
In Seconds
241/ 1 30 secs 4 200 20 2 3 15 5
250/ 50 5 30 secs 4 200 20 2 3 15 5
262 / 100 10 30 secs 4 200 20 2 3 15 5
300 / 75 1 15 secs 2 200 20 2 3 15 5
285/ 35 10 30 secs 4 1000 60 10 3 15 5
8/ 105 1 30 secs 4 1000 60 10 2 2 2
10/ 42 1 30 secs 4 200 20 2 2 2 2
7/ 40 10 30 secs 4 200 20 2 2 2 2
6/ 34 1 15 secs 2 200 20 2 2 2 2
Test the Failover and Failback of Parallel Concurrent Processing
In Figure 20, Oracle Application Manager (OAM) shows the details of the Internal Manager (ICM) Activated on RH9:
Figure 20
In Figure 21, the ICM, CRM and Standard Managers all have their Primary Node as RH9.
Figure 21
In Figure 22, we can see that the Standard Manager is configured to failover to the Secondary Node RH7:
Figure 22
Review concurrent manager log file for more detailed information. : 12-JAN-2009 15:22:55 -
The VIS_0112@VIS internal concurrent manager has terminated with status 1 - giving up.
RH7 Database
RH9
Database
Listener
SQL*Net SQL*Net
Client Client
In Figure 23, OAM shows node RH9 is down, as well as all the application services on RH9.
Node RH9 is
down!
Figure 23
The Conflict
Resolution
Manager is down!
Figure 24
The ICM tries to restart the CRM and other failed processes, but can’t.
If we run the command ps-ef | grep applvis, we can see defunct processes:
The CRM and two other FNDLIBRs are shutting down, but the FNDSM is still running. The ICM is still running in another
FNDLIBR, show below:
RH9 is shown as down, TCP is disconnected, and the Internal Manager is failed over to RH7, as shown in Figure 25:
Figure 25
Figure 26
In Figure 27, the Concurrent Managers have started processing Concurrent Rerquests on the Secondary Node, RH7:
Figure 27
Figure 28 shows the Oracle Applications Manager screens with RH7 activated:
Figure 28
It is important to note that, unlike Release 11i, Release 12 doesn’t failover a manager if there is no Secondary Node defined.
In Figure 29, only the Session History Cleanup, Standard Manager and WMS Task Archiving Manager have Secondary
Nodes defined. In this case, the Primary Node is RH9 and the Secondary Node is RH7.
Figure 29
ICM Failover
Figure 30 shows the Internal Manager processing migrating back to the Primary Node, RH9.
Figure 30
In Figure 31, the Internal Manager is up for RH9 and the Conflict Resolution Manager is starting up on RH9:
Figure 31
Figure 32
Failover is complete, for the ICM and CRM, from RH9 to RH7. In the next section the TCP is reconnected and the failback
from RH7 to RH9 is documented.
Start of Failback
End of Failback
Figure 33
Target Nodes
Using the Services Instances page in Oracle Applications Manager (OAM) or the Administer Concurrent Managers form,
you can view the Target Node for each Concurrent Manager in a parallel concurrent processing environment.
The Target Node is the node that the processes associated with a Concurrent Manager should run. It can be the node that is
explicitly defined as the Concurrent Manager's Primary Node in the Concurrent Managers window or the node assigned by
the Internal Concurrent Manager, if no Primary Node is defined.
Figure 34
If you have defined Primary and Secondary Nodes for a manager, then when its Primary Node and ORACLE instance are
available, the Target Node is set to the Primary Node. Otherwise, the Target Node is set to the manager's Secondary Node (if
that node and its ORACLE instance are available). During process migration, processes migrate from their current node to
the Target Node.
Control Across Nodes
Using the Application Services category on the Site Map page in Oracle Applications Manager or the Administer Concurrent
Managers form, it is possible to start, stop, abort, restart, and monitor Concurrent Managers and Internal Monitor Processes
running on multiple nodes from any node in your parallel concurrent processing environment.
Figure 35
Figure 36 shows that It is not necessary log onto a node to control concurrent processing on it. It is possible to terminate the
Internal Concurrent Manager or any other Concurrent Manager from any node in your parallel concurrent processing
environment using Oracle Application Manager:
Figure 36
Figure 37
Start up parallel concurrent processing by running the adcmctl.sh script from the operating system prompt, as shown
below:
The Internal Concurrent Manager starts up on the node where the adcmctl.sh script is run. If it is assigned to a different
node, the ICM will migrate to the Primary Node, when available.
After the Internal Concurrent Manager starts up, it starts all the Internal Monitor Processes and all the Concurrent Managers.
It attempts to start Internal Monitor Processes and Concurrent Managers on their Primary Nodes, and resorts to a Secondary
Node only if a Primary Node is unavailable.
logfile=/VIS/logs/apps/log/VIS_0815.mgr
PRINTER=noprint
mailto=VIS
restart=N
diag=Y
sleep=15
pmon=4
quesiz=1 (default)
Figure 38
In Figure 39, the defaults for the PMON settings are initially displayed:
Figure 39
Figure 40 shows that you can change the Sleep Interval to 15 seconds and keep the PMON cycles at 4. This should recognize
a failure 1 minute after TCP finds a “dead peer”.
Figure 40
Once you’ve saved your changes, Figure 41 shows a screen that confirms that you made changes:
Figure 41
Make sure the PMON changes are made in the $FND_TOP/bin/batchmgr.sh file.
FILENAME
# batchmgr
# DESCRIPTION
# fire up Internal Concurrent Manager process
# USAGE
# batchmgr arg1=val1 arg2=val2 ...
#
# Parameters may be sent via the environment.
#
# ARGUMENTS DEFAULT
# [appmgr|sysmgr]=username/password
# [sleep=sleep_seconds] 15
# [mgrname=manager_name] icm
# [logfile=log_filename] $FND_TOP/$APPLLOG/$mgrname.mgr
# [restart=N|mim minutes between restarts] N
# [mailto="user1 user2..."] current user
# [PRINTER=printer_name]
# [pmon=iterations] 4
# [quesiz=pmon_iterations] 1
# [diag=Y|N] N
#
# SYSMGR holds the Oracle user as whom the manager should run
# and its password.
#
# SLEEP holds the number of seconds that the manager should wait
# between checks for new requests.
#
# MGRNAME is the name of the manager for locking and log purposes.
#
# LOGFILE is a filename in which the manager's own log is stored.
#
# RESTART is set to N if the manager should not restart itself after
# a crash. Otherwise, it is an integer number of minutes. The
# manager will attempt a restart after an abnormal termination
# if the past invocation lasted for at least RESTART minutes.
#
# MAILTO is a list of users who should receive mail whenever
# the manager terminates.
##
# PMON is the duration of time between process monitor
# checks (checks for failed workers). The unit of time
# is concurrent manager iterations (request table checks).
#
# QUESIZ is the duration of time between worker quantity
# checks (checks for number of active workers). The unit
# of time is process monitor checks.
Concurrent Processing is typically started from the command line by using one of these start scripts, startmgr.sh or
adcmctl.sh:
startmgr.sh
• Schema logon is passed using sysmgr parameter
• Apps logon may be passed using appmgr parameter
• Apps user must have System Administrator responsibility
The startmgr.sh script accepts the schema logon when passed as the sysmgr parameter. Now it will also accept an
Applications user sign on via the appmgr parameter. Note that the Applications User must have System Administrator
responsibility in order to be able to successfully start Concurrent Processing.
adcmctl.sh
The adcmctl.sh script is more commonly used. It will accept a single username/password combination. There is a context
file variable that determines whether this script expects a schema logon or an Applications logon. By default the schema
logon is expected.
To start using the Application Sign On instead, edit the context file variable Concurrent Processing Password Type and set its
value to AppsUser. Then run autoconfig to regenerate the adcmctl.sh script. The script will then begin to expect an
Applications Username and Password.
For this example we will use the Concurrent Program FNDSCARU, the schema logon apps/appspass and the Applications
User logon of User/UserPass. Previously to submit a request to run FNDSCARU using CONCSUB, you would run the
CONCSUB program from the command line as shown here.
Now you can choose to authenticate instead using an Applications username and password. To do so, in place of the schema
logon you should specify Apps:User as shown here. This indicates that an Applications User Sign On will be used. Then for
the Applications username parameter you should append the corresponding password. If you pass the Apps:User parameter
but do not supply a password for your specified Applications username, you will be prompted to enter the password.
Functional Security is enforced for Request Submission. After the Applications username and password is authenticated,
CONCSUB will verify that the user has the appropriate permission to submit the Concurrent Request. If the security check
fails, an error message will be printed to the screen.
Shutting Down Managers
You shut down parallel concurrent processing by issuing a "Stop" command in the OAM Service Instances page or a
"Deactivate" command in the Administer Concurrent Managers form. All Concurrent Managers and Internal Monitor
processes are shut down before the Internal Concurrent Manager shuts down.
After the failover test, sometimes the services would not failback on RH9. Figure 42 shows the OAM Dashboard and
indicates that RH9 and the applications services are unavailable. Remember, the test pulls the TCP cable from the host.
Figure 42
In order to restart the services on RH9, first stop all the services on RH9 with:
adstpall.sh apps/apps (sometimes a kill -9 -1 is necessary as the APPLMGR user)
By stopping the services, GSM is able to restart the services, except the concurrent processing, which was stopped
Figure 43
Figure 44
Then:
• Log files go to: /d01/oracle/VIS/inst/apps/VIS_rh9/logs
• Out files to: /d01/oracle/ VIS/inst/apps/VIS_rh9/out
If $APPLCSF is not set, it places the files under the product top of the application associated with the request. For example, a
PO report would go under $PO_TOP/$APPLLOG and $PO_TOP/$APPLOUT
All these directories must exist and have the correct permissions.
All concurrent requests produce a log file, but not necessarily an output file.
Concurrent Manager log files follow the same convention, and will be found in the $APPLLOG directory
Concurrent Processing Tables
Major tables that contain information about concurrent processing:
Table Description
Load Balancing
Parallel Concurrent Processing has many benefits. Key among these is its capability to provide failover in case of node
failure. When a node fails, the processes that were running on that node are restarted on Secondary Nodes (as defined by the
System Administrator.) This helps maintain throughput and keep the business running during node failures.
However, a resource intensive node (one with many processes) may inadvertently overtax the system when it fails-over. A
Secondary Node may not be able to handle its normal workload and the additional burden of managers/processes from a
failed node.
If too many processes are running on the Secondary Node when the Primary Node fails-over, the Secondary Node may not
have the capacity to process the requests from additional Concurrent Managers.
Release 12 introduces Failover Sensitive Workshifts. This enhancement allows the System Administrator to configure how
many processes failover for each workshift. With this added control, Applications System Administrators are able to enjoy
the benefits of PCP failover while reducing the risk of performance issues through overloaded resources.
Figure 45
Processing capabilities during failover may be severely degraded on the remaining hosts, unless failover processes are
restricted.
A host may be considered underutilized if the CPU utilization is less than 70%. A typical production environment may have
two application tiers, each running Apache, Forms, and Concurrent Processing. Each node supports half the JSP and Forms
users, half the Concurrent Requests and has 70% average CPU utilization.
Release 11i has no mechanism for decreasing the number of processes a manager can run during a failover. It is clearly not
possible to process 140% of the workload on one of two remaining apps tiers.
Figure 46
It’s clear, in order to really run a Release 11i or Release 12 system, during a failover, there are two choices:
• Run the servers at 35% or less utilization
• Reduce the number of processes that are allowed during failover
Figure 47
Conversely, if a failover occurs from node 1 to node 2, we may want to reduce the failover processes, however, this doesn’t
work.
Only if the node fails does the “failover processes” take effect.
Figure 48
Figure 49
By defining specialized managers it’s possible to direct concurrent requests to a specific concurrent processing node, by
defining the Primary/Secondary Node. Specialization rules allow requests to be excluded from managers and included in the
appropriate manager at the Application level.
Related module requests should be directed to a specialized Concurrent Manager. This manager can have a Primary
concurrent processing node that will use sqlnet to direct the database traffic to a related node in a RAC cluster.
Quick note: It seems a little silly to go to all the trouble to create the RAC cluster and then figure out ways to direct traffic to
a specific node. Why not just get a bigger, monolithic, SMP machine for the database server?
For a more complete, serious discussion, please refer to Optimizing the E-Business Suite with Real Application Clusters
(RAC) by Ahmed Alomari.
References
249213.1 - Performance problems with Failover when TCP Network goes down
364171.1- TAF Session Hangs, Select Fails To Complete W/ Loss Of NIC: Tune TCP Keepalive
211362.1 - Process Monitor Session Cycle Repeats Too Frequently
291201.1 - How To Remove a Dead Connection to the Target Database
362135.1 - Configuring Oracle Applications Release 11i with Oracle10g Release 2 Real Application Clusters and
Automatic Storage Management
Optimizing the E-Business Suite with Real Application Clusters (RAC) - Ahmed Alomari
240818.1 - Concurrent Processing: Transaction Manager Setup and Configuration Requirement in an 11i RAC
Environment
R12 ATG - Concurrent Processing Functional Overview – Aaron Weisberg
210062.1 - Generic Service Management (GSM) in Oracle Applications 11i
271090.1 - Parallel Concurrent Processing Failover/Failback Expectations
241370.1 - Concurrent Manager Setup and Configuration Requirements in an 11i RAC Environment
602899.1 - Some More Facts On How to Activate Parallel Concurrent Processing