Before deploying a JSLEE service in a production environment (such as an operator's live network), developers should: make sure it fulfills all functional requirements, run it through series of tests in a similar production environment, and prepare common deployment procedures.
Read before executing This article presents tasks that development and test teams should consider before signing off a service to deployment. These may be especially useful when preparing development and test plans.
This article assumes that developers and testers have a good knowledge of JSLEE and Rhino. Its scope is to discuss and suggest best practices that may be overlooked or undervalued, rather than to comprehensively list all points to consider before deploying a service. For example, it does not address service functional behavior nor any other service specifics.
Pre-deployment considerations in developing a service include making sure you:
Validate input — Have your service validate, before accepting or using, any input. Otherwise, the input's behavior in the service can be unpredictable, potentially compromising the normal behavior of other services and the SLEE itself. (Input can be invalid for many reasons, including an erroneous or malicious state in an external component or the layer underneath — a resource adaptor in this context.)
Implement tracing — Implement tracing (or logging when not associated with a JSLEE component) as an essential process to record the normal behavior of a service, and keep track of problems to analyze later.
Manage transactions — Understand the concept of transactions managed by the SLEE, and the standard set of transactions associated with executing a JSLEE service. These are particularly useful for dealing with failure scenarios, such as a resource or network failure. Implement the necessary methods and logic to fail gracefully, recover when possible from the failing state, and free acquired resources.
Implement service management — Implement service management by using the facilities and features provided by the SLEE for administrators to manage and solve problems involving the service.
To validate input, follow these best practices:
Make sure the service validates the values of all events and inputs
Unless the resource adaptor contract clearly states the allowed values of the events it fires, service developers should not make any assumptions about what they may be. Therefore, services should validate received events for non-existent and null values, as well as the type of each value.
Make sure the service validates the context of all received events
An unexpected event received out of order can have various critical and unpredictable consequences, such as a service's state machine transitioning to an invalid state. Therefore, services should validate that each event is received in the correct order — and transition to a "failure state" if not. In the "failure state", the session should gracefully finish (see "Manage Transactions" below).
To implement tracing, follow these best practices:
Assign correct tracing levels
Assigning wrong tracing levels can create logs that are too verbose or list messages out-of-context. A service in production, when processing an event, should trace between one and three messages at Info level. The levels Warning and Severe should be always activated in production and should be used to trace non-critical and critical errors, respectively. The remaining levels should be deactivated in production and should be used to trace configuration and debug information.
Guard trace calls
Not guarding trace calls can have a considerable impact on performance, especially when the service is rich in debugging trace messages. Guard all the trace calls to levels that are not active by default in the production environment. As a best practice, guard all calls and message construction of traces with the levels Config, Fine, Finer, and Finest.
Contextualize trace messages
Trace messages that do not expose the current context are not useful. When possible and necessary, contextualize the message in the received event, state of the service, requested action, and any other information that may help explain the current state of the service and that session.
To manage transactions, follow these best practices:
Validate the implementation of the method sbbExceptionThrown
The SLEE calls this method if an SBB throws an uncaught exception. () SBB developers should implement this method to trace the context in which an exception is thrown, analyze the exception, and prepare the transaction to be rolled back.)
Validate the implementation of the method sbbRollback
The SLEE calls this method if a service or the SLEE mark the transaction to roll back. () SBB developers should implement this method to trace a transaction rolling back, to try to recover to a correct state, and to free or end any resources acquired or started during the transaction. In various scenarios, the SLEE activity should be ended or the SBB entity should detach itself from the activity (so the SLEE can remove the SBB entity).
Implement service management
To implement service management, follow these best practices:
Create usage parameters
Usage parameters associate counters with actions a service performs, such as forwarding or rejecting a call. They are useful for statistics and understanding a service's usage. By counting a service's internal actions, administrators can keep track of its deployment stability and performance; for example, noting how long an external component takes to answer a request or how many retries the service has to send.
Tracing doesn't prompt immediate action from administrators — it's more for errors that can be analyzed later. Alarms however alert administrators to issues needing immediate action, such as when an external component is not returning expected responses or when the service cannot perform an action without administrator intervention.
Create rate-limiting endpoints
Rhino SLEE lets service developers to implement custom rate-limiting endpoints. These let you configure the rate at which events or messages can be received or sent (and other actions performed) in the SLEE. Rate limiting can prevent overloading the SLEE or external components. Implement custom endpoints so you can configure, at runtime, the rate at which your service performs those actions.
Create a management client
The Rhino SLEE's management interface includes access to a service's usage parameters and alarms, as well as various operations in the SLEE. The SLEE exposes all management operations through its management interface, which you can also integrate with the operations and management platform of your network (as required — otherwise administrators can just use the Rhino command-line console and the Rhino web console).
Pre-deployment considerations in testing a service include making sure you:
Use real-world data — Real-world (or as close as you can) data should be used to perform functional and acceptance tests.
Check performance — Use performance and benchmarking tests to validate how fast the SLEE works, and what resources you need (such as CPU type and nodes) to handle each required load and latency.
Check failure — Test the platform deployment configuration and the service for all kind of possible failures, of a SLEE node and components which the SLEE integrates with. These tests will validate session and service availability requirements, as well as whether the service can handle the error scenario correctly.
Verify interoperability — Perform all testing using the exact same components as production systems, especially when using components external to the SLEE. (A common pitfall is performing development and integration tests with slightly different versions of components in production.)
Use real-world data
To test the service with real-world data, follow this best practice:
Perform tests with real-world data
A common mistake is to use data created specifically for functional and acceptance tests, such as subscribers' profiles. Although there is nothing wrong with this approach (using well-known scenarios), you should make sure that all tests pass with real-world data. Use as much real-world data as possible to emulate the production environment.
To check performance, follow these best practices:
Perform a stress test
Stress tests verify how much load the system can handle and how fast it is. They are also particularly useful for tuning the service (and the SLEE), and discovering fast-failure resources, memory leaks, and race conditions. Perform stress tests by simulating loads in the system, and include load ramp-ups and ramp-down of traffic (for instance, to simulate busy hours).
Perform a long-running test
The main goal of long-running tests is to validate execution of a service over a long period of time. To do this, generate a load in the SLEE over several days or even weeks. This test can be particularly useful to detect slow-failure resources and memory leaks in the service.
To check failure, follow these best practices:
Check graceful recovery of Rhino node failure
The service and the Rhino cluster should recover gracefully from a Rhino node failure (such as a CPU or network problem). If the service is running in fault-tolerant mode, the SLEE replicates its sessions to other nodes, so only sessions being setup should be lost. When not running in fault-tolerant mode, but running in high-availability mode, the service should remain available in the remaining nodes. Simulate a failure in a Rhino node or in the network while creating different levels of load, to validate that the service behaves as expected, without performance degradation, and that the SLEE redirects new sessions to running nodes.
Check graceful recovery of external component failure
Session and service availability may depend on the availability of external components that the platform integrates with. The service and the platform should handle and try to recover from any external failure by, for example, failing over gracefully to another component and/or generating an alarm. While creating different levels of load, simulate the failure of external components. The service should handle the failure and remain available if possible, or it should become available again when the external component recovers. If applicable, the platform or the service should also fire an alarm. This test is particularly useful to validate the service-transactional model and make sure resources are freed in error scenarios.
To verify interoperability, follow this best practice:
Use OpenCloud simulators
OpenCloud simulators tend to use the same code base as production versions, so integration problem when using them are uncommon.
Implement deployment processes — It is important to define and develop a set of common deployment processes. Things to take into account include packaging and deployment automation. It should be possible to automate processes such as deployments and upgrades.
Implement deployment processes
To implement deployment processes, follow these best practices:
Create the packaging structure
As a convention, define each SLEE component in its own deployment unit, so you can deploy and undeploy them individually. Use standard tools to create a service's deployment units. Also, make sure build scripts are version agnostic, so the versions of each component can be easily set.
Automate regular processes
Production environments have common sets of processes performing regularly. Automating them means tasks execute faster and avoid human error. These include offline and online upgrades and downgrades of some or all SLEE components associated with a service. Additionally, provisioning processes and configuration changes should also be automated. Use tools such as Ant scripts to automate and validate these processes.