Building a scalable Web-based Call Center CTI Solution
My project was part of our customer’s effort to replace all of the enterprise applications with web applications based on a standardized technology stack. In this strategic move, the call center integration was a crucial step. As it turned out, the technical design of the new call center telephony solution was quite challenging. We did not only learn a lot about CTI; we also had to implement the system to be scalable and ensure that it handles more than 1000 call center agents.
The call center agents should use mostly the standard web applications, but with an additional telephony control that allowed them to accept incoming calls, to disconnect calls, or to make consultation calls to other agents or supervisors.
An incoming call
Let’s have a look at the most important usecase first: an incoming call.
The following diagram gives an overview of the flow of events, before the agent’s telephone rings:
The incoming call of the customer is handled by the PBX (Private Branch Exchange). When the agent finally takes the call, a lot of information about the customer has already been collected. In most cases, the caller will already have gone through an interactive voice response system that has collected his account number and verified his PIN (omitted from the picture above).
This is how the agent screen might look like after the agent has taken the call:
The box on the left are the telephony controls. They are embedded in an iframe and allow the agent to disconnect the call or place consultation calls to other agents. The telephony controls send commands to the gateway (and in extension to the PBX) and receive asynchronous events.
CSTA as a Model for our Protocol
Services are commands to the telephone system. An outgoing call (from any device within the domain of the switch) is initiated by a Make Call Service. But there are also services like Set Agent State.
CSTA is extremely comprehensive; we used only a small selection of its services and events. It is also easy to extend — the transfer of non-standardized key/value pairs within the data part of services and events is explicitly provided for.
The customers’ technology framework requirements were:
- Internet Explorer 8 as the browser for the call center agents
- Wicket as web application framework for the call center application
- Tomcat 7 as the web application server for both the call center web app and the gateways.
The technical requirements were: minimal latency, high throughput, and high availability. An average delay below 150 ms was required for latency, i.e. a value slightly below the attention threshold. For the call center callers, very low latency is not crucial — most callers will have waited in the queue for minutes rather than seconds to reach a free agent anyway. But the new web application should — if at all possible — not worsen the ergonomics for the call center agents. In the end, this wasn’t a problem: during tests using moderate load latencies below 80 ms could be achieved.
High availability is an obvious requirement: if a call center with about 1000 agents fails, there will be many unhappy customers. On an unlucky day the failure will even be reported in the news. We solved the problem by designing for redundant server components and a low latency failover protocol. The actual web application uses Tomcat’s built-in clustering mechanism. We couldn’t reuse this for the telephony gateway, because the relevant state is distributed across the switches anyway.
The gateway has two essential reliability requirements:
- Commands to the telephone system have to be retried quickly if a gateway fails.
- Telephony events must not be lost.
The functional requirements were straightforward:
- Incoming and outgoing calls (simple call control)
- Call forwarding (single-step/two-step transfer)
- Forwarding to the IVR (Interactive Voice Response) — including customer dependent data — as well as routing back to the same agent that originally took the call
- Setting and displaying the agent status
The architecture consists of several interconnected systems as shown in the diagram below:
- Telephony-related systems (left): the gateway (a server-side web application running in Tomcat) and the PBX,
- Call center web application (right): Wicket-based web application and its database(s).
Sending Server Events to the Browsers
For redundancy, every client connects to both gateways, and keeps the TCP connection open. This means that every application server (Tomcat) of the gateways has to hold nearly 1000 open connections. We use the AIO-Interface of Tomcat 7, so all these connections can be processed by a single thread. This greatly minimizes memory requirements and scheduling overhead.
Server-sent events (a.k.a. server push) was recently standardized as part of HTML5 in the EventSource interface. Another convenient method to implement bidirectional communication is WebSockets. But we couldn’t use any of these due to the use of legacy browsers — we were glad we didn’t have to support IE6 and could rely on at least IE8. So we implemented a COMET variant, which essentially consists of long running XMLHttpRequest through which events are sent as chunked responses.
Cross-Domain COMET with IE8
Mozilla Firefox, Safari and Google Chrome all support the CORS (Cross-Origin Resource Sharing) specification of the W3C. IE8 supports it as well; however, with IE8 one must use XDomainRequests instead of XMLHTTPRequests, and the API is slightly different. There is a also a subtle buffering bug within IE8 that makes it necessary to set 2 KB of fill characters on every new COMET connection to ensure that the next event is received by the application immediately.
Redundant Gateways and PBX
Each browser keeps two connections to two different gateways. One is active, and the other is a hot standby. When the connection to the active gateway is broken, the hot standby gateway is immediately activated. If necessary, the last failed command will be retried. As the hot standby gateway has been sending events the whole time as well, it is guaranteed that no event is lost. After this failover, the connection to the failed gateway is retried. When it is active, the previously failed gateway has become the host standby gateway.
The PBX (a Genesys installation) itself is also redundant. The fallback on this level is hidden by the Genesys API and the gateway doesn’t have to handle it.
Our solution was tested with three different methods:
- A simulator that implements the gateway’s HTTP services and simulates a single agent telephone (with a Swing GUI), and
- Load tests.
Writing the simulator was a substantial effort, but it helped in two ways:
- It made development without the telephony hardware possible.
- It made it easy to test scenarios that were not reliable testable with real hardware (like deliberate race conditions).
In an ideal world, the load tests would have been performed with an external load test tool. We didn’t have one available, so we wrote our own load test generator using the CSTA API to generate and receive calls.
Our solution is light-weight, conceptually simple and scalable. The simplicity is the result of two development iterations and rather long design phases.
The decision to use CSTA as the blueprint for the communication protocol worked well, too. It was helpful that we did not have to re-invent two-step transfer for the umpteenth time. Also, the CSTA vocabulary (which goes down to the text in the log messages) can be understood by personnel that are familiar with CTI.
In the Footprints of Arnold Schwarzenegger
Call center applications always remind me of a slightly silly movie starring Arnold Schwarzenegger as an undercover agent and Jamie Lee Curtis as his unsuspecting wife. His cover story for her is that he is doing something with IT and in one scene she inquires about his day at work. He starts telling her enthusiastically and quite elaborately about a call center integration — and she nearly falls asleep.
I, however, think the combination of a call center and a web application is technically quite fascinating.