[LWN Logo]
[LWN Feature]
Weekly Edition
Daily updates
Events Calendar
Book reviews
Penguin Gallery

About LWN.net

PRISM: Linux Cluster Computing for the National Stock Exchange of India

March 26, 2001
Liz Coolbaugh
Shuvam Misra from Starcom Software Private Limited came out from Mumbai, India, to give a talk on a Linux Cluster computing solution developed by his company for the National Stock Exchange of India. The purpose of its development was to demonstrate the use of Linux for a real business application requiring high-level floating point computations to be done at very high performance levels.

The NSE is one of the world's largest Stock Exchanges in terms of trading volume. Real time throughput is required to analyze trade risk factors. Also critical is high reliability under high pressure (both in volumes and human emotions). If the Stock Exchange computing system goes down when the stock prices are going down, a riot is likely to result.

India's NSE is fully computerised, with no open-outcry ring. It was commissioned in 1994 and was the first stock exchange in India that agressively promoted nation-wide use. More than 2000+ traders come online daily over a VSAT network. Volume runs around 300,000 trades per day with over 10,000 securities listed. 2000 to 3000 securities are traded regularly (at least once every 8-10 days). In addition, half of the day's total trades will happen in the last 30 minutes (apparently common to most stock exchanges).

NSE's current system is sized for 100 trades/second, but it needs to be able to scale to at least 500 trade/sec, preferably 1000 trades/sec, for the future.

Additional requirements for the NSE of India include risk analysis, fault tolerance, a good real-time GUI, interaction between the current price and the risk analysis (which is not done on all stock exchanges) and a two-way connection to the live-trading system. For example, if a broker goes past their acceptable risk limits, a real-time disable on that broker's account can be placed immediately. They want sufficient speed to guarantee that, if one trade crosses the limit, a second trade a second later will be rejected.

VaR is a term for the risk analysis that has been defined and approved by economists. It requires a great deal of raw computing performance. It may include 5000 to 15000 iterations per VaR calculation, depending on the accuracy required. That means a few MFLOps per VaR calculation.

So, GFLOPs are required to support real-time VaR calculations.

The performance requirements dictated the use of a supercomputer, but unfortunately, they did not have the budget. No stock exchange in the world today implements what they have done, real time risk analysis with VaR. Only 5-6 stock exchanges in the world have higher trading rates and none of them have the risk analysis engine to handle their volumes. This is a world-first. Some do risk analysis checking, but not something that an economist would be happy with. They generally only do static limit checking.

The Solution - use a cluster of inexpensive Intel Linux machines because the problem is highly parallelizable. Use PVM or MPI for clustering infrastructure with a switched ethernet 100mbps network (they verified that this was adequate). Find the non-parallel components and use multiprocessor hardware to handle those, such as data-sharing.

In this case, the Customer has extreme risk aversion, extreme caution. The Linux option seemed like the more risky choice compared to a supercomputer costing 20 times as much because the supercomputer is "proven", it has been in use for many years.

Starcom Software took on the cost, saying that if it didn't work, the customer wouldn't be charged.

PRISM

PRISM is an engine for doing the VaR calculations. It does not replace the entire NSE; it solely handles this one compute-intensive step.

The design of PRISM includes one large processor called "mother" plus a number of "children". For the test-bed, each child was an Intel 3 machines. They found the best price performance came from dual-processor Intel 3 systems, which are absolutely the ideal to get maximum throughput for minimum money. Not a large amount of RAM is required (128 or 256MB). The problem involved requires pure computation speed, bus speed and networking speed.

The systems boot from hard disk, then do not use the local disk afterwards.

"mother" is a dual-processor Alpha machine. This appeared to be the best for raw computing performance.

A separate machine is used to receive data from proprietary systems, check it and turn it into PVM messages. "mother" is the potential bottleneck; the children are not, they are just blind servers. We got 10-12ms per computation from each child.

The software only allows 3x12ms for the performance of an action before marking a child dead and reassigning the task.

They evaluated both PVM and MPI. They liked PVM, but chose MPI due to current industry support and available performance tuning for this high-stress application. The design was tested with four or five old Pentiums and clocked 50+ trades/sec. 100 trades/sec was trivial with four dual-processor Intel "children".

They believe the system can be scaled to 1000 trades/sec. The software development is done, the hardware costs are trivial. Mother runs on a Unix Alpha machine, which costs the most. A Java-based UI is provided to support a stock trading specialist (but not a computer administrator) to control the system.

The Linux children need no management, so no "Unix skills" are needed. Only the IP address has to be configured to install a replacement child.

The system has proven very stable in production. It went into production a year ago.

The Future

Shuvam commented, "We would like to replace the hardware fault tolerance and high performance trading system with a cluster as well", (as opposed to just the risk analysis engine).

"Next, we want to get into the business of matching orders, since trading systems are inherently parallelisable. A mainframe is a stupid solution when a cluster is possible. A cluster could provide an extremely scalable trading system that could handle 100 times the volume of the current system"

NSE is currently using Stratus mainframes. Shuvam was not at all confident that management would choose to replace that system with a cluster-based system any time in the near future. However, management is interested due to current IT costs. Now that PRISM exists, Starcom has more credibility. Meanwhile, it was a very enjoyable experience.

"We understand management's skepticism and risk aversion. The bottom-line, eventually, is to have a solution that works. The project cost for testing this was very low, without any risk", Shuvam said.

PRISM is in place for real-time risk analysis. If it went down, the trading can continue without real-time risk analysis. It is not a core system.

Eklektix, Inc. Linux powered! Copyright 2002 Eklektix, Inc. all rights reserved.
Linux ® is a registered trademark of Linus Torvalds