Efficient use of multithreading

AmiBroker 5.50 fully supports multithreading (parallel execution on all CPU cores) in both charting and New Analysis window. This greatly enhances speed of operation and improves responsivity of application as worker AFL execution threads do not block the user interface. For example on 4 core Intel i7 that can run upto 8 threads, it can run upto 8 times faster than old Analysis window. Exact speed up depends on complexity of the formula (the more complex it is, the more speedup is possible), amount of data processed (RAM access may be not as fast as CPU thus limiting possible speed gains).

This chapter describes how to avoid pitfalls that can affect multithreaded performance.

Understanding how multithreading is implemented

It is important to understand one simple rule first - in AmiBroker one thread can run one operation on one symbols' data:

1 operation * 1 symbol = 1 thread

The operation is displaying single chart pane, scan, exploration, backtest, optimization. The consequences are as follows: single chart pane always uses one thread. Also a single backtest or optimization running on one symbol uses one thread only.

But a chart that consists of 3 panes uses 3 threads, even though they all operate on the same symbol. So we can also write:

N operations * 1 symbol = N threads

We can also run single operation (like scan/exploration/backtest/optimization) on multiple symbols, then

1 operation * N symbols = N threads

Of course you can also run multiple Analysis windows each of it running multiple symbols or run multiple charts on multiple symbols, then

P operations * N symbols = ( P * N ) threads

It is also important to understand that some operations consist of not only AFL execution part but some extra processing and/or user-interface work. In such cases only AFL execution can be done with multiple threads. This has consequences for Individual Backtest mode which will be described in detail further.

Note: In version 5.70 there is one exception from this rule: new multi-threaded individual optimization, that allows to run single-symbol optimization using multiple threads.

Limits

The number of threads that actually are launched depends on your CPU and the version of AmiBroker you are using. Standard Edition has a limit of 2 (two) threads per Analysis window. Professional Edition has a limit of 32 threads per Analysis window. In addition to this limit, AmiBroker will detect how many logical processors are reported by Windows (for example a single Intel i7 920 CPU is recognized as 8 logical processors (4 cores x 2 hyperthreading)) and will not run more threads per single Analysis window than the number of logical processors.

Common pitfals

There are following areas of AFL programming that require some attention if you want to write multithreading-friendly AFL formulas:

Avoiding the use of OLE / CreateObject
Reducing use of AddToComposite / Foreign to minimum
Efficient and correct use of static variables
Implementing pre-processing / initialisation in the Analysis window
Accessing ~~~Equity symbol

Generally speaking the AFL formula can run in full speed only if it does not access any shared resources. Any attempt to access shared resource may result in formula execution waiting for the semaphore/critical section that protects shared resource from simultaneous modification.

1. Avoiding the use of OLE / CreateObject

AmiBroker fully supports calling OLE objects from AFL formula level, and it is still safe to use, but there are technical reasons to advocate against using OLE. The foremost reason is that OLE is slow especially when called not from "owner" thread.

OLE was developed by Microsoft back in 1990's in the 16-bit days it is old technology and it effectivelly prevents threads from running at full speed as all OLE calls must be served by one and only user-interface thread. For more details see this article: http://blogs.msdn.com/b/oldnewthing/archive/2008/04/24/8420242.aspx

For this reason, if only possible you should strictly avoid using OLE / CreateObject in your formulas.

If you fail to do so, the performance will suffer. Any call to OLE from a worker thread causes posting a message to OLE hidden window and waiting for the main application UI thread to handle the request. If multiple threads do the same, the performance would easily degrade to single-thread level, because all OLE calls are handled by main UI thread anyway.

Not only that. Threads waiting for OLE can easily deadlock when OLE server is busy with some other work. AmiBroker contains some hi-tech patented code that checks for such OLE deadlock condition and is able to unlock from it, but it may take even upto 10 seconds to unlock. Even worse. OLE calls made from non-UI thread suffer from overhead of messaging and marshaling and can be as much as 30 slower compared to when they are called from same process main UI thread. To avoid all those troubles, avoid using OLE if only possible.

For example instead of using OLE to do RefreshAll like this:

AB = CreateObject("Broker.Application"); // AVOID THIS
AB.RefreshAll(); // AVOID THIS

Use AmiBroker native RequestTimedRefresh function which is orders of magnitude faster and does not cause any problems. If you want to refresh UI after Scan/Analysis/Backtest use SetOption("RefreshWhenCompleted", True )

Keep in mind that in most cases the refresh is completely automatic (for example after AddtoComposite) and does not require any extra coding at all.

If you use OLE to read Analysis filter settings (such as watch list number), like this:

AB = CreateObject("Broker.Application"); // AVOID THIS
AA = AB.Analysis; // AVOID THIS
wlnum = AA.Filter( 0, "watchlist" ); // AVOID THIS

you should replace OLE calls by simple, native call to GetOption that allows to read analysis formula filter settings in multithreading friendly manner. For example to read Filter Include watch list number use:

wlnum = GetOption("FilterIncludeWatchlist"); // PROPER WAY

For more information about supported filter settings fields see GetOption function reference page.

Also note that AB.Analysis OLE object always refers to OLD automatic analysis window. This has side effect of launching/displaying old automatic analysis whenever you use AB.Analysis in your code. As explained above, all calls to OLE should be removed from your formulas if you want to run in New multithreaded Analysis window. It is only allowed to access new Analysis via OLE from external programs / scripts. To access new Analysis from external program you need to use AnalysisDocs/AnalysisDoc objects as described in OLE Automation interface document.

2. Reducing use of AddToComposite / Foreign to minimum

Any access to other than "current" symbol from the formula level involves global lock (critical section) and therefore may impact the performance. For this reason it is recommended to reduce use of AddToComposite/Foreign functions and use static variables wherever possible

3. Efficient and correct use of static variables

The access to static variables is fast, thread safe and atomic on single StaticVarSet/StaticVarGet call level. It means that it reads/writes entire array in atomic way, so no other thread will read/write that array in the middle of other thread updating it.

However, care must be taken if you write multiple static variables at once. Generally speaking when you write static variables as a part of multi-symbol Analysis scan/exploration/backtest, optimization, you should do the writing (StaticVarSet) on very first step using Status("stocknum")==0 as described below. This is recommended way of doing things:

if( Status("stocknum") == 0 ) { // do all static variable writing/initialization here }

Doing all initialization/writes to static variables that way provides best performance and subsequent reads (StaticVarGet) are perfectly safe and fast. You should avoid making things complex when it is possible to follow simple and effective rule of one writer - multiple readers. As long as only one thread writes and many threads just read static variables, you are safe and you don't need to worry about synchronization.

For advanced formula writers only:
If you, for some reason, need to write multiple static variables that are shared and accessed from multiple threads at the same time, and when you must ensure that all updates are atomic, then you need to protect regions of your formula that update multiple static variables with a semaphore or critical section. For best performance you should group all reads/writes in one section like this:

if( _TryEnterCS( "mysemaphore" ) ) // see StaticVarCompareExchange function for implementation { // you are inside critical section now // do all static var writing/reading here - no other thread will interfere here _LeaveCS(); } else { _TRACE("Unable to enter CS"); }

The implementation of both semaphore and critical section in AFL is shown in the examples to StaticVarCompareExchange function.

4. Implementing pre-processing / initialisation in the Analysis window

Sometimes there is a need to do some initialization or some time consuming calculation before all the other work is done. To allow for that processing without other threads interferring with the outcome you can use the following if clause:

if( Status("stocknum") == 0 ) { // initialization / pre-processing code }

AmiBroker detects such statement and runs very first symbol in one thread only, waits for completion and only after completion it launches all other threads. This allows things like setting up static variables for use in further processing, etc. Caveat: the above statement must NOT be placed inside #include.

5. Accessing ~~~Equity symbol

Using Foreign("~~~Equity", "C" ) makes sense only to display chart of the equity of the backtest that has completed. It is important to understand that new Analysis window supports multiple instance, and therefore it can not use any shared equity symbol, because if it did, multiple running backtest would interfere with each other. So New Analysis has local, private instance of all equity data that is used during backtesting and only AFTER backtesting is complete, it copies ready-to-use equity data to ~~~Equity symbol. This means that if you call Foreign("~~~Equity", "C" ) from within the formula that is currently being backtested, you will receive previous backtest equity, not current one.

To access current equity, you need to use custom backtester interface. It has "Equity" property in the backtester object that holds current account equity. If you need equity as an array there are two choices, either collect values this way:

SetOption("UseCustomBacktestProc", True ); if( Status("action") == actionPortfolio ) { bo = GetBacktesterObject(); bo.PreProcess(); // Initialize backtester PortEquity = Null; // will keep portfolio equity values for(bar=0; bar < BarCount; bar++) { bo.ProcessTradeSignals( bar ); // store current equity value into array element PortEquity[ i ] = bo.Equity; } bo.PostProcess(); // Finalize backtester // AT THIS POINT YOU PortEquity contains ARRAY of equity values }

Or you can use EquityArray property added to Backtester object in v5.50.1

if( Status("action") == actionPortfolio ) { bo = GetBacktesterObject(); bo.Backtest(); AddToComposite( bo.EquityArray, // get portfolio Equity array in one call "~~~MY_EQUITY_COPY", "X", atcFlagDeleteValues | atcFlagEnableInPortfolio ); }

Please note that values are filled during backtest and all values are valid only after backtest is complete (as in above example). If you call it in the middle of backtest, it will contain equity only upto given bar. Avoid abusing this function and it is costly in terms of RAM/CPU (however it is less costly than Foreign).

Both ways presented will access local, current copy of equity in New Analysis (unlike Foreign that accesses global symbol values from previous backtest)

Single-symbol operations run in one thread

As explained at the beginning of the article, any operation such as scan, exploration, backtest, optimization or walk forward test that is done on single symbol can only use one thread. For that reason there is almost no speed advantage compared to running same code in the old versions of AmiBroker.

Update as of 5.70: This version has a new "Individual Optimize" functionality that allows to run single-symbol optimization using multiple threads, albeit some limitations: only exhaustive optimization is supported and no custom backtester is supported. This is for two reasons: a) smart optimization engines need the result of previous step to decide what parameter combination choose for the next step; b) second phase of backtest talks to UI and OLE (custom backtester) and as such can not be run from non-UI thread (see below for the details).

Individual Backtest can only be run in one thread

The most important thing to understand is that the Individual backtest is a portfolio-level backtest ran on just ONE symbol. Even if you run it on watch list, it still executes things sequentially, single backtest on single symbol at once, then moving to next symbol in the watch list. Why this is so is described below.

Both portfolio level and individual backtests consist of the very same two phases
I. running your formula and collecting signals
II. actual backtest that may involve second run of your formula (custom backtester)

Phase I runs the formula on each symbol in the list and it can be multi-threaded (if there is more than one symbol in the list).

Phase II that processes the signals collected in phase I, generates raport and displays results is done only once per backtest.
It can not be multi-threaded because:
a) it talks to User Interface (UI)
b) it uses OLE/COM to allow you to run custom backtester.

Both OLE and UI + access can not be done from worker (non user-interface) thread. Even worse OLE/UI + multithreading equals death, see:
http://blogs.msdn.com/b/oldnewthing/archive/2008/04/24/8420242.aspx

Usually, in case of multi symbol portfolios, Phase I takes 95% of time needed to run portfolio backtest so once you run phase I in multiple threads, you get very good scalability as only 5% is not multi-threaded.

Since individual backtest runs on ONE symbol then the only phase that can be run in multiple threads, i.e. phase 1 - consists of just one run, and as such is run in one thread.

To be able to run Phase II from multiple threads you would NOT be able to talk to UI and would NOT be able to use COM/OLE (no custom backtester).

That causes that Individual Backtest can NOT be any faster than in old Automatic Analysis.

Doing the math & resonable expectations

Some users live in fantasy land and think that they can throw say 100GB data and the data will be processed fast because "they have latest hardware". This is dead wrong. What you will get is a crash. While 64-bit Windows removes 2GB per-application virtual address space barrier, it is not true that there are no limits anymore.

Unfortunatelly even people with technical background forget to do the basic math and have some unresonable expectations. First and foremost thing that people are missing is the huge difference between access speeds made by data size. The term "Random Access Memory" in the past (like back in 1990) meant that accessing data takes the same amount of time, regardless of location. That is NO LONGER the case. There are huge differences in access speeds depending on where data is located. For example Intel i7 920, triple channel configuration accesses L1 cached data with 52GB/second speed, L2 cached data 30GB/second (2x slower!), L3 cached data 24GB/second and regular RAM with 11GB/second. It means that cached data access is 5 times faster than RAM access. Things get even more dramatic if you run out of RAM and system has to go to the disk. With most modern SSD disks we speak about just 200MB/sec (0.2GB/sec). That is two orders (100x) of magnitude slower than RAM and three orders of magnitude slower than cache. That assumes zero latency (seek). In real world, disk access can be 10000 times slower than RAM.
Now do yourself a favour and do the math. Divide 100GB by 0.2GB/second SSD disk speed. What you will get ? 500 seconds - almost ten minutes just to read the data. Now are you aware that if application does not process messages for just 1 second is considered as "not responding" by Windows? What does that mean? It means that even in 64-bit world, any Windows application will have trouble processing data sets that exceed 5GB just because of raw disk read speed that in best case does not exceed 200MB/sec (usually much worse). Attempting to backtest such absurd amounts of data on high-end PC will just lead to crash, because timeouts will be reached, the Windows will struggle processing messages and you will overrun system buffers. And it has nothing to do with software. It is just brutal math lesson that some forgot. First and most important rule for getting more speed is limit your data size, so it at least fits in RAM.