Software for Machine Learning – The Machine leaning community in Europe

HTF Tutorials of trading aplications

Most of the information on the web about configuring the Linux kernel are not targeted to trading applications so I thought a little tutorial might be useful.

Isolcpus is the common name for configuring Linux to exclusively dedicate a CPU core to a certain process, such as a trading system executable. It helps minimize random latency spikes that are the result of context switches from the kernel scheduler multiplexing processes on a core. It comes from the name of the kernel parameter “isolcpus,” but it typically also refers to the other steps in assigning the core to the process.

To follow along you will need to be running Ubuntu 10.10 on a 2+ core machine.

The Isolcpus tutorial

First run the program “top” by just typing “top” into a terminal, then hit “f” and then “j” and then ENTER, which displays the core assigned to each process in column “P”. See how processes are distributed on both cores, 0 and 1:

We are going to isolate core 1 so the trading system can run on it without interrupts or context switches.

First we need to make the bootloader, GRUB2, display during startup so that we can use it to pass a kernel parameter setting to the kernel before it boots. Comment the two GRUB_HIDDEN_* lines in GRUB’s config file like the following:

sudo gedit /etc/default/grub
...
GRUB_DEFAULT=0
#GRUB_HIDDEN_TIMEOUT=0
#GRUB_HIDDEN_TIMEOUT_QUIET=true
GRUB_TIMEOUT=10
...

Now restart your computer and you will see the bootloader pop up listing your choices for the OS. Based on my configuration, I highlight the first one on the list as the one I’ll be using. When it’s highlighted, hit “e” to edit the kernel parameters. Select to the end of the second to last line and then hit the spacebar and then add “isolcpus=1″

So the line you modified should now look something like (it will be wrapped):

linux    /boot/vmlinuz-2.6.35-25-generic root=UUID=0b3b0a1b-49b1-4d60-9de1-dd7854ee8028 ro   quiet splash isolcpus=1

You are done here now so hit CTRL-x to continue booting. Run “top” again with the same column enabled as before. You should now see something like:

Sort by column P by hitting SHIFT-> a lot until it’s sorting by the rightmost column, and then hitting SHIFT-< once to sort by the second to last column, P, which shows which core each process resides on. You should see very few on core 1. The ones that are there have to do with OS utilities like maintaining the filesystem and take up basically no resources.

To run a program on the isolated core, use the command “taskset”. For example if your executable is named “a.out”, then you would call:

taskset 2 ./a.out

taskset will run your program on core 1. The number corresponds to a bitmast saying which CPU cores the process is allowed to use. 2 => 10 => core 1 but not core 0. 3 => 11 => core 1 or core 0. etc. This is called setting the CPU affinity of the process.

You can have a C program print the cores it has been authorized to use with the following code snippet:

void print_cpuaffinity() {
   unsigned long mask = 2; /* processor 1 (0-indexed) */
   unsigned int len = sizeof(mask);
   if (sched_getaffinity(0, len, &mask) < 0) {
      perror("sched_getaffinity");
   }
   printf("my affinity mask is: %08lx\n", mask);
}

Another way to check that “taskset” worked is to watch in “top” for your program to appear and see if in column P it is assigned to core 1.

Basically isolcpus=1 makes it so that the default affinity for a new process is 1 => 01 instead of 3 => 11.

After doing all this, you should find that your system runs more consistently and never “hangs” unexpectedly.

Some other interesting commands for probing your computer’s internals are:

cat /proc/cpuinfo
ntpq -c rl
gcc -v
uname -a
/lib/libc.so.6
cat /proc/interrupts
service irqbalance status

The signals of linear regression

There are some problems with linear regression. It can only capture first-order relationships, but when the signal to noise ratio is .05:1, then there’s not much point in worrying about that. Another problem is that it’s slow if you do it the typical way. Most people will just use Matlab’s backslash operator \, or R’s lm() function. This itself isn’t slow, in fact it’s one of the most highly optimized algorithms out there. The problem is when you’re backtesting and you have to recompute the entire linear regression at every single time step. It’s especially bad when you’re testing multiple values for parameters and have to run it a bunch of times, once for each parameter.

Because this is slow, people resort to hacks like letting the algo look into the future and calculate the regression coeffients on the whole dataset at once, or they avoid crossvalidating parameters (since it’s so slow) and instead overfit them by hand.

A better solution is to use an online algorithm for calculating the regression coefficients. Online algorithms take advantage of the fact that all but one data point (the new observation) is the same as the last time you calculated the regression. The updated regression can be expressed mathematically as combination of the old coefficients and just the single new point. If you have N data points total, this means the number of operations has gone from O(n) to O(1). Now there is no penalty in doing a fully-realistic backtest.

Another good thing about the online setting is that it’s more amenable to adaptive algorithms for machine learning. It’s one of the most under-emphasized facts about linear regression that weighting the points correctly is crucial. First of all you should typically weight recent points more than old points, and secondly you must weight points inversely to their volatility/variance. The first is more of a stylized fact about financial modeling. The second is a basic requirement to unbias linear regression when variables are heteroscedastic.

Here’s the difference between an exponentially weighted linear regression and one that’s not:

(This is a plot of the adaptiveness of the regression coefficients to changes in the true input-output relationship. The exponentially-weighted regression are the empty circles and the equal-weighted are the plus-filled-circles. Clearly the equal-weighted model has a lag time in reaching the true coefficients, although it is more stable.)

Online linear regression is very simple. I think it’s simpler than the typical matrix pseudo-inverse batch formulation.

Pseudo-inverse batch formulation

1. Start with prior estimates of the covariance of the signal and returns and the variance of the signal. These don’t need to be accurate, hopefully within a factor of 10, although 0 is also an ok general prior estimate. Also choose an amount to weight new points. weight=.5 will make new points a lot more influential than weight=.05

2. When you get a new data point, update the covariance with the formula:

covariance += weight*(signal*return – covariance)

Note that we are assuming signal and return have a mean over time of 0, which we can do since returns cannot be far from mean 0.

Update the variance of the signal with the formula:

variance_signal += weight*(signal*signal – variance_signal)

3. Calculate the new regression coefficient:

coef = covariance/variance_signal

Here’s the R code that generated the plot above. It also demonstrates the equality of this custom online algo formulation and R’s canonical weighted lm() implementation. To show the effect of adaptiveness, the signal (x) is a sine function and the returns (y) are a noisy sine function that is only > 0, so for half the phase the returns are white noise and uncorrelated to the signal. (One issue with this example is that the mean of the returns are nonzero- however it still works well enough):

points = 100
window = 20

in_noise = 0.0
out_noise = 0.1
trend = as.matrix(sin((1:points)/10))
# when trend <= trend_visible_above, the y-series will be pure white noise
trend_visible_above = 0
x = trend+rnorm(points, m=0, sd=in_noise)
y = trend*(trend>trend_visible_above)+rnorm(points, m=0, sd=out_noise)

weights = rep(1/window,window)
coef_flat = sapply(1:(points-window), function(t)lm.wfit(as.matrix(x[t:(t+window-1)]), as.matrix(y[t:(t+window-1)]), weights)$coefficients[[1]])

decay = 1/2

weights = rev(cumprod(rep(decay,window)))
coef_batch = sapply(1:(points-window), function(t)lm.wfit(as.matrix(x[t:(t+window-1)]), as.matrix(y[t:(t+window-1)]), weights)$coefficients[[1]])

# initialize exponential covariance and variance
ecov_xy = .5
evar_x = .5
online_linreg = function(t){evar_x<<-evar_x+decay*(x[t]*x[t]-evar_x);ecov_xy<<-ecov_xy+decay*(x[t]*y[t]-ecov_xy); ecov_xy/evar_x}
coef_online = sapply(1:points, online_linreg)[window:points]

# true coefficients
plot(coef_online, main="Exponentially-Weighted vs Flat Regression Adaptiveness", xlab="Time Index", ylab="Regression Coefficient")
points(coef_flat, pch=10)
# same as coef_online
#points(coef_batch, pch=14)
points(trend[window:points]>0, pch=20)

Machine Learning: Regression whith stochastic volatility

I had gotten there by a long search that had gone from machine learning, to fast Kalman filters, to Bayesian conjugate linear regression, to representing uncertainty in the covariance using an inverse Wishart prior, to making it time-varying, and allowing heteroschedasticity. I was thinking whith my koukio friends and this paper had all the pieces of an algorithm I think is suited for forecasting intraday returns from signals.

The features I like about imachine learning and the algotithm:

Analytic: does not require sampling which is very slow. Just uses a couple of matrix multiplications.
Online: updates parameter values at every single timestep. Offline/batch learning algorithms are typically too slow to retrain at every timestep in a backtest, which forces you to make sacrifices.
Multivariate: can’t do without this.
Adaptive regression coefficients: signals are weighted higher or lower depending on recent performance.
Completely forgetful: every part is adaptive, unlike some learners that are partially adaptive and partially sticky.
Adaptive variance: standard linear regression is biased if the inputs are heteroschedastic unless you reweight input points by the inverse of their variance.
Adaptive input correlations: in case signals become collinear.
Estimates prediction error: outputs the estimated variance of the prediction so you can choose how much you want to trust it. This estimate is interpretable unlike some of the heuristic approaches to extracting confidence intervals from other learners.
Interpretable internals: every internal variable is understandable. Looking at the internals clearly explains what the model has learned.
Uni- or multi-variate: specializes or generalizes naturally. The only modification required is to increase or decrease the number of parameters, the values don’t need to be changed. 10 input variables works as well as 2 which works as well as 1.
Interpretable parameters: parameters can easily be set based on a-priori knowledge, not exclusively by crossvalidation. The author re-parameterized the model to make it as intuitive as possible – the parameters are basically exponential discount factors.
Minimal parameters: it has just the right number of parameters to constitute a useful family but no extra parameters that are rarely useful or unintuitive.
Objective priors: incorrect starting values for internal variables do not bias predictions for a long “burn-in” period.
Emphasis on first-order effects: alpha is a first-order effect. Higher order effects are nice but alpha is so hard to detect that they are not worth the extra parameters and wasted datapoints.
Bayesian: with a Bayesian model you understand all the assumptions going into your model.

Best Programming Language for Trading Systems?

Currently, I’m working on Learning Machine’s submission for Max Dama’s QuantCup. That involves optimising a “price-time priority limit order matching engine”. More simply, it means ‘making a system which matches buy and sell orders really fast’. *

As per the competition rules, I’m programming our entry in the C programming language. But when it comes to our own system, I’m probably going to write it in something different.

Why? I want a language which balances ease of programming with speedy end results. Though compiled C is very fast, it isn’t an object-oriented (‘OO’) language, which means it’s harder to represent the concepts I’m coding about in a way which seems natural to humans.

The four most widely used OO languages out there are C++, C#, Java and Python, and in them, I’m quite happily able to implement pretty much anything within the capacity of my intelligence (we’re screwed – ed). So which one did I pick?

Machine learning tecnologies

Python

Straight off the bat, I knew Python was unsuitable. While the language make it easy to pump out code at a ridiculous pace, it is terrifically slow (unless you write a library in C – but then that’s C, not Python). That particularly holds true for large scale projects.

Another consideration was the OO syntax in the language: I just don’t like it. It’s always felt tacked-on and feeble. Python is primarily a scripting language, I guess.

Having said that, Python is my language of choice for scraping data off the web and for simple model testing, so I may well come back to it later for a different purpose.

Java

Java was another candidate that was quickly crossed off our list. Why? Because as far as I know, ~~Java doesn’t allow external functions to be called without piping a string into a program~~(if I’m wrong about this, let us know via the comments below!) [Turns out I was indeed wrong, see http://java.sun.com/docs/books/jni/html/jniTOC.html]. Another issue is the existence of C#. Pretty much the same language, but with a superset of Java’s features (i.e. does everything Java does, and more). And it has better handling of datetime type (important!).

C++

Of the four languages listed here, I’m least comfortable in C++. I thus figured that Learning Machine would be a great way to extend my knowledge of the language.

At first, C++ seemed perfect: solid OO implementation, a fast, compiled language, the ability to write Assembly language and C straight into a program, and great IDEs (I’m a fan of Visual Studio – university students can download it free through Microsoft’s DreamSpark program). C++ was so perfect, in fact that I started programming in it right away.

However, as soon as I got the basic class structure down pat, it hit me: the compiler. Spending thirty minutes debugging a simple error such as missing a type cast is not an efficient use of my time, particularly when trying to do university study alongside programming for Learning Machine.

C# (C Sharp)

Of the four languages considered, one was left: C#. An almost perfect language, it has all the advantages of C++ (bar its speed) and offers a huge standard library, with even more libraries available on the internet. It even lets you call external functions, and use pointers – features which place it in a class above Java. Not only that, but Microsoft seem to focus their documentation heavily on the language and their IDE, which smooths the ride somewhat.

Have I missed anything? Should I have included OCaml? Objective C? Erlang? Let us know in the comments! (I’m seriously considering writing some external functions in OCaml…)

* This matching task would normally be done within the exchange itself, but for speed reasons it’s also done within many high frequency trading firms so they can see the most up to date version of the order book and make orders accordingly.

Why does value investing work?

Rational, intelligent investors should not simply accept the fact that value investing works simply because it has been quite successful over the decades. How to quantify the value of a company or a brand like Coca Cola, Apple, Vondom or Zara?

Why should we buy “undervalued” assets and expect to sell them at “fair value” in the future? And what process, if any, causes the price to revert back to this “fair value”?

In this article, James Miao tackles these problems by creatively applying economic principles.

Investor strategy and machine learning tools

Firstly, let’s define value investing as a strategy where the investor buys (sells) undervalued (overvalued) cash flow generating assets, especially equities, and holds the security until the market price reflects the true intrinsic value.

Define the intrinsic value of a example company as the present value of all future cashflows, discounted at some rate (typically the weighted average cost of capital).

This precise definition tells us that there is no difference between the traditional styles of “value” and “growth”. It doesn’t matter if you chase above-average growth or underpriced companies, because what really matters is the present value of the future cashflows. This idea fits in perfectly with Warren Buffet’s idea of “intrinsic value”.

Define the observed or market value of the company as the latest tradeable market price.

We can re-write the current market price in terms of implied cash flows discounted at the same rate as the intrinsic value. These implied cash flows are the market’s interpretation of future cash flows and we might account for these just like depreciation/amortisation, given their unreal nature.

Define the estimated value of the company as an individual investor’s “best guess” at the intrinsic value of the company.

This intrinsic value is not directly observable, but can be estimated. Investors might estimate the company’s value by performing fundamental analysis – taking in factors such as industry growth, competitors, and competitive advantage and so on. This resulting estimate is modelled as a random variable, reflecting the investor’s degree of uncertainty. Indeed, if the investor estimates the value incorrectly, he stands to lose a lot of money!

Crucially, we can expect that investors’ estimates of value are accurate on average, due to the statistical properties of measurement errors.

Now that we have dispensed with all the necessary definitions, we can now proceed to the argument.

Suppose we know with certainty that the intrinsic value of the company is higher than the observed value.

How can we profit from this difference? The only way to do this without risk is to buy the stock and keep receiving cashflows until the company ceases business – much like holding a bond to maturity.

The value of the profit today is the difference between the present values of all future actual and implied cash flows. Equivalently, profit = intrinsic value – market value.

The arrows and the cash flows

These arrows represent cash flows. The difference between the black (real cashflows) and red (implied or “accounting” cashflows) arrows is the investor’s profit (green arrows). The investor realises this profit over time.

This transaction is similar to a riskless arbitrage in that we have “bought” the real cashflows and hedged them for a profit by “selling” the implied, unreal cashflows for a cheaper price.

Assume that all investors have the same expectations, risk aversion, wealth, transaction costs and market access. Additionally, they all hold the same market portfolio (i.e. they all hold the ASX200 index), they reside in a tax regime that allows full franking of dividends and that they have infinite patience (i.e. they can hold their shares from now until the company goes bust).

An investor expects to gain a profit equal to the difference between the estimated value and the market value of the company. Note that the investor is infinitely patient, so this profit does not include any resale values – the investor is content to recoup his investment from dividends.

What is his risk? His risk is purely the uncertainty arising from estimating the intrinsic value. This is because we cannot predict the future with certainty, especially with limited public and perhaps private information. Note that there is no resale price risk because investors are assumed to hold shares till the cessation of the company.

The investor now considers his risk and expected profit and decides whether to buy the security. He will now only buy if the security offers a better risk adjusted return than his currently existing portfolio (ignore diversification benefit, as he already holds the market portfolio). For example, if the market portfolio has a risk adjusted return of 10% and the security offers 15%, then he will buy.

All investors, which we assume to be perfectly identical, will rush in to buy the security. This drives up the price until it reaches somewhere close to, but never at the estimated value.

The price and risk: The trading software

The price is driven up by buying until it reaches the risk threshold, at which investors believe the expected return is not worth the risk and stop buying.

In reality, prices don’t plateau like this – this result is due to our assumptions. This shouldn’t obscure the main idea that investors’ individual search for higher returns will increase demand and thus force the price to revert back to the estimated price. The trading software can help the investors and the value companies.

Now what happens when we relax some assumptions?

Remember that we assumed all investors have to hold shares until the company ceases operating? Due to the time value of money, the resale risk of the shares will become insignificant as time passes. Hence, the investor can hold the shares for a reasonable timeframe, say 5-10years, and still gain a profit approximately equal to the profit if held to the end of the company’s life. This approximation effect means that our argument remains valid.

Taking the axe to the assumption of homogeneous expectations, what happens if investors take different views of values? The path of the price as it reaches the estimated price is then determined by a tug-of-war between sellers and buyers.

At first, the undervalued asset has few sellers and many buyers. The law of demand and supply then pushes up the current price to the market’s average estimate of value, at which equilibrium is formed at the average estimate of value.

This process is a more accurate depiction of the mechanism by which value investors are able to cause prices to revert back to fair value (Remember that the market’s estimated value should be very close to the real intrinsic value on average).

Notice there is a “no man’s land” between the buyers and sellers due to the effect of the risk thresholds. Investors in this region will not buy or sell because the expected return from making a transaction is not worth the risk, thereby leaving a “power vacuum” in which prices may not be determined on the basis of value alone. After we relax the assumption of investors having homogenous risk aversion, it can be shown that traders such as momentum traders have ample opportunity to dictate the market’s movements.

Furthermore, the remaining assumptions can be removed without changing much of these mechanics. If we remove the assumption of equal wealth and use a highly skewed wealth distribution dominated by several large institutions, then the distribution of estimates will merely be more peaked at certain values. The introduction of trading costs only widens the “no man’s land” between buyers and sellers. Information asymmetry means that some investors will have less estimation risk than others, giving them an edge in the market. Since our argument remains strong even after the removal of all these assumptions, we should be reasonably confident that there is a sound fundamental reason behind the success of value investing.

Ultimately, the reason why value investment has worked for so long is human greed. Greed ensures that investors will always search for higher returns, adjusted for risk of course. This constant search for higher returns drives investors to compete for the best investment opportunities, which in turn drives up the price of undervalued assets back to their fundamental, intrinsic value.

So rest assured value investors, your investment approach will continue to be successful for as long as human greed exists.

Machine learning and .NET technology initiative

it sounds like a simple question, but the answer isn’t so simple. .NET is a technology initiative, a computing vision, a business strategy, a development platform, a way to deliver services on the Web and probably a lot more things besides. Got all that?

ZDNet.com’s Tech Update calls .NET “the ambitious, bet-the-company initiative that aspires to weave a fabric of XML-enabled software and services across the Internet.”

And what does that mean? Even if you know that XML stands for Extensible Markup Language, you may need more explanation. Microsoft describes .NET this way.

Microsoft and de Machine Learning

“Microsoft® .NET is the Microsoft XML Web services platform. XML Web services allow applications to communicate and share data over the Internet, regardless of operating system, device, or programming language. The Microsoft .NET platform delivers what developers need to create XML Web services and stitch them together. The benefit to individuals is seamless, compelling experiences.”

Other benefits (according to Microsoft) would be:

“By using the Internet to enable software applications to more easily work together, Microsoft® .NET promises easier integration within and between businesses, while creating opportunities to more meaningfully connect with consumers. With the tools of the .NET platform, businesses can realize improvements in the time and cost associated with developing and maintaining their business applications, as well as benefiting from empowering employees with the ability to act on vital information anywhere, from any smart device.”

ZDNet.com Tech Update puts it this way:

“. . . (Microsoft) .Net products and services (are) designed to bring business computing onto the Web–as a means by which companies can eliminate time and technology barriers between customers, partners and employees.”

And further adds:

“The .Net strategy ties together nearly all of the software giant’s products, services, Web sites and development efforts. It includes a new blueprint for how software should be designed; a set of products for building that software; and .Net My Services, an initial set of Microsoft-hosted services. Later this year, the company plans to offer content, shopping, banking, entertainment and other Internet services through a variety of devices, all linked to its Passport authentication service.”

There’s lots more to .NET – a programming model (the .NET framework), a development environment (VisualStudio.NET), developer tools (like ASP.NET), programming languages (C#, VisualBasic.NET) and other elements (smart clients, .NET servers, etc) – but those are beyond the scope of this discussion.