Author: haksunli

Algorithmic Trading Courses

For those who would like some organized introductory materials to algorithmic trading, I have designed an algorithmic trading course using materials from published papers. The course has been successfully conducted in two universities. The course has also been used for in-house training in some companies. You may find the lecture notes here.

I am running this algorithmic trading course again in July with the NTU-SGX program.

In addition, my friend, Ernest Chan, will be conducting another algorithmic trading course in August. His course focuses on pairs trading, and using Matlab as the backtesting tool. Reuters will also be presenting their databases. Information can be found here.

The Right (and Wrong) Way to Run an Algorithmic Trading Group

I would like to share with you the unique vision that Numerical Method Inc. has about running an algorithmic trading group. To get an edge over competing funds, we emphasize on 1) the research process and 2) technology rather than on hiring more intelligent people.

Currently, the majority of the quant funds run like cheap arcade booths: the traders are given workstations and data. They do whatever to crank out strategies. The only contributing factor to profit is luck – luck in finding the right people and/or luck in finding the right strategies. I had a conversation with an executive from a large financial organization two years ago when they started to build an algorithmic trading group. He said, “Haksun, we need to hire some very smart people to be better than Renaissance.” I repeatedly hear something similar from various portfolio managers and hedge fund owners.

Staffing “very smart” people in this ad-hoc, unmethodical, non-scientific process in search of profitable trading strategy is merely a lottery in disguise.

The main reason is that there is no necessary condition between “very smart” people and “very profitable” trading strategies. If there is any relationship, it can only be a sufficient condition.
It is very difficult to hire the “very smart” people because
1. They are difficult to be identified among numerous pretentiously smart people.
2. The competition for them is a very fierce battle. The best examples are the fight between Microsoft and Google over Dr. Lee Kai-Fu, the dispute between Renaissance Technologies and Millennium Partners over two former employees.
The very best people are driven by passion rather than money, e.g., Gates, Jobs. They cannot be hired.

The key to success in running an algorithmic trading group, as in warfare, lies in process (tactics) and technology (weapon), not on star traders. Analogously, knights, despite being elite warriors, were superseded by cheap infantry; German tanks, despite being best engineered during WWII, were outnumbered by cheap USSR tanks. The better traders do not get their profitable strategies from higher beings. Speaking as an AI scientist, they are “better” only because their search involves a bigger space (more knowledge) and is faster (more efficient). We have created a process and a technology that enables a good trader to be just as profitable as the “very smart” traders.

Process

In terms of the algorithmic trading research process, there is usually very little standardization even within one firm, let alone in the industry. Most quant fund houses do not invest in building research technology (with the famous exception of Blackrock).

This is best illustrated by the languages they use to do backtesting. When there are 6 traders, they could be using: Matlab, R, VBA, C++, Java, C#. The first consequence of this using-their-best-language is that there is absolutely no sharing of code. This firm would write the same volatility calculation function 6 times. Suppose trader A comes up with a new way of measuring volatility, trader B could not leverage on this. Trader C could not quickly prototype a new trading idea by combining the mean-reverting signal from trader D and the trend following signal from trader E. The productivity is very low.

The management is not able to compare strategies from two traders. The 6 traders all make different “simplifications” in backtesting. For instance, they would clean the same data set in different way; they would use their “proprietary” bid-ask, execution, (market vs. limit) order models for computing historical P&L; they would make assumptions about liquidity and market impacts. Mainly due to simplifying coding, they would make all sorts of guesses about the details in executing their strategies. The management does not have the time to question every single detail in their backtesting, hence the lack of understanding and confidence. They would simply resort to “trusting” the reports.

Worst, while algorithmic traders maybe excellent mathematicians, they are usually bad programmers. I am yet to see an algorithmic trader who understands modern programming concepts such as interface vs. inheritance, memory model, thread safety, design pattern, testing, and software engineering. They usually produce code that is very long, unstructured and poorly documented. The code must have bugs. Therefore, the management cannot trust their backtesting result and the performance report.

Our solution to make systematic the trading research process has three steps. Firstly, we mandate that all traders do their backtesting in the same language, e.g., Java. Secondly, we mandate that all traders contribute their code to a research library. Thirdly, the firm invests in a common backtesting infrastructure by expanding this research library. The advantages are the following.

The traders can focus on what they are supposedly good at – generating innovative trading strategies. They no longer bother with the IT grunt work.
They can quickly prototype a trading idea by putting together components, e.g., signals, filters, modules, from the research library.
They can share code with colleagues and be understood because all conform to the same standard.
They can compare strategies because the performance measures are computed from simulations making the same assumptions.

Over time, as the algorithmic trading firm invests, creates as well as expands the research library and backtesting system, this IT infrastructure will become the most valuable asset. Suppose a star trader takes 3 month to test a good idea and make it profitable. With this standardized process and technology, a good trader is able to rapidly prototype 30 strategies in 3 days in parallel on a cluster of computers. The profitability of the firm depends not on hiring Einstein but on good and hardworking people leveraging on using the infrastructure.

Technology

There are many vendors that sell backtesting tools. However, IMHO, these backtesters are no more than augmented versions of for-loops – looping over historical data. Some better ones come with various features, e.g., importing data, cleaning data, statistics and signal computation, optimization, real-time paper trading. Some even go one step further to provide brokerage services for real trading. The major problem with all these backtesters is that you cannot code, hence simulate, true quantitative trading strategies with them. Suppose you want to replicate Elliott and Malcolm’s pair trading strategy, you will need to do SDE, EM, MLE, and KF. Most commercial backtesters are built for amateurs and simply cannot do it.

There are a few more professional backtesters that provide link to Matlab or R for data analytics. I have a lot of complains about doing data analysis in Matlab/R. Many traders use them only because they cannot code. (It is extremely rare to find someone who is mathematically talented and can code.) Matlab/R is for amateurs; VBA is a joke. The problems are:

They are very slow. These scripting languages are interpreted line-by-line. They are not built for parallel computing. (OK. I know Matlab/R can do parallel computing but who use them in practice? Which traders understand immutability to write good parallel code?)
They do not handle a lot of data well. How do you handle two year worth of EUR/USD tick by tick data in Matlab/R?
There is no modern software engineering tools built for Matlab/R. How do you know your code is correct? Most traders just think that their code is correct because they are not trained programmers.
The code cannot be debugged easily. Ok. Matlab comes with a toy debugger somewhat better than gdb. It does not compare to NetBeans, Eclipse or IntelliJ IDEA.
How do you debug a large Matlab/R application with 50 pages of scripts anyway? I usually just give up.

You can tell that Matlab/R is not fit for financial modelling simply by observing that no serious person/bank/fund does option pricing in Matlab/R. My former employers do not price our portfolios using Matlab/R on my workstation. They deploy a few thousands computers in a grid to run the pricing code for the whole night! Likewise, we need a cluster of computers to run our trading models for many different scenarios with many different parameters before we are confident to bet a few 100 million on them.

Algo Quant is a first attempt by Numerical Method Inc. to solve the technology problem. Algo Quant is an embodiment of the systematic algorithmic trading research process that we discussed. Algo Quant is a suite of algorithmic trading research tools (with source code) for trading idea development, quick prototyping, data preparation, in-samples calibration, out-samples backtesting, even automatic trading strategy generation. Algo Quant is backed by SuanShu, a powerful modern library of mathematics, statistics and data mining. For more information, please check out this.

If you are a passionate quant/trader/programmer who shares our vision to revolutionize the algorithmic trading industry, please join us! If you are a hedge fund who wants to hire us to implement this scientific trading research process in house, please contact us.

Gain Capital FX rate data

For most non-professional FX traders, one hurdle to start your own algorithmic trading research is tick-by-tick data. Gain Capital Group has been very kind and does the community a very good service by providing historical rate data on their website: http://ratedata.gaincapital.com/. The data look OK.

Unfortunately, one serious problem that catches eyes are the very ad-hoc formats of the data files. For example,

The newer csv files store data by weeks (e.g., in year 2010) while some older csv files store data by quarters (e.g., in year 2000).
Some zipped files contains folders while other contains csv files.
Some zipped files contains other zipped files.
Some csv files have headers while others don’t.
The orderings of the columns are not all the same for all csv files. Some csv files have even missing columns.
The timestamp formats are not all the same.
Worst, the csv files have different encoding!

In order to process these “raw” data zipped files into useful and more importantly usable format that we can use for research, I recommend these following steps.

We first unzip the raw zipped files recursively until we store store all plain csv files in one single folder. This is to handle the situation where zipped files in zipped files and folders in folders.
Read each csv file using an appropriate parser (depending on the encoding, format, headers, etc.), row by row, line by line.
Group the rows of the same date together and write them out as one csv file, e.g., AUDCAD.2007-10-7.csv.zip.
Zip the csv file for easy archive.
Write an adapter to read the processed zipped csv files and convert the data into whatever format your backtesting tool reads.

I have written some code to automate these steps, and they are now available in Algo Quant for free. Specifically, the important Java classes are:

GainCapitalFxCsvFilesExtractor.java: unzip all csv files into one folder
GainCapitalFXRawFile.java: read the Gain Capital csv files of various formats, save and zip the data by dates.
GainCapitalFXRawFile.java: read the zipped csv data files.
GainCapitalFXCacheFactory.java: the data feed/adapter to read the zipped csv data files into the Algo Quant backtesting system.

Here is a sample usage:

To unzip the raw zipped files from Gain Capital, we do

(c.f., source code here)

    public void test_0010() {
        GainCapitalFxCsvFilesExtracter extracter = new GainCapitalFxCsvFilesExtracter();
        String outputDirName = "./log/tmpGCcsv";
        String tmpDirName = "./log/tmpDir";
        int nCSV = extracter.extract("./test-data/gaincapital/2000", outputDirName,
            tmpDirName);
        assertEquals(18, nCSV);

        // clean up
        NMIO.deleteDir(new File(outputDirName));
        NMIO.deleteDir(new File(tmpDirName));
    }

To extract from the raw files the rows, group, save, and zip them by dates, we do

(c.f., source code here)

    public void test_0030() throws IOException {
        String input = "C:/Temp/GCcsv";
        String output = "C:/Temp/byDates";

        File outputDir = new File(output);
        if (!outputDir.canWrite()) {
            outputDir.mkdir();
        }

        File inputDir = new File(input);

        //reading the file names
        File[] csvs = inputDir.listFiles(new FilenameFilter() {

            public boolean accept(File dir, String name) {
                return name.matches(".*[.]csv");
            }
        });

        //these pairs are already processed before, e.g., the program crashed
        Set done = new HashSet();
//        done.add("AUD_USD");
//        done.add("CHF_JPY");

        Set pairs = new HashSet();
        for (File csv : csvs) {
            String pair = csv.getName().substring(0, 7);
            if (pair.matches("..._...")) {
                if (!done.contains(pair)) {
                    pairs.add(pair);
                }
            }
        }

        GainCapitalFxDailyCsvZipFileConverter converter = new
              GainCapitalFxDailyCsvZipFileConverter();
        for (String pair : pairs) {
            String ccy1 = pair.substring(0, 3);
            String ccy2 = pair.substring(4, 7);
            System.out.println(String.format("working on %s/%s", ccy1, ccy2));
            converter.process(
                    ccy1, ccy2,
                    input,
                    String.format("%s/%s%s", output, ccy1, ccy2));
        }
    }

Have fun using the Gain Capital FX data for your own trading research!

The Demand of Good Financial Engineers

It is very difficult to find people who can produce effective code for mathematical models. In general, mathematicians do not write good code and computer programmers do not understand the (advanced) mathematics. Yet, computer language is the only language we use to turn an idea into a product that people can use. I foresee that the truly valuable talents are those who have creative ideas and are able to implement them. They are the most-sought after in many engineering disciplines, e.g., finance.

Speaking from personal experience, I have been working exclusively in the financial industry, Algorithmic Trading in particular. This field is where mathematics and computer science meet. As an algo-trader, my job is to develop mathematical models and write computer software for automatic execution. In addition, I lead a team of mathematicians to design trading models and a team of programmers to build these systems. From these years of job experience, I find it exceedingly difficult to hire the perfect candidates to work in this algorithmic trading industry.

I have hired some very good statisticians and mathematicians from the top schools. They produce good research and design very sophisticated mathematical models. Unfortunately, they are not able to program to the professional standard their models that are ready to be used by other people. Often the programmers need to translate into C++/C#/Java their prototypes in Matlab/R. On the other hands, it is unrealistic to expect the programmers to develop these mathematical models simply because they do not have the training. In summary, my dilemma is: mathematicians cannot differentiate between inheritance and interface; programmers do not know about hidden Markov Chain.

It occurs very surprising to me that given the prevalence of computer programming in the industries, such as finance and bio-chemistry, universities have not been placing enough emphasize and proper training on their students. In my opinion, in a scientific corporation programming skill is as essential as speaking English. It is a way to communicate ideas (models) in a form that other people can actually use, i.e., a product. Our school curriculums have a number of drawbacks.

First, programming courses are not mandatory for many students, even science students. For instance, statistics students can graduate without ever taking a course in computer programming. They might very well be proficient in Matlab, R, and other specialized software. However, these are not real programming languages. These have little use in an industrial production environment. The students are still not trained in terms of object-oriented concepts, debugging skills, software engineering principles, team collaboration, quick adaptation of new tools and technologies.

Secondly, the professors, instructors, or lecturers teaching the programming courses usually have little industrial programming experience. Speaking from my personal experience, I thought I was a good programmer when I graduated with a PhD in computer science and after spending many years programming for my thesis and homework. It turned out that I was very naïve and ignorant. I was proved to know nothing about industrial programming on my first job. Looking back, the professors teaching programming in universities are probably in the stage where I once was. Most have never delivered a real product (not hands-on anyway).

Therefore, as we are now in the era of a technology driven world, the truly valuable talents are those who can have creative ideas and are able to implement them. I foresee that, gradually, the schools begin to recognize that having good programming skill is as essential as having good communication skill on the job. I am looking forward to changes in the academic curriculums. More emphases are placed on computer programming training across all majors. At the least, all engineering students must be proficient in one modern programming language. Equally important, these programming courses should be taught by experienced professionals rather than academic people who are trained to write journal papers.

In conclusion, I would like to see in universities a new course to teach numerical programming. This is a course to educate science students (not just computer science students) how to code mathematical models. That includes a modern programming language, software engineering methodology, debugging, algorithm design and analysis, effective implementation, and design pattern.

What is missing in our computer science curriculum?

From my years of experience of recruiting computer science students, it seems that universities fail to train students who can build good software. When I hire freshly graduated students, I often need to rewrite their code (sometimes from scratch) before putting the code in production. Most graduates know what inheritance and interface are, but few of them can properly choose and defend which to use. Most graduates have taken classes in “Operating Systems” and thus can talk a lot about threads and scheduling, but few actually have read books like the one from Doug Lea. They cannot be entrusted to write multi-threaded code. Most graduates memorize the visited nodes when writing a function to detect cycle in a linked list, but few of them are aware of Floyd’s cycle-finding algorithm. The funnies thing is that a lot of them think that they are writing object-oriented code because they have replaced ‘struct’ in C with ‘class’ in C++. They are yet to read a book on design pattern.

In other words, these graduates cannot use the programming languages effectively (ineffective). They cannot communicate their ideas to colleagues in their code (incomprehensible). They do not evaluate the space and time complexities of their algorithms (inefficient). They produce spaghetti code that is so difficult to modify (unmaintainable) so we have this motto in the industry: if it ain’t broke, don’t fix it. It is next to impossible to add new requirements without rewriting it from scratch (inextensible). Often when I do code review, the code looks like scribble to me. I have no way to tell whether it is right or wrong (incorrect).

I would like to point out that computer programming is a real professional that requires more than merely reading “C++/Java/C# for Dummies”. We need the programmers to know more than just the programming constructs and some irrelevant theories they never use on the job. Given a project, a competent programmer should be able to understand the problem, design efficient algorithms in terms of space and time complexities, evaluate different implementations, type in legible code, and provide systematic evidences that the code work. It is very unfortunate that universities fail to deliver students that qualify.

Why don’t universities produce good programmers? The major problem lies in the curriculum. Speaking from personal experience, I managed to finished all B.S. computer science requirements, obtained a M.S. and a Ph.D. in this field without ever taking a course in software engineering. In fact, I have never seen a university course with the purpose to teach students how to write good programs, not at the top-tier “research” universities anyway. Somehow the schools assume everyone can read the book such as “C++/Java/C# for Dummies” on their own. Consequently, most fresh graduates write dummy code.

Yes, we do programming projects for courses like Data structure, Operating System, Artificial Intelligence and etc. But these assignments aim to enhance the students’ knowledge in the particular topics. The code is never graded based on how professional it is written. Take Operating System as an example, the assignments aim to enhance the students’ knowledge on the concepts of thread and the OS scheduling algorithms but they are never designed to train students on writing good multi-threaded programs. The textbooks used in class teach about mutex and semaphore but they do not teach how to use them effectively. How many students, after taking an OS class, can design a thread-safe stack? I bet that most of them know what threads are and what a stack is, but few can design an efficient and effective thread-safe stack class.

The second problem is faculty. As a university, most faculty members hold doctorate degrees. The undergraduate courses are usually taught by newly graduated doctorate students. These doctorate students are trained to publish journal papers. Many never deliver any industrial applications in their life time (not hands-on anyway). Not only do they not know how to write good code, but they cannot appreciate how important it is to produce good code. For instance, in a Data Structure course, a professor teaches about stack, but he is unlikely to implement a thread-safe stack class for professional use. Without ever working in the industries, it is very difficult for a professor to understand the details and issues associated with using a thread-safe stack. It would even be more difficult for him to appreciate how important it is to provide systematic evidences to show that his class works and is ready for a third-party to use.

One may argue that “Computer Science” is about science so we study Artificial Intelligence, Operating System, etc. Science is about research and innovation. It is not about programming. My dilemma is that if we cannot go to computer science schools to seek qualified programmers, where should we turn to? Also, what exactly is the use of these university trained computer science graduates if they cannot write good code? After all, not everyone becomes a scientist. Most of them are engineers that code the science ideas into real products that benefit our societies.