For most non-professional FX traders, one hurdle to start your own algorithmic trading research is tick-by-tick data. Gain Capital Group has been very kind and does the community a very good service by providing historical rate data on their website: http://ratedata.gaincapital.com/. The data look OK.
Unfortunately, one serious problem that catches eyes are the very ad-hoc formats of the data files. For example,
• The newer csv files store data by weeks (e.g., in year 2010) while some older csv files store data by quarters (e.g., in year 2000).
• Some zipped files contains folders while other contains csv files.
• Some zipped files contains other zipped files.
• Some csv files have headers while others don’t.
• The orderings of the columns are not all the same for all csv files. Some csv files have even missing columns.
• The timestamp formats are not all the same.
• Worst, the csv files have different encoding!
In order to process these “raw” data zipped files into useful and more importantly usable format that we can use for research, I recommend these following steps.
1. We first unzip the raw zipped files recursively until we store store all plain csv files in one single folder. This is to handle the situation where zipped files in zipped files and folders in folders.
2. Read each csv file using an appropriate parser (depending on the encoding, format, headers, etc.), row by row, line by line.
3. Group the rows of the same date together and write them out as one csv file, e.g., AUDCAD.2007-10-7.csv.zip.
4. Zip the csv file for easy archive.
I have written some code to automate these steps, and they are now available in Algo Quant for free. Specifically, the important Java classes are:
1. GainCapitalFxCsvFilesExtractor.java: unzip all csv files into one folder
2. GainCapitalFXRawFile.java: read the Gain Capital csv files of various formats, save and zip the data by dates.
3. GainCapitalFXRawFile.java: read the zipped csv data files.
4. GainCapitalFXCacheFactory.java: the data feed/adapter to read the zipped csv data files into the Algo Quant backtesting system.
Here is a sample usage:
To unzip the raw zipped files from Gain Capital, we do
    public void test_0010() {
GainCapitalFxCsvFilesExtracter extracter = new GainCapitalFxCsvFilesExtracter();
String outputDirName = "./log/tmpGCcsv";
String tmpDirName = "./log/tmpDir";
int nCSV = extracter.extract("./test-data/gaincapital/2000", outputDirName,
tmpDirName);
assertEquals(18, nCSV);

// clean up
NMIO.deleteDir(new File(outputDirName));
NMIO.deleteDir(new File(tmpDirName));
}
To extract from the raw files the rows, group, save, and zip them by dates, we do
    public void test_0030() throws IOException {
String input = "C:/Temp/GCcsv";
String output = "C:/Temp/byDates";

File outputDir = new File(output);
if (!outputDir.canWrite()) {
outputDir.mkdir();
}

File inputDir = new File(input);

File[] csvs = inputDir.listFiles(new FilenameFilter() {

public boolean accept(File dir, String name) {
return name.matches(".*[.]csv");
}
});

//these pairs are already processed before, e.g., the program crashed
Set done = new HashSet();

Set pairs = new HashSet();
for (File csv : csvs) {
String pair = csv.getName().substring(0, 7);
if (pair.matches("..._...")) {
if (!done.contains(pair)) {
}
}
}

GainCapitalFxDailyCsvZipFileConverter converter = new
GainCapitalFxDailyCsvZipFileConverter();
for (String pair : pairs) {
String ccy1 = pair.substring(0, 3);
String ccy2 = pair.substring(4, 7);
System.out.println(String.format("working on %s/%s", ccy1, ccy2));
converter.process(
ccy1, ccy2,
input,
String.format("%s/%s%s", output, ccy1, ccy2));
}
}
Have fun using the Gain Capital FX data for your own trading research!

1. This is extremely useful for my own research. I have been looking for this for long time. Thank you!

2. I’ve done almost the same thing (but in Python – much easier). Instead of writing back to CSV files, I wrote to binary files, which are much faster to read back, since you don’t have to parse the date again.

By the way, would you like to read your data for, e.g., examination and verification, in csv rather than binary files?

4. This was very useful indeed.

For those like me coming from the .Net World, here is a step-by-step guide :

1 – Don’t bother trying to compile the classes presented on this page in a standalone java project. It won’t work. You NEED to Download Algoquant directly (See link to the software given in the above article), then Download Netbeans (a Java IDE – I tried to use eclipse, my preferred java IDE, but had a hard time configuring all the jars libs, packages, and stuff…)
There is a nbproject folder that will automatically be recognized by NetBeans when you open the algoquant folder (in “open project”) and you’ll have the right structure with everything compiling from the start

2 – Keep the folder structure as is. Yes, you need that publicmodels folder too, in the same level as the algoquant folder, in order for the stuff to compile.

3 – Download a trial license (this is a bit annoying, but fair enough). You have to register to the forum (I hope you did not skip math class in junior high 😛 ),
and then navigate to Numerical Method Forum > General Category > General Discussion > Numerical Method Trial License (Otherwise you might just end up to an error page). copy the numericalmethod.lic file to the algoquant folder and replace the older one (otherwise when you try to execute stuff you will have an outdated/invalid license error)

4 – Go to com.numericalmethod.algoquant.data.historicaldata.gaincapital in the NetBeans project Navigatin Pane, open the test classes you want,

change folder names in the test methods you want to run, mark them static,
and create amain method to call them, ie :
public static void main(String[] args) {
test_xxxx();
}

5 – RIght Click on the test file you need and click “Run File” (Or just click shit+F6)
Voilà

6 – Note :
step A is GainCapitalFxCsvFilesExtractorTest
public static void main(String[] args) {
test_1000();
}

step B is GainCapitalFxDailyCsvZipFileConverterTest
public static void main(String[] args) {
try{ test_0030();}
catch(Exception ex){
}
}

The “Week1.2.csv” naming is VERY unhandy (after step A), so just go to step B and watch the magic after pointing to a fresh created clean data directory (output string for test_0030() ) : Data is grouped by FX Pair Folders, each one containing one file per day. Hurray !

Note that Right side zeros are skipped in the output format (ie 14.537000 turns to 14.53) and the timestamp is a bit odd (2007-01-02 00:02:24:0000 America/New_York) but you can change this easily.

Conclusion :
This tool is very precious and is a perfect starter that you can later customize to fit your needs, while saving a lot of time so you can focus you precious fingers for coding something more value-adding than data preparation.
I might suggest the use of binary format rather that zip format as it turns to be really quicker in the backtesting process.

Peace.

ML – AMC

5. ML’s step-by-step guide is very helpful! Thank you for writing it up.
Let’s push them to get the binary format to make the process faster.

Note: they recently got rid of “publicmodels” so there needs only one folder (“algoquant”).

6. ML,

Thank you very much for reporting a problem with AlgoQuant on processing Gain Capital data.

It appears that Gain Capital has recently changed its timestamp for the sub-second part from millisecond (3 digits) to nanosecond (9 digits). AlgoQuant version 0.0.5 and before will fail to process the most recent raw data file.

For instance, this new format will fail:
0000000000,D,EUR/USD,2011-08-28 17:00:10.727000000,1.1,1.2
0000000001,D,EUR/USD,2011-08-28 17:00:11.073000000,1.3,1.4

0000000002,D,EUR/USD,2011-08-28 17:01:17.963000000,1.5,1.6
0000000003,D,EUR/USD,2011-08-28 17:01:20.027000000,1.7,1.8

(Note the nanosecond representation in the data; it was millisecond).

Based on your feedback, we have modified AlgoQuant to account for the new timestamp format.

We now also throw an exception when there is an unrecognized (new) timestamp format in case Gain Capital changes the format again.

Hope this helps.

Thanks,
Haksun

7. It looks like Gain Capital has recently added sub-seconds to the time stamps in the FX rate data.

8. can any1 please explain the ltid column
what time is 1379074668 how do i parse it?

9. “ltid” is probably just their internal identifier. It is not used in the data. AlgoQuant parses them as “long”.

10. Nice Haksun —

Thanks for the down and dirty algo method.

Special thanks to JL for the .NET clarification– 🙂

Thanks and regards,
Alex N

11. Hi Prof Li, I was in your class conducted in NTU, Apr 2012. I am at the moment big fan of python and it is very easy to read Gain Capital data. Code snippet,

# import modules
import pandas as pd
from datetime import datetime

# define a parser, for the more recent timestamp
def gain_parser(dt_str):
try:
return datetime.strptime(dt_str[:-3],’%Y-%m-%d %H:%M:%S.%f’)
except Exception, e:
return datetime.strptime(dt_str,’%Y-%m-%d %H:%M:%S’)

# parse the data

# preview data

———————————————————
RateDateTime
2013-03-03 17:00:08.673000 2632942494 D USD/JPY 93.634 93.654
2013-03-03 17:00:09.017000 2632942497 D USD/JPY 93.636 93.656
2013-03-03 17:00:09.360000 2632942498 D USD/JPY 93.634 93.654
2013-03-03 17:00:09.707000 2632942503 D USD/JPY 93.632 93.652
2013-03-03 17:00:10.393000 2632942516 D USD/JPY 93.628 93.648

12. Thank you, Qitian, for the information.

13. Haksun,

Can you tell me what timestamp GAINDATA has? Is it NY time? And if it is does it include DayLightSavings? I am trying to convert GAINDATA timestamps to UTC.

14. They are NY time with daylight savings, I believe.

15. Haksun,

thanks for the useful code to parse these data files.
Do you know if the sell/buy prices they provide are the indicative or executable prices?

16. Unfortunately, I do not know the source of their data.