FIDAL - Financial Data Access Library

  Home | Documentation | Download
 

 

ASCII Data Source

1.0 Introduction

2.0 Pre-defined ASCII File Format

3.0 User-defined ASCII File Format

4.0 FD_AddDataSource Parameters Details

horizontal rule

1.0 Introduction

ASCII Files are probably the simplest way to store stock market data. These file can be easily generated, converted or maintain by using off-the-shelf software. Most of the commercial data provider includes a conversion tool allowing to translate to at least the ASCII format. 

It is very easy to allow the FIDAL use your ASCII files. You need to indicate in which directory the files can be found. You can use wildcards to include multiple files in one call.

You will need also to specify the format in which your data is stored. For most of the user, you can simply use one of the predefined file format. See 2.0 Pre-defined ASCII File Format.

For more advanced user, the field definition allows to specify custom file format. It allows also advanced capability like extracting fields from the filename or even the directory path! See 3.0 User-defined ASCII File Format.

A very small ASCII database is provided with the software package for experimentation.

horizontal rule

2.0 Pre-defined ASCII File Format

To use a pre-defined format, you must first know in which order the data is stored in your ASCII files. You can then add the files, one by one, or many at the same time with wildcards. Adding a file or a directory to the unified database is done with FD_AddDataSource.

Here is an example adding all files from the "my_data" directory into the "US.NASDAQ.STOCK" category:

FD_AddDataSourceParam param;

memset( &param, 0, sizeof( FD_AddDataSourceParam ) );
param.id = FD_ASCII_FILE;
param.location = "c:\my_data\*.txt";
param.info = FD_DOHLCV;

FD_AddDataSource( unifiedDatabase, &param );

The directory field is very flexible. You can include:

bullet

a specific file. Example: "c:\FIDAL\database\myfile.dat"

bullet

all files in a directory. Example:  "c:\FIDAL\database\"

bullet

you can use MS-DOS type of wildcards "?" and "*" in both filenames and directories. Examples: "c:\FIDAL\database\*\*.txt" will include all ".txt" files in all subdirectories under the database directory. The capability of using wildcards in the path is giving a lot of flexibility on how your database is manage.

(Note: As you probably know, in ANSI C, the '\' shall be '\\' in a string to avoid confusion with special character like '\n'...)

By default, the name of the file will become the "symbol" name in the database. You can refine or extract the symbol name in a different way by using the field capability (see  next section). If you have the choice , I strongly suggest to keep it simple: just use the file name as the symbol name.

The order in which the data is specified is indicated by the "param.info" (The FD_DOHLCV in the above example). See "fidal.h" for the list of pre-defined type. Examples are: FD_DOHLCV, FD_DOCHLV, FD_DCV ...

As an example, the FD_DOHLCV (probably the most common) must contain the data in the following format:
"Year,Month,Day,Open,High,Low,Close,Volume"

The comma represent a separator. The separator can be any character except a digit or a decimal point '.' Ok.. let me give some concrete examples. All the following format are going to be correctly parsed:

Example 1:  this is the CSV format from Microsoft Excel.
95/3/1,73.5,73.875,73.375,73.75,11679
95/3/2,73.875,74.625,73.25,74.375,19947 ...

Example 2: this is a space delimitated format.
95-03-01 73.5 73.875 73.375 73.75 11679
95-03-02 73.875 74.625 73.25 74.375 19947 ...


Example 3: This is a weird example just to show the flexibility. In that example the last field (Open Interest) will be ignored. 

DATE    :  Open | High  |   Low | Close |Volume| O. Int.|
----------------+-------+-------+-------+------+---------
1995/3/1: 73.50 | 73.875| 73.375| 73.75 | 11679|  123   |
1995/3/2: 73.875| 74.625| 73.25 | 74.375| 19947|    0   |

You see that all these files have simply in common the order in which the data is provided.

Some basic rules for accepting a variety of ASCII files:

bullet

For each line, all leading character different than a digit are silently ignored.

bullet

A line containing no digits is silently discard if it is the first line of the file.

bullet

Once all the fields are extracted, the remaining of the line is ignored.

Using that example you can easily figure out all the other pre-defined format.

If that symbol format is not convenient (let's say because the date are not in the same format), consider the section 3.0 User-defined ASCII File Format

horizontal rule

3.0 User-defined ASCII File Format

3.1 Describing File Content

In the case that the pre-defined file format are not applicable, a user will need to build its field string. That string describe how each line of the file is going to be interpreted. If you look at "fidal.h" you will see that the pre-defined format are simply strings specifying the order of the fields (reminder: this field string is the "param.info" of the FD_AddDataSource function).

Example:  FD_DOHLCV  is the pre-defined string "[Y][M][D][O][H][L][C][V]". If someone needs a string with a different date format, let's say Month/Day/Year, he will use the string "[M][D][Y][O][H][L][C][V]" as the "param.info" parameter when calling FD_AddDataSource.

A field is a small token surrounded by the [squared bracket].  You can build your string by placing the fields in any order. The following are all valid examples:

"[HR][MIN][SEC][O][H][L][C][V]" a variant for intra-day price bar.

"[D][M][Y][C]" for daily data with only a close price (like a mutual fund).

"[M][Y][C][V][OI]" Monthly commodity/future data with open interest field.

Here is the complete list of available fields that can be used to describe the content of the file:

Complete list of fields for 'param.info'

YExtract an integer of unspecified length to represent the year.

Values [0 to 10] represents years from 2000 to 2010.
Values [11 to 99] represents years from 1911 to 1999.
Values [100 to 9999] represents years from 100 to 9999.

MValue [01 to 12] are valid. Leading zero is optional (Example: '1' is the same as '01')
DValue [01 to 31] are valid. Leading zero is optional. The FIDAL will verify the consistency with the month specified and reject the file if an inconsistency is found.
YYYYLike "Y" but force the year to be always 4 digits.
YYLike "Y" but force the year to be always 2 digits. 
MMLike "M" but force the month to be always 2 digits.
MMM3 letter month identifier (English). Not case sensitive.
Possible values are: Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec.
DDLike "D" but force the day to be always 2 digits.
HOURValue from [00 to 23] representing "hours". Leading zero optional.
MINValue from [00 to 59] representing "minutes". Leading zero optional.
SECValue from [00 to 59] representing "seconds". Leading zero optional.
HHLike "HR" but must be always two digits.
MNLike "MIN" but must be always two digits. Do not confuse with MM.
SSLike "SEC" but must be always two digits.
OReal value representing the open value for the day.
All the following format are valid:
12.1259123         012.125       0.1       .125     0.0     .0      12 
Negative value are not supported. Precision of up to 15 digits is supported.
HReal value representing the higher price of the day.
LReal value representing the lowest price of the day.
CReal value representing the close price of the day.
VInteger value representing the volume for the day.
OIInteger value representing the open interest
-R=nSkip n real value. The value is unused.
-I=nSkip n integer value. The value is unused.
-C=nSkip n character.
-H=nSkip n lines at the beginning of the file.
-NDLSkip all lines that are not starting with a digit. Abbreviation stands for "Non-Digit Line".
bullet

Except for -R, -I and -C, all fields shall not be specified more than once in the field string.

bullet

While processing a file, any error in the input will cause the whole file to be ignored. Better to prevent then being sorry of using wrong data...

bullet

All "=n" are optional. By default, "=1" is assumed.

Finally, here is an example of the most complex format I can think of:

"[-H=12][-C=10][-I][-R=2][YYYY][M][DD][-I=1][V][O][C][H][L][OI][HR][MN=5]"

In that example, the first 12 lines in the file are skip. For each line, the first 10 characters are always ignored. The following integer and 2 real values are then ignored as well. Then the date fields are extracted, following this an integer number is ignored, and finally all the remaining fields are read. [MN=5] force the periodicity of the price bar to be on 5 minutes boundary.

3.2 Defining Periodicity

The periodicity is the amount of time between each price bar (daily, monthly, 10 minutes etc...)

In most case, FIDAL will select by default the most logical periodicity depending of the date/time field specified. The rules are the following (in order):

1) If one of the time field is specified, it is assumed to be intra-day data. By default, the periodicity is determined by the amount of time between the first two price bar in the file. It is also possible to force the periodicity (see below).
2) If a day field is specified this is daily data ([DD] or [D]).
3) If a month field is specified, this is monthly data ([MMM],[MM] or [M])
4) If only a year field is specified, this is yearly data ([YYYY],[YY] or [Y])
5) If no date/time field exist, an error is reported.

For intra-day, you can force the periodicity by adding information to ONLY ONE of the time field:

[HOUR=n] or [HH=n] where 'n' are hour increment.
[MIN=n] or [MN=n] where 'n' are minutes increment.
[SEC=n] or [SS=n] where 'n' are seconds increment.

Forcing intra-day periodicity - Examples
...[HOUR=1]...
...[HH=1]...
...[HH]...
1 hour interval data (no other time field)
...[HOUR][MIN=1]...
...[HOUR][MIN]...
...[HH][MN=1]...
1 minute interval data (no other time field)  
...[HOUR][MIN][SEC]...
...[HH][MN][SS]...
...[HH][MN][SS=1]...
1 second interval data (no other time field)
...[HOUR][MIN=10]...
...[HH][MN=10]...
10 minutes interval data
...[HOUR][MIN][SEC=30]30 seconds interval data
...[HOUR=2]...2 hours interval.

If more than one field specify a time increment, an error is returned. This is sufficient for supporting all practical intra-day periodicity. Only natural boundary of times are valid:

Valid intra-day increment
Hour1,2,3,4,6,8,12
Minute1,2,3,4,5,6,10,12,15,20,30
Second1,2,3,4,5,6,10,12,15,20,30


3.3 Extracting Symbol and Category String

Some fields can be extracted from the path as well (this is the 'param.location' parameter of FD_AddDataSource).

Fields for 'param.location'
CATExtract a string representing the "category".
SYMExtract a string representing the "symbol"

These fields can extract a sub-string at any place in the path. These fields are like  "Wildcards" that are replace by the extracted value for each ASCII file. Both fields can be extract simultaneously.

Note 1: When the [CAT] field is NOT specified, the 'param.category' parameter of the FD_AddDataSource will be used by default.  If that parameter is NULL, the default "ZZ.OTHER.OTHER" category will be used.

Note 2: When the [SYM] field is NOT specified, the first portion of the filename (before the first '.')  will be used for each applicable files.

Examples of 'param.location'
"c:\db\[CAT]\[SYM].csv"This is probably the most common usage of these fields. The category will be represented by all directory immediately under "db". All ".csv" files in each directory are going to represent individually a symbol. That symbol will be added in the unified database by using the corresponding category.

In other word, you can create a "NASDAQ" and "AMEX" directory and simply put the ASCII files in these directories. All these files are going to be automatically added to the unified database by using the first portion of their names as their symbol name and the exchange (directory)  name for the category string.

"c:\db\[SYM].txt"Will extract all .txt files and use the first part of the filename to represent the symbol. This is equivalent to "c:\db\*.txt".
"c:\db\[SYM]\price.dat"Will use the directory name as the symbol name. All immediate directory under "db" will be searched for the "price.dat" files.
"c:\db\sym_[SYM]_.dat"Will only take the string between the "sym_" and the ending  "_" as the symbol name.
"c:\db\??[SYM].*"
"c:\db\*\C?[CAT]\[SYM]"
Fields can be concatenated to '?' wildcards. 
"c:\db\*[CAT]\file.dat"Although it is technically possible to concatenate with a '*', it is basically useless. The '*' will be ignored and the [CAT] will absorb all the characters.

3.4 Extracting Category Sub-Component

In the previous section, we saw how to extract the category using [CAT]. It is possible to divide the category in 3 components extracted at 3 different portion of a path.

Example, someone may organize their ASCII files in an hierarchical structure like follow:

US <DIR>
    NASDAQ <DIR>
           STOCK <DIR>
              ...put here all files...
           FUND  <DIR>      
              ...put here all files...
    NYSE   <DIR>
           STOCK <DIR>
              ...put here all files...
          

With this example, the param.location with the value "[CATC]\[CATX\[CATT]\*" will include all the files in these subdirectories in one of the following unified database category:  "US.NASDAQ.STOCK", "US.NASDAQ.FUND" and "US.NYSE.STOCK".

Additional fields for 'param.location'
CATCExtract a string representing the category country.
CATXExtract a string representing the category exchange.
CATTExtract a string representing the category type.

These 3 fields are concatenated with a '.' to form the category as suggested in the category guideline document: <CATC>.<CATX>.<CATT>.

You do not have to ALWAYS extract all 3 sub-component from the path. The default are: CATC="ZZ", CATX="OTHER", CATT="OTHER". The default can be overridden with the FD_AddDataSource optional parameter (param.country, param.exchange, param.type)

Important: The field [CAT] must not be used when one of the [CATC], [CATX] or [CATT] field is used.

Example:
In this example, someone have two directories with stock from NASDAQ and NYSE. That person wish to add all his local data to an unified database while respecting the category guideline. He should do the following assuming all the files are in a directory "C:\NASDAQ" and "C:\NYSE".

   FD_AddDataSourceParam param;

   memset( &param, 0, sizeof( FD_AddDataSourceParam ) );
   param.id       = FD_ASCII_FILE;
   param.location = "C:\[CATX]\*.TXT";
   param.info     = FD_DOHLCV;
   param.country  = "US";
   param.type     = "STOCK";

   FD_AddDataSource( &unifiedDatabase, &param );

This will add all the .TXT file in the category "US.NASDAQ.STOCK" and "US.NYSE.STOCK".

horizontal rule

4.0 FD_AddDataSource Parameters Details

Here are a quick overview of how each FD_AddDataSource parameters are used for an ASCII data source:

'param.id'
Must be FD_ASCII_FILE

'param.location'
The path of the files to be included (as explain in previous sections).

'param.info'
The pre-defined format (like FD_DOHLCV) or a custom field string as explained in the previous section.

'param.category'
The default category is "ZZ.OTHER.OTHER", unless it is override by this parameter. This parameter is ignored if one of the category field is used: [CAT], [CATC], [CATX] or [CATT]

'param.country', 'param.exchange', 'param.type'
Allows to redefine the default value for the [CATC], [CATX] or [CATT] field respectively.

'param.username', 'param.password'
Unused. Must be NULL.

'param.symbol'
Unused. Must be NULL.

'param.period'
Unused. Must be NULL.

'param.flags'
For the time being the ASCII data source are read-only and does not offer more advanced features. That parameter shall be FD_NO_FLAGS. On the other side, expect on short term the ASCII file format to be among the first to allows write/update capability.

Google  SourceForge Logo
  Web FidalSoft.org
 

Copyright© 2006 TicTacTec LLC. All Rights Reserved. Last Update: 07/21/06, Unique Visitor: