<aside> 💡 Key concepts:

Learning outcomes:

3.1. The tokenise algorithm

The first task of this topic is to write the tokenise algorithm to read and process data from the csv file. First, create a temporary file tokenize.cpp for the implementation.

What the tokenise function does is to break a given string into items separated by the given separator.

For instance, it breaks a row in the CSV file, 2020/03/17 17:01:24.884492,ETH/BTC,bid,0.02187308,7.44564869 into five tokens: 2020/03/17 17:01:24.884492, ETH/BTC, bid, 0.02187308, 7.44564869, with the separator being ,

I created a starter code with more descriptive comments than that in Coursera to help you get started.

#include <iostream>
#include <string>
#include <vector>

/** 
 * Tokenizes a CSV line based on the given separator character.
 *
 * @param csvLine The CSV line to be tokenized.
 * @param sep The separator character used to delimit tokens.
 * @return A vector of strings containing the tokens extracted from the CSV line.
 */
std::vector<std::string> tokenize(std::string csvLine, char sep)
{
    // Create an empty vector of strings to store the tokens
		std::vector<std::string> tokens;
		
    // Declare three variables to keep track of the start and end positions of each token, and the token itself

    // Find the first non-separator character in the input string

    // Loop through the input string, finding each separator character and extracting the token between it and the previous separator

    // Find the next occurrence of the separator character

    // If no separator is found or if start == end, break out of the loop

    // Extract the substring between the current position and the separator

    // Add the token to the vector of tokens

    // Update the starting position for the next iteration

    // Return the vector of tokens
    return tokens; 
}

Copy the above code and try to implement on your own before moving on.

After that, you can check with the implementation below.

#include <iostream>
#include <string>
#include <vector>

/**
 * Tokenizes a CSV line based on the given separator character.
 *
 * @param csvLine The CSV line to be tokenized.
 * @param sep The separator character used to delimit tokens.
 * @return A vector of strings containing the tokens extracted from the CSV line.
 */std::vector<std::string> tokenize(std::string csvLine, char sep)
{
    // Create an empty vector of strings to store the tokens
    std::vector<std::string> tokens;

    // Declare three variables to keep track of the start and end positions of each token, and the token itself
    signed int start, end;
    std::string token;

    // Find the first non-separator character in the input string
    start = csvLine.find_first_not_of(sep, 0);

    // Loop through the input string, finding each separator character and extracting the token between it and the previous separator
    do
    {
        // Find the next occurrence of the separator character
        end = csvLine.find_first_of(sep, start);

        // If no separator is found or if start == end, break out of the loop
        if (start == csvLine.length() || start == end)
            break;

        // Extract the substring between the current position and the separator
        if (end >= 0)
        {
            token = csvLine.substr(start, end - start);
        }
        else
        {
            token = csvLine.substr(start, csvLine.length() - start);
        }

        // Add the token to the vector of tokens
        tokens.push_back(token);

        // Update the starting position for the next iteration
        start = end + 1;
    } while (end > 0);

    // Return the vector of tokens
    return tokens;
}

int main()
{
    return 0;
}

Let’s test if the code works as expected. Implement the test code in the main function.

int main()
{
    //
    // Test cases to test functionality in different scenarios
    //
    std::vector<std::string> tokens;
    // tokenize regular strings seperated by commas 
    std::string s1 = "thing,thing1,thing2";
    // can handle a string that ends with a comma
    std::string s2 = "thing,thing1,thing2,";
    // can handle a string with a single substring
    std::string s3 = "thing";
    // can handle the actual data
    std::string s4 = "2020/03/17 17:01:24.884492,ETH/BTC,bid,0.02187308,7.44564869";
    
		// Change s1 to s4 and test if the tokenize function works as expected
    tokens = tokenize(s4, ',');

    for (const std::string &t : tokens) {
        std::cout << t << std::endl;
    }
    return 0;
}

Awesome! Now that we’ve got the tokenize function working, let’s move into the next part - file I/O.

3.2. File I/O

Open a file

Now let’s load the csv file into the program. We modify the main function to load the csv file. Since the tokenize.cpp file is created under the same folder, we can put the file path as the name of the csv file, i.e. 20200317.csv. Otherwise, you will need to specify the path. We use the is_open() function to open a file in the program.

Remove the test code and replace with the file I/O procedure below:

#include <fstream>
// ...

int main()
{
    std::ifstream csvFile{"20200317.csv"};
    if (csvFile.is_open()){
        std::cout << "File is open" << std::endl;
    } else {
        std::cout << "File is not open" << std::endl;
    }
    return 0;
}