<aside> 💡 Key concepts:
Learning outcomes:
The first task of this topic is to write the tokenise
algorithm to read and process data from the csv file. First, create a temporary file tokenize.cpp
for the implementation.
What the tokenise
function does is to break a given string into items separated by the given separator.
For instance, it breaks a row in the CSV file, 2020/03/17 17:01:24.884492,ETH/BTC,bid,0.02187308,7.44564869
into five tokens: 2020/03/17 17:01:24.884492
, ETH/BTC
, bid
, 0.02187308
, 7.44564869
, with the separator being ,
I created a starter code with more descriptive comments than that in Coursera to help you get started.
#include <iostream>
#include <string>
#include <vector>
/**
* Tokenizes a CSV line based on the given separator character.
*
* @param csvLine The CSV line to be tokenized.
* @param sep The separator character used to delimit tokens.
* @return A vector of strings containing the tokens extracted from the CSV line.
*/
std::vector<std::string> tokenize(std::string csvLine, char sep)
{
// Create an empty vector of strings to store the tokens
std::vector<std::string> tokens;
// Declare three variables to keep track of the start and end positions of each token, and the token itself
// Find the first non-separator character in the input string
// Loop through the input string, finding each separator character and extracting the token between it and the previous separator
// Find the next occurrence of the separator character
// If no separator is found or if start == end, break out of the loop
// Extract the substring between the current position and the separator
// Add the token to the vector of tokens
// Update the starting position for the next iteration
// Return the vector of tokens
return tokens;
}
Copy the above code and try to implement on your own before moving on.
After that, you can check with the implementation below.
#include <iostream>
#include <string>
#include <vector>
/**
* Tokenizes a CSV line based on the given separator character.
*
* @param csvLine The CSV line to be tokenized.
* @param sep The separator character used to delimit tokens.
* @return A vector of strings containing the tokens extracted from the CSV line.
*/std::vector<std::string> tokenize(std::string csvLine, char sep)
{
// Create an empty vector of strings to store the tokens
std::vector<std::string> tokens;
// Declare three variables to keep track of the start and end positions of each token, and the token itself
signed int start, end;
std::string token;
// Find the first non-separator character in the input string
start = csvLine.find_first_not_of(sep, 0);
// Loop through the input string, finding each separator character and extracting the token between it and the previous separator
do
{
// Find the next occurrence of the separator character
end = csvLine.find_first_of(sep, start);
// If no separator is found or if start == end, break out of the loop
if (start == csvLine.length() || start == end)
break;
// Extract the substring between the current position and the separator
if (end >= 0)
{
token = csvLine.substr(start, end - start);
}
else
{
token = csvLine.substr(start, csvLine.length() - start);
}
// Add the token to the vector of tokens
tokens.push_back(token);
// Update the starting position for the next iteration
start = end + 1;
} while (end > 0);
// Return the vector of tokens
return tokens;
}
int main()
{
return 0;
}
Let’s test if the code works as expected. Implement the test code in the main function.
int main()
{
//
// Test cases to test functionality in different scenarios
//
std::vector<std::string> tokens;
// tokenize regular strings seperated by commas
std::string s1 = "thing,thing1,thing2";
// can handle a string that ends with a comma
std::string s2 = "thing,thing1,thing2,";
// can handle a string with a single substring
std::string s3 = "thing";
// can handle the actual data
std::string s4 = "2020/03/17 17:01:24.884492,ETH/BTC,bid,0.02187308,7.44564869";
// Change s1 to s4 and test if the tokenize function works as expected
tokens = tokenize(s4, ',');
for (const std::string &t : tokens) {
std::cout << t << std::endl;
}
return 0;
}
Awesome! Now that we’ve got the tokenize
function working, let’s move into the next part - file I/O.
Now let’s load the csv file into the program. We modify the main
function to load the csv file. Since the tokenize.cpp
file is created under the same folder, we can put the file path as the name of the csv file, i.e. 20200317.csv
. Otherwise, you will need to specify the path. We use the is_open()
function to open a file in the program.
Remove the test code and replace with the file I/O procedure below:
#include <fstream>
// ...
int main()
{
std::ifstream csvFile{"20200317.csv"};
if (csvFile.is_open()){
std::cout << "File is open" << std::endl;
} else {
std::cout << "File is not open" << std::endl;
}
return 0;
}