"1. TRIP_ID: (String) It contains an unique identifier for each trip;\n",
"1. CALL_TYPE: (char) It identifies the way used to demand this service. It may contain one of three possible values:\n",
" - ‘A’ if this trip was dispatched from the central;\n",
" - ‘B’ if this trip was demanded directly to a taxi driver on a specific stand;\n",
" - ‘C’ otherwise (i.e. a trip demanded on a random street).\n",
"1. ORIGIN_CALL: (integer) It contains an unique identifier for each phone number which was used to demand, at least, one service. It identifies the trip’s customer if CALL_TYPE=’A’. Otherwise, it assumes a NULL value;\n",
"1. ORIGIN_STAND: (integer): It contains an unique identifier for the taxi stand. It identifies the starting point of the trip if CALL_TYPE=’B’. Otherwise, it assumes a NULL value;\n",
"1. TAXI_ID: (integer): It contains an unique identifier for the taxi driver that performed each trip;\n",
"1. TIMESTAMP: (integer) Unix Timestamp (in seconds). It identifies the trip’s start; \n",
"1. DAYTYPE: (char) It identifies the daytype of the trip’s start. It assumes one of three possible values:\n",
" - ‘B’ if this trip started on a holiday or any other special day (i.e. extending holidays, floating holidays, etc.);\n",
" - ‘C’ if the trip started on a day before a type-B day;\n",
" - ‘A’ otherwise (i.e. a normal day, workday or weekend).\n",
"1. MISSING_DATA: (Boolean) It is FALSE when the GPS data stream is complete and TRUE whenever one (or more) locations are missing\n",
"1. POLYLINE: (String): It contains a list of GPS coordinates (i.e. WGS84 format) mapped as a string. The beginning and the end of the string are identified with brackets (i.e. \\[ and \\], respectively). Each pair of coordinates is also identified by the same brackets as \\[LONGITUDE, LATITUDE\\]. This list contains one pair of coordinates for each 15 seconds of trip. The last list item corresponds to the trip’s destination while the first one represents its start;\n"
],
"cell_type": "markdown",
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import csv\n",
"import json\n",
"from datetime import datetime\n",
"from typing import Iterator\n",
"\n",
"enum_mapping = {'A': 1, 'B': 2, 'C': 3}\n",
"\n",
"def load_csv_content() -> Iterator:\n",
" '''Returns a generator for all lines in the csv file with correct field types.'''\n",
" \n",
" with open('input/train.csv') as csv_file:\n",
" reader = csv.reader(csv_file) \n",
"\n",
" headers = [h.lower() for h in next(reader)]\n",
"\n",
" for line in reader:\n",
" # convert line fields to correct type\n",
" for i in range(len(headers)):\n",
" # trip_id AS string\n",
" if i == 0:\n",
" continue\n",
" # call_type, day_type \n",
" if i in [1, 6]:\n",
" line[i] = enum_mapping[line[i]]\n",
" # origin_call, origin_stand, taxi_id AS int\n",
" elif i in [2, 3, 4]:\n",
" line[i] = int(line[i]) if line[i] != \"\" else \"\"\n",
"The SMART pipeline is used to split up the data in multiple layers. Therefore, the csv file is uploaded to the Semantic Linking microservice for layer creation. <br />\n",
"Next, the Role Stage Discovery microservice will cluster the individual layers and splits them into multiple time windows based on the timestamp.\n",
"\n",
"## Define features per cluster\n",
"### \"Local\" features based on single clusters\n",
"- cluster size ($\\#\\ cluster\\ nodes$)\n",
"- cluster standard deviation (variance from cluster mean)\n",
" '''Balances an unbalanced dataset by ignoring elements from the majority label, so that majority-label data size = median of other cluster sizes.'''\n",
" y = Y.tolist()\n",
" counter = collections.Counter(y)\n",
" print(f\"Label Occurrences: Total = {counter}\")\n",
"- relative cluster sizes (list of all cluster sizes as $\\frac{cluster\\ size}{layer\\ size}$)\n",
"- entropy of the layer calculated over all clusters $C$, where $P(c_i)$ is the probability that a node belongs to cluster $c_i$ (e.g. using the relative sizes as $P(c_i)$).\n",
"- euclidean distance from the global cluster center to the cluster center in $t_i$\n",
" cluster_centers: Dict[str, Tuple[float]] = {str(cluster['cluster_label']): calculate_center(cluster['label']) for cluster in clusters if cluster['label'] != 'noise'}\n",
"\n",
" # load time windows \n",
" all_layers: List[Layer] = []\n",
" path_in = f'input/timeslices/{layer_name}'\n",
" for root, _, files in os.walk(path_in):\n",
" for f in files:\n",
" with open(os.path.join(root, f), 'r') as file:\n",
" Loads the metrics training data for an individual layer from disk.\n",
" A single metrics training data point should look like this:\n",
"\n",
" [((relative_cluster_size) ^ M, entropy, (distance_from_global_center) ^ M, (time1, time2)) ^ N, cluster_number, evolution_label]\n",
"\n",
" The first tuple represents metrics from the reference layer in t_i-(N-1).\n",
" The Nth tuple represents metrics from the reference layer in t_i.\n",
" The reference_layer has M clusters in total, this might differ from the number of clusters in layer_name.\n",
" The cluster number identifies the cluster for which the evolution_label holds. \n",
" The label is one of {continuing, shrinking, growing, dissolving, forming} \\ {splitting, merging} and identifies the change for a cluster in the layer layer_name for t_i.\n",
" \n",
" # TODO N is not implemented and fixed to 2\n",
" \"\"\"\n",
" \n",
" with open(f'input/metrics/{layer_name}.json') as file:\n",
" cluster_metrics: List[Cluster] = [Cluster.create_from_dict(e) for e in json.loads(file.read())]\n",
" cluster_ids = {c.cluster_id for c in cluster_metrics}\n",
" cluster_metrics: Dict[Any, Cluster] = {(c.time_window_id, c.cluster_id): c for c in cluster_metrics}\n",
" \n",
" with open(f'input/layer_metrics/{reference_layer}.json') as file:\n",
" layer_metrics: List[Layer] = [Layer.create_from_dict(e) for e in json.loads(file.read())]\n",
" layer_metrics: Dict[Any, Layer] = {l.time_window_id: l for l in layer_metrics}\n",
" '''Balances an unbalanced dataset by ignoring elements from the majority label, so that majority-label data size = median of other cluster sizes.'''\n",
" y = Y.tolist()\n",
" counter = collections.Counter(y)\n",
" print(f\"Label Occurrences: Total = {counter}\")\n",
Loads the metrics training data for an individual layer from disk.
A single metrics training data point should look like this:
[((relative_cluster_size) ^ M, entropy, (distance_from_global_center) ^ M, (time1, time2)) ^ N, cluster_number, evolution_label]
The first tuple represents metrics from the reference layer in t_i-(N-1).
The Nth tuple represents metrics from the reference layer in t_i.
The reference_layer has M clusters in total, this might differ from the number of clusters in layer_name.
The cluster number identifies the cluster for which the evolution_label holds.
The label is one of {continuing, shrinking, growing, dissolving, forming} \ {splitting, merging} and identifies the change for a cluster in the layer layer_name for t_i.
# TODO what exactly should the classifier predict?
# all cluster changes, this would mean that cluster information has to be provided
Loads the metrics training data for an individual layer from disk.
A single metrics training data point should look like this:
[((relative_cluster_size) ^ M, entropy, (distance_from_global_center) ^ M, (time1, time2)) ^ N, cluster_number, evolution_label]
The first tuple represents metrics from the reference layer in t_i-(N-1).
The Nth tuple represents metrics from the reference layer in t_i.
The reference_layer has M clusters in total, this might differ from the number of clusters in layer_name.
The cluster number identifies the cluster for which the evolution_label holds.
The label is one of {continuing, shrinking, growing, dissolving, forming} \ {splitting, merging} and identifies the change for a cluster in the layer layer_name for t_i.
# TODO what exactly should the classifier predict?
# all cluster changes, this would mean that cluster information has to be provided
raiseConnectionError(f"Could not fetch nodes for {use_case}//{table}//{layer_name} from semantic-linking microservice, statuscode: {response.status_code}!")