Carview!

CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: application/xml last-modified: Sun, 24 Jul 2022 15:58:27 GMT access-control-allow-origin: * etag: W/"62dd6c23-23a0e" expires: Tue, 30 Dec 2025 02:20:01 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: 1FA8:3ABDEF:96A37C:A95934:69533479 accept-ranges: bytes age: 0 date: Tue, 30 Dec 2025 02:10:01 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210043-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1767060601.463508,VS0,VE223 vary: Accept-Encoding x-fastly-request-id: d357e3197491783a98f14b952c415ccb58f5cfbf content-length: 20534 Jekyll2022-07-24T16:58:17+01:00https://tdavchev.github.io/feed.xmlTodor DavchevPhD student, University of EdinburghTodor Davchevt.davchev@gmail.comCrowded Scene Training/Inference and Useful Tricks2019-12-22T00:00:00+00:002019-12-22T00:00:00+00:00https://tdavchev.github.io/posts/2019/12/blog-post-4Now that we have defined the entire model, we can start training the neural network. The idea of each step is to take one batch and predict the next position for each of the agents and positions. The result is then compared to the target values through the associated loss function we defined earlier. <h2 id="training">Training</h2> <figure class="highlight"><pre><code class="language-python" data-lang="python">for e in range(NUM_EPOCHS): # Assign the learning rate (decayed acc. to the epoch number) lstm.sess.run(tf.assign(lstm.lr, lstm.learning_rate * (DECAY_RATE ** e))) # Reset the pointers in the data loader object pointer = data_tools.reset_batch_pointer() # Get the initial cell state of the LSTM state = lstm.sess.run(lstm.initial_state) # For each batch in this epoch for b in range(num_batches): start = time.time() # Get the source and target data of the current batch # x has the source data, y has the target data x, y, pointer = data_tools.next_batch( loaded_data, pointer, BATCH_SIZE, SEQUENCE_LENGTH) x = np.array(x) y = np.array(y) # Feed the source, target data and the initial LSTM state to the model feed = { lstm.input_data: x[:, :, 1:], lstm.target_data: y[:, :, 1:], lstm.initial_state: state } # Fetch the loss of the model on this batch, # the final LSTM state from the session train_loss, state, _ = lstm.sess.run( [lstm.cost, lstm.final_state, lstm.train_op], feed) # Toc end = time.time() cur_time = end - start step = e * num_batches + b avg_time += cur_time avg_loss += train_loss # Print epoch, batch, loss and time taken if (step%99) == 0: print( "{}/{} (epoch {}), train_loss = {:.3f}, time/batch = {:.3f}" .format( step, NUM_EPOCHS * num_batches, e, avg_loss/99.0, avg_time/99.0)) avg_time = 0 avg_loss = 0 os.makedirs(SAVE_PATH, exist_ok=True) # Save parameters after the network is trained lstm.save_json(os.path.join(SAVE_PATH, "params.json"))</code></pre></figure> <h2 id="inference">Inference</h2> After training is complete, we can start using it and infer the positions of agents in a never seen before data. When predicting trajectories of this type we assume that the agents observed have already walked for some time and we have the associated observed annotations. Thus, we “preload” the LSTM with the manner of walking of a particualr agent. Thus, we condition the associated predictions to the manner of walking, direction and behaviour of a considered agent. In addition we set the batch size to 1 and load the never seen before data set. We sample from the predicted 2D Gaussian distribution in the following way: <figure class="highlight"><pre><code class="language-python" data-lang="python">def sample_2d_normal(o_mux, o_muy, o_sx, o_sy, o_corr): ''' Function that samples from a multivariate Gaussian That has the statistics computed by the network. ''' mean = [o_mux[0][0], o_muy[0][0]] # Extract covariance matrix cov = [[o_sx[0][0]*o_sx[0][0], o_corr[0][0]*o_sx[0][0]*o_sy[0][0]], [o_corr[0][0]*o_sx[0][0]*o_sy[0][0], o_sy[0][0]*o_sy[0][0]]] # Sample a point from the multivariate normal distribution sampled_x = np.random.multivariate_normal(mean, cov, 1) return sampled_x[0][0], sampled_x[0][1]</code></pre></figure> For a given trajectory, the idea is to simply update the cell state at each step using the final cell state value from our previous prediction. We then measure the performance of the network through the average displacement error. <figure class="highlight"><pre><code class="language-python" data-lang="python">for b in range(num_batches): start = time.time() # Get the source and target data of the current batch # x has the source data, y has the target data x, y, pointer = \ data_tools.next_batch(data, pointer, BATCH_SIZE, SEQUENCE_LENGTH) x = np.array(x) y = np.array(y) obs_traj = x[0,:OBSERVED_LENGTH, 1:] for position in obs_traj[:-1]: # Create the input data tensor input_data_tensor = np.zeros((1, 1, 2), dtype=np.float32) input_data_tensor[0, 0, 0] = position[0] # x input_data_tensor[0, 0, 1] = position[1] # y # Create the feed dict feed = {lstm.input_data: input_data_tensor, lstm.initial_state: state} # Get the final state after processing the current position [state] = lstm.sess.run([lstm.final_state], feed) returned_traj = obs_traj last_position = obs_traj[-1] prev_data = np.zeros((1, 1, 2), dtype=np.float32) prev_data[0, 0, 0] = last_position[0] # x prev_data[0, 0, 1] = last_position[1] # y prev_target_data = np.reshape(x[0][obs_traj.shape[0], 1:], (1, 1, 2)) for t in range(PREDICTED_LENGTH): feed = { lstm.input_data: prev_data, lstm.initial_state: state, lstm.target_data: prev_target_data} [o_mux, o_muy, o_sx, o_sy, o_corr, state] = lstm.sess.run( [lstm.mux, lstm.muy, lstm.sx, lstm.sy, lstm.corr, lstm.final_state], feed) next_x, next_y = \ distributions.sample_2d_normal(o_mux, o_muy, o_sx, o_sy, o_corr) returned_traj = np.vstack((returned_traj, [next_x, next_y])) prev_data[0, 0, 0] = next_x prev_data[0, 0, 1] = next_y complete_traj = returned_traj total_error += \ distributions.get_mean_error(complete_traj, x[0, :, 1:], OBSERVED_LENGTH) if (b+1) % 50 == 0: print("Processed trajectory number : ", b+1, "out of ", num_batches, " trajectories") print("Total mean error of the model is ", total_error/num_batches)</code></pre></figure> <h1 id="experimental-results">Experimental Results</h1> We extimate the quality of the proposed solution using two ways; first one is using the mean squared error between the predicted points and the target distribution. Second, we visualise the behaviour using OpenCV and Matplotlib. Predicting 4 steps ahead ($T_{pred}=4$), observing 4 steps ($T_{obs}=4$) we obtain 563 trajectories in total with an average mean suqared error of 0.088. We then visualise the result in the following plot.<img src="https://dev.bg/wp-content/uploads/2018/11/inference.gif" alt="alt text" /> As you can see the result (in purple) matches the real trajectory (in green) for the first 4 steps since we observed them. However, the followed up predictions (except the first, direct one) do not do as well. The reason for this is that the distribution from which each step is sampled is conditioned on the previously predicted position. Thus, the accumulated error is itself multplicative and leads to such poor behaviour. A few ways to fix this is by iteratively computing each trajectory ignoring the empty steps, by providing more training data or providing more meaningful inputs. Iteratively computing each trajectory together with the rest of the trajectories from the same frame sequence is a relatively simple work around that deems a lot better solutiuons. Motion is conditioned on the surrounding trajectories and optimising each of them along with its neighbours makes a lot of sense. <a href="https://github.com/yadrimz/Stochastic-Futures-Prediction/blob/master/notebooks/Tutorial%202.ipynb">Tutorial 2</a> from this post’s <a href="https://github.com/yadrimz/Stochastic-Futures-Prediction">GitHub repository</a> shows one way of reducing the error exponentially by following this simple trick. <h1 id="conclusion">Conclusion</h1> The problem of trajectory modelling is relatively old. Principalled approaches prior to the utilisation of neural solutions use Kalman filters, social utility functions [10] and factorised, interactive Gaussian processes [11]. Albeit the accuracy of some of them, they all assume a lot more prior knowledge prior to modelling the trajectories. Utilising LSTMs, however, extracts the behaviour entirely from data which can be extremely useful in cases where such inductive information is not present. However, it requires larger amounts of data. One potential way to improve the associated results is by utilising rough inductive biases, ideally extracted in an unsupervised manner. This way, we can ensure a more informed representation that is conditioned on some improtant to each dataset information. One example is utilising both spatial and global dynamics to improve the discussed previously in this blog post representations. Spatial representations would extract static information that is inherent to the given dataset, such as existing trees, surrounding snow etc. While global dynamics deals with modelling the unspoken rules for motion such as the general direction of motion, the way of surpassing individual agents or places agents tend to stay in place, such as at bus stops when waiting for the next bus. [6] models the global social dynamics between agents while [12] utilises both spatial and global dynamics by imputing to features extracted in an unsupervised way. <a href="https://sites.google.com/view/rdb-agents/home">The webpage</a> associated with [12] provides a summary on one way for incorporating inductive biases in a modular way which preserves the small data requirements. As a quick step towards improving the results we can optimise batches of random sequences of frames instead of optimising batches of random agents. Intuitively, we would like to make sure that we will optimise all agents who are moving alongside together which would essentially consist of a more accurate approximation of the considered data. Further, we will optimise each agent individually and discard all empty slots and apply a few other small tricks. More information can be found in the associated notebook called <a href="https://github.com/yadrimz/Stochastic-Futures-Prediction/blob/master/notebooks/Tutorial%202.ipynb">Tutorial 2</a>. A sample from the resulted solution can be seen bellow. <img src="https://drive.google.com/uc?export=view&id=168tMhQOgaecxUeM7WT_0AXw_cmtANIp_" alt="Tutorial 2 result" /> <h1 id="references">References</h1> [1] Hochreiter, S. and Schmidhuber, J., 1997. Long short-term memory. Neural computation, 9(8), pp.1735–1780. [2] Graves, A., Mohamed, A.R. and Hinton, G., 2013, May. Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on (pp. 6645–6649). IEEE. [3] Bowman, S.R., Gauthier, J., Rastogi, A., Gupta, R., Manning, C.D. and Potts, C., 2016. A fast unified model for parsing and sentence understanding. arXiv preprint arXiv:1603.06021. [4] Sutskever, I., Vinyals, O. and Le, Q.V., 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104–3112). [5] Yücel, Z., Zanlungo, F., Ikeda, T., Miyashita, T. and Hagita, N., 2013. Deciphering the crowd: Modeling and identification of pedestrian group motion. Sensors, 13(1), pp.875–897. [6] Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L. and Savarese, S., 2016. Social lstm: Human trajectory prediction in crowded spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 961–971). [7] Lerner, A., Chrysanthou, Y. and Lischinski, D., 2007, September. Crowds by example. In Computer Graphics Forum (Vol. 26, №3, pp. 655–664). Oxford, UK: Blackwell Publishing Ltd. [8] Graves, A., 2013. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850. [9] Kingma, D. P., & Ba, J. L. (2015). Adam: a Method for Stochastic Optimization. International Conference on Learning Representations, 1–13. [10] Yamaguchi, K., Berg, A.C., Ortiz, L.E. and Berg, T.L., 2011, June. Who are you with and where are you going?. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on(pp. 1345–1352). IEEE. [11] Trautman, P. and Krause, A., 2010, October. Unfreezing the robot: Navigation in dense, interacting crowds. In Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference on (pp. 797–803). [12] Davchev, T., Burke, M. and Ramamoorthy, S., 2019. Learning Modular Representations for Long-Term Multi-Agent Motion Predictions. arXiv preprint arXiv:1911.13044.Todor Davchevt.davchev@gmail.com

Now that we have defined the entire model, we can start training the neural network. The idea of each step is to take one batch and predict the next position for each of the agents and positions. The result is then compared to the target values through the associated loss function we defined earlier.

Most Basic Stochastic LSTM for Trajectory Prediction2019-12-15T00:00:00+00:002019-12-15T00:00:00+00:00https://tdavchev.github.io/posts/2019/12/blog-post-3In this blog’s experiments we will utilise the mentioned in previous posts (x,y) coordinate representations as input to the network. Since each of these coordinate representations is associated with a specific agent who will interact with each other, it is important to separate the associated sequences and acknowledge that each prediction will be dependent on the previous sequences observed for a given agent. <h1 id="methodology---stochastic-lstms">Methodology - Stochastic LSTMs</h1> <h2 id="implementation-details">Implementation Details</h2> As we already mentioned, we assume a good understanding of LSTMs. As input to the network we use a sequence of positions of a given agent and each step will be converted to a 128 dimensional feature vector. This conversion happens through a linear operation and a nonlinear, <a href="https://cs231n.github.io/neural-networks-1/">ReLU (Rectified Linear Output)</a> activation. <figure class="highlight"><pre><code class="language-python" data-lang="python">def embed_inputs(self, inputs, embedding_w, embedding_b): # embed the inputs with tf.name_scope("Embed_inputs"): embedded_inputs = [] for x in inputs: # Each x is a 2D tensor of size numPoints x 2 embedded_x = tf.nn.relu(tf.add(tf.matmul(x, embedding_w), embedding_b)) embedded_inputs.append(embedded_x) return embedded_inputs</code></pre></figure> Now that we have a relatively good representation of the input we can feed it through the LSTM model. To do this, we will use 128 dimensional hidden cell state. We chose the hyperparameterisation proposed in [6], [12] where the authors chose them using cross-validation applied on a syntetic dataset. We use learning rate of 0.003, with annealing term of 0.95 and use RMS-prop [5] and L2 regularisation with $\lambda=0.5$ and clip our gradients between -10 and 10. <figure class="highlight"><pre><code class="language-python" data-lang="python">NUM_EPOCHS = 100 DECAY_RATE = 0.95 GRAD_CLIP = 10 LR = 0.003 NUM_UNITS = 128 EMBEDDING = 128 MODE = 'train' SAVE_PATH='save' avg_time = 0 # used for printing avg_loss = 0 # used for printing</code></pre></figure> We achieve this task by updating the associated with the LSTM cell state at each step as follows: <figure class="highlight"><pre><code class="language-python" data-lang="python">def lstm_advance(self, embedded_inputs, cell, scope_name="LSTM"): # advance the lstm cell state with one for each entry with tf.variable_scope(scope_name) as scope: state = self.initial_state outputs = [] for i, inp in enumerate(embedded_inputs): if i > 0: scope.reuse_variables() output, last_state = cell(inp, state) outputs.append(output) return outputs, last_state</code></pre></figure> Having done this, we can now convert the output from the LSTM in a 5 dimensional output. <figure class="highlight"><pre><code class="language-python" data-lang="python">def final_layer(self, outputs, output_w, output_b): with tf.name_scope("Final_layer"): # Apply the linear layer. Output would be a # tensor of shape 1 x output_size output = tf.reshape(tf.concat(outputs, 1), [-1, self.num_units]) output = tf.nn.xw_plus_b(output, output_w, output_b) return output</code></pre></figure> We will use this output to define the distribution we will sample a predicted position $y_t$ and namely a 2D Gaussian distribution with a mean value $\mu = [\mu_x, \mu_y]$, standard deviation $\sigma = [\sigma_x, \sigma_y]$ and correlation $\rho$, similar to the described approach in [8]. <img src="https://latex.codecogs.com/svg.latex?\{x_t,%20y_t\}%20\in%20R%20\times%20R%20\times%20\{0,%201\}" alt="equation" /> Please note that these $x_t$ and $y_t$ are different from the $(x,y)$ coordinates. In this case, $(x, y)_t$ represent input and output to the proposed model at time $t$. We indicate the output of the proposed model as $\hat{y_t}$ and namely: <img src="https://latex.codecogs.com/svg.latex?\hat{y_t}%20=%20\big{(}\{\mu_t,%20\sigma_t,%20\rho_t\}\big{)}%20=%20b_y%20+%20\sum^{N}_{n=1}W_{h^ny}h^n_t" alt="equation" /> In this case, $b_y$ is the associated bias, $W$ are the parameters of the last layer’s feedforward layer and $h$ are the hidden output parameters from the LSTM. We ensure that the output representing the standard deviation will always be positive by representing the output of the network using an exponential function and ensure that the correlation term will be scaled between -1 and 1 using $tanh$. <img src="https://latex.codecogs.com/svg.latex?\mu_t%20=%20\hat{\mu}_t%20\implies%20\mu_t%20\in%20R%20\\%20%20%20%20%20\sigma_t%20=%20exp\big{(}\hat{\sigma}_t%20\big{)}%20\implies%20\sigma_t%20%3E%200%20\\%20%20%20%20%20\rho_t%20=%20tanh(\hat{\rho}_t)%20\implies%20\rho_t%20\in%20(-1,%201)" alt="equation" /> We can then define the probability $p(x_{t+1}\vert y_y)$ utilising the previous target $y_t$ can be defined as: $p(x_{t+1} \vert y_y) = N(x_{t+1} \vert \mu_t, \sigma_t, \rho_t)$, for $N(x \vert \mu, \sigma, \rho) = {1\over{2\pi\sigma_1\sigma_2\sqrt{1-\rho^2}}}exp\big{[}{-Z\over{2(1-\rho^2})}\big{]}$, with <img src="https://latex.codecogs.com/svg.latex?Z%20=%20{(x_1%20-%20\mu_1)^2\over{\sigma_1^2}}%20+%20{(x_2%20-%20\mu_2)^2\over{\sigma_2^2}}%20-%20{2\rho(x_1-\mu_1)(x_2-\mu_2)\over{\sigma_1\sigma_2}}" alt="equation" />. With this we obtain a loss function which is exact to a constant and depends entirely on the quantisation of the actual information and is in no way affecting the training of the network. $\mathcal{L} = \sum^T_{t=1}-log\big{(}N(x_{t+1} \vert \mu_t, \sigma_t, \rho_t)\big{)}$ Further, we can extract the partial derivatives for the five components and obtain: <img src="https://tdavchev.github.io/files/derivation.png" alt="Derivation" /> As code, we can implement the associated loss function with the following 2 functions: <figure class="highlight"><pre><code class="language-python" data-lang="python">def get_lossfunc(g, z_mux, z_muy, z_sx, z_sy, z_corr, x_data, y_data): # Calculate the PDF of the data w.r.t to the distribution result0 = \ distributions.tf_2d_normal(g, x_data, y_data, z_mux, z_muy, z_sx, z_sy, z_corr) # For numerical stability purposes as in Vemula (2018) epsilon = 1e-20 # Numerical stability result1 = -tf.log(tf.maximum(result0, epsilon)) return tf.reduce_sum(result1) def tf_2d_normal(g, x, y, mux, muy, sx, sy, rho): ''' Function that computes a multivariate Gaussian Equation taken from 24 & 25 in Graves (2013) ''' with g.as_default(): # Calculate (x-mux) and (y-muy) normx = tf.subtract(x, mux) normy = tf.subtract(y, muy) # Calculate sx*sy sxsy = tf.multiply(sx, sy) # Calculate the exponential factor z = tf.square(tf.divide(normx, sx)) + \ tf.square(tf.divide(normy, sy)) - \ 2*tf.divide( tf.multiply(rho, tf.multiply(normx, normy)), sxsy) negatedRho = 1 - tf.square(rho) # Numerator result = tf.exp(tf.divide(-z, 2*negatedRho)) # Normalization constant denominator = \ 2 * np.pi * tf.multiply(sxsy, tf.sqrt(negatedRho)) # Final PDF calculation result = tf.divide(result, denominator) return result</code></pre></figure> To our convenience, Tensorflow is capable of computing the derivatives automatically and we do not need to worry about implementing this bit. All that is left is to choose the optimisation routine. As previously mentioned, we will use L2 regularisation since we want to enforce a single optimsal solution (namely the targeted positions $y_t$). This way we constrain the accuracy during training (or the potential to overfit) by ensuring better generalisation achieved via efficient (in terms of compute) approach. In addition, we clip our gradients between 10 and -10 and ensure we will not face problems associated with exploding gradients. Finally, we will use <a href="https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf">RMS-Prop</a>, an unpublished optimisation algorithm. The algorithm is famous for using the second moment representing the variation of the previous gradient squared. As an alternative, we can use Adam [9] which normalises the gradients using first and second moment which also corrects for the bias during training. However, empirically, we found RMS-Prop to work better in this task. We hypothesise that this could be due the fact that we are interested in exploiting some of the biases of motion as we do not aim to generalise to other agents than human beings. You can find out more about different optimisation algorithms in <a href="https://ruder.io/optimizing-gradient-descent/">S. Ruder’s blog post</a>. <figure class="highlight"><pre><code class="language-python" data-lang="python">if self.mode != tf.contrib.learn.ModeKeys.INFER: with tf.name_scope("Optimization"): lossfunc = self.get_lossfunc(o_mux, o_muy, o_sx, o_sy, o_corr, x_data, y_data) self.cost = tf.div(lossfunc, (self.batch_size * self.sequence_length)) trainable_params = tf.trainable_variables() # apply L2 regularisation l2 = 0.05 * sum(tf.nn.l2_loss(t_param) for t_param in trainable_params) self.cost = self.cost + l2 tf.summary.scalar('cost', self.cost) self.gradients = tf.gradients(self.cost, trainable_params) grads, _ = tf.clip_by_global_norm(self.gradients, self.grad_clip # Adam might also do a good job as in Graves (2013) optimizer = tf.train.RMSPropOptimizer(self.lr) # Train operator self.train_op = optimizer.apply_gradients(zip(grads, trainable_params)) self.init = tf.global_variables_initializer()</code></pre></figure> Finally, we can define the entire model as shown in the section below. <figure class="highlight"><pre><code class="language-python" data-lang="python">reset_graph() lstm = BasicLSTM(batch_size=BATCH_SIZE, sequence_length=SEQUENCE_LENGTH, num_units=NUM_UNITS, embedding_size=EMBEDDING, learning_rate=LR, grad_clip=GRAD_CLIP, mode=MODE)</code></pre></figure>Todor Davchevt.davchev@gmail.com

In this blog’s experiments we will utilise the mentioned in previous posts (x,y) coordinate representations as input to the network. Since each of these coordinate representations is associated with a specific agent who will interact with each other, it is important to separate the associated sequences and acknowledge that each prediction will be dependent on the previous sequences observed for a given agent.

Processing Trajectory Data for Sequence Generation2019-12-08T00:00:00+00:002019-12-08T00:00:00+00:00https://tdavchev.github.io/posts/2019/12/blog-post-2Before considering the details around modelling such tasks, we should spend some time to consider the datasets we will use as well as the preprocessing routines we will consider. <h1 id="datasets-used">Datasets Used</h1> In this post, we consider four different datasets and namely, ETH University, ETH Hotel [5] (see next 2 photos as examples), and Zara1,2 [7]. Photos of the first two can be seen below. <h2 id="data-processing">Data Processing</h2> In these examples we are interested in figuring out the exact pixel location of each individual pedestrian (agent), as well as the associated frames we consider. All four datasets give us annotated positions but differ slightly in representation. Thus, as a first step we ensure they are aligned. This is common across datasets when they have been built by different groups and projects and have slight misalignments. All four datasets have been recorded on 25Hz and consist on average of 3000 frames. The ETH datasets are comprised of 750 agents each while Zara has two scenes each with 786 agents. All videos include people walking on their own, as well as pedestrians moving in groups in a nonlinear manner. However, some of the videos annotate trajectories in mm in a world reference frame and others have recorded them in pixel coordinates with (0,0) considered in the centre of each video frame. We will process all of them ensuring all positions are represented in pixel positions with (0,0) placed in the bottom left corner. We will further normalise them between 0 and 1 such that the size of the image or the roadwalk considered do not bias our solution. <h2 id="processing-examples">Processing examples</h2> To simplify this post we will consider transforming 1 of the 4 datasets and leave the rest out. The aim is to clarify how such processing is achieved. The rest of the processing is similar to the one described here and can be further found in the <a href="https://github.com/yadrimz/Stochastic-Futures-Prediction">GitHub repository of this post</a>. This part, however, is not necessary to understand the details around the actual model. ETH Hotel is comprised of positions in world reference frame where we are interested in converting these to local, pixel reference frame. To do this, we are given the required homography. <figure class="highlight"><pre><code class="language-python" data-lang="python">def world_to_image_frame(loc, Hinv): """ Given H^-1 and (x, y, z) in world coordinates, returns (u, v, 1) in image frame coordinates. """ loc = np.dot(Hinv, loc) # to camera frame return loc/loc[2] # to pixels (from millimeters)</code></pre></figure> Those interested in the mathematics behind the introduced conversion can read more about it in <a href="https://www.inf.ed.ac.uk/teaching/courses/cg/lectures/cg3_2013.pdf">Taku Komura’s lecture slides</a>. Further we normalise the data using the minimum and maximum recorded values which results in the following method. <figure class="highlight"><pre><code class="language-python" data-lang="python">def mil_to_pixels(directory=["./data/ewap_dataset/seq_hotel"]): ''' Preprocess the frames from the datasets. Convert values to pixel locations from millimeters obtain and store all frames data the actually used frames (as some are skipped), the ids of all pedestrians that are present at each of those frames and the sufficient statistics. ''' def collect_stats(agents): x_pos = [] y_pos = [] for agent_id in range(1, len(agents)): trajectory = [[] for _ in range(3)] traj = agents[agent_id] for step in traj: x_pos.append(step[1]) y_pos.append(step[2]) x_pos = np.asarray(x_pos) y_pos = np.asarray(y_pos) # takes the average over all points through all agents return [[np.min(x_pos), np.max(x_pos)], [np.min(y_pos), np.max(y_pos)]] Hfile = os.path.join(directory, "H.txt") obsfile = os.path.join(directory, "obsmat.txt") # Parse homography matrix. H = np.loadtxt(Hfile) Hinv = np.linalg.inv(H) # Parse pedestrian annotations. frames, pedsInFrame, agents = parse_annotations(Hinv, obsfile) # collect mean and std statistics = collect_stats(agents) norm_agents = [] # collect the id, normalised x and normalised y of each agent's position pedsWithPos = [] for agent in agents: norm_traj = [] for step in agent: _x = (step[1] - statistics[0][0]) / (statistics[0][1] - statistics[0][0]) _y = (step[2] - statistics[1][0]) / (statistics[1][1] - statistics[1][0]) norm_traj.append([int(frames[int(step[0])]), _x, _y]) norm_agents.append(np.array(norm_traj)) return np.array(norm_agents), statistics, pedsInFrame</code></pre></figure> Lines 8 to 20 find the minimum and maximum values for the x and y positions of the agents. The called “obsmat.txt” file contains the annotated data and is comprised of the frame number, the pedestrian id, position in the x axis in the world frame, position in the y, z as well as their associated 3 velocities. More information can be found in the README.txt file within the dataset directory. In this post we are only interested in considering the frame, the pedestrian’s id and their x and y positions. Lines 34-41 are associated with the ordering of pedestrians across frames along the pedestrian id. Line 29 calls parse_annotations() parses the collected data and converts it to reference frame. <figure class="highlight"><pre><code class="language-python" data-lang="python">def parse_annotations(Hinv, obsmat_txt): ''' Parse the dataset and convert to image frames data. ''' mat = np.loadtxt(obsmat_txt) num_peds = int(np.max(mat[:,1])) + 1 peds = [np.array([]).reshape(0,4) for _ in range(num_peds)] # maps ped ID -> (t,x,y,z) path num_frames = (mat[-1,0] + 1).astype("int") num_unique_frames = np.unique(mat[:,0]).size recorded_frames = [-1] * num_unique_frames # maps timestep -> (first) frame peds_in_frame = [[] for _ in range(num_unique_frames)] # maps timestep -> ped IDs frame = 0 time = -1 blqk = False for row in mat: if row[0] != frame: frame = int(row[0]) time += 1 recorded_frames[time] = frame ped = int(row[1]) peds_in_frame[time].append(ped) loc = np.array([row[2], row[4], 1]) loc = to_image_frame(loc) loc = [time, loc[0], loc[1], loc[2]] peds[ped] = np.vstack((peds[ped], loc)) return recorded_frames, peds_in_frame, peds</code></pre></figure> We can combine the preprocessing of all datasets in a single function. Ideally, we will save the preprocessed data and load it directly each time we need to use it. Once this is done we can then call a function that loads the preprocessed data in the format we need it in. It will be useful to split the trajectories in a chosen in advance sequence length. We can then easily compute the number of batches we will get if we specified a batch size too. <figure class="highlight"><pre><code class="language-python" data-lang="python">BATCH_SIZE = 50 SEQUENCE_LENGTH = 8 agents_data, dicto, dataset_indices = \ data_tools.preprocess(training_directories) loaded_data, num_batches = \ data_tools.load_preprocessed(agents_data, BATCH_SIZE, SEQUENCE_LENGTH)</code></pre></figure> <h2 id="batching">Batching</h2> After obtaining and pre-processing all the information we need to implement a routine for sampling random batches that ensure the samples will be comprised of unbroken sequences. One thing to keep in mind is that some of the sampled trajectories might be shorter than some given sequence length and others might be longer. In the former case, we would like to avoid such trajectories, while in the latter we want to split the trajectories in multiple samples. We will do this by defining a function called “next_batch()” that will take in as input the associated data, required batch size, a frame pointer that indicates the currently considered starting point, desired sequence length, maximum number of pedestrians in the considered sequence across all datasets, the current dataset pointer and whether or not we are sampling during inference or training time. <figure class="highlight"><pre><code class="language-python" data-lang="python">def next_batch(_data, pointer, batch_size, sequence_length, infer=False): ''' Function to get the next batch of points ''' # List of source and target data for the current batch x_batch = [] y_batch = [] # For each sequence in the batch for i in range(batch_size): # Extract the trajectory of the pedestrian pointed out by pointer traj = _data[pointer] # Number of sequences corresponding to his trajectory n_batch = int(traj.shape[0] / (sequence_length+1)) # Randomly sample an index from which his trajectory is to be considered if not infer: idx = random.randint(0, traj.shape[0] - sequence_length - 1) else: idx = 0 # Append the trajectory from idx until sequence_length into source and target data x_batch.append(np.copy(traj[idx:idx+sequence_length, :])) y_batch.append(np.copy(traj[idx+1:idx+sequence_length+1, :])) if random.random() < (1.0/float(n_batch)): # Adjust sampling probability # if this is a long datapoint, sample this data more with # higher probability pointer = tick_batch_pointer(pointer, len(_data)) return x_batch, y_batch, pointer</code></pre></figure> Now that we have ensured we have the required data processed and have an associated batching function we can focus on building the actual model.Todor Davchevt.davchev@gmail.com

Before considering the details around modelling such tasks, we should spend some time to consider the datasets we will use as well as the preprocessing routines we will consider.

Tutorial on Stochastic Trajectory Prediction2019-12-01T00:00:00+00:002019-12-01T00:00:00+00:00https://tdavchev.github.io/posts/2019/12/blog-post-1This tutorial is meant as an introduction to the problem of trajectory generation. It introduces several ways for modelling the motion of agents in pixel space and proposes several ways of preprocessing data. It follows the structure from its associated <a href="https://github.com/yadrimz/Stochastic-Futures-Prediction">GitHub repository</a>. Feel free to skip to the end to see the performance of 2 basic models. <h1 id="introduction">Introduction</h1> The acquisition of neural-based solutions such as Long Shrort-Term Memory [1] has become very common for modelling time series signals in the past few years. Some popular and well established examples include different applications of AI; and namely speech processing [2], language modelling [3], translation [4] among many others. In particular, the idea behind LSTMs is that they are good at mimicing the hidden dynamics behind a given short sequence. The overall structure of such networks (see the figure bellow, taken from <a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/">Chris Olah’s</a> blog post which is a great introduction to LSTMs along with <a href="https://www.deeplearningbook.org/contents/rnn.html">Chapter 10 of the book “Deep Learning”</a>) allows us to model each step while taking into account all previously considered ones in a given signal. In the context of text modelling, those signals can be all previously observed words in a given sentence where each step would consist of predicting one word by conditioning on all previously seen words. In this context, LSTMs are good at encapsulating the context behind a given sentence which makes them a good candidate for predicting its continuation. <img src="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png" alt="LSTM-Chris Olah" /> Similarly, we can model the dynamics associated with the motion of pedestrians in crowded scenes, the behaviour of agents in different games, moving cars on a busy road etc. In such cases, the task would be to predict where each considered agent will be after some time. This post is an introduction to the use of LSTMs in such problems. We will focus on predicting the next few 2D positions (x,y) of pedestrians from annotated videos. We will assume a good understanding of what LSTMs are, the difficulties behind extracting meaningful feature representations as well as basic computer science knowledge, algebra and calculus, and some prior knowledge of using Tensorflow. We aim to conclude with a brief discussion of the pros and cons of using the introduced here use of LSTMS. Anticipating the position of pedestrians in a given video sequence is often a challenging task even for humans. Motion is often dictated by some unspoken rules that are different across cultures and are in addition interpreted differently by people. For example, when walking in crowded scenes we might aim to avoid collisions with others, but we can also aim to reach someone specific and stop in front of them. In such cases it is relatively difficult for an observer to anticipate what would be the goal of someone walking in the scene in the next 5-10 minutes just by observing but we can relatively easily tell where a pedestrian aims to be in the next 5-10 seconds. Regardless, we often maintain a few hypothesis of where this person will be and thus modelling such motions using purely deterministic approaches (such as simply using standard LSTMs) can be very hard. An alternative solution would be to train a neural network to model the sufficient statistics of the distribution the next step is sampled from. <h2 id="initialization">Initialization</h2> We begin by installing tensorflow and downloading the github code described in this blog post. We then import everything necessary including all datasets we will use for training, followed by assigning all constant variables we will need for training the network. <code class="language-plaintext highlighter-rouge">!pip install tensorflow==1.15.0</code> <code class="language-plaintext highlighter-rouge">!git clone https://github.com/yadrimz/Stochastic-Futures-Prediction.git</code> <h1 id="problem-formulation">Problem Formulation</h1> This work’s focus is on predicting the future motion of an arbitrary number of observed agents (i.e. their behaviour) whose action spaces and objectives are unknown. More specifically, we focus on predicting the two-dimensional motion of agents in video sequences. We assume we are given a history of two-dimensional position annotations and video frames as a sequence of RGB images. Each agent $a$ ($a \in [1 \ldots A]$, where $A$ is the maximum number of agents in the video) is represented by a state ($s_t^a$) which comprises xy-coordinates at time t, $s_t^a$ = $(x_t,y_t)_a$. Given a sequence of $obs$ observed states $S = s_{t-obs}, s_{t-obs+1}, \ldots, s_{t-1}, s_{t}$, we will formulate the prediction as an optimisation process, where the objective is to learn a posterior distribution $P(Y \vert S)$, of multiple agents future trajectories $Y$. Here an individual agent’s future trajectory is defined as ($s_t^a = {s_{t+1}^a, s_{t+2}^a, \ldots, s_{t+pred}^a}$) for $pred$ steps ahead for every agent $a$ found in a given frame at time $t$. <figure class="highlight"><pre><code class="language-python" data-lang="python">import os import time %tensorflow_version 1.x import numpy as np import tensorflow as tf import utils.data_tools as data_tools import utils.visualisation as visualisation import utils.distributions as distributions from models.lstm import BasicLSTM from models.lstm import reset_graph</code></pre></figure>Todor Davchevt.davchev@gmail.com

This tutorial is meant as an introduction to the problem of trajectory generation. It introduces several ways for modelling the motion of agents in pixel space and proposes several ways of preprocessing data. It follows the structure from its associated GitHub repository. Feel free to skip to the end to see the performance of 2 basic models.

Original Source | Taken Source