| CARVIEW |
Select Language
HTTP/2 200
server: GitHub.com
content-type: application/xml
last-modified: Sun, 24 Jul 2022 15:58:27 GMT
access-control-allow-origin: *
etag: W/"62dd6c23-23a0e"
expires: Tue, 30 Dec 2025 02:20:01 GMT
cache-control: max-age=600
content-encoding: gzip
x-proxy-cache: MISS
x-github-request-id: 1FA8:3ABDEF:96A37C:A95934:69533479
accept-ranges: bytes
age: 0
date: Tue, 30 Dec 2025 02:10:01 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210043-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1767060601.463508,VS0,VE223
vary: Accept-Encoding
x-fastly-request-id: d357e3197491783a98f14b952c415ccb58f5cfbf
content-length: 20534
Jekyll 2022-07-24T16:58:17+01:00 https://tdavchev.github.io/feed.xml Todor Davchev PhD student, University of Edinburgh Todor Davchev t.davchev@gmail.com Crowded Scene Training/Inference and Useful Tricks 2019-12-22T00:00:00+00:00 2019-12-22T00:00:00+00:00 https://tdavchev.github.io/posts/2019/12/blog-post-4 <p>Now that we have defined the entire model, we can start training the neural network. The idea of each step is to take one batch and predict the next position for each of the agents and positions. The result is then compared to the target values through the associated loss function we defined earlier.</p>
<h2 id="training">Training</h2>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">e</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">NUM_EPOCHS</span><span class="p">):</span>
<span class="c1"># Assign the learning rate (decayed acc. to the epoch number)
</span> <span class="n">lstm</span><span class="p">.</span><span class="n">sess</span><span class="p">.</span><span class="n">run</span><span class="p">(</span><span class="n">tf</span><span class="p">.</span><span class="n">assign</span><span class="p">(</span><span class="n">lstm</span><span class="p">.</span><span class="n">lr</span><span class="p">,</span> <span class="n">lstm</span><span class="p">.</span><span class="n">learning_rate</span> <span class="o">*</span> <span class="p">(</span><span class="n">DECAY_RATE</span> <span class="o">**</span> <span class="n">e</span><span class="p">)))</span>
<span class="c1"># Reset the pointers in the data loader object
</span> <span class="n">pointer</span> <span class="o">=</span> <span class="n">data_tools</span><span class="p">.</span><span class="n">reset_batch_pointer</span><span class="p">()</span>
<span class="c1"># Get the initial cell state of the LSTM
</span> <span class="n">state</span> <span class="o">=</span> <span class="n">lstm</span><span class="p">.</span><span class="n">sess</span><span class="p">.</span><span class="n">run</span><span class="p">(</span><span class="n">lstm</span><span class="p">.</span><span class="n">initial_state</span><span class="p">)</span>
<span class="c1"># For each batch in this epoch
</span> <span class="k">for</span> <span class="n">b</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_batches</span><span class="p">):</span>
<span class="n">start</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span>
<span class="c1"># Get the source and target data of the current batch
</span> <span class="c1"># x has the source data, y has the target data
</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">pointer</span> <span class="o">=</span> <span class="n">data_tools</span><span class="p">.</span><span class="n">next_batch</span><span class="p">(</span>
<span class="n">loaded_data</span><span class="p">,</span> <span class="n">pointer</span><span class="p">,</span> <span class="n">BATCH_SIZE</span><span class="p">,</span> <span class="n">SEQUENCE_LENGTH</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">y</span><span class="p">)</span>
<span class="c1"># Feed the source, target data and the initial LSTM state to the model
</span> <span class="n">feed</span> <span class="o">=</span> <span class="p">{</span>
<span class="n">lstm</span><span class="p">.</span><span class="n">input_data</span><span class="p">:</span> <span class="n">x</span><span class="p">[:,</span> <span class="p">:,</span> <span class="mi">1</span><span class="p">:],</span>
<span class="n">lstm</span><span class="p">.</span><span class="n">target_data</span><span class="p">:</span> <span class="n">y</span><span class="p">[:,</span> <span class="p">:,</span> <span class="mi">1</span><span class="p">:],</span>
<span class="n">lstm</span><span class="p">.</span><span class="n">initial_state</span><span class="p">:</span> <span class="n">state</span>
<span class="p">}</span>
<span class="c1"># Fetch the loss of the model on this batch,
</span> <span class="c1"># the final LSTM state from the session
</span> <span class="n">train_loss</span><span class="p">,</span> <span class="n">state</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">lstm</span><span class="p">.</span><span class="n">sess</span><span class="p">.</span><span class="n">run</span><span class="p">(</span>
<span class="p">[</span><span class="n">lstm</span><span class="p">.</span><span class="n">cost</span><span class="p">,</span> <span class="n">lstm</span><span class="p">.</span><span class="n">final_state</span><span class="p">,</span> <span class="n">lstm</span><span class="p">.</span><span class="n">train_op</span><span class="p">],</span> <span class="n">feed</span><span class="p">)</span>
<span class="c1"># Toc
</span> <span class="n">end</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span>
<span class="n">cur_time</span> <span class="o">=</span> <span class="n">end</span> <span class="o">-</span> <span class="n">start</span>
<span class="n">step</span> <span class="o">=</span> <span class="n">e</span> <span class="o">*</span> <span class="n">num_batches</span> <span class="o">+</span> <span class="n">b</span>
<span class="n">avg_time</span> <span class="o">+=</span> <span class="n">cur_time</span>
<span class="n">avg_loss</span> <span class="o">+=</span> <span class="n">train_loss</span>
<span class="c1"># Print epoch, batch, loss and time taken
</span> <span class="k">if</span> <span class="p">(</span><span class="n">step</span><span class="o">%</span><span class="mi">99</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span>
<span class="s">"{}/{} (epoch {}), train_loss = {:.3f}, time/batch = {:.3f}"</span>
<span class="p">.</span><span class="nb">format</span><span class="p">(</span>
<span class="n">step</span><span class="p">,</span>
<span class="n">NUM_EPOCHS</span> <span class="o">*</span> <span class="n">num_batches</span><span class="p">,</span>
<span class="n">e</span><span class="p">,</span>
<span class="n">avg_loss</span><span class="o">/</span><span class="mf">99.0</span><span class="p">,</span> <span class="n">avg_time</span><span class="o">/</span><span class="mf">99.0</span><span class="p">))</span>
<span class="n">avg_time</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">avg_loss</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">os</span><span class="p">.</span><span class="n">makedirs</span><span class="p">(</span><span class="n">SAVE_PATH</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c1"># Save parameters after the network is trained
</span><span class="n">lstm</span><span class="p">.</span><span class="n">save_json</span><span class="p">(</span><span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">SAVE_PATH</span><span class="p">,</span> <span class="s">"params.json"</span><span class="p">))</span></code></pre></figure>
<h2 id="inference">Inference</h2>
<p>After training is complete, we can start using it and infer the positions of agents in a never seen before data. When predicting trajectories of this type we assume that the agents observed have already walked for some time and we have the associated observed annotations. Thus, we “preload” the LSTM with the manner of walking of a particualr agent. Thus, we condition the associated predictions to the manner of walking, direction and behaviour of a considered agent. In addition we set the batch size to 1 and load the never seen before data set.</p>
<p>We sample from the predicted 2D Gaussian distribution in the following way:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">sample_2d_normal</span><span class="p">(</span><span class="n">o_mux</span><span class="p">,</span> <span class="n">o_muy</span><span class="p">,</span> <span class="n">o_sx</span><span class="p">,</span> <span class="n">o_sy</span><span class="p">,</span> <span class="n">o_corr</span><span class="p">):</span>
<span class="s">'''
Function that samples from a multivariate Gaussian
That has the statistics computed by the network.
'''</span>
<span class="n">mean</span> <span class="o">=</span> <span class="p">[</span><span class="n">o_mux</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">],</span> <span class="n">o_muy</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]]</span>
<span class="c1"># Extract covariance matrix
</span> <span class="n">cov</span> <span class="o">=</span> <span class="p">[[</span><span class="n">o_sx</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span><span class="o">*</span><span class="n">o_sx</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">],</span> <span class="n">o_corr</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span><span class="o">*</span><span class="n">o_sx</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span><span class="o">*</span><span class="n">o_sy</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]],</span> <span class="p">[</span><span class="n">o_corr</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span><span class="o">*</span><span class="n">o_sx</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span><span class="o">*</span><span class="n">o_sy</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">],</span> <span class="n">o_sy</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span><span class="o">*</span><span class="n">o_sy</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]]]</span>
<span class="c1"># Sample a point from the multivariate normal distribution
</span> <span class="n">sampled_x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">multivariate_normal</span><span class="p">(</span><span class="n">mean</span><span class="p">,</span> <span class="n">cov</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="k">return</span> <span class="n">sampled_x</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">],</span> <span class="n">sampled_x</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">1</span><span class="p">]</span></code></pre></figure>
<p>For a given trajectory, the idea is to simply update the cell state at each step using the final cell state value from our previous prediction.</p>
<p>We then measure the performance of the network through the average displacement error.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">b</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_batches</span><span class="p">):</span>
<span class="n">start</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span>
<span class="c1"># Get the source and target data of the current batch
</span> <span class="c1"># x has the source data, y has the target data
</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">pointer</span> <span class="o">=</span> \
<span class="n">data_tools</span><span class="p">.</span><span class="n">next_batch</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">pointer</span><span class="p">,</span> <span class="n">BATCH_SIZE</span><span class="p">,</span> <span class="n">SEQUENCE_LENGTH</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">y</span><span class="p">)</span>
<span class="n">obs_traj</span> <span class="o">=</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">,:</span><span class="n">OBSERVED_LENGTH</span><span class="p">,</span> <span class="mi">1</span><span class="p">:]</span>
<span class="k">for</span> <span class="n">position</span> <span class="ow">in</span> <span class="n">obs_traj</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">]:</span>
<span class="c1"># Create the input data tensor
</span> <span class="n">input_data_tensor</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">((</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="n">float32</span><span class="p">)</span>
<span class="n">input_data_tensor</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">position</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="c1"># x
</span> <span class="n">input_data_tensor</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">position</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="c1"># y
</span>
<span class="c1"># Create the feed dict
</span> <span class="n">feed</span> <span class="o">=</span> <span class="p">{</span><span class="n">lstm</span><span class="p">.</span><span class="n">input_data</span><span class="p">:</span> <span class="n">input_data_tensor</span><span class="p">,</span> <span class="n">lstm</span><span class="p">.</span><span class="n">initial_state</span><span class="p">:</span> <span class="n">state</span><span class="p">}</span>
<span class="c1"># Get the final state after processing the current position
</span> <span class="p">[</span><span class="n">state</span><span class="p">]</span> <span class="o">=</span> <span class="n">lstm</span><span class="p">.</span><span class="n">sess</span><span class="p">.</span><span class="n">run</span><span class="p">([</span><span class="n">lstm</span><span class="p">.</span><span class="n">final_state</span><span class="p">],</span> <span class="n">feed</span><span class="p">)</span>
<span class="n">returned_traj</span> <span class="o">=</span> <span class="n">obs_traj</span>
<span class="n">last_position</span> <span class="o">=</span> <span class="n">obs_traj</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="n">prev_data</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">((</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="n">float32</span><span class="p">)</span>
<span class="n">prev_data</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">last_position</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="c1"># x
</span> <span class="n">prev_data</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">last_position</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="c1"># y
</span>
<span class="n">prev_target_data</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="n">obs_traj</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="mi">1</span><span class="p">:],</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">))</span>
<span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">PREDICTED_LENGTH</span><span class="p">):</span>
<span class="n">feed</span> <span class="o">=</span> <span class="p">{</span>
<span class="n">lstm</span><span class="p">.</span><span class="n">input_data</span><span class="p">:</span> <span class="n">prev_data</span><span class="p">,</span>
<span class="n">lstm</span><span class="p">.</span><span class="n">initial_state</span><span class="p">:</span> <span class="n">state</span><span class="p">,</span>
<span class="n">lstm</span><span class="p">.</span><span class="n">target_data</span><span class="p">:</span> <span class="n">prev_target_data</span><span class="p">}</span>
<span class="p">[</span><span class="n">o_mux</span><span class="p">,</span> <span class="n">o_muy</span><span class="p">,</span> <span class="n">o_sx</span><span class="p">,</span> <span class="n">o_sy</span><span class="p">,</span> <span class="n">o_corr</span><span class="p">,</span> <span class="n">state</span><span class="p">]</span> <span class="o">=</span> <span class="n">lstm</span><span class="p">.</span><span class="n">sess</span><span class="p">.</span><span class="n">run</span><span class="p">(</span>
<span class="p">[</span><span class="n">lstm</span><span class="p">.</span><span class="n">mux</span><span class="p">,</span> <span class="n">lstm</span><span class="p">.</span><span class="n">muy</span><span class="p">,</span> <span class="n">lstm</span><span class="p">.</span><span class="n">sx</span><span class="p">,</span> <span class="n">lstm</span><span class="p">.</span><span class="n">sy</span><span class="p">,</span> <span class="n">lstm</span><span class="p">.</span><span class="n">corr</span><span class="p">,</span> <span class="n">lstm</span><span class="p">.</span><span class="n">final_state</span><span class="p">],</span>
<span class="n">feed</span><span class="p">)</span>
<span class="n">next_x</span><span class="p">,</span> <span class="n">next_y</span> <span class="o">=</span> \
<span class="n">distributions</span><span class="p">.</span><span class="n">sample_2d_normal</span><span class="p">(</span><span class="n">o_mux</span><span class="p">,</span> <span class="n">o_muy</span><span class="p">,</span> <span class="n">o_sx</span><span class="p">,</span> <span class="n">o_sy</span><span class="p">,</span> <span class="n">o_corr</span><span class="p">)</span>
<span class="n">returned_traj</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">vstack</span><span class="p">((</span><span class="n">returned_traj</span><span class="p">,</span> <span class="p">[</span><span class="n">next_x</span><span class="p">,</span> <span class="n">next_y</span><span class="p">]))</span>
<span class="n">prev_data</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">next_x</span>
<span class="n">prev_data</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">next_y</span>
<span class="n">complete_traj</span> <span class="o">=</span> <span class="n">returned_traj</span>
<span class="n">total_error</span> <span class="o">+=</span> \
<span class="n">distributions</span><span class="p">.</span><span class="n">get_mean_error</span><span class="p">(</span><span class="n">complete_traj</span><span class="p">,</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="p">:,</span> <span class="mi">1</span><span class="p">:],</span> <span class="n">OBSERVED_LENGTH</span><span class="p">)</span>
<span class="k">if</span> <span class="p">(</span><span class="n">b</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="mi">50</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Processed trajectory number : "</span><span class="p">,</span>
<span class="n">b</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span> <span class="s">"out of "</span><span class="p">,</span> <span class="n">num_batches</span><span class="p">,</span> <span class="s">" trajectories"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Total mean error of the model is "</span><span class="p">,</span> <span class="n">total_error</span><span class="o">/</span><span class="n">num_batches</span><span class="p">)</span></code></pre></figure>
<h1 id="experimental-results">Experimental Results</h1>
<p>We extimate the quality of the proposed solution using two ways; first one is using the mean squared error between the predicted points and the target distribution. Second, we visualise the behaviour using OpenCV and Matplotlib.</p>
<p>Predicting 4 steps ahead ($T_{pred}=4$), observing 4 steps ($T_{obs}=4$) we obtain 563 trajectories in total with an average mean suqared error of 0.088. We then visualise the result in the following plot.<img src="https://dev.bg/wp-content/uploads/2018/11/inference.gif" alt="alt text" /></p>
<p>As you can see the result (in purple) matches the real trajectory (in green) for the first 4 steps since we observed them. However, the followed up predictions (except the first, direct one) do not do as well. The reason for this is that the distribution from which each step is sampled is conditioned on the previously predicted position. Thus, the accumulated error is itself multplicative and leads to such poor behaviour. A few ways to fix this is by iteratively computing each trajectory ignoring the empty steps, by providing more training data or providing more meaningful inputs. Iteratively computing each trajectory together with the rest of the trajectories from the same frame sequence is a relatively simple work around that deems a lot better solutiuons. Motion is conditioned on the surrounding trajectories and optimising each of them along with its neighbours makes a lot of sense. <a href="https://github.com/yadrimz/Stochastic-Futures-Prediction/blob/master/notebooks/Tutorial%202.ipynb">Tutorial 2</a> from this post’s <a href="https://github.com/yadrimz/Stochastic-Futures-Prediction">GitHub repository</a> shows one way of reducing the error exponentially by following this simple trick.</p>
<h1 id="conclusion">Conclusion</h1>
<p>The problem of trajectory modelling is relatively old. Principalled approaches prior to the utilisation of neural solutions use Kalman filters, social utility functions [10] and factorised, interactive Gaussian processes [11]. Albeit the accuracy of some of them, they all assume a lot more prior knowledge prior to modelling the trajectories. Utilising LSTMs, however, extracts the behaviour entirely from data which can be extremely useful in cases where such inductive information is not present. However, it requires larger amounts of data.</p>
<p>One potential way to improve the associated results is by utilising rough inductive biases, ideally extracted in an unsupervised manner. This way, we can ensure a more informed representation that is conditioned on some improtant to each dataset information. One example is utilising both spatial and global dynamics to improve the discussed previously in this blog post representations. Spatial representations would extract static information that is inherent to the given dataset, such as existing trees, surrounding snow etc. While global dynamics deals with modelling the unspoken rules for motion such as the general direction of motion, the way of surpassing individual agents or places agents tend to stay in place, such as at bus stops when waiting for the next bus. [6] models the global social dynamics between agents while [12] utilises both spatial and global dynamics by imputing to features extracted in an unsupervised way. <a href="https://sites.google.com/view/rdb-agents/home">The webpage</a> associated with [12] provides a summary on one way for incorporating inductive biases in a modular way which preserves the small data requirements.</p>
<p>As a quick step towards improving the results we can optimise batches of random sequences of frames instead of optimising batches of random agents. Intuitively, we would like to make sure that we will optimise all agents who are moving alongside together which would essentially consist of a more accurate approximation of the considered data. Further, we will optimise each agent individually and discard all empty slots and apply a few other small tricks. More information can be found in the associated notebook called <a href="https://github.com/yadrimz/Stochastic-Futures-Prediction/blob/master/notebooks/Tutorial%202.ipynb">Tutorial 2</a>. A sample from the resulted solution can be seen bellow.</p>
<p><img src="https://drive.google.com/uc?export=view&id=168tMhQOgaecxUeM7WT_0AXw_cmtANIp_" alt="Tutorial 2 result" /></p>
<h1 id="references">References</h1>
<p>[1] Hochreiter, S. and Schmidhuber, J., 1997. Long short-term memory. Neural computation, 9(8), pp.1735–1780.
[2] Graves, A., Mohamed, A.R. and Hinton, G., 2013, May. Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on (pp. 6645–6649). IEEE.</p>
<p>[3] Bowman, S.R., Gauthier, J., Rastogi, A., Gupta, R., Manning, C.D. and Potts, C., 2016. A fast unified model for parsing and sentence understanding. arXiv preprint arXiv:1603.06021.</p>
<p>[4] Sutskever, I., Vinyals, O. and Le, Q.V., 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104–3112).</p>
<p>[5] Yücel, Z., Zanlungo, F., Ikeda, T., Miyashita, T. and Hagita, N., 2013. Deciphering the crowd: Modeling and identification of pedestrian group motion. Sensors, 13(1), pp.875–897.</p>
<p>[6] Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L. and Savarese, S., 2016. Social lstm: Human trajectory prediction in crowded spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 961–971).</p>
<p>[7] Lerner, A., Chrysanthou, Y. and Lischinski, D., 2007, September. Crowds by example. In Computer Graphics Forum (Vol. 26, №3, pp. 655–664). Oxford, UK: Blackwell Publishing Ltd.</p>
<p>[8] Graves, A., 2013. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850.</p>
<p>[9] Kingma, D. P., & Ba, J. L. (2015). Adam: a Method for Stochastic Optimization. International Conference on Learning Representations, 1–13.</p>
<p>[10] Yamaguchi, K., Berg, A.C., Ortiz, L.E. and Berg, T.L., 2011, June. Who are you with and where are you going?. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on(pp. 1345–1352). IEEE.</p>
<p>[11] Trautman, P. and Krause, A., 2010, October. Unfreezing the robot: Navigation in dense, interacting crowds. In Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference on (pp. 797–803).</p>
<p>[12] Davchev, T., Burke, M. and Ramamoorthy, S., 2019. Learning Modular Representations for Long-Term Multi-Agent Motion Predictions. arXiv preprint arXiv:1911.13044.</p> Todor Davchev t.davchev@gmail.com Now that we have defined the entire model, we can start training the neural network. The idea of each step is to take one batch and predict the next position for each of the agents and positions. The result is then compared to the target values through the associated loss function we defined earlier. Most Basic Stochastic LSTM for Trajectory Prediction 2019-12-15T00:00:00+00:00 2019-12-15T00:00:00+00:00 https://tdavchev.github.io/posts/2019/12/blog-post-3 <p>In this blog’s experiments we will utilise the mentioned in previous posts (x,y) coordinate representations as input to the network. Since each of these coordinate representations is associated with a specific agent who will interact with each other, it is important to separate the associated sequences and acknowledge that each prediction will be dependent on the previous sequences observed for a given agent.</p>
<h1 id="methodology---stochastic-lstms">Methodology - Stochastic LSTMs</h1>
<h2 id="implementation-details">Implementation Details</h2>
<p>As we already mentioned, we assume a good understanding of LSTMs. As input to the network we use a sequence of positions of a given agent and each step will be converted to a 128 dimensional feature vector. This conversion happens through a linear operation and a nonlinear, <a href="https://cs231n.github.io/neural-networks-1/">ReLU (Rectified Linear Output)</a> activation.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">embed_inputs</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">inputs</span><span class="p">,</span> <span class="n">embedding_w</span><span class="p">,</span> <span class="n">embedding_b</span><span class="p">):</span>
<span class="c1"># embed the inputs
</span> <span class="k">with</span> <span class="n">tf</span><span class="p">.</span><span class="n">name_scope</span><span class="p">(</span><span class="s">"Embed_inputs"</span><span class="p">):</span>
<span class="n">embedded_inputs</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">inputs</span><span class="p">:</span>
<span class="c1"># Each x is a 2D tensor of size numPoints x 2
</span> <span class="n">embedded_x</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">relu</span><span class="p">(</span><span class="n">tf</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">tf</span><span class="p">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">embedding_w</span><span class="p">),</span> <span class="n">embedding_b</span><span class="p">))</span>
<span class="n">embedded_inputs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">embedded_x</span><span class="p">)</span>
<span class="k">return</span> <span class="n">embedded_inputs</span></code></pre></figure>
<p>Now that we have a relatively good representation of the input we can feed it through the LSTM model. To do this, we will use 128 dimensional hidden cell state. We chose the hyperparameterisation proposed in [6], [12] where the authors chose them using cross-validation applied on a syntetic dataset. We use learning rate of 0.003, with annealing term of 0.95 and use RMS-prop [5] and L2 regularisation with $\lambda=0.5$ and clip our gradients between -10 and 10.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">NUM_EPOCHS</span> <span class="o">=</span> <span class="mi">100</span>
<span class="n">DECAY_RATE</span> <span class="o">=</span> <span class="mf">0.95</span>
<span class="n">GRAD_CLIP</span> <span class="o">=</span> <span class="mi">10</span>
<span class="n">LR</span> <span class="o">=</span> <span class="mf">0.003</span>
<span class="n">NUM_UNITS</span> <span class="o">=</span> <span class="mi">128</span>
<span class="n">EMBEDDING</span> <span class="o">=</span> <span class="mi">128</span>
<span class="n">MODE</span> <span class="o">=</span> <span class="s">'train'</span>
<span class="n">SAVE_PATH</span><span class="o">=</span><span class="s">'save'</span>
<span class="n">avg_time</span> <span class="o">=</span> <span class="mi">0</span> <span class="c1"># used for printing
</span><span class="n">avg_loss</span> <span class="o">=</span> <span class="mi">0</span> <span class="c1"># used for printing</span></code></pre></figure>
<p>We achieve this task by updating the associated with the LSTM cell state at each step as follows:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">lstm_advance</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">embedded_inputs</span><span class="p">,</span> <span class="n">cell</span><span class="p">,</span> <span class="n">scope_name</span><span class="o">=</span><span class="s">"LSTM"</span><span class="p">):</span>
<span class="c1"># advance the lstm cell state with one for each entry
</span> <span class="k">with</span> <span class="n">tf</span><span class="p">.</span><span class="n">variable_scope</span><span class="p">(</span><span class="n">scope_name</span><span class="p">)</span> <span class="k">as</span> <span class="n">scope</span><span class="p">:</span>
<span class="n">state</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">initial_state</span>
<span class="n">outputs</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">inp</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">embedded_inputs</span><span class="p">):</span>
<span class="k">if</span> <span class="n">i</span> <span class="o">></span> <span class="mi">0</span><span class="p">:</span>
<span class="n">scope</span><span class="p">.</span><span class="n">reuse_variables</span><span class="p">()</span>
<span class="n">output</span><span class="p">,</span> <span class="n">last_state</span> <span class="o">=</span> <span class="n">cell</span><span class="p">(</span><span class="n">inp</span><span class="p">,</span> <span class="n">state</span><span class="p">)</span>
<span class="n">outputs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">output</span><span class="p">)</span>
<span class="k">return</span> <span class="n">outputs</span><span class="p">,</span> <span class="n">last_state</span></code></pre></figure>
<p>Having done this, we can now convert the output from the LSTM in a 5 dimensional output.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">final_layer</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">outputs</span><span class="p">,</span> <span class="n">output_w</span><span class="p">,</span> <span class="n">output_b</span><span class="p">):</span>
<span class="k">with</span> <span class="n">tf</span><span class="p">.</span><span class="n">name_scope</span><span class="p">(</span><span class="s">"Final_layer"</span><span class="p">):</span>
<span class="c1"># Apply the linear layer. Output would be a
</span> <span class="c1"># tensor of shape 1 x output_size
</span> <span class="n">output</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">tf</span><span class="p">.</span><span class="n">concat</span><span class="p">(</span><span class="n">outputs</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">num_units</span><span class="p">])</span>
<span class="n">output</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">xw_plus_b</span><span class="p">(</span><span class="n">output</span><span class="p">,</span> <span class="n">output_w</span><span class="p">,</span> <span class="n">output_b</span><span class="p">)</span>
<span class="k">return</span> <span class="n">output</span></code></pre></figure>
<p>We will use this output to define the distribution we will sample a predicted position $y_t$ and namely a 2D Gaussian distribution with a mean value $\mu = [\mu_x, \mu_y]$, standard deviation $\sigma = [\sigma_x, \sigma_y]$ and correlation $\rho$, similar to the described approach in [8].</p>
<p><img src="https://latex.codecogs.com/svg.latex?\{x_t,%20y_t\}%20\in%20R%20\times%20R%20\times%20\{0,%201\}" alt="equation" /></p>
<p>Please note that these $x_t$ and $y_t$ are different from the $(x,y)$ coordinates. In this case, $(x, y)_t$ represent input and output to the proposed model at time $t$. We indicate the output of the proposed model as $\hat{y_t}$ and namely:</p>
<p><img src="https://latex.codecogs.com/svg.latex?\hat{y_t}%20=%20\big{(}\{\mu_t,%20\sigma_t,%20\rho_t\}\big{)}%20=%20b_y%20+%20\sum^{N}_{n=1}W_{h^ny}h^n_t" alt="equation" /></p>
<p>In this case, $b_y$ is the associated bias, $W$ are the parameters of the last layer’s feedforward layer and $h$ are the hidden output parameters from the LSTM. We ensure that the output representing the standard deviation will always be positive by representing the output of the network using an exponential function and ensure that the correlation term will be scaled between -1 and 1 using $tanh$.</p>
<p><img src="https://latex.codecogs.com/svg.latex?\mu_t%20=%20\hat{\mu}_t%20\implies%20\mu_t%20\in%20R%20\\%20%20%20%20%20\sigma_t%20=%20exp\big{(}\hat{\sigma}_t%20\big{)}%20\implies%20\sigma_t%20%3E%200%20\\%20%20%20%20%20\rho_t%20=%20tanh(\hat{\rho}_t)%20\implies%20\rho_t%20\in%20(-1,%201)" alt="equation" /></p>
<p>We can then define the probability $p(x_{t+1}\vert y_y)$ utilising the previous target $y_t$ can be defined as:</p>
<p>$p(x_{t+1} \vert y_y) = N(x_{t+1} \vert \mu_t, \sigma_t, \rho_t)$, for $N(x \vert \mu, \sigma, \rho) = {1\over{2\pi\sigma_1\sigma_2\sqrt{1-\rho^2}}}exp\big{[}{-Z\over{2(1-\rho^2})}\big{]}$, with</p>
<p><img src="https://latex.codecogs.com/svg.latex?Z%20=%20{(x_1%20-%20\mu_1)^2\over{\sigma_1^2}}%20+%20{(x_2%20-%20\mu_2)^2\over{\sigma_2^2}}%20-%20{2\rho(x_1-\mu_1)(x_2-\mu_2)\over{\sigma_1\sigma_2}}" alt="equation" />.</p>
<p>With this we obtain a loss function which is exact to a constant and depends entirely on the quantisation of the actual information and is in no way affecting the training of the network.</p>
<p>$\mathcal{L} = \sum^T_{t=1}-log\big{(}N(x_{t+1} \vert \mu_t, \sigma_t, \rho_t)\big{)}$</p>
<p>Further, we can extract the partial derivatives for the five components and obtain:</p>
<p><img src="https://tdavchev.github.io/files/derivation.png" alt="Derivation" /></p>
<p>As code, we can implement the associated loss function with the following 2 functions:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">get_lossfunc</span><span class="p">(</span><span class="n">g</span><span class="p">,</span> <span class="n">z_mux</span><span class="p">,</span> <span class="n">z_muy</span><span class="p">,</span> <span class="n">z_sx</span><span class="p">,</span> <span class="n">z_sy</span><span class="p">,</span> <span class="n">z_corr</span><span class="p">,</span> <span class="n">x_data</span><span class="p">,</span> <span class="n">y_data</span><span class="p">):</span>
<span class="c1"># Calculate the PDF of the data w.r.t to the distribution
</span> <span class="n">result0</span> <span class="o">=</span> \
<span class="n">distributions</span><span class="p">.</span><span class="n">tf_2d_normal</span><span class="p">(</span><span class="n">g</span><span class="p">,</span> <span class="n">x_data</span><span class="p">,</span> <span class="n">y_data</span><span class="p">,</span> <span class="n">z_mux</span><span class="p">,</span> <span class="n">z_muy</span><span class="p">,</span> <span class="n">z_sx</span><span class="p">,</span> <span class="n">z_sy</span><span class="p">,</span> <span class="n">z_corr</span><span class="p">)</span>
<span class="c1"># For numerical stability purposes as in Vemula (2018)
</span> <span class="n">epsilon</span> <span class="o">=</span> <span class="mf">1e-20</span>
<span class="c1"># Numerical stability
</span> <span class="n">result1</span> <span class="o">=</span> <span class="o">-</span><span class="n">tf</span><span class="p">.</span><span class="n">log</span><span class="p">(</span><span class="n">tf</span><span class="p">.</span><span class="n">maximum</span><span class="p">(</span><span class="n">result0</span><span class="p">,</span> <span class="n">epsilon</span><span class="p">))</span>
<span class="k">return</span> <span class="n">tf</span><span class="p">.</span><span class="n">reduce_sum</span><span class="p">(</span><span class="n">result1</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">tf_2d_normal</span><span class="p">(</span><span class="n">g</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">mux</span><span class="p">,</span> <span class="n">muy</span><span class="p">,</span> <span class="n">sx</span><span class="p">,</span> <span class="n">sy</span><span class="p">,</span> <span class="n">rho</span><span class="p">):</span>
<span class="s">'''
Function that computes a multivariate Gaussian
Equation taken from 24 & 25 in Graves (2013)
'''</span>
<span class="k">with</span> <span class="n">g</span><span class="p">.</span><span class="n">as_default</span><span class="p">():</span>
<span class="c1"># Calculate (x-mux) and (y-muy)
</span> <span class="n">normx</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">subtract</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">mux</span><span class="p">)</span>
<span class="n">normy</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">subtract</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">muy</span><span class="p">)</span>
<span class="c1"># Calculate sx*sy
</span> <span class="n">sxsy</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">multiply</span><span class="p">(</span><span class="n">sx</span><span class="p">,</span> <span class="n">sy</span><span class="p">)</span>
<span class="c1"># Calculate the exponential factor
</span> <span class="n">z</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">square</span><span class="p">(</span><span class="n">tf</span><span class="p">.</span><span class="n">divide</span><span class="p">(</span><span class="n">normx</span><span class="p">,</span> <span class="n">sx</span><span class="p">))</span> <span class="o">+</span> \
<span class="n">tf</span><span class="p">.</span><span class="n">square</span><span class="p">(</span><span class="n">tf</span><span class="p">.</span><span class="n">divide</span><span class="p">(</span><span class="n">normy</span><span class="p">,</span> <span class="n">sy</span><span class="p">))</span> <span class="o">-</span> \
<span class="mi">2</span><span class="o">*</span><span class="n">tf</span><span class="p">.</span><span class="n">divide</span><span class="p">(</span>
<span class="n">tf</span><span class="p">.</span><span class="n">multiply</span><span class="p">(</span><span class="n">rho</span><span class="p">,</span> <span class="n">tf</span><span class="p">.</span><span class="n">multiply</span><span class="p">(</span><span class="n">normx</span><span class="p">,</span> <span class="n">normy</span><span class="p">)),</span>
<span class="n">sxsy</span><span class="p">)</span>
<span class="n">negatedRho</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">tf</span><span class="p">.</span><span class="n">square</span><span class="p">(</span><span class="n">rho</span><span class="p">)</span>
<span class="c1"># Numerator
</span> <span class="n">result</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">tf</span><span class="p">.</span><span class="n">divide</span><span class="p">(</span><span class="o">-</span><span class="n">z</span><span class="p">,</span> <span class="mi">2</span><span class="o">*</span><span class="n">negatedRho</span><span class="p">))</span>
<span class="c1"># Normalization constant
</span> <span class="n">denominator</span> <span class="o">=</span> \
<span class="mi">2</span> <span class="o">*</span> <span class="n">np</span><span class="p">.</span><span class="n">pi</span> <span class="o">*</span> <span class="n">tf</span><span class="p">.</span><span class="n">multiply</span><span class="p">(</span><span class="n">sxsy</span><span class="p">,</span> <span class="n">tf</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">negatedRho</span><span class="p">))</span>
<span class="c1"># Final PDF calculation
</span> <span class="n">result</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">divide</span><span class="p">(</span><span class="n">result</span><span class="p">,</span> <span class="n">denominator</span><span class="p">)</span>
<span class="k">return</span> <span class="n">result</span></code></pre></figure>
<p>To our convenience, Tensorflow is capable of computing the derivatives automatically and we do not need to worry about implementing this bit. All that is left is to choose the optimisation routine.</p>
<p>As previously mentioned, we will use L2 regularisation since we want to enforce a single optimsal solution (namely the targeted positions $y_t$). This way we constrain the accuracy during training (or the potential to overfit) by ensuring better generalisation achieved via efficient (in terms of compute) approach. In addition, we clip our gradients between 10 and -10 and ensure we will not face problems associated with exploding gradients. Finally, we will use <a href="https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf">RMS-Prop</a>, an unpublished optimisation algorithm. The algorithm is famous for using the second moment representing the variation of the previous gradient squared. As an alternative, we can use Adam [9] which normalises the gradients using first and second moment which also corrects for the bias during training. However, empirically, we found RMS-Prop to work better in this task. We hypothesise that this could be due the fact that we are interested in exploiting some of the biases of motion as we do not aim to generalise to other agents than human beings. You can find out more about different optimisation algorithms in <a href="https://ruder.io/optimizing-gradient-descent/">S. Ruder’s blog post</a>.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">mode</span> <span class="o">!=</span> <span class="n">tf</span><span class="p">.</span><span class="n">contrib</span><span class="p">.</span><span class="n">learn</span><span class="p">.</span><span class="n">ModeKeys</span><span class="p">.</span><span class="n">INFER</span><span class="p">:</span>
<span class="k">with</span> <span class="n">tf</span><span class="p">.</span><span class="n">name_scope</span><span class="p">(</span><span class="s">"Optimization"</span><span class="p">):</span>
<span class="n">lossfunc</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">get_lossfunc</span><span class="p">(</span><span class="n">o_mux</span><span class="p">,</span> <span class="n">o_muy</span><span class="p">,</span> <span class="n">o_sx</span><span class="p">,</span> <span class="n">o_sy</span><span class="p">,</span> <span class="n">o_corr</span><span class="p">,</span> <span class="n">x_data</span><span class="p">,</span> <span class="n">y_data</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">cost</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">div</span><span class="p">(</span><span class="n">lossfunc</span><span class="p">,</span> <span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">batch_size</span> <span class="o">*</span> <span class="bp">self</span><span class="p">.</span><span class="n">sequence_length</span><span class="p">))</span>
<span class="n">trainable_params</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">trainable_variables</span><span class="p">()</span>
<span class="c1"># apply L2 regularisation
</span> <span class="n">l2</span> <span class="o">=</span> <span class="mf">0.05</span> <span class="o">*</span> <span class="nb">sum</span><span class="p">(</span><span class="n">tf</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">l2_loss</span><span class="p">(</span><span class="n">t_param</span><span class="p">)</span> <span class="k">for</span> <span class="n">t_param</span> <span class="ow">in</span> <span class="n">trainable_params</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">cost</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">cost</span> <span class="o">+</span> <span class="n">l2</span>
<span class="n">tf</span><span class="p">.</span><span class="n">summary</span><span class="p">.</span><span class="n">scalar</span><span class="p">(</span><span class="s">'cost'</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">cost</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">gradients</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">gradients</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">cost</span><span class="p">,</span> <span class="n">trainable_params</span><span class="p">)</span>
<span class="n">grads</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">clip_by_global_norm</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">gradients</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">grad_clip</span>
<span class="c1"># Adam might also do a good job as in Graves (2013)
</span> <span class="n">optimizer</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">train</span><span class="p">.</span><span class="n">RMSPropOptimizer</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">lr</span><span class="p">)</span>
<span class="c1"># Train operator
</span> <span class="bp">self</span><span class="p">.</span><span class="n">train_op</span> <span class="o">=</span> <span class="n">optimizer</span><span class="p">.</span><span class="n">apply_gradients</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">grads</span><span class="p">,</span> <span class="n">trainable_params</span><span class="p">))</span>
<span class="bp">self</span><span class="p">.</span><span class="n">init</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">global_variables_initializer</span><span class="p">()</span></code></pre></figure>
<p>Finally, we can define the entire model as shown in the section below.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">reset_graph</span><span class="p">()</span>
<span class="n">lstm</span> <span class="o">=</span> <span class="n">BasicLSTM</span><span class="p">(</span><span class="n">batch_size</span><span class="o">=</span><span class="n">BATCH_SIZE</span><span class="p">,</span>
<span class="n">sequence_length</span><span class="o">=</span><span class="n">SEQUENCE_LENGTH</span><span class="p">,</span>
<span class="n">num_units</span><span class="o">=</span><span class="n">NUM_UNITS</span><span class="p">,</span>
<span class="n">embedding_size</span><span class="o">=</span><span class="n">EMBEDDING</span><span class="p">,</span>
<span class="n">learning_rate</span><span class="o">=</span><span class="n">LR</span><span class="p">,</span>
<span class="n">grad_clip</span><span class="o">=</span><span class="n">GRAD_CLIP</span><span class="p">,</span>
<span class="n">mode</span><span class="o">=</span><span class="n">MODE</span><span class="p">)</span></code></pre></figure> Todor Davchev t.davchev@gmail.com In this blog’s experiments we will utilise the mentioned in previous posts (x,y) coordinate representations as input to the network. Since each of these coordinate representations is associated with a specific agent who will interact with each other, it is important to separate the associated sequences and acknowledge that each prediction will be dependent on the previous sequences observed for a given agent. Processing Trajectory Data for Sequence Generation 2019-12-08T00:00:00+00:00 2019-12-08T00:00:00+00:00 https://tdavchev.github.io/posts/2019/12/blog-post-2 <p>Before considering the details around modelling such tasks, we should spend some time to consider the datasets we will use as well as the preprocessing routines we will consider.</p>
<h1 id="datasets-used">Datasets Used</h1>
<p>In this post, we consider four different datasets and namely, ETH University, ETH Hotel [5] (see next 2 photos as examples), and Zara1,2 [7]. Photos of the first two can be seen below.</p>
<h2 id="data-processing">Data Processing</h2>
<p>In these examples we are interested in figuring out the exact pixel location of each individual pedestrian (agent), as well as the associated frames we consider. All four datasets give us annotated positions but differ slightly in representation. Thus, as a first step we ensure they are aligned.</p>
<p>This is common across datasets when they have been built by different groups and projects and have slight misalignments. All four datasets have been recorded on 25Hz and consist on average of 3000 frames. The ETH datasets are comprised of 750 agents each while Zara has two scenes each with 786 agents. All videos include people walking on their own, as well as pedestrians moving in groups in a nonlinear manner. However, some of the videos annotate trajectories in mm in a world reference frame and others have recorded them in pixel coordinates with (0,0) considered in the centre of each video frame. We will process all of them ensuring all positions are represented in pixel positions with (0,0) placed in the bottom left corner. We will further normalise them between 0 and 1 such that the size of the image or the roadwalk considered do not bias our solution.</p>
<h2 id="processing-examples">Processing examples</h2>
<p>To simplify this post we will consider transforming 1 of the 4 datasets and leave the rest out. The aim is to clarify how such processing is achieved. The rest of the processing is similar to the one described here and can be further found in the <a href="https://github.com/yadrimz/Stochastic-Futures-Prediction">GitHub repository of this post</a>. This part, however, is not necessary to understand the details around the actual model.</p>
<p><b>ETH Hotel</b>
is comprised of positions in world reference frame where we are interested in converting these to local, pixel reference frame. To do this, we are given the required homography.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">world_to_image_frame</span><span class="p">(</span><span class="n">loc</span><span class="p">,</span> <span class="n">Hinv</span><span class="p">):</span>
<span class="s">"""
Given H^-1 and (x, y, z) in world coordinates, returns (u, v, 1) in image
frame coordinates.
"""</span>
<span class="n">loc</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">Hinv</span><span class="p">,</span> <span class="n">loc</span><span class="p">)</span> <span class="c1"># to camera frame
</span> <span class="k">return</span> <span class="n">loc</span><span class="o">/</span><span class="n">loc</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="c1"># to pixels (from millimeters)</span></code></pre></figure>
<p>Those interested in the mathematics behind the introduced conversion can read more about it in <a href="https://www.inf.ed.ac.uk/teaching/courses/cg/lectures/cg3_2013.pdf">Taku Komura’s lecture slides</a>. Further we normalise the data using the minimum and maximum recorded values which results in the following method.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">mil_to_pixels</span><span class="p">(</span><span class="n">directory</span><span class="o">=</span><span class="p">[</span><span class="s">"./data/ewap_dataset/seq_hotel"</span><span class="p">]):</span>
<span class="s">'''
Preprocess the frames from the datasets.
Convert values to pixel locations from millimeters
obtain and store all frames data the actually used frames (as some are skipped),
the ids of all pedestrians that are present at each of those frames and the sufficient statistics.
'''</span>
<span class="k">def</span> <span class="nf">collect_stats</span><span class="p">(</span><span class="n">agents</span><span class="p">):</span>
<span class="n">x_pos</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">y_pos</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">agent_id</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">agents</span><span class="p">)):</span>
<span class="n">trajectory</span> <span class="o">=</span> <span class="p">[[]</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">3</span><span class="p">)]</span>
<span class="n">traj</span> <span class="o">=</span> <span class="n">agents</span><span class="p">[</span><span class="n">agent_id</span><span class="p">]</span>
<span class="k">for</span> <span class="n">step</span> <span class="ow">in</span> <span class="n">traj</span><span class="p">:</span>
<span class="n">x_pos</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">step</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
<span class="n">y_pos</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">step</span><span class="p">[</span><span class="mi">2</span><span class="p">])</span>
<span class="n">x_pos</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">asarray</span><span class="p">(</span><span class="n">x_pos</span><span class="p">)</span>
<span class="n">y_pos</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">asarray</span><span class="p">(</span><span class="n">y_pos</span><span class="p">)</span>
<span class="c1"># takes the average over all points through all agents
</span> <span class="k">return</span> <span class="p">[[</span><span class="n">np</span><span class="p">.</span><span class="nb">min</span><span class="p">(</span><span class="n">x_pos</span><span class="p">),</span> <span class="n">np</span><span class="p">.</span><span class="nb">max</span><span class="p">(</span><span class="n">x_pos</span><span class="p">)],</span> <span class="p">[</span><span class="n">np</span><span class="p">.</span><span class="nb">min</span><span class="p">(</span><span class="n">y_pos</span><span class="p">),</span> <span class="n">np</span><span class="p">.</span><span class="nb">max</span><span class="p">(</span><span class="n">y_pos</span><span class="p">)]]</span>
<span class="n">Hfile</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">directory</span><span class="p">,</span> <span class="s">"H.txt"</span><span class="p">)</span>
<span class="n">obsfile</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">directory</span><span class="p">,</span> <span class="s">"obsmat.txt"</span><span class="p">)</span>
<span class="c1"># Parse homography matrix.
</span> <span class="n">H</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">loadtxt</span><span class="p">(</span><span class="n">Hfile</span><span class="p">)</span>
<span class="n">Hinv</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">inv</span><span class="p">(</span><span class="n">H</span><span class="p">)</span>
<span class="c1"># Parse pedestrian annotations.
</span> <span class="n">frames</span><span class="p">,</span> <span class="n">pedsInFrame</span><span class="p">,</span> <span class="n">agents</span> <span class="o">=</span> <span class="n">parse_annotations</span><span class="p">(</span><span class="n">Hinv</span><span class="p">,</span> <span class="n">obsfile</span><span class="p">)</span>
<span class="c1"># collect mean and std
</span> <span class="n">statistics</span> <span class="o">=</span> <span class="n">collect_stats</span><span class="p">(</span><span class="n">agents</span><span class="p">)</span>
<span class="n">norm_agents</span> <span class="o">=</span> <span class="p">[]</span>
<span class="c1"># collect the id, normalised x and normalised y of each agent's position
</span> <span class="n">pedsWithPos</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">agent</span> <span class="ow">in</span> <span class="n">agents</span><span class="p">:</span>
<span class="n">norm_traj</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">step</span> <span class="ow">in</span> <span class="n">agent</span><span class="p">:</span>
<span class="n">_x</span> <span class="o">=</span> <span class="p">(</span><span class="n">step</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">-</span> <span class="n">statistics</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">])</span> <span class="o">/</span> <span class="p">(</span><span class="n">statistics</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">1</span><span class="p">]</span> <span class="o">-</span> <span class="n">statistics</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">])</span>
<span class="n">_y</span> <span class="o">=</span> <span class="p">(</span><span class="n">step</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">-</span> <span class="n">statistics</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="mi">0</span><span class="p">])</span> <span class="o">/</span> <span class="p">(</span><span class="n">statistics</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="mi">1</span><span class="p">]</span> <span class="o">-</span> <span class="n">statistics</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="mi">0</span><span class="p">])</span>
<span class="n">norm_traj</span><span class="p">.</span><span class="n">append</span><span class="p">([</span><span class="nb">int</span><span class="p">(</span><span class="n">frames</span><span class="p">[</span><span class="nb">int</span><span class="p">(</span><span class="n">step</span><span class="p">[</span><span class="mi">0</span><span class="p">])]),</span> <span class="n">_x</span><span class="p">,</span> <span class="n">_y</span><span class="p">])</span>
<span class="n">norm_agents</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">norm_traj</span><span class="p">))</span>
<span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">norm_agents</span><span class="p">),</span> <span class="n">statistics</span><span class="p">,</span> <span class="n">pedsInFrame</span></code></pre></figure>
<p>Lines 8 to 20 find the minimum and maximum values for the x and y positions of the agents. The called “obsmat.txt” file contains the annotated data and is comprised of the frame number, the pedestrian id, position in the x axis in the world frame, position in the y, z as well as their associated 3 velocities. More information can be found in the README.txt file within the dataset directory. In this post we are only interested in considering the frame, the pedestrian’s id and their x and y positions.</p>
<p>Lines 34-41 are associated with the ordering of pedestrians across frames along the pedestrian id.</p>
<p>Line 29 calls parse_annotations() parses the collected data and converts it to reference frame.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">parse_annotations</span><span class="p">(</span><span class="n">Hinv</span><span class="p">,</span> <span class="n">obsmat_txt</span><span class="p">):</span>
<span class="s">'''
Parse the dataset and convert to image frames data.
'''</span>
<span class="n">mat</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">loadtxt</span><span class="p">(</span><span class="n">obsmat_txt</span><span class="p">)</span>
<span class="n">num_peds</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="nb">max</span><span class="p">(</span><span class="n">mat</span><span class="p">[:,</span><span class="mi">1</span><span class="p">]))</span> <span class="o">+</span> <span class="mi">1</span>
<span class="n">peds</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([]).</span><span class="n">reshape</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">4</span><span class="p">)</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_peds</span><span class="p">)]</span> <span class="c1"># maps ped ID -> (t,x,y,z) path
</span>
<span class="n">num_frames</span> <span class="o">=</span> <span class="p">(</span><span class="n">mat</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="mi">1</span><span class="p">).</span><span class="n">astype</span><span class="p">(</span><span class="s">"int"</span><span class="p">)</span>
<span class="n">num_unique_frames</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">unique</span><span class="p">(</span><span class="n">mat</span><span class="p">[:,</span><span class="mi">0</span><span class="p">]).</span><span class="n">size</span>
<span class="n">recorded_frames</span> <span class="o">=</span> <span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">*</span> <span class="n">num_unique_frames</span> <span class="c1"># maps timestep -> (first) frame
</span> <span class="n">peds_in_frame</span> <span class="o">=</span> <span class="p">[[]</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_unique_frames</span><span class="p">)]</span> <span class="c1"># maps timestep -> ped IDs
</span>
<span class="n">frame</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">time</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span>
<span class="n">blqk</span> <span class="o">=</span> <span class="bp">False</span>
<span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">mat</span><span class="p">:</span>
<span class="k">if</span> <span class="n">row</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">!=</span> <span class="n">frame</span><span class="p">:</span>
<span class="n">frame</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">row</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="n">time</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="n">recorded_frames</span><span class="p">[</span><span class="n">time</span><span class="p">]</span> <span class="o">=</span> <span class="n">frame</span>
<span class="n">ped</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">row</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
<span class="n">peds_in_frame</span><span class="p">[</span><span class="n">time</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="n">ped</span><span class="p">)</span>
<span class="n">loc</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="n">row</span><span class="p">[</span><span class="mi">2</span><span class="p">],</span> <span class="n">row</span><span class="p">[</span><span class="mi">4</span><span class="p">],</span> <span class="mi">1</span><span class="p">])</span>
<span class="n">loc</span> <span class="o">=</span> <span class="n">to_image_frame</span><span class="p">(</span><span class="n">loc</span><span class="p">)</span>
<span class="n">loc</span> <span class="o">=</span> <span class="p">[</span><span class="n">time</span><span class="p">,</span> <span class="n">loc</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">loc</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">loc</span><span class="p">[</span><span class="mi">2</span><span class="p">]]</span>
<span class="n">peds</span><span class="p">[</span><span class="n">ped</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">vstack</span><span class="p">((</span><span class="n">peds</span><span class="p">[</span><span class="n">ped</span><span class="p">],</span> <span class="n">loc</span><span class="p">))</span>
<span class="k">return</span> <span class="n">recorded_frames</span><span class="p">,</span> <span class="n">peds_in_frame</span><span class="p">,</span> <span class="n">peds</span></code></pre></figure>
<p>We can combine the preprocessing of all datasets in a single function. Ideally, we will save the preprocessed data and load it directly each time we need to use it. Once this is done we can then call a function that loads the preprocessed data in the format we need it in. It will be useful to split the trajectories in a chosen in advance sequence length. We can then easily compute the number of batches we will get if we specified a batch size too.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">BATCH_SIZE</span> <span class="o">=</span> <span class="mi">50</span>
<span class="n">SEQUENCE_LENGTH</span> <span class="o">=</span> <span class="mi">8</span>
<span class="n">agents_data</span><span class="p">,</span> <span class="n">dicto</span><span class="p">,</span> <span class="n">dataset_indices</span> <span class="o">=</span> \
<span class="n">data_tools</span><span class="p">.</span><span class="n">preprocess</span><span class="p">(</span><span class="n">training_directories</span><span class="p">)</span>
<span class="n">loaded_data</span><span class="p">,</span> <span class="n">num_batches</span> <span class="o">=</span> \
<span class="n">data_tools</span><span class="p">.</span><span class="n">load_preprocessed</span><span class="p">(</span><span class="n">agents_data</span><span class="p">,</span> <span class="n">BATCH_SIZE</span><span class="p">,</span> <span class="n">SEQUENCE_LENGTH</span><span class="p">)</span></code></pre></figure>
<h2 id="batching">Batching</h2>
<p>After obtaining and pre-processing all the information we need to implement a routine for sampling random batches that ensure the samples will be comprised of unbroken sequences. One thing to keep in mind is that some of the sampled trajectories might be shorter than some given sequence length and others might be longer. In the former case, we would like to avoid such trajectories, while in the latter we want to split the trajectories in multiple samples. We will do this by defining a function called “next_batch()” that will take in as input the associated data, required batch size, a frame pointer that indicates the currently considered starting point, desired sequence length, maximum number of pedestrians in the considered sequence across all datasets, the current dataset pointer and whether or not we are sampling during inference or training time.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">next_batch</span><span class="p">(</span><span class="n">_data</span><span class="p">,</span> <span class="n">pointer</span><span class="p">,</span> <span class="n">batch_size</span><span class="p">,</span> <span class="n">sequence_length</span><span class="p">,</span> <span class="n">infer</span><span class="o">=</span><span class="bp">False</span><span class="p">):</span>
<span class="s">'''
Function to get the next batch of points
'''</span>
<span class="c1"># List of source and target data for the current batch
</span> <span class="n">x_batch</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">y_batch</span> <span class="o">=</span> <span class="p">[]</span>
<span class="c1"># For each sequence in the batch
</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">batch_size</span><span class="p">):</span>
<span class="c1"># Extract the trajectory of the pedestrian pointed out by pointer
</span> <span class="n">traj</span> <span class="o">=</span> <span class="n">_data</span><span class="p">[</span><span class="n">pointer</span><span class="p">]</span>
<span class="c1"># Number of sequences corresponding to his trajectory
</span> <span class="n">n_batch</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">traj</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">/</span> <span class="p">(</span><span class="n">sequence_length</span><span class="o">+</span><span class="mi">1</span><span class="p">))</span>
<span class="c1"># Randomly sample an index from which his trajectory is to be considered
</span> <span class="k">if</span> <span class="ow">not</span> <span class="n">infer</span><span class="p">:</span>
<span class="n">idx</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">traj</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">-</span> <span class="n">sequence_length</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">idx</span> <span class="o">=</span> <span class="mi">0</span>
<span class="c1"># Append the trajectory from idx until sequence_length into source and target data
</span> <span class="n">x_batch</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">copy</span><span class="p">(</span><span class="n">traj</span><span class="p">[</span><span class="n">idx</span><span class="p">:</span><span class="n">idx</span><span class="o">+</span><span class="n">sequence_length</span><span class="p">,</span> <span class="p">:]))</span>
<span class="n">y_batch</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">copy</span><span class="p">(</span><span class="n">traj</span><span class="p">[</span><span class="n">idx</span><span class="o">+</span><span class="mi">1</span><span class="p">:</span><span class="n">idx</span><span class="o">+</span><span class="n">sequence_length</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span> <span class="p">:]))</span>
<span class="k">if</span> <span class="n">random</span><span class="p">.</span><span class="n">random</span><span class="p">()</span> <span class="o"><</span> <span class="p">(</span><span class="mf">1.0</span><span class="o">/</span><span class="nb">float</span><span class="p">(</span><span class="n">n_batch</span><span class="p">)):</span>
<span class="c1"># Adjust sampling probability
</span> <span class="c1"># if this is a long datapoint, sample this data more with
</span> <span class="c1"># higher probability
</span> <span class="n">pointer</span> <span class="o">=</span> <span class="n">tick_batch_pointer</span><span class="p">(</span><span class="n">pointer</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">_data</span><span class="p">))</span>
<span class="k">return</span> <span class="n">x_batch</span><span class="p">,</span> <span class="n">y_batch</span><span class="p">,</span> <span class="n">pointer</span></code></pre></figure>
<p>Now that we have ensured we have the required data processed and have an associated batching function we can focus on building the actual model.</p> Todor Davchev t.davchev@gmail.com Before considering the details around modelling such tasks, we should spend some time to consider the datasets we will use as well as the preprocessing routines we will consider. Tutorial on Stochastic Trajectory Prediction 2019-12-01T00:00:00+00:00 2019-12-01T00:00:00+00:00 https://tdavchev.github.io/posts/2019/12/blog-post-1 <p>This tutorial is meant as an introduction to the problem of trajectory generation. It introduces several ways for modelling the motion of agents in pixel space and proposes several ways of preprocessing data. It follows the structure from its associated <a href="https://github.com/yadrimz/Stochastic-Futures-Prediction">GitHub repository</a>. Feel free to skip to the end to see the performance of 2 basic models.</p>
<h1 id="introduction">Introduction</h1>
<p>The acquisition of neural-based solutions such as Long Shrort-Term Memory [1] has become very common for modelling time series signals in the past few years. Some popular and well established examples include different applications of AI; and namely speech processing [2], language modelling [3], translation [4] among many others. In particular, the idea behind LSTMs is that they are good at mimicing the hidden dynamics behind a given short sequence.</p>
<p>The overall structure of such networks (see the figure bellow, taken from <a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/">Chris Olah’s</a> blog post which is a great introduction to LSTMs along with <a href="https://www.deeplearningbook.org/contents/rnn.html">Chapter 10 of the book “Deep Learning”</a>) allows us to model each step while taking into account all previously considered ones in a given signal. In the context of text modelling, those signals can be all previously observed words in a given sentence where each step would consist of predicting one word by conditioning on all previously seen words. In this context, LSTMs are good at encapsulating the context behind a given sentence which makes them a good candidate for predicting its continuation.
<img src="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png" alt="LSTM-Chris Olah" /></p>
<p>Similarly, we can model the dynamics associated with the motion of pedestrians in crowded scenes, the behaviour of agents in different games, moving cars on a busy road etc. In such cases, the task would be to predict where each considered agent will be after some time. This post is an introduction to the use of LSTMs in such problems.</p>
<p>We will focus on predicting the next few 2D positions (x,y) of pedestrians from annotated videos. We will assume a good understanding of what LSTMs are, the difficulties behind extracting meaningful feature representations as well as basic computer science knowledge, algebra and calculus, and some prior knowledge of using Tensorflow. We aim to conclude with a brief discussion of the pros and cons of using the introduced here use of LSTMS.</p>
<p>Anticipating the position of pedestrians in a given video sequence is often a challenging task even for humans. Motion is often dictated by some unspoken rules that are different across cultures and are in addition interpreted differently by people. For example, when walking in crowded scenes we might aim to avoid collisions with others, but we can also aim to reach someone specific and stop in front of them. In such cases it is relatively difficult for an observer to anticipate what would be the goal of someone walking in the scene in the next 5-10 minutes just by observing but we can relatively easily tell where a pedestrian aims to be in the next 5-10 seconds. Regardless, we often maintain a few hypothesis of where this person will be and thus modelling such motions using purely deterministic approaches (such as simply using standard LSTMs) can be very hard. An alternative solution would be to train a neural network to model the sufficient statistics of the distribution the next step is sampled from.</p>
<h2 id="initialization">Initialization</h2>
<p>We begin by installing tensorflow and downloading the github code described in this blog post. We then import everything necessary including all datasets we will use for training, followed by assigning all constant variables we will need for training the network.</p>
<p><code class="language-plaintext highlighter-rouge">!pip install tensorflow==1.15.0</code></p>
<p><code class="language-plaintext highlighter-rouge">!git clone https://github.com/yadrimz/Stochastic-Futures-Prediction.git</code></p>
<h1 id="problem-formulation">Problem Formulation</h1>
<p>This work’s focus is on predicting the future motion of an arbitrary number of observed agents (i.e. their behaviour) whose action spaces and objectives are unknown. More specifically, we focus on predicting the two-dimensional motion of agents in video sequences.</p>
<p>We assume we are given a history of two-dimensional position annotations and video frames as a sequence of RGB images.
Each agent $a$ ($a \in [1 \ldots A]$, where $A$ is the maximum number of agents in the video) is represented by a state ($s_t^a$) which comprises xy-coordinates at time t, $s_t^a$ = $(x_t,y_t)_a$.</p>
<p>Given a sequence of $obs$ observed states $S = s_{t-obs}, s_{t-obs+1}, \ldots, s_{t-1}, s_{t}$, we will formulate the prediction as an optimisation process, where the objective is to learn a posterior distribution $P(Y \vert S)$, of multiple agents future trajectories $Y$. Here an individual agent’s future trajectory is defined as ($s_t^a = {s_{t+1}^a, s_{t+2}^a, \ldots, s_{t+pred}^a}$) for $pred$ steps ahead for every agent $a$ found in a given frame at time $t$.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">time</span>
<span class="o">%</span><span class="n">tensorflow_version</span> <span class="mf">1.</span><span class="n">x</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">tensorflow</span> <span class="k">as</span> <span class="n">tf</span>
<span class="kn">import</span> <span class="nn">utils.data_tools</span> <span class="k">as</span> <span class="n">data_tools</span>
<span class="kn">import</span> <span class="nn">utils.visualisation</span> <span class="k">as</span> <span class="n">visualisation</span>
<span class="kn">import</span> <span class="nn">utils.distributions</span> <span class="k">as</span> <span class="n">distributions</span>
<span class="kn">from</span> <span class="nn">models.lstm</span> <span class="kn">import</span> <span class="n">BasicLSTM</span>
<span class="kn">from</span> <span class="nn">models.lstm</span> <span class="kn">import</span> <span class="n">reset_graph</span></code></pre></figure> Todor Davchev t.davchev@gmail.com This tutorial is meant as an introduction to the problem of trajectory generation. It introduces several ways for modelling the motion of agents in pixel space and proposes several ways of preprocessing data. It follows the structure from its associated GitHub repository. Feel free to skip to the end to see the performance of 2 basic models.