Create simulation on basis of real recording

Something that came up at the Edinburgh workshop which caught my interest – the possibility of creating a simulated ground truth simulation on the basis of a real recording. That is, rather than trying to set simulation parameters in a way that will match closely a real experiment, perhaps we can just input the real experiment.

One possibility for how this could be done:

Step 1. Start with a real recording as input. Subtract off large spikes in the input recording leaving background signal only (noise and signal from distant neurons). One way to do this de-spiking is to detect spikes and for each subtract off something like the average of K nearest neighbor spike waveforms.

Step 2. Insert spikes at random times (hybrid approach). One could use avg waveforms obtained in step 1… but we’d need to be very careful not to use a particular spike sorting approach here. Something more primitive. Because it’s important not to make the simulation biased toward a particular sorting method.

The second step would require specifying things like number of true units and firing rates.

In the end I’d love to just do something like

recording_sim, sorting_true_sim = create_synth_recording_based_on(real_recording, ... some_params)

(utilizing SpikeExtractors of course)

@magland a very similar idea came out discussing with Hernan Rey at the workshop.

I think that another way around would be simulating clean spiking activity with a simulator, let’s say we randomly choose MEArec, and we add noise from real experiments.

We think there are 2 problems to be solved:

  • how to properly cutout the peaks without affecting the spectrum? This is a non trivial one…

  • generalize noise model: instead of concatenating the same noise over and over again, maybe it would be smarter to build an ARMA model or similar to capture its dynamics

What do you think?

We wouldn’t need to use same noise over and over (that would be problematic). The duration of the simulated recording could simply match the duration of the input recording.

Yes, cutting peaks properly is tricky, but I think doable.

I have been thinking about this, specially in a way to fix the signal if the segments with spikes are set to nan. I didn’t finish looking for bibliography but maybe an algorithm to fix missing audio clips (it is a typical issue) could work.

1 Like

@ferchaure I think those algorithms might use some kind of ARMA processes.

@magland if you fit one of this model to a probe specific noise, then you could save a very compressed representation of the data and generate new realizations of the noise of any duration! I think that would be promising. Of course one can check how the distribution, frequency spectrum, etc. of this statistical models match the original recordings.

1 Like

Do you think ARMA processes can capture phenomenon of superimposed signal of distant neurons?

Yes I believe so. I think they can capture the spatial correlations in different frequency contents :slight_smile:

I’ve been thinking about this, as MEArec can sort of simulate these “extra” neurons, right?

The issue is that I think every single neuron should be included in the ground truth, even the ones that make up the “noise” we’re talking about here. It is quite instructive to see the results for HYBRID_JANELIA on spkeforest, where many neurons are simulated, including ones with SNR close to zero.

So I think the best model is a full ground truth simulation with many, many neurons. It would be interesting to compare the noise profile generated by such a simulation with real data, to see what is missing (probe effects, other current sources not part of the model etc).

Edit: focusing on getting the power spectrum right may not be sufficient, as there will be significant higher order stats in the data caused by neurons firing together, effects of morphology and so on. We have a good model, Visapy, but this cannot handle many neurons, Can we make a generative model that captures higher order properties better?

Regarding a model with ‘many many neurons’, we have simulations of Traub’s, Thalamo-cortical model, where we modified the scripts to track the trans-membrane currents, added morphology and placed these cells as in a barrel cortical column cell density. Previously, we have used these datasets as ground truth for CSD estimations among others things. Anticipating the utility of these simulations in other cases, we curated these simulations into HDF5 files.

These data sets, and their utility as we saw it are presented as a paper here. There are some associated scripts for this, provided here which can be easily expanded to include fancier electrode configurations, conductivity profiles, noise filters, different cell densities etc.

I would be very happy to contribute, if you think that this could be useful as ground-truth in your test-bench.

@mhhennig Yes in MEArec you can also simulate thousands of neurons distributed around a probe. I think that additionally a noise floor should be added to model the electrode noise. I think that a multivariate model, maybe non-linear, would capture higher order stats as well. The problem with the ViSAPy model is that it’s extremely computationally heavy.

Hi @ccluri!

I think it woulb be very interesting to use your model as benchmark as well. I was looking at the datasets the other day. How are the morphologies stored? Do you have the location of each neuronal segment for all cells?

@magland I think it would be great to include this data in SpikeForest.

@ccluri could you run a simulation on a Neuropixels probe? You can use the MEAutility package to get the channel locations. This snippet of code loads a Neuropixels-like probe qith 128 channels (centered at 0,0,0):

import MEAutility as mu
neuropixels = mu.return_mea('Neuropixels-128')
channel_positions = neuropixels.channel_positions

@ccluri, @alejoe91, I’d be happy to include such a simulation in SpikeForest. Shall we follow up by email on the logistics of it?

Hello, sorry for my delayed reply.

@alejoe91 Indeed the morphology of individual neurons are stored in the datasets as well. These can be played with - moved around, rotated etc.

Please find the scripts for extracting extracellular potentials from these simulations on github repo spikeforest branch

At the moment this script placed the neuropixel probe with an offset relative to the cylindrical axis of the cortical column (thalamus is ignored). The simulations correspond to the pulsedstimulus.h5 from the above mentioned datasets - which can be changed to any other there. The conductivity is assumed to be 0.3 S/m, homogeneous medium, ideal electrodes with sources at the mid-points of the individual compartments. The probe is transparent - no shadowing effect, and can see compartments behind its facing direction. All of these can be tweaked. At the moment the extracellular potentials are saved for this particular simulation as an .npy array (num_ele x num_time_pts). Indeed you can include high-pass filters* or save these as neo compatible formats*, like here

There is a fix for this, and I will yet again shamelessly plug-in the simulation data format which we developed and can ease this immensely. You run the simulations once, save morphology and the trans-membrane currents in a standardized way. Then you can use the scripts mentioned in my previous post to get you the potentials. You can then ‘move around’ the cell positions and electrodes as per your requirements, or tweak the electrode properties etc.

I am also aware that there is an equivalent simulation data format being developed in Gaute’s group - specifically catering to LFPy. I do not know the updates on this. Perhaps they would be interested to mitigate your issues when using ViSAPy as well.

Sure. My work email is chaitanya.chintaluri(at)cncb.ox.ac.uk

@ccluri we have a similar approach for simulation using MEArec. First we simulated templates and then we assemble them in recordings.

I was referring to the noise model from ViSAPy being quite heavy, as it tries to estimate coherence among electrodes and frequencies on all electrodes.