This dataset/code forms part of Andrew Landels' thesis: "Improving proteomic methods and investigating H2 production in Synechocystis sp. PCC6803" http://etheses.whiterose.ac.uk/id/eprint/19034
The code presented here was written in R, and uses the ggplot2 package for generating graphs. The data was generated a QExactive mass spectrometer, and was analysed in MaxQuant software using the standard data processing pipeline. Included with the code for this section are the output from the MaxQuant pipeline, namely the 'Evidence' files, which contain peptide information and raw/processed quantifications.
The code in this section is a little detailed, due to the style of the experimental design: the experimental methodology is described in detail in chapter 5 of the aforementioned thesis. The files have been named in such a manner that they contain a series of switch identifiers (X\_X\_X, where X is either 1 or 0 and \_ is an identifier for splitting the file-name string with a regular expression). These were used to trigger switches in the code, and determined if the dataset being analysed used iTRAQ or TMT tags, was a flat or extended concentration range of the protein mix, and whether the extended concentration mix was run in the forwards or reverse direction (the test proteins were flipped during the experiment, to avoid protein-specific skew).
At the beginning of the code, a number of functions for applying different transformation to the data are defined:
Tags() generates a matrix of expected tag values for the spike-in proteins.
Corrected.new() turns all the values into ratio values between 0 and 1, relating to the sum of a given row.
Trim() removes the file extensions from the filenames, to enable correctly labelled graphs.
FlattenData() collects all peptides relating to the spike-in proteins from the data, and correctly arranges formatting to collapse them into a single matrix where they are aligned with the expected values based on the experimental design.
RemoveZeros() removes any rows containing 1 or more zeros.
PeptidesQuants() generates a table of spectral counts for: all peptides in a sample, just the spike-in proteins, with and without 0-values removed. It also counts the number of proteins identified and the total number of fractions the samples was measured across (ie. relating to the degree of LC separation)
ScalingMatrix() calculates, based on the expected and observed quantifications from the data, the scalar that is needed to be applied to the spike-in data to make it equal to the expected values. This enables iterative investigation of the spiked in proteins, enabling systematic experimental operator error to be measured by comparing the same mix within the final experiment after it had been exposed to successive permutations (initial mixing, dilution, etc).
ScaleData() applies the scaling matrix to the data in the manner described above.
In the main body of the code, initially the names of the spike-in proteins are given, then the summary table of counts is produced. After this, a series of plots are generated, showing the data after successive manipulations have been performed on it. Finally, the individual plots that make up the final plot from the previous section -- where all transformations have been applied to the data -- are produced, in both linear space and log space.
In the post-section of the code, a number of variations on the plots are produced. These highlight other features that were explored during the data processing in order to produce simpler graphics, but were ultimately not included in the thesis.
**Please note that due to an upload error, dataset 7 is currently not available in this repository. If you need a copy of this file before I'm able to upload it, please contact me on andrewlandels at gmail dot com.**
Funding
EU FP7 308518
History
Ethics
There is no personal data or any that requires ethical approval
Policy
The data complies with the institution and funders' policies on access and sharing