diff --git a/README.md b/README.md
index 21f9ee0..9bb89ba 100644
--- a/README.md
+++ b/README.md
@@ -3,13 +3,13 @@
Finds complexes from Blue-Native and SEC Fractionation analyzed by Liquid Chromatogrpahy coupled to Mass Spectrometry. In
- principal it works with any separation technique resulting in co-elution signal profiles. To avoid licence issues and accumulation of old database files, please first download the database of choce (see below *Download Protein-Protein Interaction Data*).
+ principle it works with any separation technique resulting in co-elution signal profiles. To avoid licence issues and accumulation of old database files, please first download the database of choce (see below *Download Protein-Protein Interaction Data*).
## Next Feature (Testing) Implementations (05.2021)
We list here several features that we will implement in the next versions. If checked, they are already available but might be still experimental.
-- [ ] Extend plotting capabibilties to extract profiles of features and complex
+- [ ] Extend plotting capabilities to extract profiles of features and complex
```python
#plotting selected feature's profile
@@ -33,34 +33,34 @@ ComplexFinder(analysisName="").plotComplexProf
## Workflow
-For thousdands of features (peptides/protein) a signal was measured over different fractions. The applied technique separates protein clusters from each other. This package aims for different things:
+For thousands of features (peptides/protein) a signal was measured over different fractions. The applied technique separates protein clusters from each other. This package aims for different things:
* signal processing including filtering, smoothing.
* if more than one replicate is analysed, the profiles over fractions will be aligned.
* identification of protein-protein interactions.
-* identification of protein cluster using diminesional reduction and density based clustering.
+* identification of protein cluster using dimensional reduction and density based clustering.

-Importantly, ComplexFinder can also be utized to analyse the data without prior knowledge of protein connectivitiy (e.g. positive database). In this case, there are two options:
+Importantly, ComplexFinder can also be utilized to analyse the data without prior knowledge of protein connectivity (e.g. positive database). In this case, there are two options:
* using raw profile signal intensities
* distance between profile pairs
-which are then subjected for dimensional reduction and HDBSCAN clusering. Importantly, when using the raw profile intensities, the derived UMAP representation is aligned using the top N correlated features between samples (e.g. same protein across all samples).
+which are then subjected for dimensional reduction and HDBSCAN clustering. Importantly, when using the raw profile intensities, the derived UMAP representation is aligned using the top N correlated features between samples (e.g. same protein across all samples).
-As a next step, we want to identify clusters of proteins with predicted interaction. To this end, we are using the interaction probabiliy matrix obtained by the
+As a next step, we want to identify clusters of proteins with predicted interaction. To this end, we are using the interaction probability matrix obtained by the
random forest classifier. We then apply the UMAP embedding calculton and apply HDBSCAN clustering. Again, we
-are using the CORUM database to quantify the the clustering result using the v-measure. Both techniques, UMAP and HDBSCAN are performed
+are using the CORUM database to quantify the clustering result using the v-measure. Both techniques, UMAP and HDBSCAN are performed
using a paramter grid to cycle through different options and find the best clustering.

- In cases of uisng the raw signal intensity or the distance metrics, those data are subjected to dimensional reduction (UMAP) and clustering (HDBSCAN). Noteworthy, other clusering algorithmns are available and can be utilized. HDBSCAN is however the default.
+ In cases of using the raw signal intensity or the distance metrics, those data are subjected to dimensional reduction (UMAP) and clustering (HDBSCAN). Noteworthy, other clusering algorithmns are available and can be utilized. HDBSCAN is however the default.
## Depositing Data analyzed using ComplexFinder
-If you analyzed your data using ComplexFinder, we highly recommend to upload the data along the raw fiiles deposition at mass spectrometry repisatories such as PRIDE / ProteomeXChange or similiar. Especially, the params.json file which is written to the results folder is of particular interest in order to reproduce the data analysis. Of note, if you upload the complete result folder, other users will be able to analyse these data using the plotting utilities of ComplexFinder.
+If you analyzed your data using ComplexFinder, we highly recommend to upload the data along the raw files deposition at mass spectrometry repositories such as PRIDE / ProteomeXChange or similar. Especially, the params.json file which is written to the results folder is of particular interest in order to reproduce the data analysis. Of note, if you upload the complete result folder, other users will be able to analyse these data using the plotting utilities of ComplexFinder.
## Installation
@@ -87,14 +87,14 @@ pip3 install -r requirements.txt
## Usage Example
-Upon downlaod and extraction of the package. You can find example data in the example-data folder.
-To run the anaylsis, you can enter:
+Upon download and extraction of the package. You can find example data in the example-data folder.
+To run the analysis, you can enter:
```python
from .src.main import ComplexFinder
X = pd.read_table("./example-data/SILAC_01.txt", sep = "\t") #loading tab delimited txt file.
ComplexFinder(analysisName = "ExampleRun_01").run(X)
```
-You can also pass a folder path to run. This will yield in the anaylsis of each txt file in the folder.
+You can also pass a folder path to run. This will yield in the analysis of each txt file in the folder.
```python
import os
@@ -130,7 +130,7 @@ ComplexFinder(
### Complex Portal
-Go the [Complex Portal Website](https://www.ebi.ac.uk/complexportal/home) and download the database (save it as HUMAN_COMPLEX_PORTAL.txt) for the utilized organismn.
+Go the [Complex Portal Website](https://www.ebi.ac.uk/complexportal/home) and download the database (save it as HUMAN_COMPLEX_PORTAL.txt) for the utilized organism.
```python
@@ -146,7 +146,7 @@ ComplexFinder(
### hu.Map 2.0
-The hu.MAP 2.0 has recently beend published and is available at this [link](http://humap2.proteincomplexes.org).
+The hu.MAP 2.0 has recently been published and is available at this [link](http://humap2.proteincomplexes.org).
```python
ComplexFinder(
@@ -162,7 +162,7 @@ ComplexFinder(
The grouping parameter in ComplexFinder is used to group files, which is used to group replicates together.
Assume, that we have 4 files, 2 KO and 2 WT files which we put together in the folder "./data".
-The grouping will be used to calculate pariwise statistics between fitted peaks. Moreover, complex prediction and protein-protein prediction summary.
+The grouping will be used to calculate pairwise statistics between fitted peaks. Moreover, complex prediction and protein-protein prediction summary.
```python
pathToFiles = os.path.join(".","data")
ComplexFinder(
@@ -192,7 +192,7 @@ Find below an overview about the extensive output of ComplexFinder.
-Please respect the respective liscence for the different databases.
+Please respect the respective license for the different databases.
## Parameters
@@ -200,7 +200,7 @@ Please respect the respective liscence for the different databases.
Find below parameters to set. The default is given in brackets after the parameter name.
* alignMethod = "RadiusNeighborsRegressor",
-* alignRuns = False, Alignment of runs is based on signal profiles that were found to have a single modelled peak. A refrence run is assign by correlation anaylsis and choosen based on a maximum R2 value. Then fraction-shifts per signal profile is calculated (must be in the window given by *alignWindow*). The fraction residuals are then modelled using the method provided in *alignMethod*. Model peak centers are then adjusted based on the regression results. Of note, the alignment is performed after peak-modelling and before distance calculations.
+* alignRuns = False, Alignment of runs is based on signal profiles that were found to have a single modelled peak. A reference run is assign by correlation analysis and chosen based on a maximum R2 value. Then fraction-shifts per signal profile is calculated (must be in the window given by *alignWindow*). The fraction residuals are then modelled using the method provided in *alignMethod*. Model peak centers are then adjusted based on the regression results. Of note, the alignment is performed after peak-modelling and before distance calculations.
* alignWindow = 3, Number of fraction +/- single-peal profile are accepted for the run alignment.
* analysisMode = "label-free", #[label-free,SILAC,SILAC-TMT]
* analysisName = None,
@@ -212,40 +212,40 @@ Find below parameters to set. The default is given in brackets after the paramet
* databaseFilter = {'Organism': ["Human"]}, Filter dict used to find relevant complexes from database. By default, the corum database is filtered based on the column 'Organism' using 'Mouse' as a search string. If no filtering is required, pass an empty dict {}.
* databaseIDColumn = "subunits(UniProt IDs)",
* databaseFileName = "20190823_CORUM.txt",
-* databaseHasComplexAnnotations = True, Indicates if the provided database does contain complex annotations. If you have a database with only pairwise interactions, this setting should be *False*. Clusters are identified by dimensional reduction and density based clustering (HDBSCAN). In order to alter UMAP and HDBSCAN settings use the kewywords *hdbscanDefaultKwargs* and *umapDefaultKwargs*.
+* databaseHasComplexAnnotations = True, Indicates if the provided database does contain complex annotations. If you have a database with only pairwise interactions, this setting should be *False*. Clusters are identified by dimensional reduction and density based clustering (HDBSCAN). In order to alter UMAP and HDBSCAN settings use the keywords *hdbscanDefaultKwargs* and *umapDefaultKwargs*.
* decoySizeFactor = 1.2,
* grouping = {"WT": ["D3_WT_04.txt","D3_WT_02.txt"],"KO":["D3_KO_01.txt","D3_KO_02.txt"]}, None or dict. Indicates which samples (file) belong to one group. Let's assume 4 files with the name 'KO_01.txt', 'KO_02.txt', 'WT_01.txt' and 'WT_02.txt' are being analysed. The grouping dict should like this : {"KO":[KO_01.txt','KO_02.txt'],"WT":['WT_01.txt','WT_02.txt']} in order to combine them for statistical testing (e.g. t-test of log2 transformed peak-AUCs). Note that when analysis multiple runs (e.g. grouping present) then calling ComplexFinder().run(X) - X must be a path to a folder containing the files.
* hdbscanDefaultKwargs = {"min_cluster_size":4,"min_samples":1},
* indexIsID = False,
* idColumn = "Uniprot ID",
* interactionProbabCutoff = 0.7, Cutoff for estimator probability. Interactions with probabilities below threshold will be removed.
-* kFold = 3, Cross validation of classifier optimiation.
+* kFold = 3, Cross validation of classifier optimization.
* maxPeaksPerSignal = 15, Number of peaks allowed for on signal profile.
* maxPeakCenterDifference = 1.8,
* metrices = ["apex","pearson","euclidean","p_pearson","max_location","umap-dist"], Metrices to access distance between two profiles. Can be either a list of strings and/or dict. In case of a list of dicts, each dict must contain the keywords: 'fn' and 'name' providing a callable function with 'fn' that returns a single floating number and takes two arrays as an input.
* metricesForPrediction = None,#["pearson","euclidean","apex"],
* metricQuantileCutoff = 0.90,
* minDistanceBetweenTwoPeaks = 3, distance in fractions (int) between two peaks. Setting this to a smaller number results in more peaks.
-* n_jobs = 12, number of workers to model peaks, to calculate distance pairs and to train and use the classifer.
+* n_jobs = 12, number of workers to model peaks, to calculate distance pairs and to train and use the classifier.
* noDatabaseForPredictions = False, If you want to use ComplexFinder without any database. Set this to *True*.
* normValueDict = {},
-* peakModel = "GaussianModel", which model should be used to model signal profiles. In principle all models from lmfit can be used. However, the initial parameters are only optimized for GaussianModel and LaurentzianModel. This might effect runtimes dramatically.
-* plotSignalProfiles = False, if True, each profile is plotted against the fractio along with the fitted models. If you are concerned about time, you might set this to False at the cost of losing visible asessment of the fit quality.
+* peakModel = "GaussianModel", which model should be used to model signal profiles. In principle all models from lmfit can be used. However, the initial parameters are only optimized for GaussianModel and LaurentzianModel. This might affect runtimes dramatically.
+* plotSignalProfiles = False, if True, each profile is plotted against the fraction along with the fitted models. If you are concerned about time, you might set this to False at the cost of losing visible assessment of the fit quality.
* plotComplexProfiles = False,
* precision = 0.5, Precision to use to filter protein-protein interactions. If None, the filtering will be performed based on the parameter *interactionProbabCutoff*.
* r2Thresh = 0.85, R2 threshold to accept a model fit. Models below the threshold will be ignored.
* removeSingleDataPointPeaks = True,
-* restartAnalysis = False, bool. Set True if you want to restart the anaylsis from scratch. If the tmp folder exsists, items and dirs will be deleted first.
+* restartAnalysis = False, bool. Set True if you want to restart the analysis from scratch. If the tmp folder exists, items and dirs will be deleted first.
* retrainClassifier = False, if the trainedClassifier.sav file is found, the classifier is loaded and the training is skipped. If you change the classifierGridSearch, you should set this to True. This will ensure that the classifier training is never skipped.
* recalculateDistance = False,
* runName = None,
* rollingWinType = "triang", the win type used for calculating the rolling metric. If None, all points are evenly weighted. Can be any string of scipy.signal window function.
(https://docs.scipy.org/doc/scipy/reference/signal.windows.html#module-scipy.signal.windows)
-* savePeakModels = True *depracted. always True and will be removed in the next version*.
+* savePeakModels = True *deprecated. always True and will be removed in the next version*.
* scaleRawDataBeforeDimensionalReduction = True, If raw data should be used (*useRawDataForDimensionalReduction*) enable this if you want to scale them. Scaling will be performed that values of each row are scaled between zero and one.
-* smoothSignal = True, Enable/disable smoothing. Defaults to True. A moving average of at least 3 adjacent datapoints is calculated using pandas rolling function. Effects the analysis time as well as the nmaximal number of peaks detected.
+* smoothSignal = True, Enable/disable smoothing. Defaults to True. A moving average of at least 3 adjacent datapoints is calculated using pandas rolling function. Effects the analysis time as well as the maximal number of peaks detected.
* smoothWindow = 2,
-* topNCorrFeaturesForUMAPAlignment = 200, Using top N features to to align UMAP Embeddings. The features are ranked by using Pearson correlation coefficient,
+* topNCorrFeaturesForUMAPAlignment = 200, Using top N features to align UMAP Embeddings. The features are ranked by using Pearson correlation coefficient,
* useRawDataForDimensionalReduction = False, Setting this to true, will force the pipeline to use the raw values for dimensional reduction. Distance calculations are not automatically turned off and the output is generated but they are not used.
* umapDefaultKwargs = {"min_dist":0.0000001,"n_neighbors":3,"n_components":2},
* quantFiles = [] list of str.
@@ -260,16 +260,16 @@ RF_GRID_SEARCH = {
}
```
-Sklearn library is used for predictions. Please check the comprehensive [documention](https://scikit-learn.org/stable/user_guide.html) for more details and for construction of a grid search dict.
+Sklearn library is used for predictions. Please check the comprehensive [documentation](https://scikit-learn.org/stable/user_guide.html) for more details and for construction of a grid search dict.
# Database Quality
For the prediction of protein-protein interactions the quality and size of the database is of importance.
-As a quick test, we performed predictions using 2000 randomly selected features of dataset D1 and siwtched the class labels (interactor vs non-interactor) of the database to train the classifier. We observed that the number of predicted protein-protein interaction was strongly reduced in after label switch of more than 5% of the features. We have used the CORUM human database for interactions. This highlights that the complexes in the database need to describe the complexome in the measured dataset accurately. The gold-standard is therefore the usage of a complex database that were experimentally validated, which is sadly often not possible due to the workload.
+As a quick test, we performed predictions using 2000 randomly selected features of dataset D1 and switched the class labels (interactor vs non-interactor) of the database to train the classifier. We observed that the number of predicted protein-protein interaction was strongly reduced in after label switch of more than 5% of the features. We have used the CORUM human database for interactions. This highlights that the complexes in the database need to describe the complexome in the measured dataset accurately. The gold-standard is therefore the usage of a complex database that were experimentally validated, which is sadly often not possible due to the workload.
-# Usin SILAC - TMT peak centric quantifiaction
+# Using SILAC - TMT peak centric quantification
*in preparation*
@@ -280,14 +280,14 @@ ComplexFinder allows peak centric quantification using different quantification
TMT allows for multiplexing in complexome experiments by labeling peptides with different tags that can be distinguished by different reporter ions using LC-MS/MS. Therefore the result (for example from a MaxQuant analysis) that is required are:
* ProteinGroups.txt -> Feature IDs (protein IDs) versus iBAQ intensity in columns. This file is the base file to extract the signal profiles and on which the peak modelling will be performed. Alternatively, you can also sum all the TMT intensities.
-* ProteinGroups.txt -> Feature IDs (protein IDs) versus the TMT Intensities per channel. If you performed a 10-plex TMT analysis, this would result in Protein ID + (fraction x 10 (TMT channels)) columns. The TMT intensties should be next to each other for each fraction, please see the figure below. TMT01_fraction_01, TMT02_fraction_01 ... TMT10_fraction_01, TMT01_fraction_02. It is advisable to put a leading zero in the MaxQuant experiment name to get the correct order straight away (otherwise you may run into such an order: 1,10,11,12,...,2,21)
+* ProteinGroups.txt -> Feature IDs (protein IDs) versus the TMT Intensities per channel. If you performed a 10-plex TMT analysis, this would result in Protein ID + (fraction x 10 (TMT channels)) columns. The TMT intensities should be next to each other for each fraction, please see the figure below. TMT01_fraction_01, TMT02_fraction_01 ... TMT10_fraction_01, TMT01_fraction_02. It is advisable to put a leading zero in the MaxQuant experiment name to get the correct order straight away (otherwise you may run into such an order: 1,10,11,12,...,2,21)
-For each peak in the samples, ComplexFinder will extract the TMT intensities and will aggreagte the fraction covered by the FWHM using a given function. By default the sum is used but can be changed to the mean as well (*TMTPoolMethod = "sum"). The data in the quantification files (feature IDs x TMT Intensities for each fraction) are not transformed at all. Therfore, if you use the mean, performing log2 transformation before averageing might be advisable. You can do this by setting the paramter *transformQuantDataBy = "log2"*. The available options are ["log2","ln",None]. None being the default which will use the provided values.
+For each peak in the samples, ComplexFinder will extract the TMT intensities and will aggregate the fraction covered by the FWHM using a given function. By default the sum is used but can be changed to the mean as well (*TMTPoolMethod = "sum"). The data in the quantification files (feature IDs x TMT Intensities for each fraction) are not transformed at all. Therefore, if you use the mean, performing log2 transformation before averaging might be advisable. You can do this by setting the parameter *transformQuantDataBy = "log2"*. The available options are ["log2","ln",None]. None being the default which will use the provided values.
## SILAC-TMT
-A combination of SILAC and TMT allows either for extended mulitplexing (2 x SILAC Channel + 10plex TMT = 20 samples) or to follow an incoporation kinetic. To this end, cells are grown on SILAC media (for example heavy) for several passages leading to fully labelled cells. Then, the media is exchanged to light media and the cell start incoporating light amino acids into newly synthesized proteins. This enabled the determination of incorporation rates / turnover rates. When combining TMT and SILAC together, the light channel peptides + TMT represent the SILAC incoporation and heavy shows the break-down of proteins. In proliferating cells, the increase in biomass (cell growth) has to be considered.
+A combination of SILAC and TMT allows either for extended multiplexing (2 x SILAC Channel + 10plex TMT = 20 samples) or to follow an incorporation kinetic. To this end, cells are grown on SILAC media (for example heavy) for several passages leading to fully labelled cells. Then, the media is exchanged to light media and the cell start incorporating light amino acids into newly synthesized proteins. This enabled the determination of incorporation rates / turnover rates. When combining TMT and SILAC together, the light channel peptides + TMT represent the SILAC incorporation and heavy shows the break-down of proteins. In proliferating cells, the increase in biomass (cell growth) has to be considered.
*Please note that at the moment only two SILAC channels are supported*.
@@ -300,11 +300,11 @@ ComplexFinder(allowSingleFractionQuant = True).run(...)
```

-*Figure. Quantification Stategy using TMT or SILAC-TMT experimental designs. In SILAC-TMT experimental designs, two quantification resutl files are required index by HEAVY and LIGHT.*
+*Figure. Quantification Strategy using TMT or SILAC-TMT experimental designs. In SILAC-TMT experimental designs, two quantification result files are required index by HEAVY and LIGHT.*
We recommend to put the signal profiles in a folder (in the figure: myCoolAnalysis) and add the files. Create a new folder within myCoolAnalysis called 'q' in which you add the quantification data. If you put the quantification txt files in the same folder as the once for analysis, ComplexFinder will treat them as signal profiles and will try to fit model peaks to them etc.
-To calculated the fit parameter for a single order kinetic, we have to provide more information otherwise, the output will contain the TMT intensities for each peak indicated by *heavy* or *light*. ComplexFinder expects a raw TMT intensities (not log2) for
+To calculate the fit parameter for a single order kinetic, we have to provide more information otherwise, the output will contain the TMT intensities for each peak indicated by *heavy* or *light*. ComplexFinder expects a raw TMT intensities (not log2) for
@@ -389,7 +389,7 @@ def _addParams(self,modelParams,prefix,peakIdx,i):
## for the other parameters as well to get a nice fit.
```
-Please not that you also have to alter the functions *_getHeight* and *_getFWHM* for your peak models.
+Please note that you also have to alter the functions *_getHeight* and *_getFWHM* for your peak models.
You can check the equations [here](http://openafox.com/science/peak-function-derivations.html).
@@ -397,7 +397,7 @@ You can check the equations [here](http://openafox.com/science/peak-function-der
In the future, we would like to implement the following features:
-* Web application with an easy uster interface to proide easy access to the pipeline
+* Web application with an easy user interface to provide easy access to the pipeline
* Implement more classifiers.
* Test various peak models for better performance.
diff --git a/reference-data/Readme.md b/reference-data/Readme.md
index 0f5d685..4cf1223 100644
--- a/reference-data/Readme.md
+++ b/reference-data/Readme.md
@@ -24,7 +24,7 @@ ComplexFinder(
### Complex Portal
-Go the [Complex Portal Website](https://www.ebi.ac.uk/complexportal/home) and download the database (save it as HUMAN_COMPLEX_PORTAL.txt) for the utilized organismn.
+Go the [Complex Portal Website](https://www.ebi.ac.uk/complexportal/home) and download the database (save it as HUMAN_COMPLEX_PORTAL.txt) for the utilized organism.
```python
@@ -40,7 +40,7 @@ ComplexFinder(
### hu.Map 2.0
-The hu.MAP 2.0 has recently beend published and is available at this [link](http://humap2.proteincomplexes.org).
+The hu.MAP 2.0 has recently been published and is available at this [link](http://humap2.proteincomplexes.org).
```python
ComplexFinder(
diff --git a/src/main.py b/src/main.py
index 435ad93..f3f86f1 100644
--- a/src/main.py
+++ b/src/main.py
@@ -117,70 +117,70 @@
class ComplexFinder(object):
def __init__(self,
- addImpurity = 0.0,
- alignMethod = "RadiusNeighborsRegressor",#"RadiusNeighborsRegressor",#"KNeighborsRegressor",#"LinearRegression", # RadiusNeighborsRegressor
- alignRuns = False,
- alignWindow = 3,
- allowSingleFractionQuant = False,
- analysisMode = "label-free", #[label-free,SILAC,SILAC-TMT]
- analysisName = None,
- binaryDatabase = False,
- classifierClass = "random_forest",
- classifierTestSize = 0.25,
- classiferGridSearch = RF_GRID_SEARCH,#STACKING_CLASSIFIER_GRID,#
- compTabFormat = False,
- considerOnlyInteractionsPresentInAllRuns = 2,
- correlationWindowSize = 5,
- databaseFilter = {'Organism': ["Human"]},#{'Organism': ["Human"]},#{"Confidence" : [1,2,3,4]} - for hu.map2.0,# {} for HUMAN_COMPLEX_PORTAL
- databaseIDColumn = "subunits(UniProt IDs)",
- databaseFileName = "20190823_CORUM.txt",#"humap2.txt
- databaseHasComplexAnnotations = True,
- databaseEntrySplitString = ";",
- decoySizeFactor = 1.2,
- grouping = {"WT": ["D3_WT_03.txt"]},
- hdbscanDefaultKwargs = {"min_cluster_size":4,"min_samples":1},
- indexIsID = False,
- idColumn = "Uniprot ID",
- interactionProbabCutoff = 0.7,
- justFitAndMatchPeaks = False,
- keepOnlySignalsValidInAllConditions = False,
- kFold = 3,
- maxPeaksPerSignal = 15,
- maxPeakCenterDifference = 1.8,
- metrices = ["apex","pearson","euclidean","cosine","max_location","rollingCorrelation"], #"umap-dist"
- metricesForPrediction = None,#["pearson","euclidean","apex"],
- metricQuantileCutoff = 0.001,
- minDistanceBetweenTwoPeaks = 3,
- minimumPPsPerFeature = 6,
- minPeakHeightOfMax = 0.05,
- n_jobs = 12,
- noDatabaseForPredictions = False,
- normValueDict = {},
- noDistanceCalculationAndPrediction = False,
- peakModel = "LorentzianModel",#"GaussianModel",#"SkewedGaussianModel",#"LorentzianModel",
- plotSignalProfiles = False,
- plotComplexProfiles = False,
- precision = 0.5,
- r2Thresh = 0.85,
- removeSingleDataPointPeaks = True,
- restartAnalysis = False,
- retrainClassifier = False,
- recalculateDistance = False,
- rollingWinType = None,
- runName = None,
- scaleRawDataBeforeDimensionalReduction = True,
- smoothSignal = True,
- smoothWindow = 2,
- takeRondomSampleFromData =False,
- topNCorrFeaturesForUMAPAlignment = 200,
- TMTPoolMethod = "sum",
- transformQuantDataBy = None,
- useRawDataForDimensionalReduction = False,
- useFWHMForQuant = True,
- umapDefaultKwargs = {"min_dist":0.001,"n_neighbors":5,"n_components":2,"random_state":120},
- quantFiles = [],
- usePeakCentricFeatures = False
- ):
+ addImpurity = 0.0,
+ alignMethod = "RadiusNeighborsRegressor", #"RadiusNeighborsRegressor",#"KNeighborsRegressor",#"LinearRegression", # RadiusNeighborsRegressor
+ alignRuns = False,
+ alignWindow = 3,
+ allowSingleFractionQuant = False,
+ analysisMode = "label-free", #[label-free,SILAC,SILAC-TMT]
+ analysisName = None,
+ binaryDatabase = False,
+ classifierClass = "random_forest",
+ classifierTestSize = 0.25,
+ classiferGridSearch = RF_GRID_SEARCH, #STACKING_CLASSIFIER_GRID,#
+ compTabFormat = False,
+ considerOnlyInteractionsPresentInAllRuns = 2,
+ correlationWindowSize = 5,
+ databaseFilter = {'Organism': ["Human"]}, #{'Organism': ["Human"]},#{"Confidence" : [1,2,3,4]} - for hu.map2.0,# {} for HUMAN_COMPLEX_PORTAL
+ databaseIDColumn = "subunits(UniProt IDs)",
+ databaseFileName = "20190823_CORUM.txt", #"humap2.txt
+ databaseHasComplexAnnotations = True,
+ databaseEntrySplitString = ";",
+ decoySizeFactor = 1.2,
+ grouping = {"WT": ["D3_WT_03.txt"]},
+ hdbscanDefaultKwargs = {"min_cluster_size":4,"min_samples":1},
+ indexIsID = False,
+ idColumn = "Uniprot ID",
+ interactionProbabCutoff = 0.7,
+ justFitAndMatchPeaks = False,
+ keepOnlySignalsValidInAllConditions = False,
+ kFold = 3,
+ maxPeaksPerSignal = 15,
+ maxPeakCenterDifference = 1.8,
+ metrices = ["apex","pearson","euclidean","cosine","max_location","rollingCorrelation"], #"umap-dist"
+ metricesForPrediction = None, #["pearson","euclidean","apex"],
+ metricQuantileCutoff = 0.001,
+ minDistanceBetweenTwoPeaks = 3,
+ minimumPPsPerFeature = 6,
+ minPeakHeightOfMax = 0.05,
+ n_jobs = 12,
+ noDatabaseForPredictions = False,
+ normValueDict = {},
+ noDistanceCalculationAndPrediction = False,
+ peakModel = "LorentzianModel", #"GaussianModel",#"SkewedGaussianModel",#"LorentzianModel",
+ plotSignalProfiles = False,
+ plotComplexProfiles = False,
+ precision = 0.5,
+ r2Thresh = 0.85,
+ removeSingleDataPointPeaks = True,
+ restartAnalysis = False,
+ retrainClassifier = False,
+ recalculateDistance = False,
+ rollingWinType = None,
+ runName = None,
+ scaleRawDataBeforeDimensionalReduction = True,
+ smoothSignal = True,
+ smoothWindow = 2,
+ takeRandomSampleFromData =False,
+ topNCorrFeaturesForUMAPAlignment = 200,
+ TMTPoolMethod = "sum",
+ transformQuantDataBy = None,
+ useRawDataForDimensionalReduction = False,
+ useFWHMForQuant = True,
+ umapDefaultKwargs = {"min_dist":0.001,"n_neighbors":5,"n_components":2,"random_state":120},
+ quantFiles = [],
+ usePeakCentricFeatures = False
+ ):
"""
Init ComplexFinder Class
@@ -192,8 +192,8 @@ def __init__(self,
* alignRuns = False,
Alignment of runs is based on signal profiles that were found to have
- a single modelled peak. A refrence run is assign by correlation anaylsis
- and choosen based on a maximum R2 value. Then fraction-shifts per signal
+ a single modelled peak. A reference run is assign by correlation analysis
+ and chosen based on a maximum R2 value. Then fraction-shifts per signal
profile is calculated (must be in the window given by *alignWindow*).
The fraction residuals are then modelled using the method provided in
*alignMethod*. Model peak centers are then adjusted based on the regression results.
@@ -223,7 +223,7 @@ def __init__(self,
True indicates that the data are in the CompBat data format which was recently introduced.
In contrast to standard txt files generated by for example MaxQuant. It contains multiple
headers. More information can be found here https://www3.cmbi.umcn.nl/cedar/browse/comptab
- ComplexFinder will try to identifiy the samples and fractions and create separeted txt files.
+ ComplexFinder will try to identify the samples and fractions and create separate txt files.
* considerOnlyInteractionsPresentInAllRuns = 2,
@@ -311,7 +311,7 @@ def __init__(self,
* peakModel = "GaussianModel",
Indicates which model should be used to model signal profiles. In principle all models from lmfit can be used.
However, the initial parameters are only optimized for GaussianModel and LaurentzianModel.
- This might effect runtimes dramatically.
+ This might affect runtimes dramatically.
* plotSignalProfiles = False,
If True, each profile is plotted against the fractio along with the fitted models.
@@ -407,7 +407,7 @@ def __init__(self,
"maxPeakCenterDifference" : maxPeakCenterDifference,
"classiferGridSearch" : classiferGridSearch,
"plotSignalProfiles" : plotSignalProfiles,
- "savePeakModels" : True, #must be true to process pipeline, depracted, remove from class arguments.
+ "savePeakModels" : True, #must be true to process pipeline, deprecated, remove from class arguments.
"removeSingleDataPointPeaks" : removeSingleDataPointPeaks,
"grouping" : grouping,
"analysisMode" : analysisMode,
@@ -439,7 +439,7 @@ def __init__(self,
"quantFiles" : quantFiles,
"compTabFormat" : compTabFormat,
"correlationWindowSize" : correlationWindowSize,
- "takeRondomSampleFromData" : takeRondomSampleFromData,
+ "takeRandomSampleFromData" : takeRandomSampleFromData,
"minPeakHeightOfMax" : minPeakHeightOfMax,
"justFitAndMatchPeaks" : justFitAndMatchPeaks,
"keepOnlySignalsValidInAllConditions" : keepOnlySignalsValidInAllConditions,
@@ -484,7 +484,7 @@ def _addMetricesToDB(self,analysisName):
def _addMetricToStats(self,metricName, value):
"""
Adds a metric to the stats data frame.
- Does not check if metric is represent, if present,
+ Does not check if metric is present, if present,
it will just overwrite.
Parameters
@@ -534,8 +534,8 @@ def _attachQuantificationDetails(self, combinedPeakModels = None):
"""
if self.params["analysisMode"] == "label-free":
if len(self.params["quantFiles"]) != 0:
- print("Warning :: Quant files have been specified but anaylsis mode is label-free. Please define SILAC or TMT or SILAC-TMT")
- print("Info :: Label-free mode selected. No additation quantification performed..")
+ print("Warning :: Quant files have been specified but analysis mode is label-free. Please define SILAC or TMT or SILAC-TMT")
+ print("Info :: Label-free mode selected. No additional quantification performed..")
return
if len(self.params["quantFiles"]) > 0:
@@ -556,7 +556,7 @@ def _attachQuantificationDetails(self, combinedPeakModels = None):
print(k.split("HEAVY_",maxsplit=1))
initFilesFound = [k for k in self.params["quantFiles"].keys() if k.split("HEAVY_",maxsplit=1)[-1] in files or k.split("LIGHT_",maxsplit=1)[-1] in files]
- print("Info :: For the following files and correpsonding co-elution profile data was detected")
+ print("Info :: For the following files and corresponding co-elution profile data was detected")
print(initFilesFound)
print("Warning :: other files will be ignored.")
@@ -567,12 +567,12 @@ def _attachQuantificationDetails(self, combinedPeakModels = None):
print("combining Peaks!!")
if combinedPeakModels is None:
- ## load combined peak reuslts
+ ## load combined peak results
txtOutput = os.path.join(self.params["pathToComb"],"CombinedPeakModelResults.txt")
if os.path.exists(txtOutput):
combinedPeakModels = pd.read_csv(txtOutput,sep="\t")
else:
- print("Warning :: Combined peak model reuslts not found. Deleted? Skipping peak centric quantification.")
+ print("Warning :: Combined peak model results not found. Deleted? Skipping peak centric quantification.")
return
@@ -639,12 +639,12 @@ def _attachQuantificationDetails(self, combinedPeakModels = None):
elif self.params["analysisMode"] == "TMT":
print("Info :: Peak centric quantification using TMT :: extracting sum from TMT reporters using file {}".format(self.params["quantFiles"][k]))
- print("Info :: Detecting reporter channles..")
+ print("Info :: Detecting reporter channels..")
nFractions = self.Xs[k].shape[1]
nTMTs = quantData.shape[1] / nFractions
print("Info :: {} reporter channels detected and {} fractions.".format(nTMTs,nFractions))
if nTMTs != int(nTMTs):
- print("Warning :: Could not detect the number of TMT reporter channles. Please check columns in quantFiles to have nTMTx x fractions columns")
+ print("Warning :: Could not detect the number of TMT reporter channels. Please check columns in quantFiles to have nTMTx x fractions columns")
continue
nTMTs = int(nTMTs)
@@ -662,12 +662,12 @@ def _attachQuantificationDetails(self, combinedPeakModels = None):
elif self.params["analysisMode"] == "SILAC-TMT":
print("Info :: Extracting quantification details from SILAC-TMT data.")
- print("Info :: Detecting reporter channles..")
+ print("Info :: Detecting reporter channels..")
nFractions = self.Xs[k].shape[1]
nTMTs = quantData.shape[1] / nFractions
print("Info :: {} reporter channels detected and {} fractions.".format(nTMTs,nFractions))
if nTMTs != int(nTMTs):
- print("Warning :: Could not detect the number of TMT reporter channles. Please check columns in quantFiles to have nTMTx x fractions columns")
+ print("Warning :: Could not detect the number of TMT reporter channels. Please check columns in quantFiles to have nTMTx x fractions columns")
continue
nTMTs = int(nTMTs)
@@ -710,14 +710,14 @@ def _checkParameterInput(self):
Raises
-------
- ValueErrors if datatype if given parameters do not match.
+ ValueErrors if datatype of given parameters do not match.
"""
- #check anaylsis mode
+ #check analysis mode
validModes = ["label-free","SILAC","SILAC-TMT","TMT"]
if self.params["analysisMode"] not in validModes:
- raise ValueError("Parmaeter analysis mode is not valid. Must be one of: {}".format(validModes))
+ raise ValueError("Parameter analysis mode is not valid. Must be one of: {}".format(validModes))
elif self.params["analysisMode"] != "label-free" and len(self.params["quantFiles"]) == 0:
raise ValueError("Length 'quantFiles must be at least 1 if the analysis mode is not set to 'label-free'.")
@@ -770,7 +770,7 @@ def _checkParameterInput(self):
def _chunkPrediction(self,pathToChunk,classifier,nMetrices,probCutoff):
"""
- Predicts for each chunk the proability for positive interactions.
+ Predicts for each chunk the probability for positive interactions.
Parameters
----------
@@ -838,10 +838,10 @@ def _load(self, X):
self.X = self.X.set_index(self.params["idColumn"])
self.X = self.X.astype(np.float32)
else:
- self.X = self.X.loc[self.X.index.drop_duplicates()] #remove duplicaates
+ self.X = self.X.loc[self.X.index.drop_duplicates()] #remove duplicates
self.X = self.X.astype(np.float32) #set dtype to 32 to save memory
- if self.params["takeRondomSampleFromData"] != False and self.params["takeRondomSampleFromData"] > 50:
- self.X = self.X.sample(self.params["takeRondomSampleFromData"])
+ if self.params["takeRandomSampleFromData"] != False and self.params["takeRandomSampleFromData"] > 50:
+ self.X = self.X.sample(self.params["takeRandomSampleFromData"])
print("Random samples taken from data. New data size {}".format(self.X.index.size))
self.params["rawData"][self.currentAnalysisName] = self.X.copy()
else:
@@ -863,11 +863,11 @@ def _loadReferenceDB(self):
"""
if self.params["noDistanceCalculationAndPrediction"]:
- print("noDistanceCalculationAndPrediction was enabled. No database laoded.")
+ print("noDistanceCalculationAndPrediction was enabled. No database loaded.")
return
if self.params["noDatabaseForPredictions"]:
- print("Info :: Parameter noDatabaseForPredictions was set to True. No database laoded.")
+ print("Info :: Parameter noDatabaseForPredictions was set to True. No database loaded.")
return
print("Info :: Load positive set from data base")
@@ -882,7 +882,7 @@ def _loadReferenceDB(self):
# self._addMetricToStats("nPositiveInteractions",dbSize)
else:
- self.DB.pariwiseProteinInteractions(
+ self.DB.pairwiseProteinInteractions(
self.params["databaseIDColumn"],
dbID = self.params["databaseFileName"],
filterDb=self.params["databaseFilter"])
@@ -896,9 +896,9 @@ def _loadReferenceDB(self):
#add decoy to db
if dbSize == 0:
- raise ValueError("Warning :: No hits found in database. Check dabaseFilter keyword.")
+ raise ValueError("Warning :: No hits found in database. Check databaseFilter keyword.")
elif dbSize < 150:
- raise ValueError("Warining :: Less than 150 pairwise interactions found.")
+ raise ValueError("Warning :: Less than 150 pairwise interactions found.")
elif dbSize < 200:
#raise ValueError("Filtered positive database contains less than 200 interactions..")
print("Warning :: Filtered positive database contains less than 200 interactions.. {}".format(dbSize))
@@ -909,7 +909,7 @@ def _loadReferenceDB(self):
def _checkGroups(self):
- "Checks grouping. For comparision of multiple co-elution data sets."
+ "Checks grouping. For comparison of multiple co-elution data sets."
if isinstance(self.params["grouping"],dict):
if len(self.params["grouping"]) == 0:
@@ -1091,7 +1091,7 @@ def _createSignalChunks(self,chunkSize = 30):
Parameter
---------
- chunkSize - int. default 30. Nuber of signals in a single chunk.
+ chunkSize - int. default 30. Number of signals in a single chunk.
Returns
-------
@@ -1154,7 +1154,7 @@ def _createSignalChunks(self,chunkSize = 30):
self.signalChunks[analysisName] = [p for p in c if os.path.exists(p)] #
- #saves signal chunls.
+ #saves signal chunks.
dump(self.signalChunks,pathToSignalChunk)
@@ -1165,7 +1165,7 @@ def _collectRSquaredAndFitDetails(self):
"""
if not self.params["savePeakModels"]:
- print("!! Warning !! This parameter is depracted and from now on always true.")
+ print("!! Warning !! This parameter is deprecated and from now on always true.")
self.params["savePeakModels"] = True
pathToPlotFolder = os.path.join(self.params["pathToTmp"][self.currentAnalysisName],"result","modelPlots")
@@ -1174,7 +1174,7 @@ def _collectRSquaredAndFitDetails(self):
fittedPeaksPath = os.path.join(resultFolder,"fittedPeaks_{}.txt".format(self.currentAnalysisName))
nPeaksPath = os.path.join(resultFolder,"nPeaks.txt")
if os.path.exists(fittedPeaksPath) and os.path.exists(nPeaksPath):
- print("Warning :: FittedPeaks detected. If you changed the data, you have to set the paramter 'restartAnalysis' True to include changes..")
+ print("Warning :: FittedPeaks detected. If you changed the data, you have to set the parameter 'restartAnalysis' True to include changes..")
return
if not os.path.exists(resultFolder):
os.mkdir(resultFolder)
@@ -1286,9 +1286,9 @@ def _trainPredictor(self, addImpurity = 0.3, apexTraining = False):
gridSearch = self.params["classiferGridSearch"],
testSize = self.params["classifierTestSize"])
- probabilites, meanAuc, stdAuc, oobScore, optParams, Y_test, Y_pred = self.classifier.fit(X,Y,kFold=self.params["kFold"],pathToResults=self.params["pathToComb"], metricColumns = metricColumnsForPrediction)
+ probabilities, meanAuc, stdAuc, oobScore, optParams, Y_test, Y_pred = self.classifier.fit(X,Y,kFold=self.params["kFold"],pathToResults=self.params["pathToComb"], metricColumns = metricColumnsForPrediction)
- dataForTraining["PredictionClass"] = probabilites
+ dataForTraining["PredictionClass"] = probabilities
#save prediction summary
pathToFImport = os.path.join(self.params["pathToComb"],"PredictorSummary{}_{}.txt".format(self.params["metrices"],self.params["addImpurity"]))
@@ -1351,7 +1351,7 @@ def _predictInteractions(self):
predInteractions = None
- metricIdx = [n + 4 if "apex" in self.params["metrices"] else n + 3 for n in range(len(self.params["metrices"]))] #in order to extract from dinstances, apex creates an extra column (apex_dist)
+ metricIdx = [n + 4 if "apex" in self.params["metrices"] else n + 3 for n in range(len(self.params["metrices"]))] #in order to extract from distances, apex creates an extra column (apex_dist)
for n,(X,nChunks) in enumerate(self._loadPairsForPrediction()):
@@ -1546,7 +1546,7 @@ def _scoreComplexes(self, complexDf, complexMemberIds = "subunits(UniProt IDs)",
matchingResults = pd.DataFrame(columns = ["Entry","Cluster Labels","Complex ID", "NumberOfInteractionsInDB"])
clearedEntries = pd.Series([x.split("_")[0] for x in complexDf.index], index=complexDf.index)
- for c,d in self.DB.indentifiedComplexes.items():
+ for c,d in self.DB.identifiedComplexes.items():
boolMatch = clearedEntries.isin(d["members"])
clusters = complexDf.loc[boolMatch,"Cluster Labels"].values.flatten()
@@ -1936,7 +1936,7 @@ def _createTxtFile(self,pathToFile,headers):
def _makeTmpFolder(self, n = 0):
"""
- Creates temporary fodler.
+ Creates temporary folder.
Parameters
@@ -1946,7 +1946,7 @@ def _makeTmpFolder(self, n = 0):
Returns
-------
pathToTmp : str
- ansolute path to tmp/anlysis name folder.
+ absolute path to tmp/anlysis name folder.
"""
@@ -1964,13 +1964,13 @@ def _makeTmpFolder(self, n = 0):
self.currentAnalysisName = analysisName
date = datetime.today().strftime('%Y-%m-%d')
- self.params["Date of anaylsis"] = date
+ self.params["Date of analysis"] = date
runName = self.params["runName"] if self.params["runName"] is not None else self._randomStr(3)
self.params["pathToComb"] = self._makeFolder(pathToTmp,"{}_n({})runs".format(runName,len(self.params["analysisName"])))
print("Info :: Folder created in which combined results will be saved: " + self.params["pathToComb"])
pathToTmpFolder = os.path.join(self.params["pathToComb"],analysisName)
if os.path.exists(pathToTmpFolder):
- print("Info :: Path to results folder exsists")
+ print("Info :: Path to results folder exists")
if self.params["restartAnalysis"]:
print("Warning :: Argument restartAnalysis was set to True .. cleaning folder.")
#to do - shift to extra fn
@@ -2019,7 +2019,7 @@ def _handleComptabFormat(self,X,filesToLoad):
Returns
-------
detectedDataFrames : list of pd.DataFrame
- list of identified data farmes from compbat file
+ list of identified data frames from comptab file
fileNames : list of str
Internal names :
@@ -2064,7 +2064,7 @@ def run(self,X, maxValueToOne = False):
Returns
-------
pathToTmp : str
- ansolute path to tmp/anlysis name folder.
+ absolute path to tmp/anlysis name folder.
"""
self.allSamplesFound = False
@@ -2075,10 +2075,10 @@ def run(self,X, maxValueToOne = False):
if isinstance(X,list) and all(isinstance(x,pd.DataFrame) for x in X):
if self.params["compTabFormat"]:
raise TypeError("If 'compTabFormat' is True. X must be a path to a folder. Either set compTabFormat to False or provide a path.")
- print("Multiple dataset detected - each one will be analysed separetely")
+ print("Multiple dataset detected - each one will be analysed separately")
if self.params["analysisName"] is None or not isinstance(self.params["analysisName"],list) or len(self.params["analysisName"]) != len(X):
self.params["analysisName"] = [self._randomStr(10) for n in range(len(X))] #create random analysisNames
- print("Info :: 'anylsisName' did not match X shape. Created random strings per dataframe.")
+ print("Info :: 'analysisName' did not match X shape. Created random strings per dataframe.")
elif isinstance(X,str):
if os.path.exists(X):
@@ -2153,7 +2153,7 @@ def run(self,X, maxValueToOne = False):
endSignalTime = time.time()
- self.params["runTimes"]["SignalFitting&Comparision"] = time.time() - self.params["runTimes"]["StartTime"]
+ self.params["runTimes"]["SignalFitting&Comparison"] = time.time() - self.params["runTimes"]["StartTime"]
if not self.params["justFitAndMatchPeaks"]:
@@ -2162,7 +2162,7 @@ def run(self,X, maxValueToOne = False):
self._createSignalChunks()
for n,X in enumerate(X):
- if n < len(self.params["analysisName"]): #happnes if others than txt file are present
+ if n < len(self.params["analysisName"]): #happens if others than txt file are present
self.currentAnalysisName = self.params["analysisName"][n]
print(self.currentAnalysisName," :: Starting distance calculations.")
@@ -2283,7 +2283,7 @@ def _combinePredictedInteractions(self, pathToComb):
boolIdx = combResults[groupItems] == "+"
if isinstance(boolIdx,pd.Series):
- #grouping equals 1 (groupItems, nonsenese (always ture), but repoted due to conisitency)
+ #grouping equals 1 (groupItems, nonsense (always true), but reported due to consistency)
combResults["Complete in {}".format(groupName)] = boolIdx
else:
combResults["Complete in {}".format(groupName)] = np.sum(boolIdx,axis=1) == len(groupItems)
@@ -2453,7 +2453,7 @@ def _combinePeakResults(self):
"""
Combine Peak results. For each run, each signal profile per feature
is represented by an ensemble of peaks. This function matches
- the peaks using a maximimal distance of 1.8 by default defined
+ the peaks using a maximal distance of 1.8 by default defined
by the parameter 'maxPeakCenterDifference'.
Peak height or area under curve are compared using a t-test and or an ANOVA.
@@ -2606,7 +2606,7 @@ def _combinePeakResults(self):
recalculateDistance = False,
retrainClassifier = True,
minPeakHeightOfMax= 0.01,
- takeRondomSampleFromData = False,
+ takeRandomSampleFromData= False,
justFitAndMatchPeaks = False,
noDistanceCalculationAndPrediction = False,
runName = "D1_exampleTest", #change analysis name
@@ -2630,8 +2630,8 @@ def _combinePeakResults(self):
correlationWindowSize = 5,
interactionProbabCutoff = 0.7,
minimumPPsPerFeature = 2,
- #usePeakCentricFeatures = True, ## careful, eperimental!
+ #usePeakCentricFeatures = True, ## careful, experimental!
removeSingleDataPointPeaks=True,
keepOnlySignalsValidInAllConditions = False,
quantFiles = {},
- useRawDataForDimensionalReduction = False).run("../example-data/D1") #adjust the folder where the files are sstored
+ useRawDataForDimensionalReduction = False).run("../example-data/D1") #adjust the folder where the files are stored
diff --git a/src/modules/Database.py b/src/modules/Database.py
index a92bcd6..6cde7de 100644
--- a/src/modules/Database.py
+++ b/src/modules/Database.py
@@ -52,7 +52,7 @@ class Database(object):
def __init__(self, nJobs = 4, splitString = ";"):
"""Database Module.
- The pipeline requires a database containing positve feature interactions.
+ The pipeline requires a database containing positive feature interactions.
This module find interactions present in the dataset to be analysed,
creates decoy interactions and matches metrices to databases.
@@ -115,7 +115,7 @@ def _filterDb(self,
raise ValueError("complexNameColumn not in database")
- def pariwiseProteinInteractions(self,
+ def pairwiseProteinInteractions(self,
complexIDsColumn,
dbID = "20190823_CORUM.txt",
filterDb = {'Organism': ["Human"]},
@@ -163,7 +163,7 @@ def addDecoy(self, sizeFraction = 1.2):
Adds a decoy database to the module.
Random entries from positive data are taken and Fake
- complexes are build. Self-ineractions (x1 == x2) are
+ complexes are build. Self-interactions (x1 == x2) are
not allowed and ignored. Duplicated interactions are
also ignored as well as positive Interactions that is
reported in a different positive complex.
@@ -277,7 +277,7 @@ def getInteractionClassByE1E2(self,E1E2s,E1s,E2s):
else:
E1E2Type.append("decoy")
else:
- #if we get here, those itneractions cannot be positive or decoy
+ #if we get here, those interactions cannot be positive or decoy
e1 = E1s[n]
e2 = E2s[n]
@@ -353,7 +353,7 @@ def _saveFilteredDf(self,fileName):
def collectPairwiseInt(self,i,interactors,complexName,predictClass,splitString = ";"):
collectedResult = []
- for interaction in self._getPariwiseInteractions(interactors.split(splitString)):
+ for interaction in self._getPairwiseInteractions(interactors.split(splitString)):
interaction = [e[:6] for e in interaction]
collectedResult.append({"ComplexID":i,"E1":interaction[0],"E2":interaction[1],"E1E2":''.join(sorted(interaction)),"complexName":complexName,"Class":predictClass})
return collectedResult
@@ -368,7 +368,7 @@ def _findPositiveInteractions(self,filteredDB, df, dbID, complexNameColumn):
return df
- def _getPariwiseInteractions(self,entryList):
+ def _getPairwiseInteractions(self, entryList):
""
return itertools.combinations(entryList, 2)
@@ -426,23 +426,23 @@ def findMatch(self,x,metricDf, mCols):
return metricDf.loc[metricDf["E2E1"] == search,mCols]
@property
- def indentifiedComplexes(self):
+ def identifiedComplexes(self):
if hasattr(self,'uniqueComplexesIdentified'):
return self.uniqueComplexesIdentified
def identifiableComplexes(self,complexMemberIds, ID = "20190823_CORUM.txt"):
""
- identifiableMebmers = OrderedDict()
+ identifiableMembers = OrderedDict()
if hasattr(self,'uniqueComplexesIdentified'):
for k in self.uniqueComplexesIdentified.keys():
- identifiableMebmers[k] = {}
+ identifiableMembers[k] = {}
boolIdx = self.dbs[ID].index == k
complexData = self.dbs[ID][boolIdx]
cMembers = complexData[complexMemberIds].tolist()[0].split(";")
- identifiableMebmers[k]["n"] = len(cMembers)
- identifiableMebmers[k]["members"] = cMembers
+ identifiableMembers[k]["n"] = len(cMembers)
+ identifiableMembers[k]["members"] = cMembers
- return identifiableMebmers
+ return identifiableMembers
def assignComplexToProtein(self, e, complexMemberIds, complexIDColumn, ID = "20190823_CORUM.txt", filterDict = {'Organism': ["Human"]}):
@@ -553,12 +553,12 @@ def matchMetrices(self,pathToTmp,entriesInChunks,metricColumns,analysisName,forc
def _createChunks(self,pathToTmp,entriesInChunks,metricColumns):
"""
- Craetes chunks
+ Creates chunks
To do:
- Parellelerize.
+ Parallelize.
Parameters
----------
@@ -728,10 +728,10 @@ def matchInteractions(self,columnLabel, distanceMatrix):
def fillComplexMatrixFromData(self, X):
""
if not isinstance(X, pd.DataFrame):
- raise ValueError("X must be a pandas data frame with index and columns containg ID")
+ raise ValueError("X must be a pandas data frame with index and columns containing ID")
return X.merge(self.df,how="left",left_index=True,right_on="E1;E2")
if __name__ == "__main__":
- Database().pariwiseProteinInteractions("subunits(UniProt IDs)")
+ Database().pairwiseProteinInteractions("subunits(UniProt IDs)")
diff --git a/src/modules/Distance.py b/src/modules/Distance.py
index d103c09..f3d1bb3 100644
--- a/src/modules/Distance.py
+++ b/src/modules/Distance.py
@@ -15,7 +15,7 @@
def minMaxNorm(X,axis=0):
- "Normalize array betweem 0 and 1"
+ "Normalize array between 0 and 1"
Xmin = np.nanmin(X,axis=axis, keepdims=True)
Xmax = np.nanmax(X,axis=axis,keepdims=True)
X_transformed = (X - Xmin) / (Xmax-Xmin)
@@ -116,7 +116,7 @@ def _pearson(u,v):
@jit()
def pearson(nY,Ys):
- "Calcualtes pearson correlation."
+ """Calculates pearson correlation."""
return [_pearson(nY,Y) for Y in Ys]
@@ -217,16 +217,16 @@ def __init__(self,
Identifier of E1
E2 : obj:`list`of obj `np.array`
- Signal intensity of E2s. Disntances
- betwenn ID and E2 are calculated.
- The intensitiy profiles of E2s are uploaded from source.npy.
+ Signal intensity of E2s. Distances
+ between ID and E2 are calculated.
+ The intensity profiles of E2s are uploaded from source.npy.
ownPeaks : obj:`list`of obj `dict`
List of modelled peaks for Y. Required to calculate apex distance,
- which is equal to the euclidean dinstance of the closest peaks.
+ which is equal to the euclidean distance of the closest peaks.
metrices : obj:`list` of obj:`str` or obj`list` of obj`dict`
- List of strings or dictionories of metrices used to calculate distance.
+ List of strings or dictionaries of metrices used to calculate distance.
If dict is provided, two keys namely `fn`and `name`must be provided.
The name must be unique (if more than one dict is provided.)
diff --git a/src/modules/Distance_archive.py b/src/modules/Distance_archive.py
index 0b7ccf1..88a8791 100644
--- a/src/modules/Distance_archive.py
+++ b/src/modules/Distance_archive.py
@@ -43,21 +43,21 @@ def __init__(self,
Identifier of E1
E2 : obj:`list`of obj `np.array`
- Signal intensity of E2s. Disntances
- betwenn ID and E2 are calculated.
- The intensitiy profiles of E2s are uploaded from source.npy.
+ Signal intensity of E2s. Distances
+ between ID and E2 are calculated.
+ The intensity profiles of E2s are uploaded from source.npy.
ownPeaks : obj:`list`of obj `dict`
List of modelled peaks for Y. Required to calculate apex distance,
- which is equal to the euclidean dinstance of the closest peaks.
+ which is equal to the euclidean distance of the closest peaks.
metrices : obj:`list` of obj:`str` or obj`list` of obj`dict`
- List of strings or dictionories of metrices used to calculate distance.
- If dict is provided, two keys namely `fn`and `name`must be provided.
+ List of strings or dictionaries of metrices used to calculate distance.
+ If dict is provided, two keys namely `fn` and `name` must be provided.
The name must be unique (if more than one dict is provided.)
pathToTmp : string
- Path to the temporary folder for the current anaylsis. Required to load
+ Path to the temporary folder for the current analysis. Required to load
Signals (called Ys)
chunkName : string
diff --git a/src/modules/Predictor.py b/src/modules/Predictor.py
index a521fbc..f22bac1 100644
--- a/src/modules/Predictor.py
+++ b/src/modules/Predictor.py
@@ -77,7 +77,7 @@ def __init__(self, classifierClass = "random forest", n_jobs = 4, gridSearch = N
def _initClassifier(self):
"""
- Initiate Classifer
+ Initiate Classifier
Parameters
----------
@@ -127,11 +127,11 @@ def _scaleFeatures(self,X):
Feature scaling. Data are scaled by StandardScaler (0-1)
Importantly, the scaler is not retrained once it was initiated
- to ensure that the scaling remains similiar for predictors.
+ to ensure that the scaling remains similar for predictors.
Parameters
----------
- X : two dimensional numpy array (feature paris in rows)
+ X : two dimensional numpy array (feature pairs in rows)
Distance matrix for feature pairs
@@ -153,7 +153,7 @@ def _gridOptimization(self,X,Y):
Parameters
----------
- X : two dimensional numpy array (feature paris in rows)
+ X : two dimensional numpy array (feature pairs in rows)
Distance matrix for feature pairs
Y : numpy array
Array containing class labels of X (0,1)
@@ -181,7 +181,7 @@ def _gridOptimization(self,X,Y):
def getFeatureImportance(self):
"""
- Returns estimatore feature imporantance, if estimator allows for this.
+ Returns estimator feature importance, if estimator allows for this.
Parameters
----------
@@ -215,7 +215,7 @@ def predict(self,X,scale=True):
Returns
-------
Two dimensional array (n feature pairs x predictors)
- containing the class proability
+ containing the class probability
if predictors (default: 3 - see fit function)
"""
@@ -247,7 +247,7 @@ def fit(self, X, Y, kFold = 3, optimizedParams=None, pathToResults = '', plotROC
X : two dimensional numpy array
Distance matrix for feature pairs
Y : np.array
- Class labels (1 - 0) for postive
+ Class labels (1 - 0) for positive
and negative interaction
kFold : int
Number of cross validations. Equals the number of predictors.
@@ -275,7 +275,7 @@ def fit(self, X, Y, kFold = 3, optimizedParams=None, pathToResults = '', plotROC
if self.gridSerach is not None and optimizedParams is None:
optimizedClassifier, optimizedParams = self._gridOptimization(X_train,y_train)
else:
- print("Info :: Grid serach skipped. Automatically skipped when using Guassian NB or parameter 'classiferGridSearch' is None.")
+ print("Info :: Grid search skipped. Automatically skipped when using Gaussian NB or parameter 'classiferGridSearch' is None.")
optimizedClassifier = self.classifier
#cv = StratifiedShuffleSplit(n_splits=10, test_size=0.2)
if optimizedParams is not None:
@@ -284,7 +284,7 @@ def fit(self, X, Y, kFold = 3, optimizedParams=None, pathToResults = '', plotROC
self.predictors = [optimizedClassifier]
probasOut = optimizedClassifier.predict_proba(X)
- #predict probabiliteis for complete data set to create a classfier report.
+ #predict probabilities for complete data set to create a classifier report.
tprs = []
aucs = []
oobScore = np.nan
diff --git a/src/modules/Signal.py b/src/modules/Signal.py
index 76138ca..7632bf5 100644
--- a/src/modules/Signal.py
+++ b/src/modules/Signal.py
@@ -40,12 +40,12 @@ def __init__(self,
"""Signal module for pre-processing and modeling
- The Signal module allows to do severl pre-processing/modelling
+ The Signal module allows to do several pre-processing/modelling
steps such as
a) smoothing (rolling average)
b) filtering by number of nonNaN values
c) removal of single data points (surrounded by zeros or nans)
- b) Peak detection (finds peaks) - required for further anaylsis
+ b) Peak detection (finds peaks) - required for further analysis
The peak modelling allows for usage of `LorentzianModel` or `GaussianModel`
@@ -131,34 +131,34 @@ def _removeSingleDataPointPeaks(self):
"""
peaksFiltered = 0
- flilteredY = []
+ filteredY = []
for i,x in enumerate(self.Y):
if i == 0: #first item is different
if self.Y[i+1] == 0:
- flilteredY.append(0)
+ filteredY.append(0)
if self.Y[i] > 0:
peaksFiltered += 1
else:
- flilteredY.append(x)
+ filteredY.append(x)
elif i == self.Y.size - 1: #last item also
if self.Y[-1] != 0 and self.Y[-1]:
- flilteredY.append(0)
+ filteredY.append(0)
if self.Y[i] > 0:
peaksFiltered += 1
else:
- flilteredY.append(x)
+ filteredY.append(x)
else:
if self.Y[i-1] == 0 and self.Y[i+1] == 0:
- flilteredY.append(0)
+ filteredY.append(0)
if self.Y[i] > 0:
peaksFiltered += 1
else:
- flilteredY.append(x)
+ filteredY.append(x)
- return np.array(flilteredY), peaksFiltered
+ return np.array(filteredY), peaksFiltered
def isValid(self, nonZero = 4):
"""Returns true if signal contains more than
@@ -173,7 +173,7 @@ def isValid(self, nonZero = 4):
Returns
-------
- boolean, True if vald
+ boolean, True if valid
"""
valid = np.sum(self.Y > 0) > nonZero
@@ -241,7 +241,7 @@ def _addParams(self,modelParams,prefix,peakIdx,i):
Parameters
----------
- mdeolParams :
+ modelParams :
modelParam object. Returned by model.make_params() (lmfit package)
Documentation: https://lmfit.github.io/lmfit-py/model.html
@@ -249,7 +249,7 @@ def _addParams(self,modelParams,prefix,peakIdx,i):
Prefix for the model (e.g. peak), defaults to f'm{i}_'.format(i)
peakIdx : int
- Arary index at which the peak was detected in the Signal arary self.Y
+ Array index at which the peak was detected in the Signal array self.Y
i : int
index of detected models
@@ -263,7 +263,7 @@ def _addParams(self,modelParams,prefix,peakIdx,i):
if self.avoidWideSmallPeaks and self.Y[peakIdx[i]] < np.max(self.Y) * 0.2:
- #small peaks should not be to wide!
+ #small peaks should not be too wide!
self._addParam(modelParams,
name=prefix+'amplitude',
max = self.Y[peakIdx[i]] * 1.2 * np.pi,
@@ -328,7 +328,7 @@ def _findParametersForModels(self,spec,peakIdx):
def _checkPeakIdx(self,peakIdx, maxPeaks = 15):
"""
Checks if number of peaks exceed the max number of
- allwed peaks. (paramater: maxPeaks)
+ allowed peaks. (parameter: maxPeaks)
If the number exceeds maxPeaks, the peaks with the
highest value are taken. Others are removed
@@ -362,7 +362,7 @@ def fitModel(self):
"""
Fits the model (ensemble of several peaks).
The number of models equals the number of
- detected peaks. Please not that that the maximum
+ detected peaks. Please note that the maximum
number of peaks is limited by the parameter:
maxPeaks (defaults to 12)
@@ -371,7 +371,7 @@ def fitModel(self):
- peak models + signal profile are plotted and saved as pdf (folder modelPlots)
- - if squaredR for the model fit is below threshold (r2Tresh - deufault 0.85), the
+ - if squaredR for the model fit is below threshold (r2Tresh - default 0.85), the
signal profile is ignored. A message is printed if this happens.
Parameters
diff --git a/src/modules/utils.py b/src/modules/utils.py
index f6eb7bd..ab65d95 100644
--- a/src/modules/utils.py
+++ b/src/modules/utils.py
@@ -91,7 +91,7 @@ def calculateDistanceP(pathToFile):
"""
with open(pathToFile,"rb") as f:
chunkItems = pickle.load(f)
- exampleItem = chunkItems[0] #used to get specfici chunk name to save under same name
+ exampleItem = chunkItems[0] #used to get specific chunk name to save under same name
if "chunkName" in exampleItem:
XX = [DistanceCalculator(**c).calculateMetrices() for c in chunkItems]
data = np.concatenate([X[0] for X in XX],axis=0)
diff --git a/tests/test_misc.py b/tests/test_misc.py
index 8490b55..f9c5cbd 100644
--- a/tests/test_misc.py
+++ b/tests/test_misc.py
@@ -40,7 +40,7 @@ def test_workflow_completes():
recalculateDistance=False,
retrainClassifier=True,
minPeakHeightOfMax=0.01,
- takeRondomSampleFromData=False,
+ takeRandomSampleFromData=False,
justFitAndMatchPeaks=False,
noDistanceCalculationAndPrediction=False,
runName="D1_exampleTest",