scholarly journals Fully-Connected Neural Networks with Reduced Parameterization for Predicting Histological Types of Lung Cancer from Somatic Mutations

Biomolecules ◽  
2020 ◽  
Vol 10 (9) ◽  
pp. 1249
Author(s):  
Kazuma Kobayashi ◽  
Amina Bolatkan ◽  
Shuichiro Shiina ◽  
Ryuji Hamamoto

Several challenges appear in the application of deep learning to genomic data. First, the dimensionality of input can be orders of magnitude greater than the number of samples, forcing the model to be prone to overfitting the training dataset. Second, each input variable’s contribution to the prediction is usually difficult to interpret, owing to multiple nonlinear operations. Third, genetic data features sometimes have no innate structure. To alleviate these problems, we propose a modification to Diet Networks by adding element-wise input scaling. The original Diet Networks concept can considerably reduce the number of parameters of the fully-connected layers by taking the transposed data matrix as an input to its auxiliary network. The efficacy of the proposed architecture was evaluated on a binary classification task for lung cancer histology, that is, adenocarcinoma or squamous cell carcinoma, from a somatic mutation profile. The dataset consisted of 950 cases, and 5-fold cross-validation was performed for evaluating the model performance. The model achieved a prediction accuracy of around 80% and showed that our modification markedly stabilized the learning process. Also, latent representations acquired inside the model allowed us to interpret the relationship between somatic mutation sites for the prediction.

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Tri-Cong Pham ◽  
Chi-Mai Luong ◽  
Van-Dung Hoang ◽  
Antoine Doucet

AbstractMelanoma, one of the most dangerous types of skin cancer, results in a very high mortality rate. Early detection and resection are two key points for a successful cure. Recent researches have used artificial intelligence to classify melanoma and nevus and to compare the assessment of these algorithms to that of dermatologists. However, training neural networks on an imbalanced dataset leads to imbalanced performance, the specificity is very high but the sensitivity is very low. This study proposes a method for improving melanoma prediction on an imbalanced dataset by reconstructed appropriate CNN architecture and optimized algorithms. The contributions involve three key features as custom loss function, custom mini-batch logic, and reformed fully connected layers. In the experiment, the training dataset is kept up to date including 17,302 images of melanoma and nevus which is the largest dataset by far. The model performance is compared to that of 157 dermatologists from 12 university hospitals in Germany based on the same dataset. The experimental results prove that our proposed approach outperforms all 157 dermatologists and achieves higher performance than the state-of-the-art approach with area under the curve of 94.4%, sensitivity of 85.0%, and specificity of 95.0%. Moreover, using the best threshold shows the most balanced measure compare to other researches, and is promisingly application to medical diagnosis, with sensitivity of 90.0% and specificity of 93.8%. To foster further research and allow for replicability, we made the source code and data splits of all our experiments publicly available.


Author(s):  
Irzam Sarfraz ◽  
Muhammad Asif ◽  
Joshua D Campbell

Abstract Motivation R Experiment objects such as the SummarizedExperiment or SingleCellExperiment are data containers for storing one or more matrix-like assays along with associated row and column data. These objects have been used to facilitate the storage and analysis of high-throughput genomic data generated from technologies such as single-cell RNA sequencing. One common computational task in many genomics analysis workflows is to perform subsetting of the data matrix before applying down-stream analytical methods. For example, one may need to subset the columns of the assay matrix to exclude poor-quality samples or subset the rows of the matrix to select the most variable features. Traditionally, a second object is created that contains the desired subset of assay from the original object. However, this approach is inefficient as it requires the creation of an additional object containing a copy of the original assay and leads to challenges with data provenance. Results To overcome these challenges, we developed an R package called ExperimentSubset, which is a data container that implements classes for efficient storage and streamlined retrieval of assays that have been subsetted by rows and/or columns. These classes are able to inherently provide data provenance by maintaining the relationship between the subsetted and parent assays. We demonstrate the utility of this package on a single-cell RNA-seq dataset by storing and retrieving subsets at different stages of the analysis while maintaining a lower memory footprint. Overall, the ExperimentSubset is a flexible container for the efficient management of subsets. Availability and implementation ExperimentSubset package is available at Bioconductor: https://bioconductor.org/packages/ExperimentSubset/ and Github: https://github.com/campbio/ExperimentSubset. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Young-Gon Kim ◽  
Sungchul Kim ◽  
Cristina Eunbee Cho ◽  
In Hye Song ◽  
Hee Jin Lee ◽  
...  

AbstractFast and accurate confirmation of metastasis on the frozen tissue section of intraoperative sentinel lymph node biopsy is an essential tool for critical surgical decisions. However, accurate diagnosis by pathologists is difficult within the time limitations. Training a robust and accurate deep learning model is also difficult owing to the limited number of frozen datasets with high quality labels. To overcome these issues, we validated the effectiveness of transfer learning from CAMELYON16 to improve performance of the convolutional neural network (CNN)-based classification model on our frozen dataset (N = 297) from Asan Medical Center (AMC). Among the 297 whole slide images (WSIs), 157 and 40 WSIs were used to train deep learning models with different dataset ratios at 2, 4, 8, 20, 40, and 100%. The remaining, i.e., 100 WSIs, were used to validate model performance in terms of patch- and slide-level classification. An additional 228 WSIs from Seoul National University Bundang Hospital (SNUBH) were used as an external validation. Three initial weights, i.e., scratch-based (random initialization), ImageNet-based, and CAMELYON16-based models were used to validate their effectiveness in external validation. In the patch-level classification results on the AMC dataset, CAMELYON16-based models trained with a small dataset (up to 40%, i.e., 62 WSIs) showed a significantly higher area under the curve (AUC) of 0.929 than those of the scratch- and ImageNet-based models at 0.897 and 0.919, respectively, while CAMELYON16-based and ImageNet-based models trained with 100% of the training dataset showed comparable AUCs at 0.944 and 0.943, respectively. For the external validation, CAMELYON16-based models showed higher AUCs than those of the scratch- and ImageNet-based models. Model performance for slide feasibility of the transfer learning to enhance model performance was validated in the case of frozen section datasets with limited numbers.


Water ◽  
2020 ◽  
Vol 13 (1) ◽  
pp. 37
Author(s):  
Tomás de Figueiredo ◽  
Ana Caroline Royer ◽  
Felícia Fonseca ◽  
Fabiana Costa de Araújo Schütz ◽  
Zulimar Hernández

The European Space Agency Climate Change Initiative Soil Moisture (ESA CCI SM) product provides soil moisture estimates from radar satellite data with a daily temporal resolution. Despite validation exercises with ground data that have been performed since the product’s launch, SM has not yet been consistently related to soil water storage, which is a key step for its application for prediction purposes. This study aimed to analyse the relationship between soil water storage (S), which was obtained from soil water balance computations with ground meteorological data, and soil moisture, which was obtained from radar data, as affected by soil water storage capacity (Smax). As a case study, a 14-year monthly series of soil water storage, produced via soil water balance computations using ground meteorological data from northeast Portugal and Smax from 25 mm to 150 mm, were matched with the corresponding monthly averaged SM product. Linear (I) and logistic (II) regression models relating S with SM were compared. Model performance (r2 in the 0.8–0.9 range) varied non-monotonically with Smax, with it being the highest at an Smax of 50 mm. The logistic model (II) performed better than the linear model (I) in the lower range of Smax. Improvements in model performance obtained with segregation of the data series in two subsets, representing soil water recharge and depletion phases throughout the year, outlined the hysteresis in the relationship between S and SM.


2021 ◽  
pp. 016555152199804
Author(s):  
Qian Geng ◽  
Ziang Chuai ◽  
Jian Jin

To provide junior researchers with domain-specific concepts efficiently, an automatic approach for academic profiling is needed. First, to obtain personal records of a given scholar, typical supervised approaches often utilise structured data like infobox in Wikipedia as training dataset, but it may lead to a severe mis-labelling problem when they are utilised to train a model directly. To address this problem, a new relation embedding method is proposed for fine-grained entity typing, in which the initial vector of entities and a new penalty scheme are considered, based on the semantic distance of entities and relations. Also, to highlight critical concepts relevant to renowned scholars, scholars’ selective bibliographies which contain massive academic terms are analysed by a newly proposed extraction method based on logistic regression, AdaBoost algorithm and learning-to-rank techniques. It bridges the gap that conventional supervised methods only return binary classification results and fail to help researchers understand the relative importance of selected concepts. Categories of experiments on academic profiling and corresponding benchmark datasets demonstrate that proposed approaches outperform existing methods notably. The proposed techniques provide an automatic way for junior researchers to obtain organised knowledge in a specific domain, including scholars’ background information and domain-specific concepts.


2020 ◽  
Vol 15 (1) ◽  
pp. 326-330
Author(s):  
Xing Liu ◽  
Bin Shi

AbstractLung cancer is one of the most prevalent malignancies worldwide. Local recurrence and distant metastasis remain the major causes of treatment failure. It has been recognized that the process of tumor growth and metastasis involves multiple interactions between tumor and host. Various biomarkers have been used for predicting tumor recurrence, metastasis, and prognosis in patients with lung cancer. However, these biomarkers are still controversial and require further validation. The relationship between malignancy and coagulation system disorders has been explored for more than a century. Fibrinogen is the most abundant plasma coagulation factor synthesized mainly by hepatic cells. Increased plasma fibrinogen levels were observed in various carcinomas such as gastric cancer, colon cancer, and pancreatic cancer. Recent studies have also investigated the role of fibrinogen in patients with lung cancer. This review aimed to address the role of fibrinogen in lung cancer.


Sign in / Sign up

Export Citation Format

Share Document