Large Datasets
Recently Published Documents





2021 ◽  
Vol 23 (09) ◽  
pp. 981-993
T. Balamurugan ◽  
E. Gnanamanoharan ◽  

Brain tumor segmentation is a challenging task in the medical diagnosis. The primary aim of brain tumor segmentation is to produce precise characterizations of brain tumor areas using adequately placed masks. Deep learning techniques have shown great promise in recent years for solving various computer vision problems such as object detection, image classification, and semantic segmentation. Numerous deep learning-based approaches have been implemented to achieve excellent system performance in brain tumor segmentation. This article aims to comprehensively study the recently developed brain tumor segmentation technology based on deep learning in light of the most advanced technology and its performance. A genetic algorithm based on fuzzy C-means (FCM-GA) was used in this study to segment tumor regions from brain images. The input image is scaled to 256×256 during the preprocessing stage. FCM-GA segmented a preprocessed MRI image. This is a versatile advanced machine learning (ML) technique for locating objects in large datasets. The segmented image is then subjected to hybrid feature extraction (HFE) to improve the feature subset. To obtain the best feature value, Kernel Nearest Neighbor with a genetic algorithm (KNN-GA) is used in the feature selection process. The best feature value is fed into the RESNET classifier, which divides the MRI image into meningioma, glioma, and pituitary gland regions. Real-time data sets are used to validate the performance of the proposed hybrid method. The proposed method improves average classification accuracy by 7.99 % to existing Convolutional Neural Networks (CNN) and Support Vector Machines (SVM) classification algorithms

2021 ◽  
Pierre Parutto ◽  
Jennifer Heck ◽  
Meng Lu ◽  
Clemens F Kaminski ◽  
Martin Heine ◽  

Super-resolution imaging can generate thousands of single-particle trajectories. These data can potentially reconstruct subcellular organization and dynamics, as well as measure disease-linked changes. However, computational methods that can derive quantitative information from such massive datasets are currently lacking. Here we present data analysis and algorithms that are broadly applicable to reveal local binding and trafficking interactions and organization of dynamic sub-cellular sites. We applied this analysis to the endoplasmic reticulum and neuronal membrane. The method is based on spatio-temporal time window segmentation that explores data at multiple levels and detects the architecture and boundaries of high density regions in areas that are hundreds of nanometers. By statistical analysis of a large number of datapoints, the present method allows measurements of nano-region stability. By connecting highly dense regions, we reconstructed the network topology of the ER, as well as molecular flow redistribution, and the local space explored by trajectories. Segmenting trajectories at appropriate scales extracts confined trajectories, allowing quantification of dynamic interactions between lysosomes and the ER. A final step of the method reveals the motion of trajectories relative to the ensemble, allowing reconstruction of dynamics in normal ER and the atlastin-null mutant. Our approach allows users to track previously inaccessible large scale dynamics at high resolution from massive datasets. The algorithm is available as an ImageJ plugin that can be applied by users to large datasets of overlapping trajectories.

Biology ◽  
2021 ◽  
Vol 10 (9) ◽  
pp. 932
Nuno M. Rodrigues ◽  
João E. Batista ◽  
Pedro Mariano ◽  
Vanessa Fonseca ◽  
Bernardo Duarte ◽  

Over recent decades, the world has experienced the adverse consequences of uncontrolled development of multiple human activities. In recent years, the total production of chemicals has been composed of environmentally harmful compounds, the majority of which have significant environmental impacts. These emerging contaminants (ECs) include a wide range of man-made chemicals (such as pesticides, cosmetics, personal and household care products, pharmaceuticals), which are of worldwide use. Among these, several ECs raised concerns regarding their ecotoxicological effects and how to assess them efficiently. This is of particular interest if marine diatoms are considered as potential target species, due to their widespread distribution, being the most abundant phytoplankton group in the oceans, and also being responsible for key ecological roles. Bio-optical ecotoxicity methods appear as reliable, fast, and high-throughput screening (HTS) techniques, providing large datasets with biological relevance on the mode of action of these ECs in phototrophic organisms, such as diatoms. However, from the large datasets produced, only a small amount of data are normally extracted for physiological evaluation, leaving out a large amount of information on the ECs exposure. In the present paper, we use all the available information and evaluate the application of several machine learning and deep learning algorithms to predict the exposure of model organisms to different ECs under different doses, using a model marine diatom (Phaeodactylum tricornutum) as a test organism. The results show that 2D convolutional neural networks are the best method to predict the type of EC to which the cultures were exposed, achieving a median accuracy of 97.65%, while Rocket is the best at predicting which concentration the cultures were subjected to, achieving a median accuracy of 100%.

2021 ◽  
Vol 5 (3) ◽  
pp. 45
Sotiris Leventis ◽  
Fotios Fitsilis ◽  
Vasileios Anastasiou

The accessibility and reuse of legal data is paramount for promoting transparency, accountability and, ultimately, trust towards governance institutions. The aggregation of structured and semi-structured legal data inevitably leads to the big data realm and a series of challenges for the generation, handling, and analysis of large datasets. When it comes to data generation, LEOS represents a legal informatics tool that is maturing quickly. Now in its third release, it effectively supports the drafting of legal documents using Akoma Ntoso compatible schemes. However, the tool, originally developed for cooperative legislative drafting, can be repurposed to draft parliamentary control documents. This is achieved through the use of actor-oriented software components, referred to as software agents, which enable system interoperability by interlinking the text editing system with parliamentary control datasets. A validated corpus of written questions from the Hellenic Parliament is used to evaluate the feasibility of the endeavour, and the feasibility of using it as an authoring tool for written parliamentary questions and generation of standardised, open, legislative data. Systemic integration not only proves the tool’s versatility, but also opens up new grounds in interoperability between formerly unrelated legal systems and data sources.

2021 ◽  
Cori Pegliasco ◽  
Antoine Delepoulle ◽  
Rosemary Morrow ◽  
Yannice Faugère ◽  
Gérald Dibarboure

Abstract. This paper presents the new global Mesoscale Eddy Trajectories Atlases (META3.1exp DT all-satellites,, Pegliasco et al., 2021a and META3.1exp DT two-satellites,, Pegliasco et al., 2021b), composed of the eddies’ identifications and trajectories produced with altimetric maps. The detection method used is a heritage of the py-eddy-tracker algorithm developed by Mason et al. (2014), optimized to manage with efficiency large datasets, and thus long time series. These products are an improvement of the META2.0 product, produced by SSALTO/DUACS and distributed by AVISO+ ( with support from CNES, in collaboration with Oregon State University with support from NASA and based on Chelton et al. (2011). META3.1exp provides supplementary information such as the mesoscale eddy shapes with the eddy edges and their maximum speed contour, and the eddy speed profiles from the center to the edge. The tracking algorithm used is based on overlapping contours, includes virtual observations and acts as a filter with respect to the shortest trajectories. The absolute dynamic topography field is now used for eddy detection, instead of the sea level anomaly maps, to better represent the ocean dynamics in the more energetic areas and close to coasts and islands. To evaluate the impact of the changes from META2.0 to META3.1exp, a comparison methodology has been applied. The similarity coefficient is based on the ratio between the eddies' overlap and their cumulative area, and allows an extensive comparison of the different datasets in terms of geographic distribution, statistics over the main physical characteristics, changes in the lifetime of the trajectories, etc. After evaluating the impact of each change separately, we conclude that the major differences between META3.1exp and META2.0 are due to the change in the detection algorithm. META3.1exp contains smaller eddies and trajectories lasting at least 10 days that were not available in the distributed META2.0 product. Nevertheless, 55 % of the structures in META2.0 are similar in META3.1exp, ensuring the continuity between the two products, and the physical characteristics of the common eddies are close. Geographically, the eddy distribution mainly differs in the strong current regions, where the mean dynamic topography gradients are sharp. The additional information on the eddy contours allows more accurate collocation of mesoscale structures with data from other sources, so META3.1exp is recommended for multi-disciplinary applications.

2021 ◽  
Nicolai Ree ◽  
Andreas H. Göller ◽  
Jan H. Jensen

We present RegioML, an atom-based machine learning model for predicting the regioselectivities of electrophilic aromatic substitution reactions. The model relies on CM5 atomic charges computed using semiempirical tight binding (GFN1-xTB) combined with the ensemble decision tree variant light gradient boosting machine (LightGBM). The model is trained and tested on 21,201 bromination reactions with 101K reaction centers, which is split into a training, test, and out-of-sample datasets with 58K, 15K, and 27K reaction centers, respectively. The accuracy is 93% for the test set and 90% for the out-of-sample set, while the precision (the percentage of positive predictions that are correct) is 88% and 80%, respectively. The test-set performance is very similar to the graph-based WLN method developed by Struble et al. (React. Chem. Eng. 2020, 5, 896) though the comparison is complicated by the possibility that some of the test and out-of-sample molecules are used to train WLN. RegioML out-performs our physics-based RegioSQM20 method (J. Cheminform. 2021, 13:10) where the precision is only 75%. Even for the out-of-sample dataset, RegioML slightly outperforms RegioSQM20. The good performance of RegioML and WLN is in large part due to the large datasets available for this type of reaction. However, for reactions where there is little experimental data, physics-based approaches like RegioSQM20 can be used to generate synthetic data for model training. We demonstrate this by showing that the performance of RegioSQM20 can be reproduced by a ML-model trained on RegioSQM20-generated data.

2021 ◽  
Merit P. Ekeregbe ◽  
Mina S. Khalaf ◽  
Robello Samuel

Abstract Although visual data analytics using image processing is one of the most growing research areas today and is largely applied in many fields, it is not fully utilized in the petroleum industry. This study is inspired by medical image segmentation in detecting tumor cells. This paper uses a supervised Machine Learning technique through video analytics to identify bit dullness that can be used in the drilling industry in place of the subjective screening approach. The evaluation of bit performance can be affected by subjective evaluation of the degree of dullness. The present approach of using video analytics is able to grade bit dullness to avoid user subjectivity. The approach involves the use of datasets in good quantity and quality by separating them into training datasets, testing datasets, and validation datasets. Due to the large datasets, Google Collaboratory was used as it provides access to its Graphic Processing Unit (GPU) online for the processing of the bit datasets. The processing time and resource consumption are minimized using Google GPU. Using the Google GPU resources, the procedure is automated without any installation. After the bit is pulled out and cleaned, a video is taken around and up and down in 360°. Further, it is compared against the green bit. By this approach, multiple video datasets are not required. The algorithm was validated with new sets of bit videos and the results were satisfactory. The identification of the dullness or otherwise of each screened bit is done with the aid of a bounding box with a stamp of a level of confidence (range 0.5–1) and the algorithm assigns for its decision on the identified or screened object. This method is also able to screen multiple bits stored in a single place. In an event where several drill bits are to be screened, manual grading will be a huge task and will require a lot of resources. This model and algorithm will take a few minutes to screen and provide grading for several bits while videos are passed through the algorithm. It has also been found that the grading with the video was much better than the single image as the contextual information extracted are much higher at the level of the entire video, per segment, per shot, and per frame. Also, methodology is made robust so that the video model test starts successfully without error. The time penalty for the processing is fast and it took less time for a single video screening. The work developed here is probably the first to handle the dull bit grading using video analytics. With more of these datasets available, the future automation of the IADC bit characterization will soon evolve into an automated process.

2021 ◽  
Heinrich Peters ◽  
Zachariah Marrero ◽  
Samuel D. Gosling

As human interactions have shifted to virtual spaces and as sensing systems have become more affordable, an increasing share of peoples’ everyday lives can be captured in real time. The availability of such fine-grained behavioral data from billions of people has the potential to enable great leaps in our understanding of human behavior. However, such data also pose challenges to engineers and behavioral scientists alike, requiring a specialized set of tools and methodologies to generate psychologically relevant insights.In particular, researchers may need to utilize machine learning techniques to extract information from unstructured or semi-structured data, reduce high-dimensional data to a smaller number of variables, and efficiently deal with extremely large sample sizes. Such procedures can be computationally expensive, requiring researchers to balance computation time with processing power and memory capacity. Whereas modelling procedures on small datasets will usually take mere moments to execute, applying modeling procedures to big data can take much longer with typical execution times spanning hours, days, or even weeks depending on the complexity of the problem and the resources available. Seemingly subtle decisions regarding preprocessing and analytic strategy can end up having a huge impact on the viability of executing analyses within a reasonable timeframe. Consequently, researchers must anticipate potential pitfalls regarding the interplay of their analytic strategy with memory and computational constraints.Many researchers who are interested in using “big data” report having problems learning about new analytic methods or software, finding collaborators with the right skills and knowledge, and getting access to commercial or proprietary data for their research (Metzler et al. 2016). This chapter aims to serve as a practical introduction for psychologists who want to use large datasets and datasets from non-traditional data sources in their research (i.e., data not generated in the lab or through conventional surveys). First, we discuss the concept of big data and review some of the theoretical challenges and opportunities that arise with the availability of ever larger amounts of data. Second, we discuss practical implications and best practices with respect to data collection, data storage, data processing, and data modelling for psychological research in the age of big data.

2021 ◽  
Matthew Seto ◽  
Kristin Medlin

What does it mean to be in a strong partnership? Using Collaboratory’s national dataset of community engagement data, we explored partnerships between higher education institutions and the community organizations with which they are partnered. Our goals were to 1- understand what quantitative characteristics from Collaboratory denote ‘strong’ community-university partnerships, 2- use those characteristics to create an algorithmic assessment model to identify the strongest partnerships in the Collaboratory dataset, and 3- reveal common themes that practitioners can leverage to cultivate stronger and more resilient partnerships. With input from Collaboratory administrators, community engagement professionals, and institutional research team members, we identified four quantitative data points in Collaboratory data that we combined into a partnership strength model. The model identified 99 out of 2,083 community-university partnerships that might be classified as high-strength. The model’s results represent an initial jumping-off point for future research, including qualitative assessment of the 99 strongest partnerships to validate the model. Additionally, we argue that quantitative assessment of qualitative partnerships is by no means a silver bullet, but instead represents a pragmatic method of high-level assessment and quick filtering of large datasets of qualitative partnership data that would otherwise be prohibitively time-consuming.

2021 ◽  
Vol 40 (2) ◽  
Kathleen M. Hemeon ◽  
Eric N. Powell ◽  
Eric Robillard ◽  
Sara M. Pace ◽  
Theresa E. Redmond ◽  

Sign in / Sign up

Export Citation Format

Share Document