similar cluster
Recently Published Documents


TOTAL DOCUMENTS

33
(FIVE YEARS 7)

H-INDEX

8
(FIVE YEARS 1)

Author(s):  
Randa Mohamed Abd El-ghafar ◽  
◽  
Ali H. El-Bastawissy ◽  
Eman S. Nasr ◽  
Mervat H. Gheith ◽  
...  

Entity Resolution (ER) is defined as the process 0f identifying records/ objects that correspond to real-world objects/ entities. To define a good ER approach, the schema of the data should be well-known. In addition, schema alignment of multiple datasets is not an easy task and may require either domain expert or ML algorithm to select which attributes to match. Schema agnostic meta-blocking tries to solve such a problem by considering each token as a blocking key regardless of the attributes it appears in. It may also be coupled with meta-blocking to reduce the number of false negatives. However, it requires the exact match of tokens which is very hard to occur in the actual datasets and it results in very low precision. To overcome such issues, we propose a novel and efficient ER approach for big data implemented in Apache Spark. The proposed approach is employed to avoid schema alignment as it treats the attributes as a bag of words and generates a set of n-grams which is transformed to vectors. The generated vectors are compared using a chosen similarity measure. The proposed approach is a generic one as it can accept all types of datasets. It consists of five consecutive sub-modules: 1) Dataset acquisition, 2) Dataset pre-processing, 3) Setting selection criteria, where all settings of the proposed approach are selected such as the used blocking key, the significant attributes, NLP techniques, ER threshold, and the used scenario of ER, 4) ER pipeline construction, and 5) Clustering where the similar records are grouped into the similar cluster. The ER pipeline could accept two types of attributes; the Weighted Attributes (WA) or the Compound Attributes (CA). In addition, it accepts all the settings selected in the fourth module. The pipeline consists of five phases. Phase 1) Generating the tokens composing the attributes. Phase 2) Generating n-grams of length n. Phase 3) Applying the hashing Text Frequency (TF) to convert each n-grams to a fixed-length feature vector. Phase 4) Applying Locality Sensitive Hashing (LSH), which maps similar input items to the same buckets with a higher probability than dissimilar input items. Phase 5) Classification of the objects to duplicates or not according to the calculated similarity between them. We introduced seven different scenarios as an input to the ER pipeline. To minimize the number of comparisons, we proposed the length filter which greatly contributes to improving the effectiveness of the proposed approach as it achieves the highest F-measure between the existing computational resources and scales well with the available working nodes. Three results have been revealed: 1) Using the CA in the different scenarios achieves better results than the single WA in terms of efficiency and effectiveness. 2) Scenario 3 and 4 Achieve the best performance time because using Soundex and Stemming contribute to reducing the performance time of the proposed approach. 3) Scenario 7 achieves the highest F-measure because by utilizing the length filter, we only compare records that are nearly within a pre-determined percentage of increase or decrease of string length. LSH is used to map the same inputs items to the buckets with a higher probability than dis-similar ones. It takes numHashTables as a parameter. Increasing the number of candidate pairs with the same numHashTables will reduce the accuracy of the model. Utilizing the length filter helps to minimize the number of candidates which in turn increases the accuracy of the approach.


2021 ◽  
Vol 6 (3) ◽  
pp. 65612
Author(s):  
Amelia Nugrahaningrum ◽  
R.C. Hidayat Soesilohadi

Drepanosticta spatulifera is a Javan endemic damselfly. The population is spread unevenly in the Petungkriyono Forest and is threatened due to environmental pressure. The aims of this research are to know the variation of the movement, dispersal, and morphometric among subpopulations of D. spatulifera. Movement and dispersal variation data collection was done using Mark Release Recapture (MRR) for six weeks from early August until mid-September 2020. The collection of morphometric samples was done during the last week of the MRR survey and 46 individuals were measured with 12 continuous characters. During the MRR survey, 596 males of D. spatulifera were marked and 302 were recaptured. D. spatulifera had short movement and dispersal thus no individuals were found across the subpopulations. The distance moved of successive capture and net lifetime movement were dominantly less or equal to five meters. The duration of the MRRsurvey had a low correlation with the dispersal distance of D. spatulifera. In themorphometric variations, closer subpopulations tended to have a similar cluster ofmorphometric characters. Variation of distance moved between successive captureand wing size from Mangli Stream was significantly different from other sites. Thesubpopulation of Mangli, the farthest and higher altitude of the sites, had thehighest distance move, more disperse, and the largest wing size. Our study showedthat D. spatulifera was extremely sedentary damselfly. It will enhance inbreeding andvulnerability to extinction. Therefore, the interaction between the subpopulationsof D. spatulifera in the Petungkriyono Forest needs to be done more.


2021 ◽  
Vol 12 ◽  
Author(s):  
Xuekai Wang ◽  
Xinxin Cao ◽  
Han Liu ◽  
Linna Guo ◽  
Yanli Lin ◽  
...  

Lactic acid bacteria occupy an important position in silage microorganisms, and the effects of exogenous lactic acid bacteria on silage quality have been widely studied. Microbial metabolism has been proved as an indicator of substrate utilization by microorganisms. Paper mulberry is rich in free carbohydrate, amino acids, and other components, with the potential to be decomposed and utilized. In this study, changes in the microbial metabolism characteristics of paper mulberry silage with Lactiplantibacillus plantarum (LP) and Lentilactobacillus buchneri (LB) were studied along with a control (CK) using BIOLOG ECO microplates. The results showed that average well-color development (AWCD), Shannon diversity, Shannon evenness, and Simpson diversity exhibited significant temporal trends. LB and LP responded differently in the early ensiling phase, and the AWCD of LB was higher than LP at 7 days. Principal component analysis revealed that CK, LB, and LP samples initially clustered at 3 days and then moved into another similar cluster after 15 days. Overall, the microplates methodology applied in this study offers important advantages, not least in terms of accuracy.


2020 ◽  
Vol 21 (3) ◽  
Author(s):  
Nur Indah Julisaniah ◽  
SUHARJONO ◽  
RETNO MASTUTI ◽  
ESTRI LARAS ARUMINGTYAS

Abstract. Julisaniah NI, Suharjono, Mastuti R, Arumingtyas EL. 2020. Coat protein gene of a PStV-Bm isolate from West Nusa Tenggara, Indonesia. Biodiversitas 21: 903-909. Peanut stripe virus (PStV) is a single-stranded positive-sense RNA virus capable of infecting peanut plants. An isolate of PStV (PStV-Bm) was collected from a peanut field in the Bima District, West Nusa Tenggara Province, Indonesia and the coat protein (CP) gene of this virus (CP-PStV) was extracted from the viral RNA and analyzed using reverse transcription-polymerase chain reaction methods. The CP-PStV gene of PStV-Bm was aligned with several PStV genes deposited in the Genbank (http://www.ncbi.nml.nih.gov). Based on the nucleotide sequence of the CP gene, PStV-Bm was grouped into a similar cluster with other PStVs that originated from Indonesia with a similar index, ranging from 96.8% to 98.9%. Genetic similarity (about 96.1%) was also observed between PStV-Bm and PStV from the USA. This genetic similarity indicated that viruses from adjacent regions have high genetic relationships. Some amino acid differences were observed in PStV-Bm that may be typical of this strain.


2020 ◽  
Vol 105 (4) ◽  
pp. 1186-1195 ◽  
Author(s):  
Marina A Skiba ◽  
Robin J Bell ◽  
Rakibul M Islam ◽  
Md Nazmul Karim ◽  
Susan R Davis

Abstract Context An important element of the diagnosis of polycystic ovary syndrome is hyperandrogenism. Objective To determine the distribution of modified Ferriman-Gallwey (mF-G) scores, as a measure of facial and body hair growth, and associations between the mF-G scores and serum androgen concentrations, including 11-oxygenated androgens. Design Cross-sectional study of non-health-care-seeking women, aged 18 to 39 years, recruited from the eastern states of Australia from November 2016 to July 2017. Participants and measurements Participants provided an mF-G self-assessment that corresponded to their appearance when not using treatment for excess hair. Androgens were measured in 710 women by liquid chromatography and tandem mass spectrometry. Results The distribution of the mF-G scores was right-skewed. The median (range) mF-G score of all participants (73.1% Caucasian) was 5 (0–36). The mF-G scores were negatively associated with age (rs = 0.124; P < 0.0001) and positively associated with body mass index (BMI) (rs = 0.073; P < 0.0001). Only androstenedione remained significantly associated with mF-G scores when controlling for age and BMI. Cluster analysis identified 2 groups with mF-G score of < 10 and ≥ 10. Repeating the cluster analysis using the combined vector of mF-G score and androstenedione returned a similar cluster structure, and again separated the 2 groups at a mF-G score < 10 versus ≥ 10. Conclusions A self-assessed mF-G score ≥ 10 is indicative of excess body hair. Androstenedione, as well as testosterone, should be measured when hyperandrogenism is being evaluated. The lack of association between mF-G scores and the 11-oxygenated androgens highlights the need for a better understanding of these steroids.


2020 ◽  
Author(s):  
◽  
E. Ribeiro

We are involved in an environment full of sounds around us. Studying and analyzing the impacts that musical practice causes and showing mathematically that this practice provides significant cognitive effects on the human brain are the main motivations of this thesis. In more detail, the aim of this thesis was to develop a methodology capable of characterizing the cortical activation patterns generated during the register of Electroencephalogram (EEG) signals through pattern recognition techniques in statistics, in addition to analyzing the acoustic features commonly employed in this context, in order to reveal whether they are statistically relevant. A computational framework was initially developed to address a 2 group classification problem based on data from EEG signals extracted from volunteer musicians and non-musicians during an auditory task, to predict whether a particular person is a musician or not. The results showed that it is possible to classify the sampled groups with accuracy ranging from 69.2% to 93.8%, allowing not only a better description of the neural activation patterns that characterize the musician and non-musician volunteers, but also highlighting how these patterns they change in the transition regions and decision boundaries that separate the sampled groups, indicating a plausible linear separation between these groups. Additionally, as another original contribution of this thesis, the audio signals from a public and internationally referenced database containing 1000 musical excerpts with 10 different genres were analyzed to investigate numerical similarities between the short-term acoustic features extracted from the audios and commonly explored in related literature. The results obtained show a similar cluster behavior among these features for all analyzed music, regardless of the musical genre. It was then possible to discuss in an unprecedented way the relationship between the way the acoustic features of songs are described in the literature and how they are grouped statistically, revealing that the information we use to cognitively process these sound features is implicitly statistical. Although all the methods described and implemented in this thesis are based on EEG signals, it is believed that they can be extended to other types of multivariate cognitive signals, such as, for example, functional Magnetic Resonance Imaging (fMRI), allowing a greater cortical and sub-cortical understanding of the functioning of our brain during listening


2018 ◽  
Author(s):  
Christopher A Sanders ◽  
Stephen M Schueller ◽  
Acacia C Parks ◽  
Ryan T Howell

BACKGROUND A critical issue in understanding the benefits of Web-based interventions is the lack of information on the sustainability of those benefits. Sustainability in studies is often determined using group-level analyses that might obscure our understanding of who actually sustains change. Person-centric methods might provide a deeper knowledge of whether benefits are sustained and who tends to sustain those benefits. OBJECTIVE The aim of this study was to conduct a person-centric analysis of longitudinal outcomes, examining well-being in participants over the first 3 months following a Web-based happiness intervention. We predicted we would find distinct trajectories in people’s pattern of response over time. We also sought to identify what aspects of the intervention and the individual predicted an individual’s well-being trajectory. METHODS Data were gathered from 2 large studies of Web-based happiness interventions: one in which participants were randomly assigned to 1 of 14 possible 1-week activities (N=912) and another wherein participants were randomly assigned to complete 0, 2, 4, or 6 weeks of activities (N=1318). We performed a variation of K-means cluster analysis on trajectories of life satisfaction (LS) and affect balance (AB). After clusters were identified, we used exploratory analyses of variance and logistic regression models to analyze groups and compare predictors of group membership. RESULTS Cluster analysis produced similar cluster solutions for each sample. In both cases, participant trajectories in LS and AB fell into 1 of 4 distinct groups. These groups were as follows: those with high and static levels of happiness (n=118, or 42.8%, in Sample 1; n=306, or 52.8%, in Sample 2), those who experienced a lasting improvement (n=74, or 26.8% in Sample 1; n=104, or 18.0%, in Sample 2), those who experienced a temporary improvement but returned to baseline (n=37, or 13.4%, in Sample 1; n=82, or 14.2%, in Sample 2), and those with other trajectories (n=47, or 17.0%, in Sample 1; n=87, or 15.0% in Sample 2). The prevalence of depression symptoms predicted membership in 1 of the latter 3 groups. Higher usage and greater adherence predicted sustained rather than temporary benefits. CONCLUSIONS We revealed a few common patterns of change among those completing Web-based happiness interventions. A noteworthy finding was that many individuals began quite happy and maintained those levels. We failed to identify evidence that the benefit of any particular activity or group of activities was more sustainable than any others. We did find, however, that the distressed portion of participants was more likely to achieve a lasting benefit if they continued to practice, and adhere to, their assigned Web-based happiness intervention.


2018 ◽  
Vol 55 (4) ◽  
pp. 217-229
Author(s):  
Irina Filina ◽  
Kris Guthrie ◽  
Mindi Searls ◽  
Caroline Burberry

A sudden spike in earthquake events has been observed in central Nebraska. Since April 2018, 26 earthquakes with equivalent moment magnitudes from 2.7 to 4.1 occurred, clustered tightly in Custer County. A similar cluster of 24 earthquakes with equivalent moment magnitudes from 2.6 to 3.7 occurred in Jewell County in northern Kansas in 2017. We have compiled an earthquake database for Nebraska and parts of adjacent states from different sources to determine whether these recent earthquake spikes are consistent with historic seismicity. We identified two historic earthquake clusters occurring in our study area. The first contained 32 events and was active in Red Willow County in southwestern Nebraska from 1977 to 1982. As it coincides spatially with the Sleepy Hollow oil field, it may be related to enhanced oil recovery from that field, although it is also located at the edge of the Chadron-Cambridge Arch. The second historical earthquake cluster is located in Pawnee and Richardson counties in southwestern Nebraska and includes eight earthquakes with equivalent moment magnitudes of 2.3 to 2.8 that occurred in a period from 1982 to 1989 over the Nemaha uplift and appear to be related to the Humboldt fault. We note an increase in both maximum magnitude, as well as in the cumulative seismic moment per cluster with time. We have also used gravity and magnetic fields to map potential basement faults in the study area. Our analysis shows that the two recent earthquake spikes are aligned with the proposed basement faults. Despite this correlation, the cause of this sudden spike in seismicity is not well understood, as the stresses that might reactivate these basement faults are unknown. In addition, both recent clusters are distant from oil and gas operations. More seismic stations are necessary in central Nebraska in order to better detect focal depths and faulting style in the ongoing cluster of earthquakes and investigate possible causes.


Archaea ◽  
2018 ◽  
Vol 2018 ◽  
pp. 1-7 ◽  
Author(s):  
Yanfen Zhang ◽  
Anzhou Ma ◽  
Wenzong Liu ◽  
Zhihui Bai ◽  
Xuliang Zhuang ◽  
...  

Recently, a new oxygenic pathway has been proposed based on the disproportionation of NO with putative NO dismutase (Nod). In addition to a new process in nitrogen cycling, this process provides ecological advantages for the degradation of substrates in anaerobic conditions, which is of great significance for wastewater treatment. However, the Nod distribution in aquatic environments is rarely investigated. In this study, we obtained the nod genes with an abundance of 2.38 ± 0.96 × 105 copies per gram of dry soil from the Zoige wetland and aligned the molecular characteristics in the corresponding Nod sequences. These Nod sequences were not only found existing in NC10 bacteria, but were also found forming some other clusters with Nod sequences from a WWTP reactor or contaminated aquifers. Moreover, a new subcluster in the aquifer-similar cluster was even dominant in the Zoige wetland and was named the Z-aquifer subcluster. Additionally, soils from the Zoige wetland showed a high potential rate (10.97 ± 1.42 nmol of CO2 per gram of dry soil per day) for nitrite-dependent anaerobic methane oxidation (N-DAMO) with low abundance of NC10 bacteria, which may suggest a potential activity of Nod in other clusters when considering the dominance of the Z-aquifer subcluster Nod. In conclusion, we verified the occurrence of Nod in an alpine wetland for the first time and found a new subcluster to be dominant in the Zoige wetland. Moreover, this new subcluster of Nod may even be active in the N-DAMO process in this alpine wetland, which needs further study to confirm.


Sign in / Sign up

Export Citation Format

Share Document