scholarly journals A Data Extraction Algorithm from Open Source Software Project Repositories for Building Duration Estimation Models: Case Study of Github

2020 ◽  
Vol 11 (6) ◽  
pp. 31-46
Author(s):  
Donatien K. Moulla ◽  
Alain Abran ◽  
Kol yang

Software project estimation is important for allocating resources and planning a reasonable work schedule. Estimation models are typically built using data from completed projects. While organizations have their historical data repositories, it is difficult to obtaintheir collaboration due to privacy and competitive concerns. To overcome the issue of public access to private data repositories this study proposes an algorithm to extract sufficient data from the GitHub repository for building duration estimation models. More specifically, this study extracts and analyses historical data on WordPress projects to estimate OSS project duration using commits as an independent variable as well as an improved classification of contributors based on the number of active days for each contributor within a release period. The results indicate that duration estimation models using data from OSS repositories perform well and partially solves the problem of lack of data encountered in empirical research in software engineering.

Author(s):  
Donatien Koulla Moulla ◽  
◽  
Alain Abran ◽  
Kolyang

For software organizations that rely on Open Source Software (OSS) to develop customer solutions and products, it is essential to accurately estimate how long it will take to deliver the expected functionalities. While OSS is supported by government policies around the world, most of the research on software project estimation has focused on conventional projects with commercial licenses. OSS effort estimation is challenging since OSS participants do not record effort data in OSS repositories. However, OSS data repositories contain dates of the participants’ contributions and these can be used for duration estimation. This study analyses historical data on WordPress and Swift projects to estimate OSS project duration using either commits or lines of code (LOC) as the independent variable. This study proposes first an improved classification of contributors based on the number of active days for each contributor in the development period of a release. For the WordPress and Swift OSS projects environments the results indicate that duration estimation models using the number of commits as the independent variable perform better than those using LOC. The estimation model for full-time contributors gives an estimate of the total duration, while the models with part-time and occasional contributors lead to better estimates of projects duration with both for the commits data and the lines of data.


Electronics ◽  
2021 ◽  
Vol 10 (10) ◽  
pp. 1195
Author(s):  
Priya Varshini A G ◽  
Anitha Kumari K ◽  
Vijayakumar Varadarajan

Software Project Estimation is a challenging and important activity in developing software projects. Software Project Estimation includes Software Time Estimation, Software Resource Estimation, Software Cost Estimation, and Software Effort Estimation. Software Effort Estimation focuses on predicting the number of hours of work (effort in terms of person-hours or person-months) required to develop or maintain a software application. It is difficult to forecast effort during the initial stages of software development. Various machine learning and deep learning models have been developed to predict the effort estimation. In this paper, single model approaches and ensemble approaches were considered for estimation. Ensemble techniques are the combination of several single models. Ensemble techniques considered for estimation were averaging, weighted averaging, bagging, boosting, and stacking. Various stacking models considered and evaluated were stacking using a generalized linear model, stacking using decision tree, stacking using a support vector machine, and stacking using random forest. Datasets considered for estimation were Albrecht, China, Desharnais, Kemerer, Kitchenham, Maxwell, and Cocomo81. Evaluation measures used were mean absolute error, root mean squared error, and R-squared. The results proved that the proposed stacking using random forest provides the best results compared with single model approaches using the machine or deep learning algorithms and other ensemble techniques.


BMJ Open ◽  
2017 ◽  
Vol 7 (9) ◽  
pp. e017567
Author(s):  
Shimels Hussien Mohammed ◽  
Mulugeta Molla Birhanu ◽  
Tesfamichael Awoke Sissay ◽  
Tesfa Dejenie Habtewold ◽  
Balewgizie Sileshi Tegegn ◽  
...  

IntroductionIndividuals living in poor neighbourhoods are at a higher risk of overweight/obesity. There is no systematic review and meta-analysis study on the association of neighbourhood socioeconomic status (NSES) with overweight/obesity. We aimed to systematically review and meta-analyse the existing evidence on the association of NSES with overweight/obesity.Methods and analysisCross-sectional, case–control and cohort studies published in English from inception to 15 May 2017 will be systematically searched using the following databases: PubMed, EMBASE, Web of Sciences and Google Scholar. Selection, screening, reviewing and data extraction will be done by two reviewers, independently and in duplicate. The Newcastle–Ottawa Scale (NOS) will be used to assess the quality of evidence. Publication bias will be checked by visual inspection of funnel plots and Egger’s regression test. Heterogeneity will be checked by Higgins’s method (I2statistics). Meta-analysis will be done to estimate the pooled OR. Narrative synthesis will be performed if meta-analysis is not feasible due to high heterogeneity of studies.Ethics and disseminationEthical clearance is not required as we will be using data from published articles. Findings will be communicated through a publication in a peer-reviewed journal and presentations at professional conferences.PROSPERO registration numberCRD42017063889.


2018 ◽  
Vol 7 (2.28) ◽  
pp. 312
Author(s):  
Manu Kohli

Asset intensive Organizations have searched long for a framework model that would timely predict equipment failure. Timely prediction of equipment failure substantially reduces direct and indirect costs, unexpected equipment shut-downs, accidents, and unwarranted emission risk. In this paper, the author proposes a model that can predict equipment failure by using data from SAP Plant Maintenance module. To achieve that author has applied data extraction algorithm and numerous data manipulations to prepare a classification data model consisting of maintenance records parameters such as spare parts usage, time elapsed since last completed maintenance and the period to the next scheduled maintained and so on. By using unsupervised learning technique of clustering, the author observed a class to cluster evaluation of 80% accuracy. After that classifier model was trained using various machine language (ML) algorithms and subsequently tested on mutually exclusive data sets with an objective to predict equipment breakdown. The classifier model using ML algorithms such as Support Vector Machine (SVM) and Decision Tree (DT) returned an accuracy and true positive rate (TPR) of greater than 95% to predict equipment failure. The proposed model acts as an Advanced Intelligent Control system contributing to the Cyber-Physical Systems for asset intensive organizations. 


2017 ◽  
Vol 21 (2) ◽  
pp. 317-340 ◽  
Author(s):  
HENDRIK DE SMET ◽  
FREEK VAN DE VELDE

While it is undoubtedly true that historical data do not lend themselves well to the reproduction of experimental findings, the availability of increasingly extensive data sets has brought some experimenting within practical reach. This means that certain predictions based on a combination of synchronic observations and uniformitarian thinking are now testable. Synchronic evidence shows a negative correlation between analysability in morphologically complex words and various measures of frequency. It is therefore expected that when the frequency of morphologically complex items changes, their analysability will change along with this. If analysability decreases, this should in turn be reflected in decreasing sensitivity to priming by items with analogous composition. The latter prediction is in principle testable on diachronic data, offering a way of verifying the diachronic effect of frequency change on analysability. In this spirit, the present article examines the relation between changing frequency and priming sensitivity, as a proxy to analysability. This is done for a sample of 250 English ly-adverbs, such as roughly, blindly, publicly, etc. over the period 1950–2005, using data from the Hansard Corpus. Some of the expected relations between frequency and analysability can be shown to hold, albeit with great variation across lexical items. At the same time, much of the variation in our measure of analysability cannot be accounted for by frequency or frequency change alone.


2020 ◽  
Vol 13 (1) ◽  
pp. 40-68
Author(s):  
Heni Listiana

Discussions about children and female migrant workers (TKW) are always in interesting issue. Especially, related to child care. By using data extraction techniques such as observation, interviews, and documentation, it is known that parenting children of migrant workers in Madura has formed a new structure with the emergence of a second mother. There are three types of second mothers, namely grandmother,  bu de (mother's brother or sister), and sister of TKW's child. They carry out the role of mother, among them being a model of children's behavior that is easily observed and imitated, becomes an educator, becomes a consultant, and becomes a source of information. Nearly 77% of grandmothers become maternal substitutes for migrant workers' children. Grandmother is considered the right person to do childcare tasks. This structure is called the inner parenting structure. While the structure of outside parenting takes the form of community participation in child care, namely good neighbors, the attention of the village head (Klebun), and the environment of friends and schools.   Pembahasan tentang anak dan Tenaga Kerja Wanita (TKW) selalu menjadi isu yang menarik. Terutama yang berkaitan dengan pola asuh anak. Dengan menggunakan teknik penggalian data berupa observasi, wawancara, dan dokumentasi diketahui bahwa pola asuh anak TKW di Madura membentuk struktur baru dengan munculnya ibu pengganti (second mother). Ada tiga jenis ibu pengganti, yaitu nenek, bu de (kakak atau adik ibu), serta kakak dari anak TKW. Mereka menjalankan peran ibu diantaranya menjadi model tingkah laku anak yang mudah diamati dan ditiru, menjadi pendidik, menjadi konsultan, dan menjadi sumber informasi. Hampir 77% nenek menjadi sosok pengganti ibu bagi anak-anak TKW. Nenek dianggap sebagai sosok yang tepat untuk melakukan tugas-tugas pengasuhan anak. Struktur ini disebut dengan struktur pola asuh dalam. Sementara struktur pola asuh luar itu berwujud peran serta masyarakat dalam pengasuhan anak yaitu tetangga yang baik, perhatian kepala desa (Klebun), dan lingkungan teman dan sekolah.


Web Services ◽  
2019 ◽  
pp. 803-821
Author(s):  
Thiago Poleto ◽  
Victor Diogho Heuer de Carvalho ◽  
Ana Paula Cabral Seixas Costa

Big Data is a radical shift or an incremental change for the existing digital infrastructures, that include the toolset used to aid the decision making process such as information systems, data repositories, formal modeling, and analysis of decisions. This work aims to provide a theoretical approach about the elements necessary to apply the big data concept in the decision making process. It identifies key components of the big data to define an integrated model of decision making using data mining, business intelligence, decision support systems, and organizational learning all working together to provide decision support with a reliable visualization of the decision-related opportunities. The concepts of data integration and semantic also was explored in order to demonstrate that, once mined, data must be integrated, ensuring conceptual connections and bequeathing meaning to use them appropriately for problem solving in decision.


Author(s):  
Abdelrahman E. E. Eltoukhy ◽  
Ibrahim Abdelfadeel Shaban ◽  
Felix T. S. Chan ◽  
Mohammad A. M. Abdel-Aal

The outbreak of the 2019 novel coronavirus disease (COVID-19) has adversely affected many countries in the world. The unexpected large number of COVID-19 cases has disrupted the healthcare system in many countries and resulted in a shortage of bed spaces in the hospitals. Consequently, predicting the number of COVID-19 cases is imperative for governments to take appropriate actions. The number of COVID-19 cases can be accurately predicted by considering historical data of reported cases alongside some external factors that affect the spread of the virus. In the literature, most of the existing prediction methods focus only on the historical data and overlook most of the external factors. Hence, the number of COVID-19 cases is inaccurately predicted. Therefore, the main objective of this study is to simultaneously consider historical data and the external factors. This can be accomplished by adopting data analytics, which include developing a nonlinear autoregressive exogenous input (NARX) neural network-based algorithm. The viability and superiority of the developed algorithm are demonstrated by conducting experiments using data collected for top five affected countries in each continent. The results show an improved accuracy when compared with existing methods. Moreover, the experiments are extended to make future prediction for the number of patients afflicted with COVID-19 during the period from August 2020 until September 2020. By using such predictions, both the government and people in the affected countries can take appropriate measures to resume pre-epidemic activities.


Sign in / Sign up

Export Citation Format

Share Document