A Data Extraction Algorithm from Open Source Software Project Repositories for Building Duration Estimation Models: Case Study of Github

Software project estimation is important for allocating resources and planning a reasonable work schedule. Estimation models are typically built using data from completed projects. While organizations have their historical data repositories, it is difficult to obtaintheir collaboration due to privacy and competitive concerns. To overcome the issue of public access to private data repositories this study proposes an algorithm to extract sufficient data from the GitHub repository for building duration estimation models. More specifically, this study extracts and analyses historical data on WordPress projects to estimate OSS project duration using commits as an independent variable as well as an improved classification of contributors based on the number of active days for each contributor within a release period. The results indicate that duration estimation models using data from OSS repositories perform well and partially solves the problem of lack of data encountered in empirical research in software engineering.

Download Full-text

Duration Estimation Models for Open Source Software Projects

International Journal of Information Technology and Computer Science ◽

10.5815/ijitcs.2021.01.01 ◽

2021 ◽

Vol 13 (1) ◽

pp. 1-17

Author(s):

Donatien Koulla Moulla ◽

◽

Alain Abran ◽

Kolyang

Keyword(s):

Open Source ◽

Open Source Software ◽

Total Duration ◽

Full Time ◽

Effort Estimation ◽

Estimation Model ◽

Data Repositories ◽

Independent Variable ◽

Duration Estimation ◽

Estimation Models

For software organizations that rely on Open Source Software (OSS) to develop customer solutions and products, it is essential to accurately estimate how long it will take to deliver the expected functionalities. While OSS is supported by government policies around the world, most of the research on software project estimation has focused on conventional projects with commercial licenses. OSS effort estimation is challenging since OSS participants do not record effort data in OSS repositories. However, OSS data repositories contain dates of the participants’ contributions and these can be used for duration estimation. This study analyses historical data on WordPress and Swift projects to estimate OSS project duration using either commits or lines of code (LOC) as the independent variable. This study proposes first an improved classification of contributors based on the number of active days for each contributor in the development period of a release. For the WordPress and Swift OSS projects environments the results indicate that duration estimation models using the number of commits as the independent variable perform better than those using LOC. The estimation model for full-time contributors gives an estimate of the total duration, while the models with part-time and occasional contributors lead to better estimates of projects duration with both for the commits data and the lines of data.

Download Full-text

Estimating Software Development Efforts Using a Random Forest-Based Stacked Ensemble Approach

Electronics ◽

10.3390/electronics10101195 ◽

2021 ◽

Vol 10 (10) ◽

pp. 1195

Author(s):

Priya Varshini A G ◽

Anitha Kumari K ◽

Vijayakumar Varadarajan

Keyword(s):

Deep Learning ◽

Random Forest ◽

Software Development ◽

Weighted Averaging ◽

Software Project ◽

Effort Estimation ◽

Software Effort Estimation ◽

Single Model ◽

Ensemble Techniques ◽

Project Estimation

Software Project Estimation is a challenging and important activity in developing software projects. Software Project Estimation includes Software Time Estimation, Software Resource Estimation, Software Cost Estimation, and Software Effort Estimation. Software Effort Estimation focuses on predicting the number of hours of work (effort in terms of person-hours or person-months) required to develop or maintain a software application. It is difficult to forecast effort during the initial stages of software development. Various machine learning and deep learning models have been developed to predict the effort estimation. In this paper, single model approaches and ensemble approaches were considered for estimation. Ensemble techniques are the combination of several single models. Ensemble techniques considered for estimation were averaging, weighted averaging, bagging, boosting, and stacking. Various stacking models considered and evaluated were stacking using a generalized linear model, stacking using decision tree, stacking using a support vector machine, and stacking using random forest. Datasets considered for estimation were Albrecht, China, Desharnais, Kemerer, Kitchenham, Maxwell, and Cocomo81. Evaluation measures used were mean absolute error, root mean squared error, and R-squared. The results proved that the proposed stacking using random forest provides the best results compared with single model approaches using the machine or deep learning algorithms and other ensemble techniques.

Download Full-text

What does my neighbourhood have to do with my weight? A protocol for systematic review and meta-analysis of the association between neighbourhood socioeconomic status and body weight

BMJ Open ◽

10.1136/bmjopen-2017-017567 ◽

2017 ◽

Vol 7 (9) ◽

pp. e017567

Author(s):

Shimels Hussien Mohammed ◽

Mulugeta Molla Birhanu ◽

Tesfamichael Awoke Sissay ◽

Tesfa Dejenie Habtewold ◽

Balewgizie Sileshi Tegegn ◽

...

Keyword(s):

Systematic Review ◽

Socioeconomic Status ◽

Data Extraction ◽

Meta Analysis ◽

Cross Sectional ◽

Registration Number ◽

Newcastle Ottawa Scale ◽

Neighbourhood Socioeconomic Status ◽

Using Data

IntroductionIndividuals living in poor neighbourhoods are at a higher risk of overweight/obesity. There is no systematic review and meta-analysis study on the association of neighbourhood socioeconomic status (NSES) with overweight/obesity. We aimed to systematically review and meta-analyse the existing evidence on the association of NSES with overweight/obesity.Methods and analysisCross-sectional, case–control and cohort studies published in English from inception to 15 May 2017 will be systematically searched using the following databases: PubMed, EMBASE, Web of Sciences and Google Scholar. Selection, screening, reviewing and data extraction will be done by two reviewers, independently and in duplicate. The Newcastle–Ottawa Scale (NOS) will be used to assess the quality of evidence. Publication bias will be checked by visual inspection of funnel plots and Egger’s regression test. Heterogeneity will be checked by Higgins’s method (I2statistics). Meta-analysis will be done to estimate the pooled OR. Narrative synthesis will be performed if meta-analysis is not feasible due to high heterogeneity of studies.Ethics and disseminationEthical clearance is not required as we will be using data from published articles. Findings will be communicated through a publication in a peer-reviewed journal and presentations at professional conferences.PROSPERO registration numberCRD42017063889.

Download Full-text

Using Machine Learning Algorithms on data residing in SAP ERP Application to predict equipment failures

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.28.12952 ◽

2018 ◽

Vol 7 (2.28) ◽

pp. 312

Author(s):

Manu Kohli

Keyword(s):

Data Extraction ◽

Indirect Costs ◽

True Positive Rate ◽

Spare Parts ◽

Machine Learning Algorithms ◽

Machine Language ◽

Support Vector ◽

Data Sets ◽

Equipment Failure ◽

Using Data

Asset intensive Organizations have searched long for a framework model that would timely predict equipment failure. Timely prediction of equipment failure substantially reduces direct and indirect costs, unexpected equipment shut-downs, accidents, and unwarranted emission risk. In this paper, the author proposes a model that can predict equipment failure by using data from SAP Plant Maintenance module. To achieve that author has applied data extraction algorithm and numerous data manipulations to prepare a classification data model consisting of maintenance records parameters such as spare parts usage, time elapsed since last completed maintenance and the period to the next scheduled maintained and so on. By using unsupervised learning technique of clustering, the author observed a class to cluster evaluation of 80% accuracy. After that classifier model was trained using various machine language (ML) algorithms and subsequently tested on mutually exclusive data sets with an objective to predict equipment breakdown. The classifier model using ML algorithms such as Support Vector Machine (SVM) and Decision Tree (DT) returned an accuracy and true positive rate (TPR) of greater than 95% to predict equipment failure. The proposed model acts as an Advanced Intelligent Control system contributing to the Cyber-Physical Systems for asset intensive organizations.

Download Full-text

Software Project Estimation Using Improved Use Case Point

2018 IEEE 16th International Conference on Software Engineering Research, Management and Applications (SERA) ◽

10.1109/sera.2018.8477225 ◽

2018 ◽

Cited By ~ 3

Author(s):

Sima Bagheri ◽

Alireza Shameli-Sendi

Keyword(s):

Software Project ◽

Use Case ◽

Project Estimation

Download Full-text

Experimenting on the past: a case study on changing analysability in English ly-adverbs

English Language and Linguistics ◽

10.1017/s1360674317000168 ◽

2017 ◽

Vol 21 (2) ◽

pp. 317-340 ◽

Cited By ~ 3

Author(s):

HENDRIK DE SMET ◽

FREEK VAN DE VELDE

Keyword(s):

Negative Correlation ◽

Historical Data ◽

Frequency Change ◽

Data Sets ◽

The Past ◽

Analogous Composition ◽

Using Data ◽

Complex Words ◽

Experimental Findings

While it is undoubtedly true that historical data do not lend themselves well to the reproduction of experimental findings, the availability of increasingly extensive data sets has brought some experimenting within practical reach. This means that certain predictions based on a combination of synchronic observations and uniformitarian thinking are now testable. Synchronic evidence shows a negative correlation between analysability in morphologically complex words and various measures of frequency. It is therefore expected that when the frequency of morphologically complex items changes, their analysability will change along with this. If analysability decreases, this should in turn be reflected in decreasing sensitivity to priming by items with analogous composition. The latter prediction is in principle testable on diachronic data, offering a way of verifying the diachronic effect of frequency change on analysability. In this spirit, the present article examines the relation between changing frequency and priming sensitivity, as a proxy to analysability. This is done for a sample of 250 English ly-adverbs, such as roughly, blindly, publicly, etc. over the period 1950–2005, using data from the Hansard Corpus. Some of the expected relations between frequency and analysability can be shown to hold, albeit with great variation across lexical items. At the same time, much of the variation in our measure of analysability cannot be accounted for by frequency or frequency change alone.

Download Full-text

ECG signal compression using data extraction and truncated singular value decomposition

2017 IEEE Region 10 Humanitarian Technology Conference (R10-HTC) ◽

10.1109/r10-htc.2017.8288893 ◽

2017 ◽

Cited By ~ 1

Author(s):

Syed Salman Kabir ◽

Mamshad Nayeem Rizve ◽

Md. Kamrul Hasan

Keyword(s):

Singular Value Decomposition ◽

Data Extraction ◽

Singular Value ◽

Ecg Signal ◽

Signal Compression ◽

Truncated Singular Value Decomposition ◽

Using Data ◽

Value Decomposition

Download Full-text

STRUKTUR POLA ASUH ANAK TENAGA KERJA WANITA DI MADURA

An-Nisa' : Jurnal Kajian Perempuan dan Keislaman ◽

10.35719/annisa.v13i1.24 ◽

2020 ◽

Vol 13 (1) ◽

pp. 40-68

Author(s):

Heni Listiana

Keyword(s):

Child Care ◽

Migrant Workers ◽

Data Extraction ◽

Children Of Migrant Workers ◽

Using Data ◽

The Right ◽

The Village ◽

Source Of Information ◽

Female Migrant

Discussions about children and female migrant workers (TKW) are always in interesting issue. Especially, related to child care. By using data extraction techniques such as observation, interviews, and documentation, it is known that parenting children of migrant workers in Madura has formed a new structure with the emergence of a second mother. There are three types of second mothers, namely grandmother, bu de (mother's brother or sister), and sister of TKW's child. They carry out the role of mother, among them being a model of children's behavior that is easily observed and imitated, becomes an educator, becomes a consultant, and becomes a source of information. Nearly 77% of grandmothers become maternal substitutes for migrant workers' children. Grandmother is considered the right person to do childcare tasks. This structure is called the inner parenting structure. While the structure of outside parenting takes the form of community participation in child care, namely good neighbors, the attention of the village head (Klebun), and the environment of friends and schools. Pembahasan tentang anak dan Tenaga Kerja Wanita (TKW) selalu menjadi isu yang menarik. Terutama yang berkaitan dengan pola asuh anak. Dengan menggunakan teknik penggalian data berupa observasi, wawancara, dan dokumentasi diketahui bahwa pola asuh anak TKW di Madura membentuk struktur baru dengan munculnya ibu pengganti (second mother). Ada tiga jenis ibu pengganti, yaitu nenek, bu de (kakak atau adik ibu), serta kakak dari anak TKW. Mereka menjalankan peran ibu diantaranya menjadi model tingkah laku anak yang mudah diamati dan ditiru, menjadi pendidik, menjadi konsultan, dan menjadi sumber informasi. Hampir 77% nenek menjadi sosok pengganti ibu bagi anak-anak TKW. Nenek dianggap sebagai sosok yang tepat untuk melakukan tugas-tugas pengasuhan anak. Struktur ini disebut dengan struktur pola asuh dalam. Sementara struktur pola asuh luar itu berwujud peran serta masyarakat dalam pengasuhan anak yaitu tetangga yang baik, perhatian kepala desa (Klebun), dan lingkungan teman dan sekolah.

Download Full-text

The Full Knowledge of Big Data in the Integration of Inter-Organizational Information

Web Services ◽

10.4018/978-1-5225-7501-6.ch044 ◽

2019 ◽

pp. 803-821

Author(s):

Thiago Poleto ◽

Victor Diogho Heuer de Carvalho ◽

Ana Paula Cabral Seixas Costa

Keyword(s):

Decision Making ◽

Big Data ◽

Decision Support ◽

Decision Making Process ◽

Data Repositories ◽

Modeling And Analysis ◽

Incremental Change ◽

Working Together ◽

Full Knowledge ◽

Using Data

Big Data is a radical shift or an incremental change for the existing digital infrastructures, that include the toolset used to aid the decision making process such as information systems, data repositories, formal modeling, and analysis of decisions. This work aims to provide a theoretical approach about the elements necessary to apply the big data concept in the decision making process. It identifies key components of the big data to define an integrated model of decision making using data mining, business intelligence, decision support systems, and organizational learning all working together to provide decision support with a reliable visualization of the decision-related opportunities. The concepts of data integration and semantic also was explored in order to demonstrate that, once mined, data must be integrated, ensuring conceptual connections and bequeathing meaning to use them appropriately for problem solving in decision.

Download Full-text

Data Analytics for Predicting COVID-19 Cases in Top Affected Countries: Observations and Recommendations

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph17197080 ◽

2020 ◽

Vol 17 (19) ◽

pp. 7080 ◽

Cited By ~ 1

Author(s):

Abdelrahman E. E. Eltoukhy ◽

Ibrahim Abdelfadeel Shaban ◽

Felix T. S. Chan ◽

Mohammad A. M. Abdel-Aal

Keyword(s):

Data Analytics ◽

Historical Data ◽

External Factors ◽

Number Of Patients ◽

Future Prediction ◽

The Government ◽

Using Data ◽

Novel Coronavirus ◽

Improved Accuracy ◽

Nonlinear Autoregressive

The outbreak of the 2019 novel coronavirus disease (COVID-19) has adversely affected many countries in the world. The unexpected large number of COVID-19 cases has disrupted the healthcare system in many countries and resulted in a shortage of bed spaces in the hospitals. Consequently, predicting the number of COVID-19 cases is imperative for governments to take appropriate actions. The number of COVID-19 cases can be accurately predicted by considering historical data of reported cases alongside some external factors that affect the spread of the virus. In the literature, most of the existing prediction methods focus only on the historical data and overlook most of the external factors. Hence, the number of COVID-19 cases is inaccurately predicted. Therefore, the main objective of this study is to simultaneously consider historical data and the external factors. This can be accomplished by adopting data analytics, which include developing a nonlinear autoregressive exogenous input (NARX) neural network-based algorithm. The viability and superiority of the developed algorithm are demonstrated by conducting experiments using data collected for top five affected countries in each continent. The results show an improved accuracy when compared with existing methods. Moreover, the experiments are extended to make future prediction for the number of patients afflicted with COVID-19 during the period from August 2020 until September 2020. By using such predictions, both the government and people in the affected countries can take appropriate measures to resume pre-epidemic activities.

Download Full-text