Comprehension and production of Kinyarwanda verbs in the Discriminative Lexicon

Mapping Intimacies ◽

10.31234/osf.io/h63zy ◽

2021 ◽

Author(s):

Ruben van de Vijver ◽

Emmanuel Uwambayinema ◽

Yu-Ying Chuang

Keyword(s):

Data Sets ◽

Semantic Features ◽

Full Data ◽

Data Set ◽

Bantu Language ◽

Word Forms ◽

The One ◽

Agglutinative Languages ◽

Complex Words ◽

Whole Word

How do speakers comprehend and produce complex words? In the theory ofthe Discriminative Lexicon this is hypothesized to be the results of mapping the phonology of whole word forms onto their semantics and vice versa, without recourse to morphemes. This raises the question whether this hypothesis also holds true in highly agglutinative languages, which are oǒten seen to exemplify the compositional nature of morphology. On the one hand, one could expect that the hypothesis for agglutinative languages is correct, since it remains unclear whether speakers are able to isolate the morphemes they need to achieve this. On the other hand, agglutinative languages have so many different words that it is not obvious how speakers can use their knowledge of words to comprehend and produce them.In this paper, we investigate comprehension and production of verbs in Kinyarwanda,an agglutinative Bantu language, by means of computational modeling within the theDiscriminative Lexicon, a theory of the mental lexicon, which is grounded in word andparadigm morphology, distributional semantics, error-driven learning, and uses insightsof psycholinguistic theories, and is implemented mathematically and computationallyas a shallow, two-layered network.In order to do this, we compiled a data set of 11528 verb forms and annotated for eachverb form its meaning and grammatical functions, and, additionally, we used our dataset to extract 573 verbs that are present in our full data set and for which meanings ofverbs are based on word embeddings. In order to assess comprehension and production of Kinyarwanda verbs, we fed both data sets into the Linear Discriminative Learningalgorithm, a two-layered, fully connected network. One layer represent the phonological form and the layer represents meaning. Comprehension is modeled as a mapping from phonology to meaning and production is modeled as a mapping from meaning to phonology. Both comprehension and production is learned with high accuracy in all data and in held-out data, both for the full data set, with manually annotated semantic features, and for the data set with meanings derived from word embeddings.Our findings provide support for the various hypotheses of the Discriminative Lexicon:Words are stored as wholes, meanings are a result of the distribution of words in utterances, comprehension and production can be successfully modeled from mappings from form to meaning and vice versa, which can be modeled in a shallow two-layered network, and these mappings are learned in by minimizing errors.

Download Full-text

Key Parts of Transmission Line Detection Using Improved YOLO v3

The International Arab Journal of Information Technology ◽

10.34028/iajit/18/6/1 ◽

2021 ◽

Vol 18 (6) ◽

Author(s):

Tu Renwei ◽

Zhu Zhongjie ◽

Bai Yongqiang ◽

Gao Ming ◽

Ge Zhifeng

Keyword(s):

Transmission Line ◽

Detection Accuracy ◽

Data Sets ◽

Feature Maps ◽

Data Set ◽

Detection Model ◽

The Neural Network ◽

Low Efficiency ◽

The One ◽

Detection Speed

Unmanned Aerial Vehicle (UAV) inspection has become one of main methods for current transmission line inspection, but there are still some shortcomings such as slow detection speed, low efficiency, and inability for low light environment. To address these issues, this paper proposes a deep learning detection model based on You Only Look Once (YOLO) v3. On the one hand, the neural network structure is simplified, that is the three feature maps of YOLO v3 are pruned into two to meet specific detection requirements. Meanwhile, the K-means++ clustering method is used to calculate the anchor value of the data set to improve the detection accuracy. On the other hand, 1000 sets of power tower and insulator data sets are collected, which are inverted and scaled to expand the data set, and are fully optimized by adding different illumination and viewing angles. The experimental results show that this model using improved YOLO v3 can effectively improve the detection accuracy by 6.0%, flops by 8.4%, and the detection speed by about 6.0%.

Download Full-text

Data Sets

Hurricane Climatology ◽

10.1093/oso/9780199827633.003.0009 ◽

2013 ◽

Author(s):

James B. Elsner ◽

Thomas H. Jagger

Keyword(s):

Model Building ◽

Careful Analysis ◽

Logical Structure ◽

Fortran Program ◽

Data Sets ◽

Data Set ◽

Near Surface ◽

Wind Speeds ◽

The North ◽

The One

Hurricane data originate from careful analysis of past storms by operational meteorologists. The data include estimates of the hurricane position and intensity at 6-hourly intervals. Information related to landfall time, local wind speeds, damages, and deaths, as well as cyclone size, are included. The data are archived by season. Some effort is needed to make the data useful for hurricane climate studies. In this chapter, we describe the data sets used throughout this book. We show you a work flow that includes importing, interpolating, smoothing, and adding attributes. We also show you how to create subsets of the data. Code in this chapter is more complicated and it can take longer to run. You can skip this material on first reading and continue with model building in Chapter 7. You can return here when you have an updated version of the data that includes the most recent years. Most statistical models in this book use the best-track data. Here we describe these data and provide original source material. We also explain how to smooth and interpolate them. Interpolations are needed for regional hurricane analyses. The best-track data set contains the 6-hourly center locations and intensities of all known tropical cyclones across the North Atlantic basin, including the Gulf of Mexico and Caribbean Sea. The data set is called HURDAT for HURricane DATa. It is maintained by the U.S. National Oceanic and Atmospheric Administration (NOAA) at the National Hurricane Center (NHC). Center locations are given in geographic coordinates (in tenths of degrees) and the intensities, representing the one-minute near-surface (∼10 m) wind speeds, are given in knots (1 kt = .5144 m s−1) and the minimum central pressures are given in millibars (1 mb = 1 hPa). The data are provided in 6-hourly intervals starting at 00 UTC (Universal Time Coordinate). The version of HURDAT file used here contains cyclones over the period 1851 through 2010 inclusive. Information on the history and origin of these data is found in Jarvinen et al (1984). The file has a logical structure that makes it easy to read with a FORTRAN program. Each cyclone contains a header record, a series of data records, and a trailer record.

Download Full-text

An Accurate Substitution Method To Minimize Left Censoring Bias in Serum Steroid Measurements

Endocrinology ◽

10.1210/en.2019-00340 ◽

2019 ◽

Vol 160 (10) ◽

pp. 2395-2400 ◽

Cited By ~ 5

Author(s):

David J Handelsman ◽

Lam P Ly

Keyword(s):

Data Analysis ◽

Ad Hoc ◽

Likelihood Estimation ◽

Large Data ◽

Serum Testosterone ◽

Accurate Method ◽

Estimation Methods ◽

Data Sets ◽

Full Data ◽

Data Set

Abstract Hormone assay results below the assay detection limit (DL) can introduce bias into quantitative analysis. Although complex maximum likelihood estimation methods exist, they are not widely used, whereas simple substitution methods are often used ad hoc to replace the undetectable (UD) results with numeric values to facilitate data analysis with the full data set. However, the bias of substitution methods for steroid measurements is not reported. Using a large data set (n = 2896) of serum testosterone (T), DHT, estradiol (E2) concentrations from healthy men, we created modified data sets with increasing proportions of UD samples (≤40%) to which we applied five different substitution methods (deleting UD samples as missing and substituting UD sample with DL, DL/√2, DL/2, or 0) to calculate univariate descriptive statistics (mean, SD) or bivariate correlations. For all three steroids and for univariate as well as bivariate statistics, bias increased progressively with increasing proportion of UD samples. Bias was worst when UD samples were deleted or substituted with 0 and least when UD samples were substituted with DL/√2, whereas the other methods (DL or DL/2) displayed intermediate bias. Similar findings were replicated in randomly drawn small subsets of 25, 50, and 100. Hence, we propose that in steroid hormone data with ≤40% UD samples, substituting UD with DL/√2 is a simple, versatile, and reasonably accurate method to minimize left censoring bias, allowing for data analysis with the full data set.

Download Full-text

The implications of the revised code of corporate governance on firm performance

Journal of Accounting in Emerging Economies ◽

10.1108/jaee-11-2012-0048 ◽

2015 ◽

Vol 5 (3) ◽

pp. 350-380 ◽

Cited By ~ 15

Author(s):

Abdifatah Ahmed Haji ◽

Sanni Mubaraq

Keyword(s):

Corporate Governance ◽

Firm Performance ◽

Ownership Structure ◽

Market Performance ◽

Negative Relationship ◽

Future Research ◽

Data Sets ◽

Data Set ◽

Content Type ◽

The One

Purpose – The purpose of this paper is to examine the impact of corporate governance and ownership structure attributes on firm performance following the revised code on corporate governance in Malaysia. The study presents a longitudinal assessment of the compliance and implications of the revised code on firm performance. Design/methodology/approach – Two data sets consisting of before (2006) and after (2008-2010) the revised code are examined. Drawing from the largest companies listed on Bursa Malaysia (BM), the first data set contains 92 observations in the year 2006 while the second data set comprises of 282 observations drawn from the largest companies listed on BM over a three-year period, from 2008-2010. Both accounting (return on assets and return on equity) and market performance (Tobin’s Q) measures were used to measure firm performance. Multiple and panel data regression analyses were adopted to analyze the data. Findings – The study shows that there were still cases of non-compliance to the basic requirements of the code such as the one-third independent non-executive director (INDs) requirement even after the revised code. While the regression models indicate marginal significance of board size and independent directors before the revised code, the results indicate all corporate governance variables have a significant negative relationship with at least one of the measures of corporate performance. Independent chairperson, however, showed a consistent positive impact on firm performance both before and after the revised code. In addition, ownership structure elements were found to have a negative relationship with either accounting or market performance measures, with institutional ownership showing a consistent negative impact on firm performance. Firm size and leverage, as control variables, were significant in determining corporate performance. Research limitations/implications – One limitation is the use of separate measures of corporate governance attributes, as opposed to a corporate governance index (CGI). As a result, the study constructs a CGI based on the recommendations of the revised code and proposes for future research use. Practical implications – Some of the largest companies did not even comply with basic requirements such as the “one-third INDs” mandatory requirement. Hence, the regulators may want to reinforce the requirements of the code and also detail examples of good governance practices. The results, which show a consistent positive relationship between the presence of an independent chairperson and firm performance in both data sets, suggest listed companies to consider appointing an independent chairperson in the corporate leadership. The regulatory authorities may also wish to note this phenomenon when drafting any future corporate governance codes. Originality/value – This study offers new insights of the implications of regulatory changes on the relationship between corporate governance attributes and firm performance from the perspective of a developing country. The development of a CGI for future research is a novel approach of this study.

Download Full-text

GLEAM v3: satellite-based land evaporation and root-zone soil moisture

10.5194/gmd-2016-162 ◽

2016 ◽

Cited By ~ 25

Author(s):

Brecht Martens ◽

Diego G. Miralles ◽

Hans Lievens ◽

Robin van der Schalie ◽

Richard A. M. de Jeu ◽

...

Keyword(s):

Soil Moisture ◽

Eddy Covariance ◽

Optical Depth ◽

Root Zone ◽

European Space Agency ◽

Data Sets ◽

Data Set ◽

Previous Version ◽

The One

Abstract. The Global Land Evaporation Amsterdam Model (GLEAM) is a set of algorithms dedicated to the estimation of terrestrial evaporation and root-zone soil moisture from satellite data. Ever since its development in 2011, the model has been regularly revised aiming at the optimal incorporation of new satellite-observed geophysical variables, and improving the representation of physical processes. In this study, the next version of this model (v3) is presented. Key changes relative to the previous version include: (1) a revised formulation of the evaporative stress, (2) an optimized drainage algorithm, and (3) a new soil moisture data assimilation system. GLEAM v3 is used to produce three new data sets of terrestrial evaporation and root-zone soil moisture, including a 35-year data set spanning the period 1980–2014 (v3.0a, based on satellite-observed soil moisture, vegetation optical depth and snow water equivalents, reanalysis air temperature and radiation, and a multi-source precipitation product), and two fully satellite-based data sets. The latter two share most of their forcing, except for the vegetation optical depth and soil moisture products, which are based on observations from different passive and active C- and L-band microwave sensors (European Space Agency Climate Change Initiative data sets) for the first data set (v3.0b, spanning the period 2003–2015) and observations from the Soil Moisture and Ocean Salinity satellite in the second data set (v3.0c, spanning the period 2011–2015). These three data sets are described in detail, compared against analogous data sets generated using the previous version of GLEAM (v2), and validated against measurements from 64 eddy-covariance towers and 2338 soil moisture sensors across a broad range of ecosystems. Results indicate that the quality of the v3 soil moisture is consistently better than the one from v2: average correlations against in situ surface soil moisture measurements increase from 0.61 to 0.64 in case of the v3.0a data set and the representation of soil moisture in the second layer improves as well, with correlations increasing from 0.47 to 0.53. Similar improvements are observed for the two fully satellite-based data sets. Despite regional differences, the quality of the evaporation fluxes remains overall similar as the one obtained using the previous version of GLEAM, with average correlations against eddy-covariance measurements between 0.78 and 0.80 for the three different data sets. These global data sets of terrestrial evaporation and root-zone soil moisture are now openly available at http://GLEAM.eu and may be used for large-scale hydrological applications, climate studies and research on land-atmosphere feedbacks.

Download Full-text

Analysis of SARS-CoV-2 RNA-Sequences by Interpretable Machine Learning Models

10.1101/2020.05.15.097741 ◽

2020 ◽

Author(s):

Marika Kaden ◽

Katrin Sophie Bohnsack ◽

Mirko Weber ◽

Mateusz Kudła ◽

Kaja Gutowska ◽

...

Keyword(s):

Machine Learning ◽

Virus Type ◽

Data Sets ◽

Data Set ◽

Machine Learning Methods ◽

Alignment Free ◽

Interpretable Machine Learning ◽

Vector Quantizers ◽

The One ◽

Viral Sequences

AbstractWe present an approach to investigate SARS-CoV-2 virus sequences based on alignment-free methods for RNA sequence comparison. In particular, we verify a given clustering result for the GISAID data set, which was obtained analyzing the molecular differences in coronavirus populations by phylogenetic trees. For this purpose, we use alignment-free dissimilarity measures for sequences and combine them with learning vector quantization classifiers for virus type discriminant analysis and classification. Those vector quantizers belong to the class of interpretable machine learning methods, which, on the one hand side provide additional knowledge about the classification decisions like discriminant feature correlations, and on the other hand can be equipped with a reject option. This option gives the model the property of self controlled evidence if applied to new data, i.e. the models refuses to make a classification decision, if the model evidence for the presented data is not given. After training such a classifier for the GISAID data set, we apply the obtained classifier model to another but unlabeled SARS-CoV-2 virus data set. On the one hand side, this allows us to assign new sequences to already known virus types and, on the other hand, the rejected sequences allow speculations about new virus types with respect to nucleotide base mutations in the viral sequences.Author summaryThe currently emerging global disease COVID-19 caused by novel SARS-CoV-2 viruses requires all scientific effort to investigate the development of the viral epidemy, the properties of the virus and its types. Investigations of the virus sequence are of special interest. Frequently, those are based on mathematical/statistical analysis. However, machine learning methods represent a promising alternative, if one focuses on interpretable models, i.e. those that do not act as black-boxes. Doing so, we apply variants of Learning Vector Quantizers to analyze the SARS-CoV-2 sequences. We encoded the sequences and compared them in their numerical representations to avoid the computationally costly comparison based on sequence alignments. Our resulting model is interpretable, robust, efficient, and has a self-controlling mechanism regarding the applicability to data. This framework was applied to two data sets concerning SARS-CoV-2. We were able to verify previously published virus type findings for one of the data sets by training our model to accurately identify the virus type of sequences. For sequences without virus type information (second data set), our trained model can predict them. Thereby, we observe a new scattered spreading of the sequences in the data space which probably is caused by mutations in the viral sequences.

Download Full-text

Return Prediction Based on Discriminating market-styles with Reinforcement Learning

International Journal of Circuits, Systems and Signal Processing ◽

10.46300/9106.2021.15.86 ◽

2021 ◽

Vol 15 ◽

pp. 782-791

Author(s):

Zhiguo Bao ◽

Shuyu Wang

Keyword(s):

Reinforcement Learning ◽

Prediction Model ◽

Nonlinear Model ◽

Hedge Funds ◽

Hedge Fund ◽

Investment Strategy ◽

Data Sets ◽

Full Data ◽

Data Set ◽

Fitting Ability

For hedge funds, return prediction has always been a fundamental and important problem. Usually, a good return prediction model directly determines the performance of a quantitative investment strategy. However, the performance of the model will be influenced by the market-style. Even the models trained through the same data set, their performance is different in different market-styles. Traditional methods hope to train a universal linear or nonlinear model on the data set to cope with different market-styles. However, the linear model has limited fitting ability and is insufficient to deal with hundreds of features in the hedge fund features pool. The nonlinear model has a risk to be over-fitting. Simultaneously, changes in market-style will make certain features valid or invalid, and a traditional linear or nonlinear model is not sufficient to deal with this situation. This thesis proposes a method based on Reinforcement Learning that automatically discriminates market-styles and automatically selects the model that best ﬁts the current market-style from sub-models pre-trained with different categories of features to predict the return of stocks. Compared with the traditional method that training return prediction model directly through the full data sets, the experiment shows that the proposed method has a better performance, which has a higher Sharpe ratio and annualized return.

Download Full-text

Current Topics in Avian Conservation Genetics with Special Reference to the Southwestern Willow Flycatcher

The Open Ornithology Journal ◽

10.2174/1874453201609010060 ◽

2016 ◽

Vol 9 (1) ◽

pp. 60-69

Author(s):

Robert M. Zink

Keyword(s):

Analytical Methods ◽

Scientific Method ◽

Morphological Data ◽

Data Sets ◽

Avian Conservation ◽

Full Data ◽

Data Set ◽

Willow Flycatcher ◽

Existing Data ◽

Conservation Decisions

It is sometimes said that scientists are entitled to their own opinions but not their own set of facts. This suggests that application of the scientific method ought to lead to a single conclusion from a given set of data. However, sometimes scientists have conflicting opinions about which analytical methods are most appropriate or which subsets of existing data are most relevant, resulting in different conclusions. Thus, scientists might actually lay claim to different sets of facts. However, if a contrary conclusion is reached by selecting a subset of data, this conclusion should be carefully scrutinized to determine whether consideration of the full data set leads to different conclusions. This is important because conservation agencies are required to consider all of the best available data and make a decision based on them. Therefore, exploring reasons why different conclusions are reached from the same body of data has relevance for management of species. The purpose of this paper was to explore how two groups of researchers can examine the same data and reach opposite conclusions in the case of the taxonomy of the endangered subspecies Southwestern Willow Flycatcher (Empidonax traillii extimus). It was shown that use of subsets of data and characters rather than reliance on entire data sets can explain conflicting conclusions. It was recommend that agencies tasked with making conservation decisions rely on analyses that include all relevant molecular, ecological, behavioral, and morphological data, which in this case show that the subspecies is not valid, and hence its listing is likely not warranted.

Download Full-text

The Adjusted Log-logistic Generalized Exponential Distribution with Application to Lifetime Data

International Journal of Statistics and Probability ◽

10.5539/ijsp.v6n4p1 ◽

2017 ◽

Vol 5 (4) ◽

pp. 1

Author(s):

I. E. Okorie ◽

A. C. Akpanta ◽

J. Ohakwe ◽

D. C. Chikezie ◽

C. U. Onyemachi ◽

...

Keyword(s):

Exponential Distribution ◽

Likelihood Estimation ◽

Model Parameters ◽

Data Sets ◽

Lifetime Data ◽

Generalized Exponential Distribution ◽

Data Set ◽

Method Of Maximum Likelihood ◽

The One ◽

Special Case

This paper introduces a new generator of probability distribution-the adjusted log-logistic generalized (ALLoG) distribution and a new extension of the standard one parameter exponential distribution called the adjusted log-logistic generalized exponential (ALLoGExp) distribution. The ALLoGExp distribution is a special case of the ALLoG distribution and we have provided some of its statistical and reliability properties. Notably, the failure rate could be monotonically decreasing, increasing or upside-down bathtub shaped depending on the value of the parameters $\delta$ and $\theta$. The method of maximum likelihood estimation was proposed to estimate the model parameters. The importance and flexibility of he ALLoGExp distribution was demonstrated with a real and uncensored lifetime data set and its fit was compared with five other exponential related distributions. The results obtained from the model fittings shows that the ALLoGExp distribution provides a reasonably better fit than the one based on the other fitted distributions. The ALLoGExp distribution is therefore ecommended for effective modelling of lifetime data sets.

Download Full-text

Efficiency of HLA-A, -B, -C, -DRB1 High-Resolution Typings of Newly Recruited Potential Stem Cell Donors.

Blood ◽

10.1182/blood.v108.11.5415.5415 ◽

2006 ◽

Vol 108 (11) ◽

pp. 5415-5415 ◽

Cited By ~ 3

Author(s):

Alexander H. Schmidt ◽

Andrea Stahr ◽

Daniel Baier ◽

Gerhard Ehninger ◽

Claudia Rutt

Keyword(s):

Stem Cell ◽

High Resolution ◽

Final Decision ◽

Data Sets ◽

Donor Group ◽

Donor Center ◽

Full Data ◽

Data Set ◽

Group A ◽

Group B

Abstract In strategic stem cell donor registry planning, it is of special importance to decide how to type newly registered donors. This question refers to both the selection of HLA loci and the resolution (low, intermediate, or high) of HLA typings. In principle, high-resolution typings of all transplant-relevant loci are preferable. However, cost considerations generally lead to incomplete typings (only selected HLA loci with low or intermediate typing resolution) in practice. Here, we present results of a project in which newly recruited donors are typed for the HLA-A, -B, -C, and -DRB1 loci with high resolution by sequencing. Efficiency of these typings is measured by subsequent requests for confirmatory typings (CTs) and stem cell donations. Results for donors who were included in the project (Donor Group A) are compared to requests for donors with other, less complete typing levels: HLA-A and HLA-B at intermediate resolution, HLA-DRB1 at high resolution (Group B); HLA-A, -B, -C, and -DRB1 at intermediate resolution (Group C); HLA-A, -B, and -DRB1 at intermediate resolution (Group D). All data are taken from the donor file of DKMS German Bone Marrow Donor Center. Since the four groups differ considerably regarding their age and sex distributions, calculations are also carried through for restricted data sets that include only male donors up to age 25. Results are shown in Table 1. Donors of Groups A and B have similar CT request frequencies of 5.90 and 5.92 requests per 100 donors per year in the resctricted data sets, respectively. These frequencies significantly exceed the corresponding frequencies of the other groups with less complete typing levels. For donation requests, the frequency is signifcantly higher for Group A than for Group B (restricted data sets): 1.45 vs 1.02 requests per donor per year (p<0.05). Obviously, the additional HLA information for Group A donors leads to a higher ratio between donations and CT requests. Again, figures are much lower for Groups C and D. These results are based on a high number of requests even for the restricted data sets, namely between 44 and 90 donation requests and between 227 and 619 CT requests per group. Our results show that full (HLA-A, -B, -C, and -DRB1) high-resolution typings at donor recruitment lead to significantly higher probabilities for donation requests. Donor centers and registries should carefully take into account these higher probabilities when they consider full high-resolution typings for newly recruited donors. However, the final decision regarding the typing strategy at recruitment must also depend on the individual cost structure of a donor center or registry. The presented results are based on a donor file that consists mainly (≈99%) of Caucasian donors. It should be subject to further analyses if these results also apply to other, more heterogeneous donor pools. Table 1: Requests per 100 donors per year by donor group CT requests Donation requests Donor Group Full data set Only male donors≤ 25 Full data set Only male donors≤ 25 A 5.14 5.90 1.45 1.45 B 4.60 5.92 0.84 1.02 C 2.50 3.03 0.58 0.67 D 2.36 2.80 0.38 0.48

Download Full-text