Extracting bibliographical data for PDF documents with HMM and external resources

2014 ◽  
Vol 48 (3) ◽  
pp. 293-313 ◽  
Author(s):  
Wen-Feng Hsiao ◽  
Te-Min Chang ◽  
Erwin Thomas

Purpose – The purpose of this paper is to propose an automatic metadata extraction and retrieval system to extract bibliographical information from digital academic documents in portable document formats (PDFs). Design/methodology/approach – The authors use PDFBox to extract text and font size information, a rule-based method to identify titles, and an Hidden Markov Model (HMM) to extract the titles and authors. Finally, the extracted titles and authors (possibly incorrect or incomplete) are sent as query strings to digital libraries (e.g. ACM, IEEE, CiteSeerX, SDOS, and Google Scholar) to retrieve the rest of metadata. Findings – Four experiments are conducted to examine the feasibility of the proposed system. The first experiment compares two different HMM models: multi-state model and one state model (the proposed model). The result shows that one state model can have a comparable performance with multi-state model, but is more suitable to deal with real-world unknown states. The second experiment shows that our proposed model (without the aid of online query) can achieve as good performance as other researcher's model on Cora paper header dataset. In the third experiment the paper examines the performance of our system on a small dataset of 43 real PDF research papers. The result shows that our proposed system (with online query) can perform pretty well on bibliographical data extraction and even outperform the free citation management tool Zotero 3.0. Finally, the paper conducts the fourth experiment with a larger dataset of 103 papers to compare our system with Zotero 4.0. The result shows that our system significantly outperforms Zotero 4.0. The feasibility of the proposed model is thus justified. Research limitations/implications – For academic implication, the system is unique in two folds: first, the system only uses Cora header set for HMM training, without using other tagged datasets or gazetteers resources, which means the system is light and scalable. Second, the system is workable and can be applied to extracting metadata of real-world PDF files. The extracted bibliographical data can then be imported into citation software such as endnote or refworks to increase researchers’ productivity. Practical implications – For practical implication, the system can outperform the existing tool, Zotero v4.0. This provides practitioners good chances to develop similar products in real applications; though it might require some knowledge about HMM implementation. Originality/value – The HMM implementation is not novel. What is innovative is that it actually combines two HMM models. The main model is adapted from Freitag and Mccallum (1999) and the authors add word features of the Nymble HMM (Bikel et al, 1997) to it. The system is workable even without manually tagging the datasets before training the model (the authors just use cora dataset to train and test on real-world PDF papers), as this is significantly different from what other works have done so far. The experimental results have shown sufficient evidence about the feasibility of our proposed method in this aspect.

2017 ◽  
Vol 117 (9) ◽  
pp. 1866-1889 ◽  
Author(s):  
Vahid Shokri Kahi ◽  
Saeed Yousefi ◽  
Hadi Shabanpour ◽  
Reza Farzipoor Saen

Purpose The purpose of this paper is to develop a novel network and dynamic data envelopment analysis (DEA) model for evaluating sustainability of supply chains. In the proposed model, all links can be considered in calculation of efficiency score. Design/methodology/approach A dynamic DEA model to evaluate sustainable supply chains in which networks have series structure is proposed. Nature of free links is defined and subsequently applied in calculating relative efficiency of supply chains. An additive network DEA model is developed to evaluate sustainability of supply chains in several periods. A case study demonstrates applicability of proposed approach. Findings This paper assists managers to identify inefficient supply chains and take proper remedial actions for performance optimization. Besides, overall efficiency scores of supply chains have less fluctuation. By utilizing the proposed model and determining dual-role factors, managers can plan their supply chains properly and more accurately. Research limitations/implications In real world, managers face with big data. Therefore, we need to develop an approach to deal with big data. Practical implications The proposed model offers useful managerial implications along with means for managers to monitor and measure efficiency of their production processes. The proposed model can be applied in real world problems in which decision makers are faced with multi-stage processes such as supply chains, production systems, etc. Originality/value For the first time, the authors present additive model of network-dynamic DEA. For the first time, the authors outline the links in a way that carry-overs of networks are connected in different periods and not in different stages.


2020 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Zoraya Roldán Rockow ◽  
Brandon E. Ross

PurposeThis paper aims to describe and demonstrate a quantitative areal openness model (AOM) for measuring the openness of floor plans. Creation of the model was motivated by the widely reported but rarely quantified link between openness and adaptability.Design/methodology/approachThe model calculates values for three indicators: openness score (OS), weighted OS (WOS) and openness potential (OP). OS measures the absence of obstructions (walls, chases, columns) that separate areas in a floor plan. WOS measures the number of obstructions while also accounting for the difficulty of removing them. OP measures the potential of a floor plan to become more open. Indicators were calculated for three demolished case study buildings and for three adapted buildings. The case study buildings were selected because openness – or lack thereof – contributed to the owners' decisions to demolish or adapt.FindingsOpenness indicators were consistent with the real-world outcomes (adaptation or demolition) of the case study buildings. This encouraging result suggests that the proposed model is a reasonable approach for comparing the openness of floor plans and evaluating them for possible adaptation or demolition.Originality/valueThe AOM is presented as a tool for facility managers to evaluate inventories of existing buildings, designers to compare alternative plan layouts and researchers to measure openness of case studies. It is intended to be sufficiently complex as to produce meaningful results, relatively simple to apply and readily modifiable to suit different situations. The model is the first to calculate floor plan openness within the context of adaptability.


2016 ◽  
Vol 11 (1) ◽  
pp. 134-153 ◽  
Author(s):  
Gholamreza Malekzadeh ◽  
Mostaffa Kazemi ◽  
Mohammad Lagzian ◽  
Saeed Mortazavi

Purpose – This paper aims to investigate a novel model for organizational intelligence (OI) in Iranian public universities. OI is an effective concept in organizational behavior for reshaping the organizational rules. Multidimensional nature of OI makes it a very useful management tool. Design/methodology/approach – This model is investigated by using an expert panel opinion and decision-making trial and evaluation laboratory technique based on Iranian university professors’ opinions. Findings – The proposed model consists of eight dimensions: structural, cultural, strategic, communicational, informational, functional, behavioral and environmental dimensions. Each one of these dimensions consists of some components. The results showed that the “Structural”, “Cultural”, “Strategic”, “Informational” and “Environmental” dimensions are the cause dimensions, while the “Behavioral” and “Communicational” dimensions are the effect dimensions. Hierarchical levels of these dimensions are also determined. Practical implications – Comprehending this model offers a handful of beneficial insights for university managers. These points are synoptically stated in the form of managerial implications. Originality/value – The paper by using a real case study provides a cause and effect model for OI management in Iranian public universities and can be enhanced for other organizations.


2017 ◽  
Vol 12 (1) ◽  
pp. 106-123
Author(s):  
Choo Jun Tan ◽  
Ting Yee Lim ◽  
Chin Wei Bong ◽  
Teik Kooi Liew

Purpose The purpose of this paper is to propose a soft computing model based on multi-objective evolutionary algorithm (MOEA), namely, modified micro genetic algorithm (MmGA) coupled with a decision tree (DT)-based classifier, in classifying and optimising the students’ online interaction activities as classifier of student achievement. Subsequently, the results are transformed into useful information that may help educator in designing better learning instructions geared towards higher student achievement. Design/methodology/approach A soft computing model based on MOEA is proposed. It is tested on benchmark data pertaining to student activities and achievement obtained from the University of California at Irvine machine learning repository. Additional, a real-world case study in a distance learning institution, namely, Wawasan Open University in Malaysia has been conducted. The case study involves a total of 46 courses collected over 24 consecutive weeks with students across the entire regions in Malaysia and worldwide. Findings The proposed model obtains high classification accuracy rates at reduced number of features used. These results are transformed into useful information for the educational institution in our case study in an effort to improve student achievement. Whether benchmark or real-world case study, the proposed model successfully reduced the number features used by at least 48 per cent while achieving higher classification accuracy. Originality/value A soft computing model based on MOEA, namely, MmGA coupled with a DT-based classifier, in handling educational data is proposed.


2015 ◽  
Vol 19 (1) ◽  
pp. 45-59 ◽  
Author(s):  
Subhasis Dasgupta ◽  
Pinakpani Pal ◽  
Chandan Mazumdar ◽  
Aditya Bagchi

Purpose – This paper provides a new Digital Library architecture that supports polyhierarchic ontology structure where a child concept representing an interdisciplinary subject area can have multiple parent concepts. The paper further proposes an access control mechanism for controlled access to different concepts by different users depending on the authorizations available to each such user. The proposed model thus provides a better knowledge representation and faster searching possibility of documents for modern Digital Libraries with controlled access to the system. Design/methodology/approach – Since the proposed Digital Library Architecture considers polyhierarchy, the underlying hierarchical structure becomes a Directed Acyclic Graph instead of a tree. A new access control model has been developed for such a polyhierarchic ontology structure. It has been shown that such model may give rise to undecidability problem. A client specific view generation mechanism has been developed to solve the problem. Findings – The paper has three major contributions. First, it provides better knowledge representation for present-day digital libraries, as new interdisciplinary subject areas are getting introduced. Concepts representing interdisciplinary subject areas will have multiple parents, and consequently, the library ontology introduces a new set of nodes representing document classes. This concept also provides faster search mechanism. Secondly, a new access control model has been introduced for the ontology structure where a user gets authorizations to access a concept node only if its credential supports it. Lastly, a client-based view generation algorithm has been developed so that a client’s access remains limited to its view and avoids any possibility of undecidability in authorization specification. Research limitations/implications – The proposed model, in its present form, supports only read and browse facilities. It would later be extended for addition and update of documents. Moreover, the paper explains the model in a single user environment. It will be augmented later to consider simultaneous access from multiple users. Practical implications – The paper emphasizes the need for changing the present digital library ontology to a polyhierarchic structure to provide proper representation of knowledge related to the concepts covering interdisciplinary subject areas. Possible implementation strategies have also been mentioned. This design method can also be extended for other semantic web applications. Originality/value – This paper offers a new knowledge management strategy to cover the gradual proliferation of interdisciplinary subject areas along with a suitable access control model for a digital library ontology. This methodology can also be extended for other semantic web applications.


2018 ◽  
Vol 26 (5) ◽  
pp. 613-636 ◽  
Author(s):  
Gunikhan Sonowal ◽  
KS Kuppusamy

Purpose This paper aims to propose a model entitled MMSPhiD (multidimensional similarity metrics model for screen reader user to phishing detection) that amalgamates multiple approaches to detect phishing URLs. Design/methodology/approach The model consists of three major components: machine learning-based approach, typosquatting-based approach and phoneme-based approach. The major objectives of the proposed model are detecting phishing URL, typosquatting and phoneme-based domain and suggesting the legitimate domain which is targeted by attackers. Findings The result of the experiment shows that the MMSPhiD model can successfully detect phishing with 99.03 per cent accuracy. In addition, this paper has analyzed 20 leading domains from Alexa and identified 1,861 registered typosquatting and 543 phoneme-based domains. Research limitations/implications The proposed model has used machine learning with the list-based approach. Building and maintaining the list shall be a limitation. Practical implication The results of the experiments demonstrate that the model achieved higher performance due to the incorporation of multi-dimensional filters. Social implications In addition, this paper has incorporated the accessibility needs of persons with visual impairments and provides an accessible anti-phishing approach. Originality/value This paper assists persons with visual impairments on detection phoneme-based phishing domains.


2020 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Xiwang Xiang ◽  
Xin Ma ◽  
Minda Ma ◽  
Wenqing Wu ◽  
Lang Yu

PurposePM10 is one of the most dangerous air pollutants which is harmful to the ecological system and human health. Accurate forecasting of PM10 concentration makes it easier for the government to make efficient decisions and policies. However, the PM10 concentration, particularly, the emerging short-term concentration has high uncertainties as it is often impacted by many factors and also time varying. Above all, a new methodology which can overcome such difficulties is needed.Design/methodology/approachThe grey system theory is used to build the short-term PM10 forecasting model. The Euler polynomial is used as a driving term of the proposed grey model, and then the convolutional solution is applied to make the new model computationally feasible. The grey wolf optimizer is used to select the optimal nonlinear parameters of the proposed model.FindingsThe introduction of the Euler polynomial makes the new model more flexible and more general as it can yield several other conventional grey models under certain conditions. The new model presents significantly higher performance, is more accurate and also more stable, than the six existing grey models in three real-world cases and the case of short-term PM10 forecasting in Tianjin China.Practical implicationsWith high performance in the real-world case in Tianjin China, the proposed model appears to have high potential to accurately forecast the PM10 concentration in big cities of China. Therefore, it can be considered as a decision-making support tool in the near future.Originality/valueThis is the first work introducing the Euler polynomial to the grey system models, and a more general formulation of existing grey models is also obtained. The modelling pattern used in this paper can be used as an example for building other similar nonlinear grey models. The practical example of short-term PM10 forecasting in Tianjin China is also presented for the first time.


Kybernetes ◽  
2019 ◽  
Vol 49 (10) ◽  
pp. 2569-2587 ◽  
Author(s):  
Saeed Mirzamohammadi ◽  
Saeed Karimi ◽  
Mir Saman Pishvaee

Purpose The purpose of this paper is to develop a new systematic method for a multi-unit organization to cope with the cost allocation problem, which is an extension of the reciprocal method. As uncertainty is the inherent characteristic of business environments, assuming changes in engaged parameters is almost necessary. The outputs of the model determine the total value of each unit/business lines or product. Design/methodology/approach In the proposed method, contrary to existing models, business units are able to transfer their costs to other units, and also, not necessarily transfer the total costs of support units completely. The DEMATEL approach, which finds all relationships between different parts of a system, is also applied for computing effects of the units’ expense paid to each other. Moreover, a fuzzification approach is used to capture linguistic experts’ judgments about related data. Findings Being closer to the real-world problem in comparison to the previous approach, the proposed systematic approach encompasses the other cost allocation models. Practical implications Applying the proposed model for a system like a multi-unit organization, the total price of each unit/business line can be obtained. Moreover, this cost allocation process guides the related decision-makers to better manage the expenses that each unit pays the others. Originality/value In the existing studies, business units cannot pay expense support units. However, in the proposed method, the business units are able to pay expenses for other units, and also, not necessarily pay total expenses for support unit completely. Moreover, considering engaged parameters as fuzzy numbers makes the proposed model closer to real-world problems.


2019 ◽  
Vol 43 (5/6) ◽  
pp. 592-618 ◽  
Author(s):  
Afaf Jghamou ◽  
Aziz Maziri ◽  
El Hassan Mallil ◽  
Jamal Echaabi

Purpose In this paper, the authors focus on training as a frequently used knowledge management tool. This paper aims to help training function to achieve excellence at the first attempt by evaluating and deciding on the most interesting method for each training action before engaging the investment. Design/methodology/approach The authors apply instructional theories to evaluate the relevance of training methods and explored the multiple criteria decision analysis methods, which is a mathematical approach, for the evaluation to be rational. An experimental research based on study cases is also presented to test the applicability and effectiveness of the model proposed. Findings A decisional model that allows to choose rationally the most appropriate training method for each case. It is based on Elicitation and Choice Translating Reality (ELECTRE I) method, which is a multi-criteria decision analysis method and uses criteria from First Principles developed by Merrill in 2002. Practical implications The proposed model may have several implications for the improvement of training performance, particularly in the context of quality management systems that require product compliance based on continuous improvement and risk-based approaches. It can, therefore, be used as a tool to control the quality of training process or control the risk relative to the execution of a training action or more generally as a tool to “check” that training methods chosen are the most appropriate to attempt the training objectives before “act” the training action. Originality/value The combination of a decision analysis system with the theory of instruction and the applicability to training process management.


2018 ◽  
Vol 11 (2) ◽  
pp. 331-358
Author(s):  
Mahmoud Moradi ◽  
Nima Esfandiari ◽  
Majid Keshavarz Moghaddam

Purpose This paper aims to propose a model to assess leagile strategy effectively as one of the most discussed production and supply chain strategies. Design/methodology/approach This research is based on an integrated FLinPreRa-FQFD approach, which ranks and determines the importance of criteria and indices of the proposed model in an effective way. Findings “Cost” has been taken as the most important competitive advantage in the four selected industries. “Customer and market sensitiveness” has been considered as the most important enabler in three industries (“machinery and equipment,” “textile” and “food and beverage”). “Collaborative relationship” has been also considered as the most important enabler in the “material and chemical products” industry to gain leagility. Practical implication On the one hand, the proposed research model can be used as a reference guide for firms to reach leagility. This model presents indices of leagility at different levels. On the other hand, with respect to the main activities of the Iran Chambers of Commerce, Industries, Mines and Agriculture, including Qazvin’s, committed to provide requirements for economic development, this work provides an opportunity for the implementation of leagility model in 151 active companies. It also gives new insights into the leagility application in the four industries and Qazvin’s Chamber of Commerce, Industries, Mines and Agriculture. Originality/value This research has two main contributions. First, it presents a model of enablers and attributes of leagility, combined with four competitive advantages. Second, the research study is equipped with an integrated and consistent methodology that, first, helps decision makers prioritize their goals and agenda by means of fewer paired comparisons without the need of consistency rate and, second, allows for direct evaluation of impact of enablers on attributes of leagility.


Sign in / Sign up

Export Citation Format

Share Document