CHIC: a short read aligner for pan-genomic references

Mapping Intimacies ◽

10.1101/178129 ◽

2017 ◽

Cited By ~ 5

Author(s):

Daniel Valenzuela ◽

Veli Mäkinen

Keyword(s):

Computer Science ◽

Open Source ◽

Bioinformatic Analysis ◽

Hybrid Technique ◽

Short Read ◽

Pan Genome ◽

Link Type ◽

Short Read Aligner ◽

Science Community

AbstractRecently the topic of computational pan-genomics has gained increasing attention, and particularly the problem of moving from a single-reference paradigm to a pan-genomic one. Perhaps the simplest way to represent a pan-genome is to represent it as a set of sequences. While indexing highly repetitive collections has been intensively studied in the computer science community, the research has focused on efficient indexing and exact pattern patching, making most solutions not yet suitable to be used in bioinformatic analysis pipelines.Results:We present CHIC, a short-read aligner that indexes very large and repetitive references using a hybrid technique that combines Lempel-Ziv compression with Burrows-Wheeler read aligners.Availability:Our tool is open source and available online at https://gitlab.com/dvalenzu/CHIC

RaGOO: fast and accurate reference-guided scaffolding of draft genomes

Genome Biology ◽

10.1186/s13059-019-1829-6 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 56

Author(s):

Michael Alonge ◽

Sebastian Soyk ◽

Srividya Ramakrishnan ◽

Xingang Wang ◽

Sara Goodwin ◽

...

Keyword(s):

Arabidopsis Thaliana ◽

Open Source ◽

Genome Analysis ◽

De Novo ◽

Structural Variants ◽

Tomato Genome ◽

Pan Genome ◽

Link Type ◽

Genome Assemblies

Abstract We present RaGOO, a reference-guided contig ordering and orienting tool that leverages the speed and sensitivity of Minimap2 to accurately achieve chromosome-scale assemblies in minutes. After the pseudomolecules are constructed, RaGOO identifies structural variants, including those spanning sequencing gaps. We show that RaGOO accurately orders and orients 3 de novo tomato genome assemblies, including the widely used M82 reference cultivar. We then demonstrate the scalability and utility of RaGOO with a pan-genome analysis of 103 Arabidopsis thaliana accessions by examining the structural variants detected in the newly assembled pseudomolecules. RaGOO is available open source at https://github.com/malonge/RaGOO.

Open Source Research — the Power of Us

Australian Journal of Chemistry ◽

10.1071/ch06095 ◽

2006 ◽

Vol 59 (5) ◽

pp. 291 ◽

Cited By ~ 17

Author(s):

Thomas B. Kepler ◽

Marc A. Marti-Renom ◽

Stephen M. Maurer ◽

Arti K. Rai ◽

Ginger Taylor ◽

...

Keyword(s):

Organic Chemistry ◽

Computer Science ◽

Open Source ◽

Peer Review ◽

Biomedical Research ◽

Specific Problem ◽

Scientific Research ◽

Great Promise ◽

Competitive Funding ◽

Science Community

Academic and industrial scientific research operate on powerful and complementary models, consisting of some mix of competitive funding, peer review, and limited inter-laboratory collaboration. Enormous successes have arisen from both models. Yet there are clear failures to deliver results in certain areas, such as the provision of drugs for some of the most prevalent of human diseases. Is there a mechanism of research that is not wholly dependent on funding for its operation nor on traditional peer-reviewed articles for its propagation? Open source methods have delivered tangible benefits in the computer science community. We describe here efforts to extend these principles to science generally, and in particular biomedical research. Open source research holds great promise for solving complex problems in areas where profit-driven research is seen to have failed. We illustrate this with a specific problem in organic chemistry that we think will be solved substantially faster with an open source approach.

Leveraging Final Degree Projects for Open Source Software Contributions

Electronics ◽

10.3390/electronics10101181 ◽

2021 ◽

Vol 10 (10) ◽

pp. 1181

Author(s):

Juanan Pereira

Keyword(s):

Graduate Students ◽

Software Engineering ◽

Computer Science ◽

Open Source ◽

Communication Skills ◽

Open Source Software ◽

Real World ◽

Practical Experience ◽

Multiple Challenges ◽

Final Degree

(1) Background: final year students of computer science engineering degrees must carry out a final degree project (FDP) in order to graduate. Students’ contributions to improve open source software (OSS) through FDPs can offer multiple benefits and challenges, both for the students, the instructors and for the project itself. This work reports on a practical experience developed by four students contributing to mature OSS projects during their FDPs, detailing how they addressed the multiple challenges involved, both from the students and teachers perspective. (2) Methods: we followed the work of four students contributing to two established OSS projects for two academic years and analyzed their work on GitHub and their responses to a survey. (3) Results: we obtained a set of specific recommendations for future practitioners and detailed a list of benefits achieved by steering FDP towards OSS contributions, for students, teachers and the OSS projects. (4) Conclusion: we find out that FDPs oriented towards enhancing OSS projects can introduce students into real-world, practical examples of software engineering principles, give them a boost in their confidence about their technical and communication skills and help them build a portfolio of contributions to daily used worldwide open source applications.

Open source tools for geographic analysis in transport planning

Journal of Geographical Systems ◽

10.1007/s10109-020-00342-2 ◽

2021 ◽

Author(s):

Robin Lovelace

Keyword(s):

User Interface ◽

Open Source ◽

Citizen Participation ◽

Simulation Software ◽

First Century ◽

Transport Planning ◽

Geographic Analysis ◽

Link Type ◽

Twenty First Century ◽

Interactive Map

AbstractGeographic analysis has long supported transport plans that are appropriate to local contexts. Many incumbent ‘tools of the trade’ are proprietary and were developed to support growth in motor traffic, limiting their utility for transport planners who have been tasked with twenty-first century objectives such as enabling citizen participation, reducing pollution, and increasing levels of physical activity by getting more people walking and cycling. Geographic techniques—such as route analysis, network editing, localised impact assessment and interactive map visualisation—have great potential to support modern transport planning priorities. The aim of this paper is to explore emerging open source tools for geographic analysis in transport planning, with reference to the literature and a review of open source tools that are already being used. A key finding is that a growing number of options exist, challenging the current landscape of proprietary tools. These can be classified as command-line interface, graphical user interface or web-based user interface tools and by the framework in which they were implemented, with numerous tools released as R, Python and JavaScript packages, and QGIS plugins. The review found a diverse and rapidly evolving ‘ecosystem’ tools, with 25 tools that were designed for geographic analysis to support transport planning outlined in terms of their popularity and functionality based on online documentation. They ranged in size from single-purpose tools such as the QGIS plugin AwaP to sophisticated stand-alone multi-modal traffic simulation software such as MATSim, SUMO and Veins. Building on their ability to re-use the most effective components from other open source projects, developers of open source transport planning tools can avoid ‘reinventing the wheel’ and focus on innovation, the ‘gamified’ A/B Street https://github.com/dabreegster/abstreet/#abstreet simulation software, based on OpenStreetMap, a case in point. The paper, the source code of which can be found at https://github.com/robinlovelace/open-gat, concludes that, although many of the tools reviewed are still evolving and further research is needed to understand their relative strengths and barriers to uptake, open source tools for geographic analysis in transport planning already hold great potential to help generate the strategic visions of change and evidence that is needed by transport planners in the twenty-first century.

Creating a national K--12 computer science community

Communications of the ACM ◽

10.1145/1039539.1039563 ◽

2005 ◽

Vol 48 (1) ◽

pp. 29-31 ◽

Cited By ~ 1

Author(s):

Chris Stephenson

Keyword(s):

Computer Science ◽

Science Community ◽

K 12

Faster short-read mapping with strobemer seeds in syncmer space

10.1101/2021.06.18.449070 ◽

2021 ◽

Author(s):

Kristoffer Sahlin

Keyword(s):

High Speed ◽

Mapping Accuracy ◽

Original Sequence ◽

Short Read ◽

Short Read Mapping ◽

Alignment Algorithms ◽

Reverse Complement ◽

Short Read Aligner ◽

Candidate Regions ◽

Burrows Wheeler Transform

Short-read genome alignment is a fundamental computational step used in many bioinformatic analyses. It is therefore desirable to align such data as fast as possible. Most alignment algorithms consider a seed-and-extend approach. Several popular programs perform the seeding step based on the Burrows-Wheeler Transform with a low memory footprint, but they are relatively slow compared to more recent approaches that use a minimizer-based seeding-and-chaining strategy. Recently, syncmers and strobemers were proposed for sequence comparison. Both protocols were designed for improved conservation of matches between sequences under mutations. Syncmers is a thinning protocol proposed as an alternative to minimizers, while strobemers is a linking protocol for gapped sequences and was proposed as an alternative to k-mers. The main contribution in this work is a new seeding approach that combines syncmers and strobemers. We use a strobemer protocol (randstrobes) to link together syncmers (i.e., in syncmer-space) instead of over the original sequence. Our protocol allows us to create longer seeds while preserving mapping accuracy. A longer seed length reduces the number of candidate regions which allows faster mapping and alignment. We also contribute the insight that speed-wise, this protocol is particularly effective when syncmers are canonical. Canonical syncmers can be created for specific parameter combinations and reduce the computational burden of computing the non-canonical randstrobes in reverse complement. We implement our idea in a proof-of-concept short-read aligner strobealign that aligns short reads 3-4x faster than minimap2 and 15-23x faster than BWA and Bowtie2. Many implementation versions of, e.g., BWA, achieve high speed on specific hardware. Our contribution is algorithmic and requires no hardware architecture or system-specific instructions. Strobealign is available at https://github.com/ksahlin/StrobeAlign.

Engineering and computer science community college transfers and native freshmen students: Relationships among participation in extra-curricular and co-curricular activities, connecting to the university campus, and academic success

2012 Frontiers in Education Conference Proceedings ◽

10.1109/fie.2012.6462276 ◽

2012 ◽

Cited By ~ 7

Author(s):

Lisa Massi ◽

Patrice Lancey ◽

Uday Nair ◽

Rachel Straney ◽

Michael Georgiopoulos ◽

...

Keyword(s):

Community College ◽

Academic Success ◽

Computer Science ◽

University Campus ◽

Freshmen Students ◽

Science Community ◽

Community College Transfers ◽

The University

Automatic Identification of Harmful, Aggressive, Abusive, and Offensive Language on the Web: A Survey of Technical Biases Informed by Psychology Literature

ACM Transactions on Social Computing ◽

10.1145/3479158 ◽

2021 ◽

Vol 4 (3) ◽

pp. 1-56

Author(s):

Agathe Balayn ◽

Jie Yang ◽

Zoltan Szlavik ◽

Alessandro Bozzon

Keyword(s):

Computer Science ◽

State Of The Art ◽

Automatic Detection ◽

Automatic Identification ◽

Machine Learning Classification ◽

Detection Systems ◽

Science Community ◽

Psychology Literature ◽

Offensive Language ◽

The Web

The automatic detection of conflictual languages (harmful, aggressive, abusive, and offensive languages) is essential to provide a healthy conversation environment on the Web. To design and develop detection systems that are capable of achieving satisfactory performance, a thorough understanding of the nature and properties of the targeted type of conflictual language is of great importance. The scientific communities investigating human psychology and social behavior have studied these languages in details, but their insights have only partially reached the computer science community. In this survey, we aim both at systematically characterizing the conceptual properties of online conflictual languages, and at investigating the extent to which they are reflected in state-of-the-art automatic detection systems. Through an analysis of psychology literature, we provide a reconciled taxonomy that denotes the ensemble of conflictual languages typically studied in computer science. We then characterize the conceptual mismatches that can be observed in the main semantic and contextual properties of these languages and their treatment in computer science works; and systematically uncover resulting technical biases in the design of machine learning classification models and the dataset created for their training. Finally, we discuss diverse research opportunities for the computer science community and reflect on broader technical and structural issues.

chewBBACA: A complete suite for gene-by-gene schema creation and strain identification

10.1101/173146 ◽

2017 ◽

Cited By ~ 5

Author(s):

Mickael Silva ◽

Miguel Machado ◽

Diogo N. Silva ◽

Mirko Rossi ◽

Jacob Moran-Gilad ◽

...

Keyword(s):

Open Source ◽

Core Genome ◽

Bacterial Species ◽

Outbreak Detection ◽

Strain Identification ◽

List Type ◽

Whole Genome ◽

Link Type ◽

The Creation ◽

Allele Calling

ABSTRACTGene-by-gene approaches are becoming increasingly popular in bacterial genomic epidemiology and outbreak detection. However, there is a lack of open-source scalable software for schema definition and allele calling for these methodologies. The chewBBACA suite was designed to assist users in the creation and evaluation of novel whole-genome or core-genome gene-by-gene typing schemas and subsequent allele calling in bacterial strains of interest. The software can run in a laptop or in high performance clusters making it useful for both small laboratories and large reference centers. ChewBBACA is available athttps://github.com/B-UMMI/chewBBACAor as a docker image athttps://hub.docker.com/r/ummidock/chewbbaca/.DATA SUMMARYAssembled genomes used for the tutorial were downloaded from NCBI in August 2016 by selecting those submitted asStreptococcus agalactiaetaxon or sub-taxa. All the assemblies have been deposited as a zip file in FigShare (https://figshare.com/s/9cbe1d422805db54cd52), where a file with the original ftp link for each NCBI directory is also available.Code for the chewBBACA suite is available athttps://github.com/B-UMMI/chewBBACAwhile the tutorial example is found athttps://github.com/B-UMMI/chewBBACA_tutorial.I/We confirm all supporting data, code and protocols have been provided within the article or through supplementary data files. ⊠IMPACT STATEMENTThe chewBBACA software offers a computational solution for the creation, evaluation and use of whole genome (wg) and core genome (cg) multilocus sequence typing (MLST) schemas. It allows researchers to develop wg/cgMLST schemes for any bacterial species from a set of genomes of interest. The alleles identified by chewBBACA correspond to potential coding sequences, possibly offering insights into the correspondence between the genetic variability identified and phenotypic variability. The software performs allele calling in a matter of seconds to minutes per strain in a laptop but is easily scalable for the analysis of large datasets of hundreds of thousands of strains using multiprocessing options. The chewBBACA software thus provides an efficient and freely available open source solution for gene-by-gene methods. Moreover, the ability to perform these tasks locally is desirable when the submission of raw data to a central repository or web services is hindered by data protection policies or ethical or legal concerns.

The Popgen Pipeline Platform: A Software Platform for Facilitating Population Genomic Analyses

10.1101/785774 ◽

2019 ◽

Author(s):

Andrew Webb ◽

Jared Knoblauch ◽

Nitesh Sabankar ◽

Apeksha Sukesh Kallur ◽

Jody Hey ◽

...

Keyword(s):

Open Source ◽

Development Time ◽

End Users ◽

File Format ◽

Software Platform ◽

Format Conversion ◽

Link Type ◽

Population Genomic ◽

Genomic Analyses ◽

File Format Conversion

AbstractHere we present the Pop-Gen Pipeline Platform (PPP), a software platform with the goal of reducing the computational expertise required for conducting population genomic analyses. The PPP was designed as a collection of scripts that facilitate common population genomic workflows in a consistent and standardized Python environment. Functions were developed to encompass entire workflows, including: input preparation, file format conversion, various population genomic analyses, output generation, and visualization. By facilitating entire workflows, the PPP offers several benefits to prospective end users - it reduces the need of redundant in-house software and scripts that would require development time and may be error-prone, or incorrect. The platform has also been developed with reproducibility and extensibility of analyses in mind. The PPP is an open-source package that is available for download and use at https://ppp.readthedocs.io/en/latest/PPP_pages/install.html