JCDL '18- Proceedings of the 18th ACM/IEEE on Joint Conference on Digital LibrariesFull Citation in the ACM Digital Library
SESSION: Keynote Talks
We Have Interesting Problems: Some Applied Grand Challenges from Digital Libraries, Archives and Museums
Libraries, Archives and Museums now have massive digital holdings. There is tremendous potential for library and information science, computer science and computer engineering researchers to partner with cultural heritage institutions and make our digital cultural record more useful and usable. In particular, there is a significant need to bridge basic research in areas such as computer vision, crowdsourcing, natural language processing, multilingual OCR, and machine learning to make this work directly usable in the practices of cultural heritage institutions. In this talk, I discuss a series of exemplar projects, largely funded through the Institute of Museum and Library Services National Digital Platform initiative, that illustrate some key principles for building applied research partnerships with cultural heritage institutions. Building on Ben Schniderman's The New ABCs of Research: Achieving Breakthrough Collaborations, I focus specifically on why the public purpose and missions of cultural heritage institutions are particularly valuable in establishing new kinds of collaborations that can simultaneously advance basic research and the ability for people of the world to engage with their cultural record.
With the advancements in data driven research ushered in with the "Big Data" era, many research fields from astrophysics to genomics have discovered previously impossible outcomes including exoplanet detection and precision medicine. As with many other rapid changes, challenges are emerging now in extending the reach of these technologies to less computationally savvy fields that often have even more challenging computational problems to address. In this talk, I will explore some of the ways the Texas Advanced Computing Center (TACC) and others are creating environments to simplify onboarding new communities both through simple yet powerful user interfaces that feature the ability to preserve and publish data, computational workflows, and outcomes as linked yet separate entities. By linking, preserving, and making discoverable these entitles, researchers new to these computational capabilities can more easily understand them often in a research context with which they are already familiar. By then seeing the details of how a result came from the data, they can more easily build upon the work of others or even begin new workflows tuned to their own research needs. I will talk out our collaboration with publishers and libraries to not only preserve and expose the links between these digital entities but allow for their recall into computational environments to either reproduce previous results or extend the research through additional data and/or new computational methods.
We can all agree that current publishing and dissemination modes for scholarly communication are not optimized for speed or utility, and are often impediments to advancing ideas and knowledge. I will discuss the current landscape of publishing tech, and what the Collaborative Knowledge Foundation (Coko) and its partners are doing to shake things up.
SESSION: Session 1A: Use
Understanding the Position of Information Professionals with regards to Linked Data: A Survey of Libraries, Archives and Museums
The aim of this study was to explore the benefits and challenges to using Linked Data (LD) in Libraries, Archives and Museums (LAMs) as perceived by Information Professionals (IPs). The study also aimed to gain an insight into potential solutions for overcoming these challenges. Data was collected via a questionnaire which was completed by 185 Information Professionals (IPs) from a range of LAM institutions. Results indicated that IPs find the process of integrating and interlinking LD datasets particularly challenging, and that current LD tooling does not meet their needs. The study showed that LD tools designed with the workflows and expertise of IPs in mind could help overcome these challenges.
Many digital information environments enable sharing of readers' highlights and other annotations, despite the lack of clear evidence of the effects on interaction behaviours and outcomes. We report on an experimental user study (n=15) of the impact of pre-existing highlights of varying quality on the digital reading process and outcomes of participants with different cognitive styles. We found that highlight quality affects surface level comprehension, but not deeper understanding. Participants were able to assess highlight quality and expressed different approaches to highlighting that influenced their interpretation of pre-existing highlights. Results regarding the impact of cognitive style were inconclusive.
This paper presents a foundation for an extensive framework expanding the use of eye movements as a source for user modeling. This work constructs a model of human oculomotor plant features during user's interactions with the goal of better interpreting user gaze data related to resource content. This work also explores the anatomical reasoning behind incorporating additional gaze features, the integration of the additional features into an existing interest modeling architecture, and a plan for assessing the impact of the addition of the features. The paper concludes with few observations regarding the promises of using OPF in a user modeling framework in studying search behavior.
Interaction on An Academic Social Networking Sites: A Study of ResearchGate Q&A on Library and Information Science
Online information interaction on academic social networking sites (ASNSs) has been increasingly popular in recent years. It is unclear whether these sites have satisfied scholars' needs for interaction. This study investigates what scholars ask about, what features the questions and answers convey, and what socio-emotional reactions during the interaction on ResearchGate Q&A. Bales' Interaction Process Analysis (IPA) was adopted to analyze 371 questions and 7530 answers on Library and Information Science from ResearchGate Q&A. Implications of results are discussed and suggestions for future study are made.
SESSION: Session 1B: Collection Building
Digitized document collections often suffer from OCR errors that may impact a document's readability and retrievability. We studied the effects of correcting OCR errors on the retrievability of documents in a historic newspaper corpus of a digital library. We computed retrievability scores for the uncorrected documents using queries from the library's search log, and found that the document OCR character error rate and retrievability score are strongly correlated. We computed retrievability scores for manually corrected versions of the same documents, and report on differences in their total sum, the overall retrievability bias, and the distribution of these changes over the documents, queries and query terms. For large collections, often only a fraction of the corpus is manually corrected. Using a mixed corpus, we assess how this mix affects the retrievability of the corrected and uncorrected documents. The correction of OCR errors increased the number of documents retrieved in all conditions. The increase contributed to a less biased retrieval, even when taking the potential lower ranking of uncorrected documents into account.
Building the collection of an institutional repository requires a complex understanding of both digital library infrastructure and staff resources, as well as the institution's faculty awareness and attitudes toward self-archiving. For collection development decisions, institutional repository (IR) managers weigh the influence of these factors when pursuing strategies to increase content and faculty participation. To evaluate strategies for collection development, the authors will apply the Analytic Hierarchy Process (AHP) to create a model through which collection development strategies can be evaluated based on the unique context of the institution.
Computer-Assisted Crowd Transcription of the U.S. Census with Personalized Assignments for Better Accuracy and Participation
Our Open Genealogy Data census transcription project is intended to make valuable census data more readily available to researchers, digital libraries, and others. We use automatic handwriting recognition to bootstrap our census database, facilitating searches even in the early stages of the project while manual transcription is underway. We provide a web-based interface for crowd-sourced transcription of the census to complete and correct the database. In an effort to improve both volunteer participation and transcription accuracy, we provide default transcription assignments based on transcribers' own genealogical family lines or geographical locations to improve the likelihood that the transcribers will be familiar with the names and places in their assignments.
Maintaining literature databases and online bibliographies is a core responsibility of metadata aggregators such as digital libraries. In the process of monitoring all the available data sources the question arises which data source should be prioritized. Based on a broad definition of information quality we are looking for different ways to find the best fitting and most promising conference candidates to harvest next. We evaluate different conference ranking features by using a pseudo-relevance assessment and a component-based evaluation of our approach.
SESSION: Session 1C: Semantics and Linking
The availability of entity linking technologies provides a novel way to organize, categorize, and analyze large textual collections in digital libraries. However, in many situations a link to an entity offers only relatively coarse-grained semantic information. This is problematic especially when the entity is related to several different events, topics, roles, and -- more generally -- when it has different aspects. In this work, we introduce and address the task of entity-aspect linking: given a mention of an entity in a contextual passage, we refine the entity link with respect to the aspect of the entity it refers to. We show that a combination of different features and aspect representations in a learning-to-rank setting correctly predicts the entity-aspect in 70% of the cases. Additionally, we demonstrate significant and consistent improvements using entity-aspect linking on three entity prediction and categorization tasks relevant for the digital library community.
In this paper we investigate the accuracy and overall suitability of a variety of Entity Linking systems for the task of disambiguating entities in 17th century depositions obtained during the 1641 Irish Rebellion. The depositions are extremely difficult for modern NLP tools to work with due to inconsistent spelling, use of language and archaic references. In order to assess the severity of difficulty faced by Entity Linking systems when working with the depositions we use them to create an evaluation corpus. This corpus is used as an input to the General Entity Annotator Benchmarking Framework a standard benchmarking platform for entity annotation systems. Based on this corpus and the results obtained from General Entity Annotator Benchmarking Framework we observe that the accuracy of existing Entity Linking systems is lacking when applied to content like these depositions. This is due to a number of issues ranging from problems with existing state-of-the-art systems to poor representation of historic entities in modern knowledge bases. We discuss some interesting questions raised by this evaluation and put forward a plan for future work in order to learn more.
Physical and digital documents do often not exist in isolation but are implicitly or explicitly linked. Previous research in Human-Computer Interaction and Personal Information Management has revealed certain user behaviour in associating information across physical and digital documents. Nevertheless, there is a lack of empirical studies on user needs and behaviour when defining these associations. In this paper, we address this lack of empirical studies and provide insights into strategies that users apply when associating information across physical and digital documents. In addition, our study reveals the limitations of current practices and we suggest improvements for associating information across documents. Last but not least, we identify a set of design implications for the development of future cross-document linking solutions.
SESSION: Session 2A: Collection Access and Indexing
Putting Dates on the Map: Harvesting and Analyzing Street Names with Date Mentions and their Explanations
Street names are not only used across the world as part of addresses, but also reveal a lot about a country's identity. Thus, they are subject to analysis in the fields of geography and social science. There, typically, a manual analysis limited to a small region is performed, e.g., focusing on the renaming of streets in a city after a political change in a country. Surprisingly, there have been hardly any automatic, large-scale studies of street names so far, although this might lead to interesting insights regarding the distribution of particular street name phenomena. In this paper, we present an automated, world-wide analysis of street names with date references. Such temporal streets are frequently used to commemorate important events and thus particularly interesting to study. After applying a multilingual temporal tagger to discover such street names, we analyze their temporal and geographic distributions on different levels of granularity. Furthermore, we present an approach to automatically harvest potential explanations why streets in specific regions refer to particular dates. Despite the challenges of the tasks, our evaluation demonstrates the feasibility of the street extraction and the explanation harvesting.
Contextualisation has proven to be effective in tailoring search results towards the users' information need. While this is true for a basic query search, the usage of contextual session information during exploratory search especially on the level of browsing has so far been underexposed in research. In this paper, we present two approaches that contextualise browsing on the level of structured metadata in a Digital Library (DL), (1) one variant bases on document similarity and (2) one variant utilises implicit session information, such as queries and different document metadata encountered during the session of a users. We evaluate our approaches in a living lab environment using a DL in the social sciences and compare our contextualisation approaches against a non-contextualised approach. For a period of more than three months we analysed 47,444 unique retrieval sessions that contain search activities on the level of browsing. Our results show that a contextualisation of browsing significantly outperforms our baseline in terms of the position of the first clicked item in the result set. The mean rank of the first clicked document (measured as mean first relevant - MFR) was 4.52 using a non-contextualised ranking compared to 3.04 when re-ranking the result lists based on similarity to the previously viewed document. Furthermore, we observed that both contextual approaches show a noticeably higher click-through rate. A contextualisation based on document similarity leads to almost twice as many document views compared to the non-contextualised ranking.
SESSION: Session 3A: Citation Analysis
Machine Learning vs. Rules and Out-of-the-Box vs. Retrained: An Evaluation of Open-Source Bibliographic Reference and Citation Parsers
Bibliographic reference parsing refers to extracting machine-readable metadata, such as the names of the authors, the title, or journal name, from bibliographic reference strings. Many approaches to this problem have been proposed so far, including regular expressions, knowledge bases and supervised machine learning. Many open source reference parsers based on various algorithms are also available. In this paper, we apply, evaluate and compare ten reference parsing tools in a specific business use case. The tools are Anystyle-Parser, Biblio, CERMINE, Citation, Citation-Parser, GROBID, ParsCit, PDFSSA4MET, Reference Tagger and Science Parse, and we compare them in both their out-of-the-box versions and versions tuned to the project-specific data. According to our evaluation, the best performing out-of-the-box tool is GROBID (F1 0.89), followed by CERMINE (F1 0.83) and ParsCit (F1 0.75). We also found that even though machine learning-based tools and tools based on rules or regular expressions achieve on average similar precision (0.77 for ML-based tools vs. 0.76 for non-ML-based tools), applying machine learning-based tools results in a recall three times higher than in the case of non-ML-based tools (0.66 vs. 0.22). Our study also confirms that tuning the models to the task-specific data results in the increase in the quality. The retrained versions of reference parsers are in all cases better than their out-of-the-box counterparts; for GROBID F1 increased by 3% (0.92 vs. 0.89), for CERMINE by 11% (0.92 vs. 0.83), and for ParsCit by 16% (0.87 vs. 0.75).
Linked Open Citation Database: Enabling Libraries to Contribute to an Open and Interconnected Citation Graph
Citations play a crucial role in the scientific discourse, in information retrieval, and in bibliometrics. Many initiatives are currently promoting the idea of having free and open citation data. Creation of citation data, however, is not part of the cataloging workflow in libraries nowadays.
In this paper, we present our project Linked Open Citation Database, in which we design distributed processes and a system infrastructure based on linked data technology. The goal is to show that efficiently cataloging citations in libraries using a semi-automatic approach is possible. We specifically describe the current state of the workflow and its implementation. We show that we could significantly improve the automatic reference extraction that is crucial for the subsequent data curation. We further give insights on the curation and linking process and provide evaluation results that not only direct the further development of the project, but also allow us to discuss its overall feasibility.
SESSION: Session 3B: Scientific Collections and Libraries I
Building a Theoretical Framework for the Development of Digital Scholarship Services in China's Universities
The provision of digital scholarship services (DSS) in China's university is very unsystematic and fragmented. This paper reports on a literature review that aims to develop a comprehensive theoretical framework, which can serve as a practical guide for the development of DSS in China's university libraries. The framework was developed through systematically searching, screening, assessing, coding and aggregating DSS as reported in the existing body of literature. Academic literature, both in Chinese and English, as well as relevant professional reports are carefully searched, selected and analysed. The analysis of the literature pointed to 25 DSS in six categories: supporting services, formulating research ideas, locating research partners, proposal writing, research process, and publication. This paper focuses on the development of DSS in China's university libraries, but the research findings and the framework developed can provide useful insights and indications that can be shared across international borders.
With the ever-increasing volume of formulae on the Web, formula retrieval has drawn much attention from researchers. However, most of the existing researches on formula retrieval treat each formula within an article equally, while different formulae in the same article have different importance to the article. In this paper, we address the issue to rank formulae within an article based on their importance. To evaluate the importance of each formula within an article, a formula citation graph is firstly built in a large scale corpus. And the inter-articles features of formulae are extracted by the link topology analysis of formulae based on the graph. Then the word embedding model is explored to extract the inner-article features by mining the semantic relevance between a formula and the corresponding article. Finally, we leverage learning to rank technique for formulae ranking within an article based on those features. The experimental results demonstrate that the proposed features are helpful for formula ranking and our approach yields better performance compared with other state-of-the-art methods.
Evaluating papers and venues objectively and fairly is a vital and challenging task for scientists, research organizations, and research funding bodies alike. Recently, heterogeneous networks have been used to evaluate papers, authors and venues separately or simultaneously. However, most of the approaches treat all the papers in the citation network equally and ignore the prestige of citing venues and citation time intervals. In this paper, we propose a new framework, MR-Rank, which ranks papers and venues iteratively in a mutually reinforcing way. Several factors including citation time interval, recent performance of a venue, and contribution of a paper to its venue are considered at the same time. Based on the ACL dataset, our experiments show that MR-Rank outperforms other models in terms of ranking effectiveness and efficiency.
SESSION: Session 3C: Multimedia Retrieval
Identifying plagiarized content is a crucial task for educational and research institutions, funding agencies, and academic publishers. Plagiarism detection systems available for productive use reliably identify copied text, or near-copies of text, but often fail to detect disguised forms of academic plagiarism, such as paraphrases, translations, and idea plagiarism. To improve the detection capabilities for disguised forms of academic plagiarism, we analyze the images in academic documents as text-independent features. We propose an adaptive, scalable, and extensible image-based plagiarism detection approach suitable for analyzing a wide range of image similarities that we observed in academic documents. The proposed detection approach integrates established image analysis methods, such as perceptual hashing, with newly developed similarity assessments for images, such as ratio hashing and position-aware OCR text matching. We evaluate our approach using 15 image pairs that are representative of the spectrum of image similarity we observed in alleged and confirmed cases of academic plagiarism. We embed the test cases in a collection of 4,500 related images from academic texts. Our detection approach achieved a recall of 0.73 and a precision of 1. These results indicate that our image-based approach can complement other content-based feature analysis approaches to retrieve potential source documents for suspiciously similar content from large collections. We provide our code as open source to facilitate future research on image-based plagiarism detection.
Multiple diagram navigation (MDN) uses multiple diagrams, maps, or charts to navigate a document collection. To support exploratory search scenarios, MDN provides unconventional overviews for the content and introduces novel navigational queries. In a Diagram-to-Content query, a user clicks on diagram elements to retrieve related collection documents. Diagram-to-Diagram queries allows users to select diagram element(s) to see related elements highlighted in other diagrams. Content-to-Diagram queries highlight diagram elements related to document(s) selected by the user. MDN depends on manually created connections between diagram elements and collection documents. Therefore, only a small portion of the targeted collection might be accessible from the diagrams. In this paper, we extend the domain of MDN diagram-to-content queries to reach related collection documents not directly connected to the diagrams. We focus on Wikipedia as a case study and a class of diagrams with specific characteristics. We exploit the Wikipedia hyperlink graph and internal diagram structures to provide a diagram-influenced ranking of Wikipedia pages. We tested different settings for our ranking algorithm using 12 diagrams from six domains. The results showed the strong influence of diagrams on ranking and reasonably high similarity between diagram elements selected by the user and the top-10-ranked pages.
SESSION: Session 4A: Quality and Preservation
We develop an evaluation framework for the validation of conformance checkers for the long-term preservation. The framework assesses the correctness, usability, and usefulness of the tools for three media types: PDF/A (text), TIFF (image), and Matroska (audio/video). Finally, we report the results of the validation of these conformance checkers using the proposed framework. In general, the presented framework is a high-level tool that can be quite easily employed in other preservation-related tasks.
Archived collections of documents (like newspaper and web archives) serve as important information sources in a variety of disciplines, including Digital Humanities, Historical Science, and Journalism. However, the absence of efficient and meaningful exploration methods still remains a major hurdle in the way of turning them into usable sources of information. A semantic layer is an RDF graph that describes metadata and semantic information about a collection of archived documents, which in turn can be queried through a semantic query language (SPARQL). This allows running advanced queries by combining metadata of the documents (like publication date) and content-based semantic information (like entities mentioned in the documents). However, the results returned by such structured queries can be numerous and moreover they all equally match the query. In this paper, we deal with this problem and formalize the task of ranking archived documents for structured queries on semantic layers. Then, we propose two ranking models for the problem at hand which jointly consider: i) the relativeness of documents to entities, ii) the timeliness of documents, and iii) the temporal relations among the entities. The experimental results on a new evaluation dataset show the effectiveness of the proposed models and allow us to understand their limitations.
Authorship contribution is often taken for granted. Internally, the contribution rate is usually known among all the authors of a given paper. However, this rate is hard to be verified by external parties, as the measurement of the authors' contribution is still not common and the way to measure it is unclear. In this paper, we propose a new blockchain based framework to assess the contribution of all authors of any scientific paper. Our framework can be implemented by anyone who is directly or indirectly involved in the publication of the paper, such as a principal researcher, grant funder, research assistant or anyone from relevant external bodies.
SESSION: Session 4B: Text Collections
Using Deep Learning for Title-Based Semantic Subject Indexing to Reach Competitive Performance to Full-Text
For (semi-)automated subject indexing systems in digital libraries, it is often more practical to use metadata such as the title of a publication instead of the full-text or the abstract. Therefore, it is desirable to have good text mining and text classification algorithms that operate well already on the title of a publication. So far, the classification performance on titles is not competitive with the performance on the full-texts if the same number of training samples is used for training. However, it is much easier to obtain title data in large quantities and to use it for training than full-text data. In this paper, we investigate the question how models obtained from training on increasing amounts of title training data compare to models from training on a constant number of full-texts. We evaluate this question on a large-scale dataset from the medical domain (PubMed) and from economics (EconBiz). In these datasets, the titles and annotations of millions of publications are available, and they outnumber the available full-texts by a factor of 20 and 15, respectively. To exploit these large amounts of data to their full potential, we develop three strong deep learning classifiers and evaluate their performance on the two datasets. The results are promising. On the EconBiz dataset, all three classifiers outperform their full-text counterparts by a large margin. The best title-based classifier outperforms the best full-text method by 9.4%. On the PubMed dataset, the best title-based method almost reaches the performance of the best full-text classifier, with a difference of only 2.9%.
To develop a richer understanding of how folksonomies and social tagging differ from and are similar to professional indexing languages, the following paper presents preliminary analysis of over 2 million keyword tags on the community blog MetaFilter and its companion question-and-answer site Ask MetaFilter. Most of the tags in these narrow folksonomies were created by users when they published a post, but some tags were retroactively created for old posts by a small group of volunteers. Both organic and retroactive tags on MetaFilter and Ask MetaFilter followed a power law distribution, which is expected for folksonomies. Based on tag distribution, use of organization tags, and avoidance of synonyms, however, retroactive taggers did not tag like professional indexers. Instead, they tagged using similar practices to organic taggers, even actively accommodating the broader community's use of synonyms. These findings suggest that folksonomies remain a distinctly different approach to knowledge organization.
Researching the evolution of the concepts represented by words, like "peace" or "freedom", named conceptual history, is an important discipline in the humanities, but still a laborious task. It normally consists of reading and interpreting a large number of carefully selected texts, without however always having a comprehensive knowledge of all the potentially relevant material. Thus, our objective is to design a query algebra to access temporal text corpora. It shall comprehensively allow domain experts to formalize hypotheses on how concepts manifest in large-scale digital text corpora targeting at the complete works of Reinhart Koselleck, a highly prominent researcher in conceptual history. In cooperation with domain experts, we first determine the primary information types used in conceptual history, such as word usage frequency or sentiment. Based on this, we define database operators formalizing these types, which can be combined to formulate arbitrarily complex queries representing hypotheses. The result is a novel query algebra that enables researchers in conceptual history to access large text corpora and extensively analyze word behaviors over time in a comprehensive way. In a proof of concept, we demonstrate how to use our algebra resulting in the first novel insights. This proves the suitability of our algebra.
SESSION: Session 5A: Exploring and Analyzing Collections
Social networks like Twitter and Facebook are the largest sources of public opinion and real-time information on the Internet. If an event is of general interest, news articles follow and eventually a Wikipedia page. We propose the problem of automatic event story generation and archiving by combining social and news data to construct a new type of document in the form of a Wiki-like page structure. We introduce a technique that shows the evolution of a story as perceived by the crowd in social media, along with editorially authored articles annotated with examples of social media as supporting evidence. At the core of our research, is the temporally sensitive extraction of data that serve as context for retrieval purposes. Our approach includes a fine-grained vote counting strategy that is used for weighting purposes, pseudo-relevance feedback and query expansion with social data and web query logs along with a timeline algorithm as the base for a story. We demonstrate the effectiveness of our approach by processing a dataset comprising millions of English language tweets generated over a one year period and present a full implementation of our system.
This work addresses the problem of author name homonymy in the Web of Science. Aiming for an efficient, simple and straightforward solution, we introduce a novel probabilistic similarity measure for author name disambiguation based on feature overlap. Using the researcher-ID available for a subset of the Web of Science, we evaluate the application of this measure in the context of agglomeratively clustering author mentions. We focus on a concise evaluation that shows clearly for which problem setups and at which time during the clustering process our approach works best. In contrast to most other works in this field, we are skeptical towards the performance of author name disambiguation methods in general and compare our approach to the trivial single-cluster baseline. Our results are presented separately for each correct clustering size as we can explain that, when treating all cases together, the trivial baseline and more sophisticated approaches are hardly distinguishable in terms of evaluation results. Our model shows state-of-the-art performance for all correct clustering sizes without any discriminative training and with tuning only one convergence parameter.
Having good knowledge and comprehension of history is believed to be important for a variety of reasons. Microblogging platforms could offer good opportunities to study how and when people explicitly refer to the past, in which context such references appear and what purpose they serve. However, this area remains unexplored. In this paper we report the results of a large scale exploratory analysis of history-focused references in microblogs based on 11-months long snapshot of Twitter data. We are the first to analyze general historical references in Twitter based on large scale data analysis. The results of this study can be used for designing content recommendation systems and could help to improve time aware search applications.
SESSION: Session 5B: Scientific Collections and Libraries II
Non-textual components such as charts, diagrams and tables provide key information in many scientific documents, but the lack of large labeled datasets has impeded the development of data-driven methods for scientific figure extraction. In this paper, we induce high-quality training labels for the task of figure extraction in a large number of scientific documents, with no human intervention. To accomplish this we leverage the auxiliary data provided in two large web collections of scientific documents (arXiv and PubMed) to locate figures and their associated captions in the rasterized PDF. We share the resulting dataset of over 5.5 million induced labels---4,000 times larger than the previous largest figure extraction dataset---with an average precision of 96.8%, to enable the development of modern data-driven methods for this task. We use this dataset to train a deep neural network for end-to-end figure detection, yielding a model that can be more easily extended to new domains compared to previous work. The model was successfully deployed in Semantic Scholar,\footnote\urlhttps://www.semanticscholar.org/ a large-scale academic search engine, and used to extract figures in 13 million scientific documents.\footnoteA demo of our system is available at \urlhttp://labs.semanticscholar.org/deepfigures/,and our dataset of induced labels can be downloaded at \urlhttps://s3-us-west-2.amazonaws.com/ai2-s2-research-public/deepfigures/jcdl-deepfigures-labels.tar.gz. Code to run our system locally can be found at \urlhttps://github.com/allenai/deepfigures-open.
Improving the Representation and Conversion of Mathematical Formulae by Considering their Textual Context
Mathematical formulae represent complex semantic information in a concise form. Especially in Science, Technology, Engineering, and Mathematics, mathematical formulae are crucial to communicate information, e.g., in scientific papers, and to perform computations using computer algebra systems. Enabling computers to access the information encoded in mathematical formulae requires machine-readable formats that can represent both the presentation and content, i.e., the semantics, of formulae. Exchanging such information between systems additionally requires conversion methods for mathematical representation formats. We analyze how the semantic enrichment of formulae improves the format conversion process and show that considering the textual context of formulae reduces the error rate of such conversions. Our main contributions are: (1) providing an openly available benchmark dataset for the mathematical format conversion task consisting of a newly created test collection, an extensive, manually curated gold standard and task-specific evaluation metrics; (2) performing a quantitative evaluation of state-of-the-art tools for mathematical format conversions; (3) presenting a new approach that considers the textual context of formulae to reduce the error rate for mathematical format conversions. Our benchmark dataset facilitates future research on mathematical format conversions as well as research on many problems in mathematical information retrieval. Because we annotated and linked all components of formulae, e.g., identifiers, operators and other entities, to Wikidata entries, the gold standard can, for instance, be used to train methods for formula concept discovery and recognition. Such methods can then be applied to improve mathematical information retrieval systems, e.g., for semantic formula search, recommendation of mathematical content, or detection of mathematical plagiarism.
Scientific articles usually follow a common pattern of discourse, and their contents can be divided into several facets, such as objective, method, and result. We examine the efficacy of using these discourse facets for citation recommendation. A method for learning multi-vector representations of scientific articles is proposed, in which each vector encodes a discourse facet present in an article. With each facet represented as a separate vector, the similarity of articles can be measured not in their entirety, but facet by facet. The proposed representation method is tested on a new citation recommendation task called context-based co-citation recommendation. This task calls for the evaluation of article similarity in terms of citation contexts, wherein facets help to abstract and generalize the diversity of contexts. The experimental results show that the facet-based representation outperforms the standard monolithic representation of articles.
SESSION: Session 5C: Archiving
Biography is a genre that intertwines history and identity. Modern biographies have relied on interviews to supplement contemporaneous sources of primary material such as letters, literary drafts, and notebooks, usually held in physical special collections acquired and curated by major research libraries. Will the addition of new digital sources such as records repositories, digital libraries, social media, and collections of ephemera change biographical research practices? I use the construction of a subject-driven collection of over 11,750 discrete digital items as a case study to demonstrate how new digital resources can extend the breadth and depth of biographical description, facilitate the rediscovery of a subject's social network, and enable formerly invisible literary influences to be foregrounded. I also explore the implications of the use of ephemera in tandem with other digital resources to ask what we might want to save in the future-including elements of today's social media platforms-and to discuss the trade-offs in making material that was once ephemeral (and difficult to access) so readily available online.
Event-based collections are often started with a web search, but the search results you find on Day 1 may not be the same as those you find on Day 7. In this paper, we consider collections that originate from extracting URIs (Uniform Resource Identifiers) from Search Engine Result Pages (SERPs). Specifically, we seek to provide insight about the retrievability of URIs of news stories found on Google, and to answer two main questions: first, can one "refind" the same URI of a news story (for the same query) from Google after a given time? Second, what is the probability of finding a story on Google over a given period of time? To answer these questions, we issued seven queries to Google every day for over seven months (2017-05-25 to 2018-01-12) and collected links from the first five SERPs to generate seven collections for each query. The queries represent public interest stories: "healthcare bill," "manchester bombing," "london terrorism," "trump russia," "travel ban," "hurricane harvey," and "hurricane irma." We tracked each URI in all collections over time to estimate the discoverability of URIs from the first five SERPs. Our results showed that the daily average rate at which stories were replaced on the default Google SERP ranged from 0.21 - 0.54, and a weekly rate of 0.39 - 0.79, suggesting the fast replacement of older stories by newer stories. The probability of finding the same URI of a news story after one day from the initial appearance on the SERP ranged from 0.34 - 0.44. After a week, the probability of finding the same news stories diminishes rapidly to 0.01 - 0.11. In addition to the reporting of these probabilities, we also provide two predictive models for estimating the probability of finding the URI of an arbitrary news story on SERPs as a function of time. The web archiving community considers link rot and content drift important reasons for collection building. Similarly, our findings suggest that due to the difficulty in retrieving the URIs of news stories from Google, collection building that originates from search engines should begin as soon as possible in order to capture the first stages of events, and should persist in order to capture the evolution of the events, because it becomes more difficult to find the same news stories with the same queries on Google, as time progresses.
Personal and private Web archives are proliferating due to the increase in the tools to create them and the realization that Internet Archive and other public Web archives are unable to capture personalized (e.g., Facebook) and private (e.g., banking) Web pages. We introduce a framework to mitigate issues of aggregation in private, personal, and public Web archives without compromising potential sensitive information contained in private captures. We amend Memento syntax and semantics to allow TimeMap enrichment to account for additional attributes to be expressed inclusive of the requirements for dereferencing private Web archive captures. We provide a method to involve the user further in the negotiation of archival captures in dimensions beyond time. We introduce a model for archival querying precedence and short-circuiting, as needed when aggregating private and personal Web archive captures with those from public Web archives through Memento. Negotiation of this sort is novel to Web archiving and allows for the more seamless aggregation of various types of Web archives to convey a more accurate picture of the past Web.
SESSION: Session 6A: opic Modeling and Detection
Comparative text mining extends from genre analysis and political bias detection to the revelation of cultural and geographic differences, through to the search for prior art across patents and scientific papers. These applications use cross-collection topic modeling for the exploration, clustering, and comparison of large sets of documents, such as digital libraries. However, topic modeling on documents from different collections is challenging because of domain-specific vocabulary.
We present a cross-collection topic model combined with automatic domain term extraction and phrase segmentation. This model distinguishes collection-specific and collection-independent words based on information entropy and reveals commonalities and differences of multiple text collections. We evaluate our model on patents, scientific papers, newspaper articles, forum posts, and Wikipedia articles. In comparison to state-of-the-art cross-collection topic modeling, our model achieves up to 13% higher topic coherence, up to 4% lower perplexity, and up to 31% higher document classification accuracy. More importantly, our approach is the first topic model that ensures disjunct general and specific word distributions, resulting in clear-cut topic representations.
The distributional hypothesis states that similar words tend to have similar contexts in which they occur. Word embedding models exploit this hypothesis by learning word vectors based on the local context of words. Probabilistic topic models on the other hand utilize word co-occurrences across documents to identify topically related words. Due to their complementary nature, these models define different notions of word similarity, which, when combined, can produce better topical representations. In this paper we propose WELDA, a new type of topic model, which combines word embeddings (WE) with latent Dirichlet allocation (LDA) to improve topic quality. We achieve this by estimating topic distributions in the word embedding space and exchanging selected topic words via Gibbs sampling from this space. We present an extensive evaluation showing that WELDA cuts runtime by at least 30% while outperforming other combined approaches with respect to topic coherence and for solving word intrusion tasks.
Being able to rapidly recognise new research trends is strategic for many stakeholders, including universities, institutional funding bodies, academic publishers and companies. The literature presents several approaches to identifying the emergence of new research topics, which rely on the assumption that the topic is already exhibiting a certain degree of popularity and consistently referred to by a community of researchers. However, detecting the emergence of a new research area at an embryonic stage, i.e., before the topic has been consistently labelled by a community of researchers and associated with a number of publications, is still an open challenge. We address this issue by introducing Augur, a novel approach to the early detection of research topics. Augur analyses the diachronic relationships between research areas and is able to detect clusters of topics that exhibit dynamics correlated with the emergence of new research topics. Here we also present the Advanced Clique Percolation Method (ACPM), a new community detection algorithm developed specifically for supporting this task. Augur was evaluated on a gold standard of 1,408 debutant topics in the 2000-2011 interval and outperformed four alternative approaches in terms of both precision and recall.
Within the context of mass-scale digital libraries, this panel will explore methodologies and uses for-as well as the results of- conceiving of "data as collections" and "collections as data." The panel will explore the implications of these concepts through use cases involving data mining of the HathiTrust Digital Library, particularly major projects developed at the HathiTrust Research Center. Featured will be the Workset Creation for Scholarly Analysis + Data Capsules (WCSA+DC) project, the Solr Extracted Features project, and the Image Analysis for Archival Discovery (Aida) project. Each of these projects focuses on various aspects of text, image and data mining and analysis of mass-scale digital library collections.
This panel addresses the opportunities and challenges of using multi-institutional collaborations and digital approaches to drive engaged-learning and archive-focused projects. It focuses in particular on the opportunities presented by the archives related to the negotiation of constitutions and international treaties.
Can Research Librarians make Contributions to Decision-making as Intelligence Analysts?: The Prospects and Challenges
This panel discusses the prospects and challenges of providing intelligence analysis services in U.S. research libraries to support decision-making at various levels.
We compare and contrast three different ways to implement an archival replay banner. We propose an implementation that utilizes Custom Elements and adds some unique behaviors, not common in existing archival replay systems, to enhance the user experience. Our approach has a minimal user interface footprint and resource overhead while still providing rich interactivity and extended on-demand provenance information about the archived resources.
ArchiveNow is a Python module for preserving web pages in on-demand web archives. This module allows a user to submit a URI of a web page for archiving at several configured web archives. Once the web page is captured, ArchiveNow provides the user with links to the archived copies of the web page. ArchiveNow is initially configured to use four archives but is easily configurable to add or remove other archives. In addition to pushing web pages to public archives, ArchiveNow, through the use of Wget and Squidwarc, allows users to generate local WARC files, enabling them to create their own personal and private archives.
Influence of Service Quality on Users' Satisfaction with Mobile Library Service: A Comparison of Public and Academic Libraries
Mobile service in libraries has become very popular in last few years. To notify the meaningful factors for the satisfaction with mobile library services, this study focuses on service quality and divides the factor into 4 sub-dimensions: resource quality, environment quality, interaction quality and outcome quality. The empirical findings revealed most sub-dimensions of service quality to be significant in explaining users' satisfaction. Moreover, it was found that, there were different perception between public and academic library user groups regarding the sub-dimensions of mobile service quality and their influences on users' satisfaction.
Due to recent data explosion, scientists invest most of their efforts in the collection of data needed for research. In this paper, we address the community-driven data curation system which is essential to enhancing data understandability and reusability, thereby reducing the efforts for data collection. The curation system focuses on the interlinking between data and their related literatures to capture and organize the associations among research output. The system also focuses on domain-specific contextual information to help users understand data. A global research group in protein study has adopted the system to build a community-driven curated database and established a guideline for scientific discovery.
Exploratory Investigation of Word Embedding in Song Lyric Topic Classification: Promising Preliminary Results
In this work we investigate a data-driven vector representation of word embedding for the task of classifying song lyrics into their semantic topics. Previous research on topic classification of song lyrics has used traditional frequency based text representation. On the other hand, empirically driven word embedding has shown sensible performance improvment of text classification tasks, because of its ability to capture semantic relationship between words from big data. As averaging the word vectors from a short text is known to work reasonably well compared to the other comprehensive models utilizing their order, we adopt the averaged word vectors from the lyrics and user's interpretations about them, which are short in general, as the feature for this classification task. This simple approach showed promising classification accuracy of 57%. From this, we envision the potential of the data-driven approaches to creating features, such as the sequence of word vectors and doc2vec models, to improve the performance of the system.
Co-citation clustering is often used for mapping science in the field of bibliometrics. It may be useful to utilize information from parsing the full-text of citing documents to obtain a more precise result of the clustering. Recently, "rough co-citation," which is weak co-citation relationship and can be used to indicate new related documents, has been proposed for scientific paper searches. Applying rough co-citation to the task of clustering will be beneficial. This study aims to explore whether rough co-citation can improve the performance of co-citation clustering. A clustering experiment is conducted to evaluate the effects of using rough co-citation. The experimental results indicate that the proposed technique, which uses both the original co-citation and the rough co-citation, tends to outperform the baseline technique, which only uses the original co-citation.
As video-based learning is increasingly used in all sectors of education, there is a need for video players that support active viewing practices. We introduce a video player that allows students to mark up video with highlights, tags, and notes in order to personalize their video-based learning experience.
Editorial pre-screening is the first step in academic peer review. The deluge of research papers and the huge amount of submissions being made to journals these days makes editorial decision a very challenging task. The current work attempts to investigate certain impact factors that may have a role in the editorial decision making process. The proposed work exhibits potential for the development of an AI-assisted peer review system which could aid the editors as well as the authors in making appropriate decisions in reasonable time and thus accelerate the overall process of scholarly publishing.
We present an approach to explore news archives by automatically generating semantic aspects for their navigation. Given a keyword query as an input, we utilize semantic annotations present in the pseudo-relevant set of documents for generating the aspects. Our approach to generate the aspects considers the salience of the annotations by modeling their semantics as well as considering their co-occurrence in the pseudo-relevant set of documents. The generated aspects are also beneficial for representing documents in a structured manner. We show preliminary results on two news archives demonstrating the quality of the generated aspects over a testbed of more than 5,000 aspects derived from Wikipedia.
This poster reports on the evaluation of the topic space recommendation model, proposed here as an alternative to the personalization algorithms based on large datasets that often result in content and subject matter filter bubbles. The content filter bubbles that dominate contemporary Internet media platforms have been shown to provide users more of what they already consume and exclude relevant content at the expense of user exploration and discovery. Modern algorithms have also exhibited the problematic nature of reinforcing systematic bias.
Extraction of Main Event Descriptors from News Articles by Answering the Journalistic Five W and One H Questions
The identification and extraction of the events that news articles report on is a commonly performed task in the analysis workflow of various projects that analyze news articles. However, due to the lack of universally usable and publicly available methods for news articles, many researchers must redundantly implement methods for event extraction to be used within their projects. Answers to the journalistic five W and one H questions (5W1H) describe the main event of a news story, i.e., who did what, when, where, why, and how. We propose Giveme5W1H, an open-source system that uses syntactic and domain-specific rules to extract phrases answering the 5W1H. In our evaluation, we find that the extraction precision of 5W1H phrases is p=0.64, and p=0.79 for the first four W questions, which discretely describe an event.
Keyphrase is an important way to quickly get the topic of a document by providing highly-summative information. The previous approaches for keyphrase extraction simply rank keyphrases according to statistics-based model or graph-based model, which ignore the influence of external knowledge. In this paper, we take prior knowledge, which contains controlled vocabulary of keyphrases and their prior probability, into consideration to enhance previous methods. First, we build a controlled vocabulary of keyphrases introduced by keyphrases from existing collections and a keyphrase candidate set is filtered from a given document by it. Then, we use prior probability to represent the importance of keyphrases candidate with TF-IDF and TextRank. Finally, a supervised learning algorithm is used to learn optimal weights of these three features. Experiments on four benchmark datasets show the great advantages of prior knowledge on keyphrase extraction. Furthermore, we achieve competitive performance compared with the state-of-the art methods.
and phrases in a text, for which we use an automatically generated Concept-in-Context (CiC) network. Words and phrases rarely belong to a single concept; disambiguation in Capisco relies on interplay between words that are in close vicinity in the text. Starting the disambiguation is a seeding process, that identifies the first concepts, which then form the context for further disambiguation steps. This paper introduces the seeding algorithm and explores seeding strategies for identifying these initial concepts in text volumes, such as books, that are stored in a digital library.
Our goal is to propose an alternative retrieval system of academic documents based on researcher's behavior in practice. In this study, a questionnaire survey was conducted. Question items were developed from findings in the previous observational study for researcher's behavior. From the results of 46 respondents, the top three elements checked in the search results were title, abstract, and the full-text version. They also checked structure "Introduction" in the full-text rather than other structures when they found previous research in an unfamiliar field. These results indicate that researchers use different ways for selecting documents based on the type of documents they look for.
While citation recommendation can be important for scholars, unfortunately, because of language barrier, some scientists cannot efficiently retrieve and consume the publications hosted in a foreign language repository. In this study, we propose a novel solution, cross-language citation recommendation via Publication Content and Citation Representation Fusion (PCCRF). PCCRF can learn a representation function by mapping the publications, from various languages and repositories, to a low-dimensional joint embedding space from both content semantic and citation relation viewpoints. The proposed method can optimize the publication representations by maximizing the likelihood of observing network neighborhoods (which are generated by a semi-supervised random walk algorithm) of publications. Experimental results show that the proposed method can be promising for cross-language citation recommendation.
Wikidata is one of the largest knowledge curation projects on the web. Their data is used by other Wikimedia projects such as Wikipedia, as well as major search engines. This qualitative study used content analysis of discussions involving data curation and negotiation in Wikidata. Activity Theory was used as a conceptual framework for data collection and analysis of the activities, members and tools. Some of the findings map Wikidata activities to curation frameworks. An understanding of the activities in Wikidata will help inform communities wishing to contribute data to or reuse data from Wikidata, as well as inform the design of other similar online peer-curation communities, scientific research institutional repositories, digital archives, and libraries.
The first step of the Developing a Framework for Measuring Reuse of Digital Objects project involved a survey identifying how cultural heritage organizations currently assess digital library reuse, barriers to assessing reuse, and community priorities for potential solutions and next steps. This poster offers initial analysis of the survey results.
With the exponential growth in digital research data, libraries are beginning to find opportunities to assist researchers with planning, maintaining, sharing, and accessing data through research data services. Using a content analysis with the lens of information architecture, this study sought to better understand how these services are organized in North American academic library websites and to what extent the research data lifecycle is supported within research data services. 50 academic library websites were studied and results yielded three provisions that make up research data services: Information Access, Technical Support, and Personalized Consultation. The data lifecycle was found to be strongly supported in research data services for planning, data curation, and data access stages.
Digital repositories can often easily be navigated by humans but not by machines. We introduce Signposting, a mechanism to show ma- chines how to maneuver repositories' objects and how to interpret their relationships. Signposting is based on standard and widely adopted web technologies - typed links and HTTP link headers.
Web resources change over time and many ultimately disappear. While this has become an inconvenient reality in day-to-day use of the web, it is problematic when these resources are referenced in scholarship where it is expected that referenced materials can reliably be revisited. We introduce Robust Links, an approach aimed at maintaining the integrity of the scholarly record in a dynamic web environment. The approach consists of archiving web resources when referencing them and decorating links to convey information that supports accessing referenced resources both on the live web and in web archives.
Web content acquisition forms the foundation of value extraction of web data. Two main categories of acquisition methods are crawler based methods and transactional web archiving or server-side acquisition methods. In this poster, we propose a new method to acquire web content from web caches. Our method provides improvement in terms of reduced penalty on HTTP transaction, flexibility to accommodate peak web server loads and minimal involvement of System Administrator to set up the system.
Twitter has identified 2,752 accounts that it believes are linked to the Internet Research Agency (IRA), a Russian company that creates online propaganda. These accounts are known to have tweeted about the US 2016 Elections and the list was submitted as evidence by Twitter to the United States Senate Judiciary Subcommittee on Crime and Terrorism. There is no equivalent officially published list of accounts from the IRA known to be active in the UK-EU Referendum debate (Brexit), but we found that the troll accounts active on the 2016 US Election also produced content related to Brexit. We found 3,485 tweets from 419 of the accounts listed as IRA accounts which specifically discussed Brexit and related topics such as the EU and migration. We have been collating an archive of tweets related to Brexit since August 2015 and currently have over 70 million tweets. The Brexit referendum took place on the 23rd June 2016 and the UK voted to leave the European Union. We gathered the data using the Twitter API and a selection of hashtags chosen by a panel of academic experts. Currently we have in excess of fifty different hashtags and we add to the set periodically to accurately represent the evolving conversation. Twitter has closed the accounts that were documented in the Senate list meaning that these tweets are no longer available through the webpage or API. Due to Twitter's terms of service we are unable to share specific tweet text or user profile information but our findings, utilising text and metadata from derived and aggregated data, allows us to provide important insights into the behaviour of these trolls.
Patent documents are an ample source for technical knowledge, and increased dramatically in recent years. This paper aims at identifying and analyzing converging technology based on patent analysis. The identification method of Converging technology is through cluster analysis based on USPC co-occurrence matrix calculated by the cross USPC class patents of five parties during the 2005-2015 years. Finally, 161 converging technologies are identified. Converging technology is mainly distributed in the new generation of information technology, new material industry. High end equipment manufacturing industry is the most active industry in technological convergence.
Recommending Co-authorship via Network Embeddings and Feature Engineering: The case of National Research University Higher School of Economics
Co-authorship networks contain hidden structural patterns of research collaboration. While some people may argue that the process of writing joint papers depends on mutual friendship, research interests, and university policy, we show that, given a temporal co-authorship network, one could predict the quality and quantity of future research publications. We are working on the comparison of existing graph embedding and feature engineering methods, presenting combined approach for constructing co-author recommender system formulated as link prediction problem. We also present a new link embedding operator improving the quality of link prediction base don embedding feature space. We evaluate our research on a single university publication dataset, providing meaningful interpretation of the obtained results.
We demonstrate the utility of word embedding-based semantic similarity methods for Author Name Disambiguation.
This paper shows a research paper recommender system for university students. The recommender system is embedded in an e-book system, which displays learning materials (e.g., slides) and is used at lectures. The recommender system suggests papers related to a learning material. The experiment revealed students do not access to recommended papers during the lecture. Instead, they access to research papers when reviewing the lecture and/or working for an assignment.
Scholars using digital libraries and archives routinely create worksets-aggregations of digital objects-as a way to segregate resources of interest for in-depth scrutiny. To illustrate how worksets can enhance the scholarly utility of digital library content, we distill from prior user studies three key objectives for worksets (extra-digital library manipulation, intra-item properties, and robust representations), and discuss how they motivated the workset model being developed at the HathiTrust Research Center (HTRC). We describe how HTRC's implementation of its RDF-compliant workset model helps to satisfy these objectives.
In this paper, we present preliminary results on a novel task of extracting comparison points for a pair of entities from the text articles describing them. The task is challenging as comparison points in a typical pair of articles tend to be sparse. We presented a multi-level document analysis (viz. document, paragraph and sentence level) for extracting the comparisons. For extracting sentence level comparisons, which is the hardest task among three, we have used Convolutional Neural Network (CNN) with features extracted around
Challenges to Deploying Library Services in the Cloud: Data Issues Influencing IT, People, Costs, and Policy Challenges
This poster analyzes challenges to planning, deploying, and maintaining different types of library services in the cloud. We apply grounded theory principles to analyze 75 articles authored by library administrators, librarians with IT expertise, IT professionals, cybersecurity experts, and business consultants engaged in planning, deploying, and maintaining library services in the cloud. Data analysis reveals that a majority of the past literature reports challenges to implementing Software as a Service (SaaS) in libraries. The seven key areas critical to the successful implementation of SaaS in libraries are related to: (1) data, (2) authentication and privacy of patrons, (3) skills and knowledge of library staff and organizational culture, (4) IT infrastructure, (5) features of services, (6) fixed and operational costs associated with data and technology, and (7) policies and contracts. Data issues like access, storage, ownership, curation, security, confidentiality, loss, migration, and redundancy seem to have the most influence on SaaS deployment in libraries.
In many research areas, such as the material sciences, researchers plan and conduct numerous experiments. The corresponding findings are published, while the underlying data remains described haphazardly or not at all within the respective institutes and, therefore, is neither shared nor reused by the scientific community.
Large digital libraries often index articles without curating their digital copies in their own repositories. Examples include the National Digital Library of India (NDLI) and ACM Digital Library. Full text view generally requires subscription to libraries that host the contents. The problem is particularly severe for researchers, given high journal subscription charges. However, authors often keep a free copy in preprint servers. Sometimes a conference paper behind a paywall has a closely resembling journal version freely available on the Web. These open access surrogates are immensely valuable to researchers who cannot afford to access the original publications. We present a lightweight tool called Surrogator to automatically identify open access surrogates of access-restricted scholarly papers present in a digital library. Its focus on approximate matches makes it different from many existing applications. In this poster, we describe the design and interface of the tool and our initial experiences of using it on articles indexed in NDLI.
Digital Library Systems in Intelligent Infrastructure for Human-Centered Communities: A Qualitative Research
This poster presents a socio-technical qualitative research on the strategic development of Intelligent Infrastructure for Human Centered Communities at Virginia Tech. Within such development, this study explored the future vision and projective scenarios of data infrastructure and digital libraries for smart community development. The results augment design thinking and visioning practice for digital library systems beyond traditional boundaries.
Methodological Considerations in Developing Cultural Heritage Digital Libraries: A Community-driven Framework
We present a multi-disciplinary methodological framework that was developed to create the Digital Library North for Inuit communities in Canada's north. The framework adopts a holistic approach, taking into account existing physical and digital collections, information search behaviour of community members, culturally appropriate metadata, usability and sustainability. The methodological framework provides an empirically-supported model for developing community-focused digital libraries.
People often struggle to understand scientific texts, which leads to miscommunication and often to inaccurate and even sensationalistic reports of research. Identifying and achieving a better understanding of the factors that affect comprehension would be helpful to analyze what improves public understanding of science. In this study, we generate features from scientific text that represent some common text structures and use them to predict the semantic similarity between the scientific text and the textual content posted by the general public about the same scientific text online. In this endeavor, we built regression models to achieve this purpose and evaluated them based on their R-squared values and mean squared errors. R-squared values as high as 0.73 were observed, indicating a high chance of a relationship between certain textual features and the public's understanding of science.
Creating scientific publications is a complex process. It is composed of a number of different activities, such as designing the experiments, analyzing the data, and writing the manuscript. Information about the contributions of individual authors of a paper is important for assessing authors' scientific achievements. Some biomedical publications contain a short section written in natural language, which describes the roles each author played in the process of preparing the article. In this paper, we present a study of authors' roles commonly appearing in these sections, and propose an algorithm for automatic extraction of authors' roles from them. In our study, we used co-clustering techniques, as well as Open Information Extraction, to semi-automatically discover the most popular roles within a corpus of contributions sections. In total 13 roles were discovered, three of which (paper revision, literature review, and interpretation) are not described by existing author role taxonomies. Discovered roles are then used to automatically build a training set for a supervised Naïve Bayes role extractor. The proposed role extractor is able to extract roles from the text with micro-averaged precision 0.68, recall 0.48 and F1 0.57.
Public bibliographic databases hold invaluable data about the academic environment. However, researcher affiliation information is frequently missing or outdated. We propose a statistical data extraction method to acquire affiliation information directly from university websites and solve the name extraction task in general. Previous approaches to web data extraction either lack in flexibility, because wrappers do not generalize well across websites, or they lack in precision, because domain agnostic methods neglect useful properties of this particular application domain. Our statistical approach solves the name extraction task with a good tradeoff between generality and precision. We conducted experiments over a collection of 152 faculty web pages in multiple languages from universities in 49 countries and obtained 94.37 % precision, 97.61% recall, and 0.9596 F-measure.
Vibration data from building can reflect human activities such as human movement. Lack of relative labeled dataset has been a major challenge for conducting such analysis job. We aim to explore possibilities to produce footstep metadata automatically through machine learning techniques. In this paper, we perform an analysis on identifying human footsteps by utilizing deep neural network as a classifier.
The widespread use of portable devices is reshaping the reading behaviors, and thus a shift towards hyper-attention is taking place. How to use data on fine reading behavior to identify the current style of mobile reading emerges as a novel research topic. With a large-scale flipping data, this paper finds that mobile reading is a leisure-oriented and fragmented activity with obvious tidal characteristics and long tail effects, and that the book genres and user types have influence on reading behavior.
Entity Mixture refers to a phenomenon that the information on an entity is mistaken as attributes of another entity in information extraction during knowledge base (KB) construction and population. To improve the quality of knowledge-based services, data accuracy and validity in KBs should be enhanced. This paper presents a clustering analysis-based approach for detecting potentially mixed entities in a KB. Our approach aims at detecting the inconsistency of the attribute values of a KB instance as an indication of entity mixture occurrence. This paper also presents an experiment conducted on a data set of industrial applications to demonstrate the process of entity mixture detection. Experimental results show that our proposed methodology performs well in detecting mixed entities.
Citation contexts of an article refer to sentences or paragraphs that cite that article. Citation contexts are especially useful for recommendation and summarization tasks. However, few studies have recognized the diversity of these citation contexts, thus leading to redundant recommendation lists and abstract . To address this gap, we compared several strategies that can recommend a set of diverse citation contexts by re-ranking extracted citation contexts. Diversification was achieved by combining one of two semantic distance algorithms with one of two re-ranking algorithms. Experimenting with CiteSeerX dataset, our program produced a diverse list of 10 citation contexts that could be recommended to users. We evaluated the experiment results based on a user case study of 15 articles. The case study revealed that a diversity strategy that combined the "ESA" and "MMR" led to a better reading experience for participants compared to other diversity strategies. Our study provides insights to develop better automatic academic recommendation and summarization systems.
In this paper, we study query suggestion diversi cation for time- aware queries, where query suggestions are diversified from multiple dimensions (e.g., topic and time). More precisely, we introduce the method of Time-aware Diversified Query Suggestion (TDQS) which ensures that the generated suggested queries indicate possible time points in which a searcher who issued the original query may be interested. Preliminary experiments on AOL query log demonstrate that our proposed method can significantly improve the diversity and relevance e ectiveness for time-aware queries in comparison with two state-of-the-art methods.
In the era of big data, huge amounts of data are generated. In the process of mining the relationship among data entities, we find that there are some similarity in these entity relationships. On this basis, this paper proposes an entity relationship model, and presents a set of visualization schemes that can display the relationship of multi entities. The scheme is applied to the visualization of Chinese herbal medicine prescription dataset and paper dataset, which accord with the model. The result is very satisfying, and proves the good applicability of the scheme.
The term "smart library" was coined by Aittola, Ryhanen, and Ojala in 2003, and librarians have been striving to implement smart libraries in different ways ever since. However, in the 15 years that have passed, no definitive explanation of a smart library has emerged, and it seems unclear what technologies or services truly make a library "smart." In a world of smartphones, smartwatches, and even smart homes, innovative librarians want to move toward creating next generation smart libraries, but how? Because the majority of studies relevant to smart libraries take a qualitative approach, a meta-synthesis of existing qualitative research on smart libraries has been conducted and analyzed. Three time-periods have been identified to demonstrate the transition of technology changing to meet users' needs.
After being published, a document, whether it is a research paper or an online post, can make an impact when readers cite, share, or endorse it. A document may not make its greatest impact right after its publication, and some documents' impact can last a long period of time. This study develops a graphical model to capture the temporal dynamics in the impact of latent topics from a corpus of documents. Specifically, we modeled citation counts using Poisson distributions with Gamma priors. We conducted experiments on papers published in (i) D-Lib Magazine and (ii) The Library Quarterly from 2007 to 2017. Comparing with ToT, we found that our model produced more robust results on topical trends over time. The results also showed that prevalence and impact of the same topic are not correlated. Enabling better understanding and modeling of topical impact over time, this model can be used for the design of digital libraries and social media platforms, as well as evaluation of scientific contributions and policies.
We report on the work undertaken developing a web environment that allows users to search over 1 trillion tokens of text -- down to the page-level -- of the HathiTrust Part-of-Speech Extracted Features Dataset to help produce worksets for scholarly analysis. We present an extended example of the web environment in use, along with details about its implementation.
MEDDLEing with Digital Library Searches: Surmounting User Model and System Misalignments through Lightweight Bespoke Proxying
We document how surprisingly easy it is for user misconceptions to arise when using digital library search interfaces, and the significant unseen impact this can have on the user's interpretation of search results. Further, we detail a bespoke proxying technique we have devised called Meddle -- for ModifiED Digital Library Environment -- which is a lightweight agile technique that helps address identified pitfalls in a DL search interface that operates independently of the originating digital library.
Open Access (OA) Repositories provide users with barrier free access to scientific resources and play a significant role in the dissemination of scientific results and the increase of author visibility. Although these repositories are providing free resources, they are not well connected on the level of their metadata. One approach to tackle this issue is to linkup with the scattered pieces of bibliographic and biographic information residing in external sources. In this demonstration, we focus on the contributors of a repository by means of authority data that can be linked to several authority systems (WikiData,DBpedia, VIAF, ORCID, RePEc) to make repository data more diverse, interlinked and visible.
In this paper, we demonstrate an online system for historical event retrieval. Our system outputs ranked events according to an input text query, time range and category relevance. It is useful for users searching not just for important past events related to input entities but events that belong to specified subset of general categories. It can be also helpful for creating datasets of events falling into specific categories or for generating specialized timelines.
This tutorial is a thorough and deep introduction to the Digital Libraries (DL) field, providing a firm foundation: covering key concepts and terminology, as well as services, systems, technologies, methods, standards, projects, issues, and practices. It introduces and builds upon a firm theoretical foundation (starting with the '5S' set of intuitive aspects: Streams, Structures, Spaces, Scenarios, Societies), giving careful definitions and explanations of all the key parts of a 'minimal digital library', and expanding from that basis to cover key DL issues. Illustrations come from a set of case studies, including from multiple current projects, including with webpages, tweets, and social networks. Attendees will be exposed to four Morgan and Claypool books that elaborate on 5S, published 2012-2014. Complementing the coverage of '5S' will be an overview of key aspects of the DELOS Reference Model and DL.org activities. Further, use of a Hadoop cluster supporting big data DLs will be described.
Digital Humanities is an area of inquiry and scholarship that combines the procedural methodologies from the Sciences with the reflection that is carried out in the Humanities. Although the scope of Digital Humanities is particularly difficult to define as the field is actively evolving, some of its defining characteristics have remained constant over time. These include its new forms of scholarship involving collaborative, transdisciplinary, and computationally-engaged research, teaching, and publishing.
In this tutorial, we will focus on recent developments in the keyphrase extraction task using research papers as a case study. In particular, we will discuss a wide range of keyphrase extraction models ranging from the representative supervised approaches such as KEA and GenEx to more recent ones that make use of the advances in artificial intelligence. Beyond introducing the outstanding approaches in this domain, we will discuss how keyphrases can significantly improve the search and retrieval of information in digital libraries and hence, leads to an improved organization, search, retrieval, and recommendation of scientific documents. Participants will learn about existing approaches, challenges and future trends in the keyphrase extraction task, and how they can be applied to digital library applications.
This tutorial begins with an overview of the major branches of machine learning (ML) and then provides more thorough coverage of deep neural networks. It covers key concepts, tools, experimental methods, applications, evaluation measures and associated issues for supervised learning (regression and classification), unsupervised learning (clustering and dimensionality reduction), semi-supervised and active learning (which combine the former approaches), and reinforcement learning. The deep neural network discussion covers convolutional neural networks (CNNs), recurrent neural networks (RNNs), word embeddings and related techniques. The discussion will be grounded on digital library (DL) - related applications and will highlight issues, techniques and tools associated with processing big data.
Cyberinfrastructure for Digital Libraries and Archives: Integrating Data Management, Analysis, and Publication
Increasingly, digital libraries and archives need to and are using cyberinfrastructure and machine learning to meet curation, data management, and researchers needs. This workshop focuses on facilitating adoption and integration between these spaces. It brings together researchers and practitioners to share visions, questions, latest advances in methodology, application experiences, and best practices.
The 2018 edition of the Workshop on Web Archiving and Digital Libraries (WADL) will explore the integration of Web archiving and digital libraries. The workshop aims at addressing aspects covering the entire life cycle of digital resources and will also explore areas such as community building and ethical questions around web archiving.
We propose to have a full day workshop at JCDL 2018. This workshop will provide an opportunity for participants to exchange research ideas on image collections, including the creation, organization, access and use (COAU) of various image datasets. We expect to discuss various theories, methods, techniques, challenges, and new research directions as related to image's COAU. Especially we would like to explore innovative ideas on image annotation, retrieval, use behavior &personas, processing of different types of images, and visual image metrics. The workshop will allow researchers to communicate with their peers on projects and develop new ideas through presentation and discussion. We hope to establish a community of researchers from related disciplines and explore questions critical to the future development of image's COAU. Participants of this workshop will be invited to submit a full paper to a special issue at The Electronic Library (http://www.emeraldinsight.com/journal/el) on Image Collections.
The workshop is with the ACM/IEEE Joint Conference on Digital Libraries in 2018 (JCDL 2018) which will be held in Fort Worth, Texas, USA on June 3 - 7, 2018. The Joint Conference on Digital Libraries (JCDL) is a major international forum focusing on digital libraries and associated technical, practical, and social issues.
In an era when massive amounts of medical data became available, researchers working in biological, biomedical and clinical domains have increasingly started to require the help of language engineers to process large quantities of biomedical and molecular biology literature, patient data or health records. With such a huge amount of reports, evaluating their impact has long seized to be a trivial task. Linking the contents of these documents to each other, as well as to specialized ontologies, could enable access to and discovery of structured clinical information and foster a major leap in natural language processing and health research
SESSION: Doctorial Consortium
This is an overview of the Doctoral Consortium workshop organized and held as part of the Joint Conference on Digital Libraries 2018 conference in Fort Worth, Texas.