Planet Cataloging

October 30, 2014

Catalogue & Index Blog

cilipcig

Join us for our next e-forum, to be held on Tuesday 25th & Wednesday 26th November 2014. Hosted by the Cataloguing & Indexing Group (CIG) of CILIP, and co-moderated by Esther Arens from the University of Leicester, Helen Doyle from the law firm Norton Rose Fulbright, and Bernadette O’Reilly from the Bodleian Libraries, this e-forum is free and open to everyone!

CIG is often asked for advice and guidance on ‘how to get into cataloguing’. How do I get the experience to apply for jobs, when I need the jobs to give me the experience? What skills do I need? How do I keep my knowledge up-to-date? Just what are managers looking for? This CIG hosted e-forum will provide a platform to ask questions, share experiences, raise concerns, and give advice.

Topics to be covered include
‘ Recruiting and applying for jobs
‘ Obtaining core skills and keeping up to date
‘ Obtaining specialist skills
‘ Alternative/unusual career paths and other options

The discussion will allow those new to cataloguing to ask questions and hear the experiences of more-established professionals. We would especially encourage anyone who is already working in the sector to participate – managers, team leaders, cataloguers who train, metadata workers with unusual career paths, etc.

*What is a CIG e-forum?*

A CIG e-forum provides an opportunity for librarians to discuss matters of interest, led by a moderator, through the e-forum discussion list. The e-forum discussion list works like an email listserv: register your email address with the list, and then you will receive messages and communicate with other participants through an email discussion. E-forums are scheduled to last two days. Registration is necessary to participate, but it’s free and open to both members and non-members of CIG.

*Details*

The e-forum takes place on Tuesday 25th and Wednesday 26th November 2014.

On both days, sessions will begin at 10am and end at 4pm (UK time).
Instructions for registration are available at: http://www.cilip.org.uk/cataloguing-and-indexing-group/e-forums

Once you have registered for one e-forum, you do not need to register again, unless you choose to leave the email list. Participation is free and open to anyone.


by cilipcig at October 30, 2014 02:28 PM

A portal to my Cataloguing Aids website

Mumbai map

Bombay (India) is now Mumbai (India)

Mumbai mapMumbai
मुंबई
Bombay

LC still authorizes use of Bombay (India) as a corporate name heading. However Bombay (India) IS NOT VALID FOR USE AS A SUBJECT! When cataloguing a work ‘about’ use Mumbai (India) The city’s name was officially changed in 1996.

Gee…. I sure miss those old Cataloging Service Bulletins!  I would have amended our database way before now…


by Fictionophile at October 30, 2014 01:49 PM

Mod Librarian

5 Things Thursday: Keywords, Dublin Core, DAM

Here are five more things plus for this week:

  1. ASMP’s guide to keywording for photographers. Very thorough!
  2. Tons of presentations and information from this year’s Dublin Core Metadata Initiative conference.
  3. Digital asset management as medication.
  4. Build a business case for DAM.
  5. Is the future of special collections community?

BONUS: If you can sit through the ad pop-up, how to transfer your filesfro…

View On WordPress

October 30, 2014 12:51 PM

First Thus

Consistency (Was: Conflicting instructions in Bib Formats about ETDs being state government publications)

Posting to Autocat

On 10/29/2014 4:18 PM, MULLEN Allen wrote:

To catalog is to have faith (a word chosen pointedly) that both common sense and helping people are served through consistency and the integrity of information that has been created via the rules. Most any aspect of cataloging rules in isolation, if closely and critically examined, could yield different interpretations as to how they could be modified (or ignored) to better serve goals of common sense and helping people. Overall, the integrity and usefulness of a catalog, whether local or larger scale, is served by consistent attention to these rules so that those that do transcend the mundane, ignored, and senseless will work as faith enshrines – helping our users find the resources.

I agree with this, and I will take a backseat to no one concerning the importance of consistency. It is one of the reasons why I have been against a lot of changes with RDA.

BUT (there is always a “but”), the fact is: we are living in transitional times. At one time–and not that long ago, just 10 or 15 years ago–the library catalog was a world apart. It was a closed system. A “closed system” meant that nothing went into it without cataloger controls, and when the catalog records went out into the wider world, they went into a similar, controlled union catalog, such as OCLC, RLIN, etc.

The unavoidable fact is, that world has almost disappeared already and the cataloging community must accept it. The cataloging goal of making our records into “linked data” means that our records can literally be sliced and diced and will wind up *anywhere*–not only in union catalogs that follow the same rules, not only in other library catalogs that may follow other rules, but quite literally anywhere. That is what linked data is all about and it has many, many consequences, not least of all for our “consistency”.

Plus there is a push for libraries to create a “single search box” so that users who search the library’s catalogs, databases, full-text silos and who knows what else, can search them all at once. Again, the world takes on a new shape because these other resources have non-cataloger, non-RDA, non-ISBD, non-any-rules-at-all created metadata, or no metadata at all: just full-text searched by algorithms. Those resources are some of the most popular materials libraries have, or have ever had. They are expanding at such an incredible rate that they would sink entire flotillas of catalogers working 24 hours a day. The very idea of “consistency” in this environment begins to lose its meaning.

For example, if a normal catalog department can add, let’s say, 70,000 AACR2/RDA records to their catalog per year, but the IT department is adding hundreds of thousands or even millions of records that follow no perceptible rules at all from the databases the library is paying lots of money for (this is happening in lots of libraries right now), then in just a few years, the records from non-library sources will clearly overwhelm the records from catalogers. That is a mathematical certainty.

Even without data dumps into the catalog itself, instituting the single search box will result in exactly the same thing from the searcher’s perspective: records of all types will be mashed together, where there will be far more non-library-created records than library-created records.

So the logical question about consistency is: Consistency over what, exactly? It is hard to escape the conclusion that it is consistency over an ever diminishing percentage of what is available to our users.

Nevertheless, I still believe very strongly in consistency, but it must be reconsidered in the world of 21st century linked data and the abolition of separate “data silos”. It is all coming together, and both the cataloging community and the library community seems to want this. I want it too.

The idea and purpose of consistency will change. It must change, or it will disappear. Is it at all realistic to think that these other non-library databases will implement RDA? Hahaha! But if a huge percentage of a catalog follows no rules at all, how can we say that consistency is so important? If consistency is to mean something today and in the future, it will have to be reconsidered. What are its costs and benefits?

I consider these to be existential questions for the future of cataloging. I don’t see these issues being discussed in the cataloging community, but I have no doubt whatsoever they are being discussed in the offices of library administrators, whether they use words such as “consistency” or not.

FacebookTwitterGoogle+PinterestShare

by James Weinheimer at October 30, 2014 12:37 PM

Work and Expression

BL blog photo

Disaster Response and Salvage Training

Recently I attended a British Library training day on ‘Disaster Response and Salvage Training’ (for libraries, that is). It was a fascinating day which kicked off with some rather compelling case studies of damage experienced across the UK. This not only emphasised the importance of the subject matter, but was an interesting exercise in learning not only from other people’s mistakes but from their preparedness.

The course was led by Emma Dadson from Harwell Document Restoration Services, whose breadth of knowledge on the topic was extremely impressive. Emma has been in the industry for 12 years and has been involved in some very high profile salvage operations, including those at the National Library of Wales and after the recent fire at the Glasgow School of Art. I cannot speak highly enough of Emma’s professionalism, public speaking skills and industry knowledge – you could not ask for a better trainer on this subject.

As well as working through a number of case studies and general information about salvage and disaster response for both paper and non-paper items, the day also included an adult version of what children’s librarians refer to as wet play. We were divided into groups and given a box of books and other resources that had been water damaged (i.e. Emma had tipped water into the boxes). We then went through the items and set them out for salvage as best we could, following what we had learnt throughout the day. It was a great way to put our learning into practice and also see how working in teams could affect a salvage operation. Part of the training room floor was covered in plastic for this section of the course!

Also on that note, the British Library has lovely training facilities and equally lovely training packages featuring none other than Sherlock Holmes. The packages themselves are extensive and also contained Template Emergency Plan provided by Harwell. Although a number of the attendees were from large libraries, some were from consultancies or smaller organisations, for whom such a template could be incredibly valuable.

BL blog photo

As well as learning a lot, the day was also a great opportunity to meet other librarians and conservators and – as always – to network. I even met one librarian I had already ‘spoken’ to on Twitter. The day was highly interesting and informative and I would strongly recommend it to anyone wanting to learn more about what is actually a fascinating – and crucial – topic.

Contributed by: Anne-Marie Nankivell


by Information Services Section for City of London Libraries at October 30, 2014 10:26 AM

Resource Description & Access (RDA)

RDA Blog Reaches 200000 Pageviews

Hi all, I am pleased to announce that RDA Blog has crossed 200000 pageviews mark. It is interesting to note that the first hundred thousand pageviews came in 3 years, but it took just 8 months to reach another hundred thousand pageviews.


Thanks all for your love, support and suggestions. Please post your feedback and comments on RDA Blog Guest Book. Select remarks will be posted on RDA Blog Testimonials page.

INTRODUCTION TO RDA BLOG:


RDA Blog is a blog on Resource Description and Access (RDA), a new library cataloging standard that provides instructions and guidelines on formulating data for resource description and discovery, organized based on the Functional Requirements for Bibliographic Records (FRBR), intended for use by libraries and other cultural organizations replacing Anglo-American Cataloging Rules (AACR2). This blog lists description and links to resources on Resource Description & Access (RDA). It is an attempt to bring together at one place all the useful and important information, rules, references, news, and links on Resource Description and Access, FRBR, FRAD, FRSAD, MARC standards, AACR2, BIBFRAME, and other items related to current developments and trends in library cataloging practice.

              RDA BLOG HIGHLIGHTS IN 1-MINUTE VIDEO PRESENTATION              

by Salman Haider (noreply@blogger.com) at October 30, 2014 05:14 AM

October 28, 2014

Bibliographic Wilderness

“Is the semantic web still a thing?”

A post on Hacker News asks:

A few years ago, it seemed as if everyone was talking about the semantic web as the next big thing. What happened? Are there still startups working in that space? Are people still interested?

Note that “linked data” is basically talking about the same technologies as “semantic web”, it’s sort of the new branding for “semantic web”, with some minor changes in focus.

The top-rated comment in the discussion says, in part:

A bit of background, I’ve been working in environments next to, and sometimes with, large scale Semantic Graph projects for much of my career — I usually try to avoid working near a semantic graph program due to my long histories of poor outcomes with them.

I’ve seen uncountably large chunks of money put into KM projects that go absolutely nowhere and I’ve come to understand and appreciate many of the foundational problems the field continues to suffer from. Despite a long period of time, progress in solving these fundamental problems seem hopelessly delayed.

The semantic web as originally proposed (Berners-Lee, Hendler, Lassila) is as dead as last year’s roadkill, though there are plenty out there that pretend that’s not the case. There’s still plenty of groups trying to revive the original idea, or like most things in the KM field, they’ve simply changed the definition to encompass something else that looks like it might work instead.

The reasons are complex but it basically boils down to: going through all the effort of putting semantic markup with no guarantee of a payoff for yourself was a stupid idea.

The entire comment, and, really the entire thread, are worth a read. There seems to be a lot of energy in libraryland behind trying to produce “linked data”, and I think it’s important to pay attention to what’s going on in the larger world here.

Especially because much of the stated motivation for library “linked data” seems to have been: “Because that’s where non-library information management technology is headed, and for once let’s do what everyone else is doing and not create our own library-specific standards.”  It turns out that may or may not be the case, if your motivation for library linked data was “so we can be like everyone else,” that simply may not be an accurate motivation, everyone else doesn’t seem to be heading there in the way people hoped a few years ago.

On the other hand, some of the reasons that semantic web/linked data have not caught on are commercial and have to do with business models.

One of the reasons that whole thing died was that existing business models simply couldn’t be reworked to make it make sense. If I’m running an ad driven site about Cat Breeds, simply giving you all my information in an easy to parse machine readable form so your site on General Pet Breeds can exist and make money is not something I’m particularly inclined to do. You’ll notice now that even some of the most permissive sites are rate limited through their API and almost all require some kind of API key authentication scheme to even get access to the data.

It may be that libraries and other civic organizations, without business models predicated on competition, may be a better fit for implementation of semantic web technologies.  And the sorts of data that libraries deal with (bibliographic and scholarly) may be better suited for semantic data as well compared to general commercial business data.  It may be that at the moment libraries, cultural heritage, and civic organizations are the majority of entities exploring linked data.

Still, the coarsely stated conclusion of that top-rated HN comment is worth repeating:

going through all the effort of putting semantic markup with no guarantee of a payoff for yourself was a stupid idea.

Putting data into linked data form simply because we’ve been told that “everyone is doing it” without carefully understanding the use cases such reformatting is supposed to benefit and making sure that it does — risks undergoing great expense for no payoff. Especially when everyone is not in fact doing it.

GIGO

Taking the same data you already have and reformatting as “linked data” does not neccesarily add much value. If it was poorly controlled, poorly modelled, or incomplete data before — it still is even in RDF.   You can potentially add a lot more value and more additional uses of your data by improving the data quality than by working to reformat it as linked data/RDF.  The idea that simply reformatting it as RDF would add significant value was predicated on the idea of an ecology of software and services built to use linked data, software and services exciting enough that making your data available to them would result in added value.  That ecology has not really materialized, and it’s hardly clear that it will (and to the extent it does, it may only be if libraries and cultural heritage organizations create it; we are unlikely to get a free ride on more general tools from a wider community).

But please do share your data

To be clear, I still highly advocate taking the data you do have and making it freely available under open (or public domain) license terms. In whatever formats you’ve already got it in.  If your data is valuable, developers will find a way to use it, and simply making the data you’ve already got available is much less expensive than trying to reformat it as linked data.  And you can find out if anyone is interested in it. If nobody’s interested in your data as it is — I think it’s unlikely the amount of interest will be significantly greater after you model it as ‘linked data’. The ecology simply hasn’t arisen to make using linked data any easier or more valuable than using anything else (in many contexts and cases, it’s more troublesome and challenging than less abstract formats, in fact).

Following the bandwagon vs doing the work

Part of the problem is that modelling data is inherently a context-specific act. There is no universally applicable model — and I’m talking here about the ontological level of entities and relationships, what objects you represent in your data as distinct entities and how they are related. Whether you model it as RDF or just as custom XML, the way you model the world may or may not be useful or even usable by those in different contexts, domains and businesses.  See “Schemas aren’t neutral” in the short essay by Cory Doctorow linked to from that HN comment.  But some of the linked data promise is premised on the idea that your data will be both useful and integrate-able nearly universally with data from other contexts and domains.

These are not insoluble problems, they are interesting problems, and they are problems that libraries as professional information organizations rightly should be interested in working on. Semantic web/linked data technologies may very well play a role in the solutions (although it’s hardly clear that they are THE answer).

It’s great for libraries to be interested in working on these problems. But working on these problems means working on these problems, it means spending resources on investigation and R&D and staff with the right expertise and portfolio. It does not mean blindly following the linked data bandwagon because you (erroneously) believe it’s already been judged as the right way to go by people outside of (and with the implication ‘smarter than’) libraries. It has not been.

For individual linked data projects, it means being clear about what specific benefits they are supposed to bring to use cases you care about — short and long term — and what other outside dependencies may be necessary to make those benefits happen, and focusing on those too.  It means understanding all your technical options and considering them in a cost/benefit/risk analysis, rather than automatically assuming RDF/semantic web/linked data and as much of it as possible.

It means being aware of the costs and the hoped for benefits, and making wise decisions about how best to allocate resources to maximize chances of success at those hoped for benefits.   Blindly throwing resources into taking your same old data and sharing it as “linked data”, because you’ve heard it’s the thing to do,  does not in fact help.


Filed under: General

by jrochkind at October 28, 2014 11:29 PM

First Thus

ACAT FW: Conflicting instructions in Bib Formats about ETDs being state government publications (OCLC’s answer: yes, they are

Posting to Autocat

On 10/28/2014 3:51 PM, MULLEN Allen wrote:

I have no horse in this race (i.e. I really don’t care either way), but it would seem from the definition provided by James that, if it is published “at Government expense,” it is a government publication.

BTW, this definition does not mention anything requiring that these be “written under the auspices of any government agency.”

Well, I thought that the part about “written under the auspices of a government agency” sort of goes without saying. For a government document to exist, I would think that a government agency must be involved in it somewhere along the way. But maybe I’m wrong.

If we are to determine an item’s status as a government document not by those who created it, but by the management of the archive it happens to be placed in (i.e. is it a state agency?), I find that very troubling. Is it true that

“… if it is published “at Government expense,” it is a government publication…”

means that just because something has been placed in a digital archive that has some part of the budget paid for by a state agency, that this makes it “published at government expense” and therefore a government document?

I find it difficult to believe that an item can be determined to be a government document based not on who wrote it, but on the archive where it has been placed. If that is the case, then it would become a “government document” and there are genuine legal consequences as to copyright.

This is why I pointed out the page at http://en.wikipedia.org/wiki/Copyright_status_of_work_by_U.S._subnational_governments that discusses the legal aspects of government documents based on each state.

Therefore, based on this Wikipedia page (which may be wrong of course), it would seem that an item in a digital archive in California that has at least substantial control from a state university, then that item would be public domain because “… [the] government may not claim copyright on public records”. But the situation would seem to be different with Minnesota.

I can’t believe that the fact that a document is in a certain archive has anything to do with what rights are connected with it. Would someone who has a dissertation that they have worked for years on willingly put it into an electronic archive at e.g. Berkeley (state university) and agree that it is a government document and they “… may not claim copyright”?

I don’t think it is just a theoretical argument. It is already a contentious topic, e.g. in this article from the Chronicle it says clearly “Don’t make your dissertation available online.” http://chronicle.com/article/From-Dissertation-to-Book/127677/

An “official” statement from the library community that a dissertation in a state repository is a government document, and as such, its rights are determined by the respective state government, gives me pause. And I don’t think it’s right.

This is something that could blow up in everybody’s faces. I think this should be discussed, with legal help if possible.

FacebookTwitterGoogle+PinterestShare

by James Weinheimer at October 28, 2014 08:38 PM

TSLL TechScans

Website Archivability

With the recent Symposium: 404/File Not Found: Link Rot, Legal Citation and Projects to Preserve Precedent at Georgetown Law School, it’s important to take into consideration the future archivability of the webpages you and your institution create. We all take for granted the fluidity of the web and frequently forget that content on websites changes, and is lost, constantly. This is not just restricted to news sites, but impacts everything from our institutional sites to government and court sites. Many organizations are working to preserve the content on the internet, from individual websites, to the documents, videos, and images that they includes. And they seek to do this in as authentic a way as possible as well as to give future users the ability to access and interact with the sites in the way it was originally intended.


To assist in the creation of websites that promote archiving, Stanford University Libraries recently published a set of Recommendations for Web Builders to Improve the Archivability of Their Content, with archivability referring to “the ease with which the content, structure, and front-end presentation(s) of a website can be preserved and later re-presented, using contemporary web archiving tools.” This documentation builds on other resources relating to web archiving and seeks to improve collective web preservation efforts. 

by noreply@blogger.com (Lauren Seney) at October 28, 2014 02:12 PM

First Thus

ACAT FW: Conflicting instructions in Bib Formats about ETDs being state government publications (OCLC’s answer: yes, they are

Posting to Autocat

Myung-Ja said:

We think that the OCLC’s recommendation, i.e., remotely-accessed electronic theses and dissertations from state colleges and universities in the United States are considered to be state government publications, requires further discussion.

It seems to me that a government document should come from a government agency. From the U.S. Code (1994 US Code
Title 44 – PUBLIC PRINTING AND DOCUMENTS, CHAPTER 19 – DEPOSITORY LIBRARY PROGRAM, Sec. 1901 – Definition of Government publication) http://law.justia.com/codes/us/1994/title44/chap19/sec1901
it says:
“Government publication” as used in this chapter, means informational matter which is published as an individual document at Government expense, or as required by law.

This seems like a fairly good definition to me. According to this, a dissertation written at a state college is not a government document because such an item is not written under the auspices of any government agency.

To claim that a dissertation is a government document, seems to me to be making a legal declaration. When something is a government document, in many states it automatically goes into the public domain, or in others it is the respective state that has the copyright. http://en.wikipedia.org/wiki/Copyright_status_of_work_by_U.S._subnational_governments

I don’t know how many authors of the dissertations would agree to that.

FacebookTwitterGoogle+PinterestShare

by James Weinheimer at October 28, 2014 01:06 PM

October 27, 2014

Metadata Matters (Diane Hillmann)

What’s this Jane-athon thing?

Everyone is getting tired of the sage-on-the-stage style of preconferences, so when Deborah Fritz suggested a hackathon (thank you Deborah!) to the RDA Dev Team, we all climbed aboard and started thinking about what that kind of event might look like, particularly in the ALA Midwinter context. We all agreed: there had to be a significant hands-on aspect to really engage those folks who were eager to learn more about how the RDA data model could work in a linked data environment, and, of course, in their own home environment.

We’re calling it a Jane-athon, which should give you a clue about the model for the event: a hackathon, of course! The Jane Austen corpus is perfect to demonstrate the value of FRBR, and there’s no lack of interesting material to look at– media materials, series, spin-offs of every description–in addition to the well known novels. So the Jane-athon will be partially about creating data, and partially about how that data fits into a larger environment. And did you know there is a Jane Austen bobblehead?

We think there will be a significant number of people who might be interested in attending, and we figured that getting the world out early would help prospective participants make their travel arrangements with attendance in mind. Sponsored by ALA Publishing, the Jane-athon will be on the Friday before the midwinter conference (the traditional pre-conference day), and though we don’t yet have registration set up, we’ll make sure everyone knows when that’s available. If you think, as we do that this event will be the hit of Midwinter, be sure to watch for that announcement, and register early! If the event is successful, you’ll be seeing others in subsequent ALA conferences.

So, what’s the plan and what will participants get out of it?

The first thing to know is that there will be tables and laptops to enable small groups to work together for the ‘making data’ portion of the event. We’ll be asking folks who have laptops they can bring to Chicago to plan on bringing theirs. We’ll be using the latest version of a new bibliographic metadata editor called RIMMF (“RDA In Many Metadata Formats”–not yet publicly available–but soon. Watch for it on the TMQ website). We encourage interested folks to download the current beta version and play with it–it’s a cool tool and really is a good one to learn about.

In the morning, we’ll form small cataloging groups and use RIMMF to do some FRBRish cataloging, starting from MARC21 and ending up with RDA records exported as RDF Linked Data. In the afternoon we’ll all take a look at what we’ve produced, share our successes and discoveries, and discuss the challenges we faced. In true hackathon tradition we’ll share our conclusions and recommendations with the rest of the library community on a special Jane-athon website set up to support this and subsequent Jane-athons.

Who should attend?

We believe that there will be a variety of people who could contribute important skills and ideas to this event. Catalogers, of course, but also every flavor of metadata people, vendors, and IT folks in libraries would be warmly welcomed. But wouldn’t tech services managers find it useful? Oh yes, they’d be welcomed enthusiastically, and I’m sure their participation in the discussion portion of the event in the afternoon will bring out issues of interest to all.

Keep in mind, this is not cataloging training, nor Toolkit training, by any stretch of the imagination. Neither will it be RIMMF training or have a focus on the RDA Registry, although all those tools are relevant to the discussion. For RIMMF, particularly, we will be looking at ways to ensure that there will be a cadre of folks who’ve had enough experience with it to make the hands-on portion of the day run smoothly. For that reason, we encourage as many as possible to play with it beforehand!

Our belief is that the small group work and the discussion will be best with a variety of experience informing the effort. We know that we can’t provide the answers to all the questions that will come up, but the issues that we know about (and that come up during the small group work) will be aired and discussed.

by Diane Hillmann at October 27, 2014 06:57 PM

025.431: The Dewey blog

WebDewey Number Building Tool: Literature and Table 3C. Notation to Be Added Where Instructed in Table 3B, 700.4, 791.4, 808-809

Note: The general approach to building numbers described here can be applied in any discipline, not just literature. See also posts 1 and 2 on using the number building tool in music.

Are you having problems using the WebDewey number building tool with Table 3C. Notation to Be Added Where Instructed in Table 3B, 700.4, 791.4, 808-809 in the 800
Literature (Belles-lettres) schedule?  

If so, let’s try an example of a work about literature with a theme: Prague Palimpsest: Writing, Memory, and the City, to which the LCSH "Prague (Czech Republic)--In literature" has been assigned.

Here is a summary of the instructions for using the WebDewey number building tool to build the DDC number 809.9335843712 Prague (Czech Republic)—literature—history and criticism.  (The format of the summary is modeled on the tables used in the WebDewey training modules for the WebDewey number building tool.)

Navigate to this number / span

Click

Number built so far

Caption of last number / notation added

809.933

Start

809.933

Literature dealing with specific themes and subjects

T3C--3583-

T3C--3589

Add

809.93358

Historical themes of ancient world; of specific continents, countries, localities; of extraterrestrial worlds

930-990

Add

809.93358

History of specific continents, countries, localities; extraterrestrial worlds

T2--43712

Add

809.9335843712

Prague (Praha)

Does that answer all your questions about how to build the number?  If not, keep reading for details.

First, here is the summary from the catalog record:

A city of immense literary mystique, Prague has inspired writers across the centuries with its beauty, cosmopolitanism, and tragic history. This interdisciplinary study helps to explain why Prague - more than any other major European city - has haunted the cultural and political imagination of the West.

Here is the table of contents from the catalog record:

Women on the verge of history: Libuše and the foundational legend of Prague --
Deviant monsters and wayward women: the Prague ghetto and the legend of the golem --
The castle hill was hidden: Franz Kafka and Czech literature --
A stranger in Prague: writing and the politics of identity in Apollinaire, Nezval, and Camus --
Sailing to Bohemia: utopia, memory, and the Holocaust in postwar Austrian and German literature --
Epilogue: Postmodern Prague?

If you browse the Relative Index for "literature," you find:

Literature--history and criticism 809

The scope note at 809 History, description, critical appraisal of more than two literatures reads: "History, description, critical appraisal of works by more than one author in more than two languages not from the same language family." Since the work about Prague in literature treats more than two literatures, you can begin by considering subdivisions of 809. (The scope note has hierarchical force and thus applies to the subdivisions of 809.) Here is the Hierarchy box for 809:

Litpix1

You might drill down in the Hierarchy box from 809 to 809.933 Literature dealing with specific themes and subjects.  Or you might browse the Relative Index for "themes" and find:   

Themes--literature--history and criticism 809.933

You could then click 809.933 to see the full record.  Either way, you have now found the record with the base number that you will use. The same record also has the add note that you need.

Here is the Hierarchy box for 809.933 Literature dealing with specific themes and subjects:

Litpix2

Here is the Notes box with the add note that you need:

Litpix3

At this point, the Create built number box has no number in the title bar.  Inside the box appear only the number and caption 809.933 Literature dealing with specific themes and subjects plus a Start button:

Litpix4

If you click Start in the Create Built Number box, you can get the add note inside the Create Built Number box:

Litpix5

Since the add note calls for adding from Table 3C—32-39, the Hierarchy box now focuses on T3C—3 Arts and literature dealing with specific themes and subjects:

Litpix6

What notation from T3C—3 Arts and literature dealing with specific themes and subjects should you use? In the work being classified, Prague is treated more as an historical theme than as a travel theme. Also, at T3C—32 Travel and geography is the relocation note: "Civilization of places, comprehensive works on places relocated to T3C—358."  At T3C—358 Historical, political, military themes is the class-here note: "Class here civilization of places, comprehensive works on places [both formerly T3C--32], historical events." T3C—3 Arts and literature dealing with specific themes and subjects parallels the DDC schedule in general.  In DDC interdisciplinary works about a specific place including both travel and history are classed with history.  At 913-919 Geography of and travel in specific continents, countries, localities; extraterrestrial worlds is the class-elsewhere note: "Class interdisciplinary works on geography and history of ancient world, of specific continents, countries, localities in 930-990." There is a corresponding class-here note at 930-990 History of specific continents, countries, localities; extraterrestrial worlds: "Class here interdisciplinary works on geography and history of ancient world, of specific continents, countries, localities."  For more information about T3C—3, see these previous blog posts (here and here) and Table of Mappings: DDC 000-990 to Table 3C—3.  

Browsing the Relative Index for "historical themes" yields:

Historical themes—arts   T3C—358

Here is the Hierarchy box for T3C—358 Historical, political, military themes:

Litpix7


T3C—358 Historical, political, military themes is too broad for history of Prague as a theme, and in the Notes box for T3C—358, there is no add note that would allow you to make the number more specific.  Looking down in the Hierarchy box, you click T3C—3583-T3C—3589 because you think that record might have the add note you need.  Here is the Hierarchy box for T3C—3583-T3C—3589 Historical themes of ancient world; of specific continents, countries, localities; of extraterrestrial worlds:

Litpix8

Here is the Notes box for T3C—3583-T3C—3589; it includes the add note you need ("Add to base number T3C--358 the numbers following 9 in 930-990. . . . "):

Litpix9

Click Add to get the add note inside the Create built number box:

Litpix10

Now, how to specify Prague? You need to add notation from Table 2. Geographic Areas, Historical Periods, Biography, but no add note now inside the Create built number box says anything about adding notation from Table 2.  The number-building tool has, however, gone to 930-990, which is mentioned in the add note you just put inside the Create built number box.  Here is the Hierarchy box for 930-990 History of specific continents, countries, localities; extraterrestrial worlds:

Litpix11

The long Notes box for 930-990 contains a large add table.  You won’t need any of the notation in the add table—but you do need the add instruction that introduces the add table, because it says to add notation from Table 2 ("Add to base number 9 notation T2--3-T2--9 from Table 2, e.g., general history of Europe 940, of England 942, of Norfolk, England 942.61; then add further as follows"):

Litpix12

When you click Add, the instruction to add notation from Table 2 is put inside the Create built number box: 

Litpix13

If you browse the Relative Index for "prague," you find:

Prague (Czech Republic) T2--43712

Here is the Hierarchy box for T2—43712 Prague (Praha):

Litpix14

After you click Add, the title bar of the Create built number box has 809.9335843712, and
T2—43712 Prague (Praha) appears inside:

Litpix15

If you click Save, the new built number appears in the Hierarchy box:

Litpix16

You now have an opportunity to modify or add to the user terms associated with that new number, as explained in the "User Terms with Number Building" part of the WebDewey training modules.  Enough for now! You have successfully built the number.   

Keys to success:
•    Find the record with the base number that you will use and the record with the add note that you will use (often the same record).
•    At each step, find the record with the relevant add note, display the full record so that the add note appears in the Notes box, and click Start or Add to get that add note to appear inside the Create built number box.
•    If a complete add table is displayed in the Notes box, but you don’t need a specific entry in the add table—rather, what you need is the add note at the beginning that introduces the add table—then click Start with the full Notes box displayed, to get that add note to appear inside the Create built number box.

 

by Juli at October 27, 2014 06:49 PM

Resource Description & Access (RDA)

Transcription in Resource Description & Access (RDA) Cataloging

“Take What You See and Accept What You Get”

This is the overriding principle of RDA concerning the transcription of data. It is consistent with the ICP “Principle of Representation” to represent the resource the way it represents itself. This is a fairly significant change from AACR2, which includes extensive rules for abbreviations, capitalization, punctuation, numerals, symbols, etc., and in some cases directs the cataloger to ‘correct’ data which is known to be wrong (e.g., typos). With RDA we generally do not alter what is on the resource when transcribing information for certain elements. This is not only to follow the principle of representation, but also for a more practical reason: to encourage re-use of found data you can copy and paste or scan or download into your description of the resource.

Let’s see what this principle means for you as an LC cataloger, regarding capitalization, punctuation, and spacing.  It is critical that you understand LCPS 1.7.1; the overriding principles codified there are generally not discussed elsewhere in the specific instructions.

P         In the RDA Toolkit, display RDA 1.7.1

Note that the alternatives at RDA 1.7.1 allow for the use of in-house guidelines for capitalization, punctuation, numerals, symbols, abbreviations, etc. -- in lieu of RDA instructions or appendices.

Capitalization

Regarding capitalization, RDA 1.7.2 directs the cataloger to “Apply the instructions on capitalization found in Appendix A.  But LC policy says that you can follow the capitalization that you find, without adjusting it.

P         In the RDA Toolkit, click on the first LCPS link in the Alternativeto RDA 1.7.1

“For capitalization of transcribed elements, either “take what you see” on the resource or follow [Appendix] A.”

Punctuation, Numerals, Symbols, Abbreviations, etc.

LCPS 1.7.1, First Alternative says “follow the guidelines in 1.7.3 – 1.7.9 and in the appendices.”


Transcribed Elements vs. Recorded Elements

RDA distinguishes between transcribed elements and recorded elements.
  • For transcribed elements, generally accept the data as found on the resource.
  • For recorded elements, the found information is often adjusted (for example, the hyphens in an ISBN are omitted).

Language and Script

The basic instruction for most of the elements for describing a manifestation is to transcribe the data in the language and script found in the resource (“take what you see”).  RDA 1.4 contains a list of elements to be transcribed from the resource in the found language and script.

For non-transcribed elements:
  • When recording all other elements (e.g., extent, notes), record them in the language and script preferred by the agency creating the data (at LC, this is English)
  • When adding information within an element, record it in the language and script of the element to which it is being added
  • When supplying an entire element, generally supply it in English


Regarding non-Latin scripts, LCPS 1.4, First Alternative states the LC policy to record a transliteration instead, or to give both (using the MARC 880 fields)

[Source: Library of Congress]


<<<<<---------->>>>>

Also check out following RDA rules in RDA Toolkit for further details:

1.7 Transcription
  • 1.7.1 General Guidelines on Transcription
  • 1.7.2 Capitalization
  • 1.7.3 Punctuation
  • 1.7.4 Diacritical Marks
  • 1.7.5 Symbols
  • 1.7.6 Spacing of Initials and Acronyms
  • 1.7.7 Letters or Words Intended to Be Read More Than Once
  • 1.7.8 Abbreviations
  • 1.7.9 Inaccuracies


<<<<<---------->>>>>


[Updated 2014-10-28]

by Salman Haider (noreply@blogger.com) at October 27, 2014 01:32 PM

Lorcan Dempsey's weblog

Research information management systems - a new service category?

It has been interesting watching Research Information Management or RIM emerge as a new service category in the last couple of years. RIM is supported by a particular system category, the Research Information Management System (RIMs), sometimes referred to by an earlier name, the CRIS (Current Research Information System).

For reasons discussed below, this area has been more prominent outside the US, but interest is also now growing in the US. See for example, the mention of RIMs in the Library FY15 Strategic Goals at Dartmouth College.

Research information management

The name is unfortunately confusing - a reserved sense living alongside more general senses. What is the reserved sense? Broadly, RIM is used to refer to the integrated management of information about the research life-cycle, and about the entities which are party to it (e.g. researchers, research outputs, organizations, grants, facilities, ..). The aim is to synchronize data across parts of the university, reducing the burden to all involved of collecting and managing data about the research process. An outcome is to provide greater visibility onto institutional research activity. Motivations include better internal reporting and analytics, support for compliance and assessment, and improved reputation management through more organized disclosure of research expertise and outputs.

A major driver has been the need to streamline the provision of data to various national university research assessment exercises (for example, in the UK, Denmark and Australia). Without integrated support, responding to these is costly, with activities fragmented across the Office of Research, individual schools or departments, and other support units, including, sometimes, the library. (See this report on national assessment regimes and the roles of libraries.)

Some of the functional areas covered by a RIM system may be:

  • Award management and identification of award opportunities. Matching of interests to potential funding sources. Supporting management of and communication around grant and contracts activity.
  • Publications management. Collecting data about researcher publications. Often this will be done by searching in external sources (Scopus and Web of Science, for example) to help populate profiles, and to provide alerts to keep them up to date.
  • Coordination and publishing of expertise profiles. Centralized upkeep of expertise profiles. Pulling of data from various systems. This may be for internal reporting or assessment purposes, to support individual researchers in providing personal data in a variety of required forms (e.g. for different granting agencies), and for publishing to the web through an institutional research portal or other venue.
  • Research analytics/reporting. Providing management information about research activity and interests, across departments, groups and individuals.
  • Compliance with internal/external mandates.
  • Support of open access. Synchronization with institutional repository. Managing deposit requirements. Integration with sources of information about Open Access policies.

To meet these goals, a RIM system will integrate data from a variety of internal and external systems.Typically, a university will currently manage information about these processes across a variety of administrative and academic departments. Required data also has to be pulled from external systems, notably data about funding opportunities and publications.

Products

Several products have emerged specifically to support RIM in recent years. This is an important reason for suggesting that it is emerging as a recognized service category.

  • Pure (Elsevier). "Pure aggregates your organization's research information from numerous internal and external sources, and ensures the data that drives your strategic decisions is trusted, comprehensive and accessible in real time. A highly versatile system, Pure enables your organization to build reports, carry out performance assessments, manage researcher profiles, enable expertise identification and more, all while reducing administrative burden for researchers, faculty and staff." [Pure]
  • Converis (Thomson Reuters). "Converis is the only fully configurable research information management system that can manage the complete research lifecycle, from the earliest due diligence in the grant process through the final publication and application of research results. With Converis, understand the full scope of your organization's contributions by building scholarly profiles based on our publishing and citations data--then layer in your institutional data to more specifically track success within your organization." [Converis]
  • Symplectic Elements. "A driving force of our approach is to minimise the administrative burden placed on academic staff during their research. We work with our clients to provide industry leading software services and integrations that automate the capture, reduce the manual input, improve the quality and expedite the transfer of rich data at their institution."[Symplectic]

Pure and Converis are parts of broader sets of research management and analytics services from, respectively, Elsevier (Elsevier research intelligence) and Thomson Reuters (Research management and evaluation). Each is a recent acquisition, providing an institutional approach alongside the aggregate, network level approach of each company's broader research analytics and management services.

Symplectic is a member of the very interesting Digital Science portfolio. Digital Science is a company set up by Macmillan Publishers to incubate start-ups focused on scientific workflow and research productivity. These include, for example, Figshare and Altmetric.

Other products are also relevant here. As RIM is an emerging area, it is natural to expect some overlap with other functions. For example, there is definitely overlap with backoffice research administration systems - Ideate from Consilience or solutions from infoEd Global, for example. And also with more publicly oriented profiling and expertise systems on the front office side.

With respect to the latter, Pure and Symplectic both note that they can interface to VIVO. Furthermore, Symplectic can provide "VIVO services that cover installation, support, hosting and integration for institutions looking to join the VIVO network". It also provides implementation support for the Profiles Research Networking Software.

As I discuss further below, one interesting question for libraries is the relationship between the RIMs or CRIS and the institutional repository. Extensions have been written for both Dspace and Eprints to provide some RIMs-like support. For example, Dspace-Cris extends the Dspace model to cater for the Cerif entities. This is based on work done for the Scholar's Hub at Hong Kong University.

It is also interesting to note that none of the three open source educational community organizations - Kuali, The Duraspace Foundation, or The Apereo Foundation - has a directly comparable offering, although there are some adjacent activities. In particular, Kuali Coeus for Research Administration is "a comprehensive system to manage the complexities of research administration needs from the faculty researcher through grants administration to federal funding agencies", based on work at MIT. Duraspace is now the organizational home for VIVO.

Finally, there are some national approaches to providing RIMs or CRIS functionality, associated with a national view of research outputs. This is the case in South Africa, Norway and The Netherlands, for example.

Standards

Another signal that this is an emerging service category is the existence of active standards activities. Two are especially relevant here:CERIF (Common European Research Information Format) from EuroCRIS, which provides a format for exchange of data between RIM systems, and the Casrai dictionary. CASRAI is the Consortia Advancing Standards in Research Administration Information.

Libraries

So, what about research information management (in this reserved sense) and libraries? One of the interesting things to happen in recent years is that a variety of other campus players are developing service agendas around digital information management that may overlap with library interests. This has happened with IT, learning and teaching support, and with the University press, for example. This coincides with another trend, the growing interest in tracking, managing and disclosing the research and learning outputs of the institution: research data, learning materials, expertise profiles, research reports and papers, and so on. The convergence of these two trends means that the library now has shared interests with the Office of Research, as well as with other campus partners. As both the local institutional and public science policy interest in university outputs grows, this will become a more important area, and the library will increasingly be a partner. Research Information Management is a part of a slowly emerging view of how institutional digital materials will be managed more holistically, with a clear connection to researcher identity.

As noted above, this interest has been more pronounced outside the US to date, but will I think become a more general interest in coming years. It will also become of more general interest to libraries. Here are some contact points.

  • The institutional repository boundary. It is acknowledged that Institutional Repositories (IRs) have been a mixed success. One reason for this is that they are to one side of researcher workflows, and not necessarily aligned with researcher incentives. Although also an additional administrative overhead, Research Information Management is better aligned with organizational and external incentives. See for example this presentation (from Royal Holloway, U of London) which notes that faculty are more interested in the CRIS than they had been in the IR, 'because it does more for them'. It also notes that the library no longer talks about the 'repository' but about updating profiles and loading full-text. There is a clear intersection between RIMs and the institutional repository and the boundary may be managed in different ways. Hong Kong University, for example, has evolved its institutional repository to include RIMs or CRIS features. Look at the publications or presentations of David Palmer, who has led this development, for more detail. There is a strong focus here on improved reputation management on the web through effective disclosure of researcher profiles and outputs. Movement in the other direction has also occurred, where a RIMs or CRIS is used to support IR-like services. Quite often, however, the RIMs and IR are working as part of an integrated workflow, as described here.
  • Management and disclosure of research outputs and expertise. There is a growing interest in researcher and research profiles, and the RIMs may support the creation and management of a 'research portal' on campus. An important part of this is assisting researchers to more easily manager their profiles, including prompting with new publications from searches of external sources. See the research portal at Queen's University Belfast for an example of a site supported by Pure. Related to this is general awareness about promotion, effective publishing, bibliometrics, and management of online research identity. Some libraries are supporting the assignment of ORCIDs. The presentations of Wouter Gerritsma, of Wageningen University in The Netherlands, provide useful pointers and experiences.
  • Compliance with mandates/reporting. The role of RIMs in supporting research assessment regimes in various countries was mentioned earlier: without such workflow support, participation was expensive and inefficient. Similar issues are arising as compliance to institutional or national mandates needs to be managed. Earlier this year, the California Digital Library announced that it had contracted with Symplectic "to implement a publication harvesting system in support of the UC Open Access Policy". US Universities are now considering the impact of the OSTP memo "Increasing Access to the Results of Federally Funded Scientific Research," [PDF] which directs funding agencies with an annual R&D budget over $100 million to develop a public access plan for disseminating the results of their research. ICPSR summarises the memo and its implications here. It is not yet clear how this will be implemented, but it is an example of the growing science and research policy interest in the organized disclosure of information about, and access to, the outputs of publicly funded research. This drives a University wide interest in research information management. In this context, SHARE may provide some focus for greater RIM awareness.
  • Management of institutional digital materials. I suggest above that RIM is one strand of the growing campus interest in managing institutional materials - research data, video, expertise profiles, and so on. Clearly, the relationship between research information management, whatever becomes of the institutional repository, and the management of research data is close. This is especially the case in the US, given the inclusion of research data within the scope of the OSTP memo. The library provides a natural institutional partner and potential home for some of this activity, and also expertise in what Arlitsch and colleagues call 'new knowledge work', thinking about the identifiers and markup that the web expects.

Whether or not Research Information Management become a new service category in the US in quite the way I have discussed it here, it is clear the issues raised will provide important opportunities for libraries to become further involved in supporting the research life of the university.


by dempseyl@oclc.org (Lorcan Dempsey) at October 27, 2014 03:02 AM

Resource Description & Access (RDA)

RDA QUIZ : question on International Cataloging Principles (ICP) by IFLA








Post your vote by this Friday. You can also add your input in the comments box...
Correct answer will be declared on the weekend... and a post will be created with further explanations and interesting comments by the users on RDA Blog

by Salman Haider (noreply@blogger.com) at October 27, 2014 02:21 AM

October 25, 2014

Coyle's InFormation

Citations get HOT

The Public Library of Science research section, PLOSLabs (ploslabs.org) has announced some very interesting news about the work that they are doing on citations, which they are calling "Rich Citations".

Citations are the ultimate "linked data" of academia, linking new work with related works. The problem is that the link is human-readable only and has to be interpreted by a person to understand what the link means. PLOS Labs have been working to make those citations machine-expressive, even though they don't natively provide the information needed for a full computational analysis.

Given what one does have in a normal machine-readable document with citations, they are able to pull out an impressive amount of information:
  • What section the citation is found in. There is some difference in meaning whether a citation is found in the "Background" section of an article, or in the "Methodology" section. This gives only a hint to the meaning of the citation, but it's more than no information at all.
  • How often a resource is cited in the article. This could give some weight to its importance to the topic of the article.
  • What resources are cited together. Whenever a sentence ends with "[3][7][9]", you at least know that those three resources equally support what is being affirmed. That creates a bond between those resources.
  • ... and more
As an open access publisher, they also want to be able to take users as directly as possible to the cited resources. For PLOS publications, they can create a direct link. For other resources, they make use of the DOI to provide links. Where possible, they reveal the license of cited resources, so that readers can know which resources are open access and which are pay-walled.

This is just a beginning, and their demo site, appropriately named "alpha," uses their rich citations on a segment of the PLOS papers. They also have an API that developers can experiment with.

I was fortunate to be able to spend a day recently at their Citation Hackathon where groups hacked on ongoing aspects of this work. Lots of ideas floated around, including adding abstracts to the citations so a reader could learn more about a resource before retrieving it. Abstracts also would add search terms for those resources not held in the PLOS database. I participated in a discussion about coordinating Wikidata citations and bibliographies with the PLOS data.

Being able to datamine the relationships inherent in the act of citation is a way to help make visible and actionable what has long been the rule in academic research, which is to clearly indicate upon whose shoulders you are standing. This research is very exciting, and although the PLOS resources will primarily be journal articles, there are also books in their collection of citations. The idea of connecting those to libraries, and eventually connecting books to each other through citations and bibliographies, opens up some interesting research possibilities.

by Karen Coyle (noreply@blogger.com) at October 25, 2014 11:07 AM

October 24, 2014

TSLL TechScans

Library of Congress BIBFRAME update

On September 4th a presentation entitled Bibliographic Framework Initiative (BIBFRAME): Update & Practical Applications was given to Library of Congress staff. Beacher Wiggins, Kevin Ford and Paul Frank deliver an explanation of the current state of BIBFRAME and it's implications for library metadata. The target audience for the presentation is experienced catalogers; BIBFRAME structure and concepts are explicated in an understandable way. Paul Frank attempts to assess the impact of BIBFRAME implementation on the work of a typical cataloger.

The presentation is available for viewing via the Library of Congress' BIBFRAME media portal at http://www.loc.gov/bibframe/media/updateforum-sep04-2014.html.

by noreply@blogger.com (Jackie Magagnosc) at October 24, 2014 09:06 PM

Resource Description & Access (RDA)

What is FRBR?

What is FRBR? -- RDA Quiz on Google+ Community RDA Cataloging.

Join RDA Cataloging online community / group / forum and share ideas on RDA and discuss issues related to Resource Description and Access Cataloging.



Following are the comments received on this RDA Blog post

<<<<<---------->>>>>


Roger Hawcroft
Roger
Library Consultant
Salman, FRBR is an acronym for Functional Requirements for Bibliographic Records. It stems from recommendations made by IFLA in 1988. The FRBR represents the departure of bibliographic description from the long-standing linear model as used in AACR... to a muti-tiered concept contemporaneous with current technology and the increasing development of digital formats and storage. These principles underpin RDA - Resource Description & Access..

You may find the following outline useful:
http://www.loc.gov/cds/downloads/FRBR.PDF

I have also placed a list of readings ( not intended to be comprehensive or entirely up-to-dtate) in DropBox for you:
https://www.dropbox.com/s/quf7nhmcm43r530/Selected%20Readings%20on%20FRBR%20%C2%A0%28April%202014%29.pdf?dl=0

An online search should relatively easily find you the latest papers / articles / opinion on this concept of cataloguing and I am sure that you will find many librarians on LI that have plenty to say for and against the approach!



















<<<<<---------->>>>>


Sris Ponniahpillai
Sris
Library Officer at University of Technology, Sydney
Salman, Hope the article in the following link would help you to understand what FRBR stands for in library terms. Thanks & Best Regards, Sris

http://www.loc.gov/cds/downloads/FRBR.PDF



<<<<<---------->>>>>


Alan Danskin
Alan
Metadata Standards Manager at The British Library
FRBR (Functional Requirements for Bibliographic Records) is a model published by IFLA. RDA is an implementation of the the FRBR and FRAD (Functional Requirements for Authority Data) models. The FRBR Review Group is currently working on consolidation of these models and the Functional Requirements for Subject Authority Data (FRSAD) model. See http://www.ifla.org/frbr-rg and http://www.ifla.org/node/2016


<<<<<---------->>>>>






Harshadkumar Patel
Harshadkumar
Deputy Librarian, C.U. Shah Medical College
Functional Requirements for Bibliographic Records is a conceptual entity-relationship model developed by the International Federation of Library Associations and Institutions that relates user tasks of retrieval and access in online library catalogues and bibliographic databases from a user's perspective.



<<<<<---------->>>>>


Erik Dessureault
Erik
Library Systems Technician at Concordia University
When I was first introduced to FRBR and RDA in library school, I was immediately struck at how the structure of FRBR lines up nicely with the structure of XML. I am sure that is not a coincidence. Our teacher made us draw out FRBR schemas as part of our assignment, and the parallels with database entity relation diagrams and programming flowcharts were immediately apparent to me. Coming from a information technology background, with some programming and database creation/management experience, FRBR came naturally to me, and struck me as a very rational way to organize information. I can see the potential for automation and standardization and I am eager to see FRBR and RDA become accepted standards in our field.










by Salman Haider (noreply@blogger.com) at October 24, 2014 02:11 AM

October 23, 2014

OCLC Cataloging and Metadata News

October 2014 data update now available for the WorldCat knowledge base

The WorldCat knowledge base continues to grow with new providers and collections added monthly.  The details for October updates are now available in the full release notes

October 23, 2014 02:15 PM

Mod Librarian

5 Things Thursday: DAM, PLUS, the Internet and the Brain

5 things in brief as I started a new job this week and its going fine.

  1. Does the web shatter focus and rewire brains? I forgot the question…
  2. DAM ROI
  3. PLUS Registry for always current embedded metadata and David Riecks
  4. The real Internet of things
  5. DAM in 30 seconds

View On WordPress

October 23, 2014 12:50 PM

October 21, 2014

First Thus

ACAT Best way to catalog geographic information?

Posting to Autocat

On 20/10/2014 22.45, Julie Huddle wrote:

I will be starting an internship which will involve cataloging. I have been asked to help develop the best way to record the geographic coordinates of research items so that patrons can find resources about a geographic area of interest. After reading Bidney’s 2010 article, I now have the following questions:
1. How difficult and effective would the official form of geographic terms be for this?
2. If I record the geographic coordinates of a resource, should I use the center or corner of the area covered?
3. Would using a geographic search interface such as MapHappy or Yahoo!Map be worth the trouble This is the sort of problem where linked data should ride to the rescue.

Instead of adding coordinates to each and every bib record (a terrifying notion!), those records should contain links to–something else–where the coordinates exist. This would normally mean links from the bib records to authority records, but unfortunately, this information does not exist in many, many, many of our geographic records, e.g. there is nothing in the record for Herculaneum (Extinct city) http://lccn.loc.gov/sh85060358–one of the greatest archaeological sites in Italy–nor in the record for the little town in New Mexico where I grew up http://lccn.loc.gov/n80085226.

But all of this is in dbpedia, e.g. for the little town in New Mexico: http://dbpedia.org/page/Socorro,_New_Mexico. The ultimate way it can work can be seen in Wikipedia (where the dbpedia information comes from) http://en.wikipedia.org/wiki/Socorro,_New_Mexico.

Close to the top are the coordinates that you can click on http://tools.wmflabs.org/geohack/geohack.php?pagename=Socorro%2C_New_Mexico&params=34_3_42_N_106_53_58_W_region:US_type:city

and from here, there are maps of all kinds: weather, traffic, historic, terrain, etc. etc. I personally like Night Lights.

So, I think the solution to your problem is to add links from authority files to something(?!) and then see what can be built, using any of the tools Wikipedia uses, or something new. As we see, none of this needs MARC format and it may be more efficient to add links to dbpedia instead of any library tools. Otherwise, it is a huge amount of work.

There is a lot of information available on the web that we can use to help us.

FacebookTwitterGoogle+PinterestShare

by James Weinheimer at October 21, 2014 12:20 PM

October 20, 2014

Bibliographic Wilderness

ActiveRecord Concurrency in Rails4: Avoid leaked connections!

My past long posts about multi-threaded concurrency in Rails ActiveRecord are some of the most visited posts on this blog, so I guess I’ll add another one here; if you’re a “tl;dr” type, you should probably bail now, but past long posts have proven useful to people over the long-term, so here it is.

I’m in the middle of updating my app that uses multi-threaded concurrency in unusual ways to Rails4.   The good news is that the significant bugs I ran into in Rails 3.1 etc, reported in the earlier post have been fixed.

However, the ActiveRecord concurrency model has always made it too easy to accidentally leak orphaned connections, and in Rails4 there’s no good way to recover these leaked connections. Later in this post, I’ll give you a monkey patch to ActiveRecord that will make it much harder to accidentally leak connections.

Background: The ActiveRecord Concurrency Model

Is pretty much described in the header docs for ConnectionPool, and the fundamental architecture and contract hasn’t changed since Rails 2.2.

Rails keeps a ConnectionPool of individual connections (usually network connections) to the database. Each connection can only be used by one thread at a time, and needs to be checked out and then checked back in when done.

You can check out a connection explicitly using `checkout` and `checkin` methods. Or, better yet use the `with_connection` method to wrap database use.  So far so good.

But ActiveRecord also supports an automatic/implicit checkout. If a thread performs an ActiveRecord operation, and that thread doesn’t already have a connection checked out to it (ActiveRecord keeps track of whether a thread has a checked out connection in Thread.current), then a connection will be silently, automatically, implicitly checked out to it. It still needs to be checked back in.

And you can call `ActiveRecord::Base.clear_active_connections!`, and all connections checked out to the calling thread will be checked back in. (Why might there be more than one connection checked out to the calling thread? Mostly only if you have more than one database in use, with some models in one database and others in others.)

And that’s what ordinary Rails use does, which is why you haven’t had to worry about connection checkouts before.  A Rails action method begins with no connections checked out to it; if and only if the action actually tries to do some ActiveRecord stuff, does a connection get lazily checked out to the thread.

And after the request had been processed and the response delivered, Rails itself will call `ActiveRecord::Base.clear_active_connections!` inside the thread that handled the request, checking back connections, if any, that were checked out.

The danger of leaked connections

So, if you are doing “normal” Rails things, you don’t need to worry about connection checkout/checkin. (modulo any bugs in AR).

But if you create your own threads to use ActiveRecord (inside or outside a Rails app, doesn’t matter), you absolutely do.  If you proceed blithly to use AR like you are used to in Rails, but have created Threads yourself — then connections will be automatically checked out to you when needed…. and never checked back in.

The best thing to do in your own threads is to wrap all AR use in a `with_connection`. But if some code somewhere accidentally does an AR operation outside of a `with_connection`, a connection will get checked out and never checked back in.

And if the thread then dies, the connection will become orphaned or leaked, and in fact there is no way in Rails4 to recover it.  If you leak one connection like this, that’s one less connection available in the ConnectionPool.  If you leak all the connections in the ConnectionPool, then there’s no more connections available, and next time anyone tries to use ActiveRecord, it’ll wait as long as the checkout_timeout (default 5 seconds; you can set it in your database.yml to something else) trying to get a connection, and then it’ll give up and throw a ConnectionTimeout. No more database access for you.

In Rails 3.x, there was a method `clear_stale_cached_connections!`, that would  go through the list of all checked out connections, cross-reference it against the list of all active threads, and if there were any checked out connections that were associated with a Thread that didn’t exist anymore, they’d be reclaimed.   You could call this method from time to time yourself to try and clean up after yourself.

And in fact, if you tried to check out a connection, and no connections were available — Rails 3.2 would call clear_stale_cached_connections! itself to see if there were any leaked connections that could be reclaimed, before raising a ConnectionTimeout. So if you were leaking connections all over the place, you still might not notice, the ConnectionPool would clean em up for you.

But this was a pretty expensive operation, and in Rails4, not only does the ConnectionPool not do this for you, but the method isn’t even available to you to call manually.  As far as I can tell, there is no way using public ActiveRecord API to clean up a leaked connection; once it’s leaked it’s gone.

So this makes it pretty important to avoid leaking connections.

(Note: There is still a method `clear_stale_cached_connections` in Rails4, but it’s been redefined in a way that doesn’t do the same thing at all, and does not do anything useful for leaked connection cleanup.  That it uses the same method name, I think, is based on misunderstanding by Rails devs of what it’s doing. See Fear the Reaper below. )

Monkey-patch AR to avoid leaked connections

I understand where Rails is coming from with the ‘implicit checkout’ thing.  For standard Rails use, they want to avoid checking out a connection for a request action if the action isn’t going to use AR at all. But they don’t want the developer to have to explicitly check out a connection, they want it to happen automatically. (In no previous version of Rails, back from when AR didn’t do concurrency right at all in Rails 1.0 and Rails 2.0-2.1, has the developer had to manually check out a connection in a standard Rails action method).

So, okay, it lazily checks out a connection only when code tries to do an ActiveRecord operation, and then Rails checks it back in for you when the request processing is done.

The problem is, for any more general-purpose usage where you are managing your own threads, this is just a mess waiting to happen. It’s way too easy for code to ‘accidentally’ check out a connection, that never gets checked back in, gets leaked, with no API available anymore to even recover the leaked connections. It’s way too error prone.

That API contract of “implicitly checkout a connection when needed without you realizing it, but you’re still responsible for checking it back in” is actually kind of insane. If we’re doing our own `Thread.new` and using ActiveRecord in it, we really want to disable that entirely, and so code is forced to do an explicit `with_connection` (or `checkout`, but `with_connection` is a really good idea).

So, here, in a gist, is a couple dozen line monkey patch to ActiveRecord that let’s you, on a thread-by-thread basis, disable the “implicit checkout”.  Apply this monkey patch (just throw it in a config/initializer, that works), and if you’re ever manually creating a thread that might (even accidentally) use ActiveRecord, the first thing you should do is:

Thread.new do 
   ActiveRecord::Base.forbid_implicit_checkout_for_thread!

   # stuff
end

Once you’ve called `forbid_implicit_checkout_for_thread!` in a thread, that thread will be forbidden from doing an ‘implicit’ checkout.

If any code in that thread tries to do an ActiveRecord operation outside a `with_connection` without a checked out connection, instead of implicitly checking out a connection, you’ll get an ActiveRecord::ImplicitConnectionForbiddenError raised — immediately, fail fast, at the point the code wrongly ended up trying an implicit checkout.

This way you can enforce your code to only use `with_connection` like it should.

Note: This code is not battle-tested yet, but it seems to be working for me with `with_connection`. I have not tried it with explicitly checking out a connection with ‘checkout’, because I don’t entirely understand how that works.

DO fear the Reaper

In Rails4, the ConnectionPool has an under-documented thing called the “Reaper”, which might appear to be related to reclaiming leaked connections.  In fact, what public documentation there is says: “the Reaper, which attempts to find and close dead connections, which can occur if a programmer forgets to close a connection at the end of a thread or a thread dies unexpectedly. (Default nil, which means don’t run the Reaper).”

The problem is, as far as I can tell by reading the code, it simply does not do this.

What does the reaper do?  As far as I can tell trying to follow the code, it mostly looks for connections which have actually dropped their network connection to the database.

A leaked connection hasn’t necessarily dropped it’s network connection. That really depends on the database and it’s settings — most databases will drop unused connections after a certain idle timeout, by default often hours long.  A leaked connection probably hasn’t yet had it’s network connection closed, and a properly checked out not-leaked connection can have it’s network connection closed (say, there’s been a network interruption or error; or a very short idle timeout on the database).

The Reaper actually, if I’m reading the code right, has nothing to do with leaked connections at all. It’s targeting a completely different problem (dropped network, not checked out but never checked in leaked connections). Dropped network is a legit problem you want to be handled gracefullly; I have no idea how well the Reaper handles it (the Reaper is off by default, I don’t know how much use it’s gotten, I have not put it through it’s paces myself). But it’s got nothing to do with leaked connections.

Someone thought it did, they wrote documentation suggesting that, and they redefined `clear_stale_cached_connections!` to use it. But I think they were mistaken. (Did not succeed at convincing @tenderlove of this when I tried a couple years ago when the code was just in unreleased master; but I also didn’t have a PR to offer, and I’m not sure what the PR should be; if anyone else wants to try, feel free!)

So, yeah, Rails4 has redefined the existing `clear_stale_active_connections!` method to do something entirely different than it did in Rails3, it’s triggered in entirely different circumstance. Yeah, kind of confusing.

Oh, maybe fear ruby 1.9.3 too

When I was working on upgrading the app, I’m working on, I was occasionally getting a mysterious deadlock exception:

ThreadError: deadlock; recursive locking:

In retrospect, I think I had some bugs in my code and wouldn’t have run into that if my code had been behaving well. However, that my errors resulted in that exception rather than a more meaningful one, maybe possibly have been a bug in ruby 1.9.3 that’s fixed in ruby 2.0. 

If you’re doing concurrency stuff, it seems wise to use ruby 2.0 or 2.1.

Can you use an already loaded AR model without a connection?

Let’s say you’ve already fetched an AR model in. Can a thread then use it, read-only, without ever trying to `save`, without needing a connection checkout?

Well, sort of. You might think, oh yeah, what if I follow a not yet loaded association, that’ll require a trip to the db, and thus a checked out connection, right? Yep, right.

Okay, what if you pre-load all the associations, then are you good? In Rails 3.2, I did this, and it seemed to be good.

But in Rails4, it seems that even though an association has been pre-loaded, the first time you access it, some under-the-hood things need an ActiveRecord Connection object. I don’t think it’ll end up taking a trip to the db (it has been pre-loaded after all), but it needs the connection object. Only the first time you access it. Which means it’ll check one out implicitly if you’re not careful. (Debugging this is actually what led me to the forbid_implicit_checkout stuff again).

Didn’t bother trying to report that as a bug, because AR doesn’t really make any guarantees that you can do anything at all with an AR model without a checked out connection, it doesn’t really consider that one way or another.

Safest thing to do is simply don’t touch an ActiveRecord model without a checked out connection. You never know what AR is going to do under the hood, and it may change from version to version.

Concurrency Patterns to Avoid in ActiveRecord?

Rails has officially supported multi-threaded request handling for years, but in Rails4 that support is turned on by default — although there still won’t actually be multi-threaded request handling going on unless you have an app server that does that (Puma, Passenger Enterprise, maybe something else).

So I’m not sure how many people are using multi-threaded request dispatch to find edge case bugs; still, it’s fairly high profile these days, and I think it’s probably fairly reliable.

If you are actually creating your own ActiveRecord-using threads manually though (whether in a Rails app or not; say in a background task system), from prior conversations @tenderlove’s preferred use case seemed to be creating a fixed number of threads in a thread pool, making sure the ConnectionPool has enough connections for all the threads, and letting each thread permanently check out and keep a connection.

I think you’re probably fairly safe doing that too, and is the way background task pools are often set up.

That’s not what my app does.  I wouldn’t necessarily design my app the same way today if I was starting from scratch (the app was originally written for Rails 1.0, gives you a sense of how old some of it’s design choices are; although the concurrency related stuff really only dates from relatively recent rails 2.1 (!)).

My app creates a variable number of threads, each of which is doing something different (using a plugin system). The things it’s doing generally involve HTTP interactions with remote API’s, is why I wanted to do them in concurrent threads (huge wall time speedup even with the GIL, yep). The threads do need to occasionally do ActiveRecord operations to look at input or store their output (I tried to avoid concurrency headaches by making all inter-thread communications through the database; this is not a low-latency-requirement situation; I’m not sure how much headache I’ve avoided though!)

So I’ve got an indeterminate number of threads coming into and going out of existence, each of which needs only occasional ActiveRecord access. Theoretically, AR’s concurrency contract can handle this fine, just wrap all the AR access in a `with_connection`.  But this is definitely not the sort of concurrency use case AR is designed for and happy about. I’ve definitely spent a lot of time dealing with AR bugs (hopefully no longer!), and just parts of AR’s concurrency design that are less than optimal for my (theoretically supported) use case.

I’ve made it work. And it probably works better in Rails4 than any time previously (although I haven’t load tested my app yet under real conditions, upgrade still in progress). But, at this point,  I’d recommend avoiding using ActiveRecord concurrency this way.

What to do?

What would I do if I had it to do over again? Well, I don’t think I’d change my basic concurrency setup — lots of short-lived threads still makes a lot of sense to me for a workload like I’ve got, of highly diverse jobs that all do a lot of HTTP I/O.

At first, I was thinking “I wouldn’t use ActiveRecord, I’d use something else with a better concurrency story for me.”  DataMapper and Sequel have entirely different concurrency architectures; while they use similar connection pools, they try to spare you from having to know about it (at the cost of lots of expensive under-the-hood synchronization).

Except if I had actually acted on that when I thought about it a couple years ago, when DataMapper was the new hotness, I probably would have switched to or used DataMapper, and now I’d be stuck with a large unmaintained dependency. And be really regretting it. (And yeah, at one point I was this close to switching to Mongo instead of an rdbms, also happy I never got around to doing it).

I don’t think there is or is likely to be a ruby ORM as powerful, maintained, and likely to continue to be maintained throughout the life of your project, as ActiveRecord. (although I do hear good things about Sequel).  I think ActiveRecord is the safe bet — at least if your app is actually a Rails app.

So what would I do different? I’d try to have my worker threads not actually use AR at all. Instead of passing in an AR model as input, I’d fetch the AR model in some other safer main thread, convert it to a pure business object without any AR, and pass that in my worker threads.  Instead of having my worker threads write their output out directly using AR, I’d have a dedicated thread pool of ‘writers’ (each of which held onto an AR connection for it’s entire lifetime), and have the indeterminate number of worker threads pass their output through a threadsafe queue to the dedicated threadpool of writers.

That would have seemed like huge over-engineering to me at some point in the past, but at the moment it’s sounding like just the right amount of engineering if it lets me avoid using ActiveRecord in the concurrency patterns I am, that while it officially supports, it isn’t very happy about.


Filed under: General

by jrochkind at October 20, 2014 03:35 AM

October 19, 2014

Coyle's InFormation

This is what sexism looks like

[Note to readers: sick and tired of it all, I am going to report these "incidents" publicly because I just can't hack it anymore.]

I was in a meeting yesterday about RDF and application profiles, in which I made some comments, and was told by the co-chair: "we don't have time for that now", and the meeting went on.

Today, a man who was not in the meeting but who listened to the audio sent an email that said:
"I agree with Karen, if I correctly understood her point, that this is "dangerous territory".  On the call, that discussion was postponed for a later date, but I look forward to having that discussion as soon as possible because I think it is fundamental."
And he went on to talk about the issue, how important it is, and at one point referred to it as "The requirement is that a constraint language not replace (or "hijack") the original semantics of properties used in the data."

The co-chair (I am the other co-chair, although reconsidering, as you may imagine) replied:
"The requirement of not hijacking existing formal specification languages for expressing constraints that rely on different semantics has not been raised yet."
"Has not been raised?!" The email quoting me stated that I had raised it the very day before. But an important issue is "not raised" until a man brings it up. This in spite of the fact that the email quoting me made it clear that my statement during the meeting had indeed raised this issue.

Later, this co-chair posted a link to a W3C document in an email to me (on list) and stated:
"I'm going on holidays so won't have time to explain you, but I could, in theory (I've been trained to understand that formal stuff, a while ago)"
That is so f*cking condescending. This happened after I quoted from W3C documents to support my argument, and I believe I had a good point.

So, in case you haven't experienced it, or haven't recognized it happening around you, this is what sexism looks like. It looks like dismissing what women say, but taking the same argument seriously if a man says it, and it looks like purposely demeaning a woman by suggesting that she can't understand things without the help of a man.

I can't tell you how many times I have been subjected to this kind of behavior, and I'm sure that some of you know how weary I am of not being treated as an equal no matter how equal I really am.

Quiet no more, friends. Quiet no more.

(I want to thank everyone who has given me support and acknowledgment, either publicly or privately. It makes a huge difference.) 

Some links about "'Splaining"
http://scienceblogs.com/thusspakezuska/2010/01/25/you-may-be-a-mansplainer-if/
http://geekfeminism.wikia.com/wiki/Splaining

by Karen Coyle (noreply@blogger.com) at October 19, 2014 01:39 PM

schema.org - where it works

In the many talks about schema.org, it seems that one topic that isn't covered, or isn't covered sufficiently, is "where do you do it?" That is, where does it fit into your data flow? I'm going to give a simple, typical example. Your actual situation may vary, but I think this will help you figure out your own case.

The typical situation is that you have a database with your data. Searches go against that database, the results are extracted, a program formats these results into a web page, and the page is sent to the screen. Let's say that your database has data about authors, titles and dates. These are stored in your database in a way that you know which is which. A search is done, and let's say that the results of the search are:
author:  Williams, R
title: History of the industrial sewing machine
date: 1996
This is where you are in your data flow:

The next thing that happens (and remember, I'm speaking very generally) is that the results then are fed into a program that formats them into HTML, probably within a template that has all your headers, footers, sidebars and branding and sends the data to the browser. The flow now looks like

Let's say that you will display this as a citation, that looks like:
Williams, R. History of the industrial sewing machine. 1996.
Without any fancy formatting, the HTML for this is:
<p>Williams, R. History of the industrial sewing machine. 1996.</p>
Now we can see the problem that schema.org is designed to fix. You started with an author, a title and date, but what you are showing to the world is a string of characters are that undifferentiated. You have lost all the information about what these represent. To a machine, this is just another of many bazillions of paragraphs on the web. Even if you format your data like this:
<p>Author: Williams, R.</p>
<p>Title:  Williams, R. History of the industrial sewing machine</p>
<p>Date: 1996</p>
What a machine sees is:
<p>blah: blah</p>
<p>blah: blah</p>
<p>blah: blah</p>  
What we want is for the program that is is formatting the HTML to also include some metadata from schema.org that retains the meaning of the data you are putting on the screen. So rather than just putting HTML formatting, it will add formatting from schema.org. Schema.org has metadata elements for many different types of data. Using our example, let's say that this is a book, and here's how you could mark that up in schema.org:
<div vocab="http://schema.org/">
<div   typeof="Book">
<p>
    <span property="author">Williams, R.</span> <span property="name">History of the industrial sewing machine</span>. <span property="datePublished">1996</span>.
    </p>
    </div>
</div>
Again, this is a very simple example, but when we test this code in the Google Rich Snippet tool, we can see that even this very simple example has added rich information that a search engine can make use of:
To see a more complex example, this is what Dan Scott and I have done to enrich the files of the Bryn Mawr Classical Reviews.

The review as seen in a browser (includes schema.org markup)

The review as seen by a tool that reads the structured schema.org data.

From these you can see a couple of things. The first is that the schema.org markup does not change how your pages look to a user viewing your data in a browser. The second is that hidden behind that simple page is a wealth of rich information that was not visible before.

Now you are probably wondering: well, what's that going to do for me? Who will use it? At the moment, the users of this data are the search engines, and they use the data to display all of that additional information that you see under a link:


In this snippet, the information about stars, ratings, type of film and audience comes from schema. org mark-up on the page.

Because the data is there, many of us think that other users and uses will evolve. The reverse of that is that, of course, if the information isn't there then those as yet undeveloped possibilities cannot happen.



by Karen Coyle (noreply@blogger.com) at October 19, 2014 10:10 AM

October 18, 2014

Bibliographic Wilderness

Google Scholar is 10 years old

An article by Steven Levy about the guy who founded the service, and it’s history:

Making the world’s problem solvers 10% more efficient: Ten years after a Google engineer empowered researchers with Scholar, he can’t bear to leave it

“Information had very strong geographical boundaries,” he says. “I come from a place where those boundaries are very, very apparent. They are in your face. To be able to make a dent in that is a very attractive proposition.”

Acharya’s continued leadership of a single, small team (now consisting of nine) is unusual at Google, and not necessarily seen as a smart thing by his peers. By concentrating on Scholar, Acharya in effect removed himself from the fast track at Google….  But he can’t bear to leave his creation, even as he realizes that at Google’s current scale, Scholar is a niche.

…But like it or not, the niche reality was reinforced after Larry Page took over as CEO in 2011, and adopted an approach of “more wood behind fewer arrows.” Scholar was not discarded — it still commands huge respect at Google which, after all, is largely populated by former academics—but clearly shunted to the back end of the quiver.

…Asked who informed him of what many referred to as Scholar’s “demotion,” Acharya says, “I don’t think they told me.” But he says that the lower profile isn’t a problem, because those who do use Scholar have no problem finding it. “If I had seen a drop in usage, I would worry tremendously,” he says. “There was no drop in usage. I also would have felt bad if I had been asked to give up resources, but we have always grown in both machine and people resources. I don’t feel demoted at all.”


Filed under: General

by jrochkind at October 18, 2014 03:47 PM

October 17, 2014

TSLL TechScans

New report offers recommendations to improve usage, discovery and access of e-content in libraries


A group of professionals from libraries, content providers and OCLC have published Success Strategies for Electronic Content Discovery and Access, a white paper that identifies data quality issues in the content supply chain and offers practical recommendations for improved usage, discovery and access of e-content in libraries.


Success Strategies for Electronic Content Discovery and Access offers solutions for the efficient exchange of high-quality data among libraries, data suppliers and service providers, such as:
  • Improve bibliographic metadata and holdings data
  • Synchronize bibliographic metadata and holdings data
  • Use consistent data formats.

See the article at http://www.librarytechnology.org/ltg-displaytext.pl?RC=19772

by noreply@blogger.com (Marlene Bubrick) at October 17, 2014 05:39 PM

Terry's Worklog

MarcEdit LibHub Plug-in

As libraries begin to join and participate in systems to test Bibframe principles, my hope is that when possible, I can provide support through MarcEdit to provide these communities a conduit to simplify the publishing of information into those systems.  The first of these test systems is the Libhub Initiative, and working with Eric Miller and the really smart folks at Zepheira (http://zepheira.com/), have created a plug-in specifically for libraries and partners working with the LibHub initiative.  The plug-in provides a mechanism to publish a variety of metadata formats into the system – from MARC, MARCXML, EAD, and MODS data – the process will hopefully help users contribute content and help spur discussion around the data model Zepheira is employing with this initiative.

For the time being, the plug-in is private, and available to any library currently participating in the LibHub project.  However, my understanding is that as they continue to ramp up the system, the plugin will be made available to the general community at large.

For now, I’ve published a video talking about the plug-in and demonstrating how it works.  If you are interested, you can view the video on YouTube.

 

–tr

by reeset at October 17, 2014 03:19 AM

Automated Language Translation using Microsoft’s Translation Services

We hear the refrain over and over – we live in a global community.  Socially, politically, economically – the ubiquity of the internet and free/cheap communications has definitely changed the world that we live in.  For software developers, this shift has definitely been felt as well.  My primary domain tends to focus around software built for the library community, but I’ve participated in a number of open source efforts in other domains as well, and while it is easier than ever to make one’s project/source available to the masses, efforts to localize said projects is still largely overlooked.  And why?  Well, doing internationalization work is hard and often times requires large numbers of volunteers proficient in multiple languages to provide quality translations of content in a wide range of languages.  It also tends to slow down the development process and requires developers to create interfaces and inputs that support language sets that they themselves may not be able to test or validate.   

Options

If your project team doesn’t have the language expertise to provide quality internalization support, you have a variety of options available to you (with the best ones reserved for those with significant funding).  These range of tools available to open source projects like: TranslateWiki (https://translatewiki.net/wiki/Translating:New_project) which provides a platform for volunteers to participate in crowd-sourced translation services.  There are also some very good subscription services like Transifex (https://www.transifex.com/), a subscription service that again, works as both a platform and match-making service between projects and translators.  Additionally, Amazon’s Mechanical Turk can be utilized to provide one off translation services at a fairly low cost.  The main point though, is that services do exist that cover a wide spectrum in terms of cost and quality.   The challenge of course, is that many of the services above require a significant amount of match-making, either on the part of the service or the individuals involved with the project and oftentimes money.  All of this ultimately takes time, sometimes a significant amount of time, making it a difficult cost/benefit analysis of determining which languages one should invest the time and resources to support.

Automated Translation

This is a problem that I’ve been running into a lot lately.  I work on a number of projects where the primary user community hails largely from North America; or, well, the community that I interact with most often are fairly English language centric.  But that’s changing — I’ve seen a rapidly growing international community and increasing calls for localized versions of software or utilities that have traditionally had very niche audiences. 

I’ll use MarcEdit (http://marcedit.reeset.net) as an example.  Over the past 5 years, I’ve seen the number of users working with the program steadily increase, with much of that increase coming from a growing international user community.  Today, 1/3-1/2 of each month’s total application usage comes from outside of North America, a number that I would have never expected when I first started working on the program in 1999.  But things have changed, and finding ways to support these changing demographics are challenging.. 

In thinking about ways to provide better support for localization, one area that I found particularly interesting was the idea of marrying automated language transcription with human intervention.  The idea being that a localized interface could be automatically generated using an automated translation tool to provide a “good enough” translation, that could also serve as the template for human volunteers to correct and improve the work.  This would enable support for a wide range of languages where English really is a barrier but no human volunteer has been secured to provide localized translation; but would enable established communities to have a “good enough” template to use as a jump-off point to improve and speed up the process of human enhanced translation.  Additionally, as interfaces change and are updated, or new services are added, automated processes could generate the initial localization, until a local expert was available to provide the high quality transcription of the new content, to avoid slowing down the development and release process.

This is an idea that I’ve been pursing for a number of months now, and over the past week, have been putting into practice.  Utilizing Microsoft’s Translation Services, I’ve been working on a process to extract all text strings from a C# application and generate localized language files for the content.  Once the files have been generated, I’ve been having the files evaluated by native speakers to comment on quality and usability…and for the most part, the results have been surprising.  While I had no expectation that the translations generated through any automated service would be comparable to human mediated translation, I was pleasantly surprised to hear that the automated data is very often, good enough.  That isn’t to say that it’s without its problems, there are definitely problems.  The bigger question has been, do these problems impede the use of the application or utility.  In most cases, the most glaring issue with the automated translation services has been context.  For example, take the word Score.  Within the context of MarcEdit and library bibliographic description, we know score applies to musical scores, not points scored in a game…context.  The problem is that many languages do make these distinctions with distinct words, and if the translation service cannot determine the context, it tends to default to the most common usage of a term – and in the case of library bibliographic description, that would be often times incorrect.  It’s made for some interesting conversations with volunteers evaluating the automated translations – which can range from very good, to down right comical.  But by a large margin, evaluators have said that while the translations were at times very awkward, they would be “good enough” until someone could provide better a better translation of the content.  And what is more, the service gets enough of the content right, that it could be used as a template to speed the translation process.  And for me, this is kind of what I wanted to hear.

Microsoft’s Translation Services

There really aren’t a lot of options available for good free automated translation services, and I guess that’s for good reason.  It’s hard, and requires both resources and adequate content to learn how to read and output natural language.  I looked hard at the two services that folks would be most familiar with: Google’s Translation API (https://cloud.google.com/translate/) and Microsoft’s translation services (https://datamarket.azure.com/dataset/bing/microsofttranslator).  When I started this project, my intention was to work with Google’s Translation API – I’d used it in the past with some success, but at some point in the past few years, Google seems to have shut down its free API translation services and replace them with a more traditional subscription service model.  Now, the costs for that subscription (which tend to be based on number of characters processed) is certainly quite reasonable, my usage will always be fairly low and a little scattershot making the monthly subscription costs hard to justify.  Microsoft’s translation service is also a subscription based service, but it provides a free tier that supports 2 million characters of through-put a month.  Since that more than meets my needs, I decided to start here. 

The service provides access to a wide range of languages, including Klingon (Qo’noS marcedit qaStaHvIS tlhIngan! nuq laH ‘oH Dunmo’?), which made working with the service kind of fun.  Likewise, the APIs are well-documented, though can be slightly confusing due to shifts in authentication practice to an OAuth Token-based process sometime in the past year or two.  While documentation on the new process can be found, most code samples found online still reference the now defunct key/secret key process.

So how does it work?  Performance-wise, not bad.  In generating 15 language files, it took around 5-8 minutes per file, with each file requiring close to 1600 calls against the server, per file.  As noted above, accuracy varies, especially when doing translations of one word commands that could have multiple meanings depending on context.  It was actually suggested that some of these context problems may actually be able to be overcome by using a language other than English as the source, which is a really interesting idea and one that might be worth investigating in the future. 

Seeing how it works

If you are interested in seeing how this works, you can download a sample program which pulls together code copied or cribbed from the Microsoft documentation (and then cleaned for brevity) as well as code on how to use the service from: https://github.com/reeset/C–Language-Translator.  I’m kicking around the idea of converting the C# code into a ruby gem (which is actually pretty straight forward), so if there is any interest, let me know.

–tr

by reeset at October 17, 2014 01:13 AM