Technical Overview

An Informal Introduction to MarkLogic Server, XQuery, and Developer Resources

This document gives you an informal introduction to Mark Logic Server, the XQuery language, and the developer resource site dedicated to both. When I say informal, I mean it. I like to think of myself here as the college student tour guide showing you, the pre-frosh, around campus. While officially sanctioned, I don't have to be official. I get to tell you the straight scoop. So come with me, I'll show you the place, and I hope you stick around for the next four years.

Tour Stops

Getting Your Bearings

The first stop on our tour today is MarkLogic Server. You're going to be tempted to call it by its initials, MLS, but it prefers to go by its full name, so that's what we're going to call it. MarkLogic Server is an XML Content Server - some might say a special-purpose database designed specifically for content. One might, and we sometimes do, call it a server for managing contentbases! If we want to get more formal, I'll just borrow the explanation of what an XML Content Server is from Mark Logic's corporate web site:

An XML content server is a platform that provides a set of services used to build applications and support business processes based on content. Each word in the name is there for a specific reason; let's examine each word more closely.

  • It's called an XML content server because its native data format is XML. XML content is accepted in as is form. Content in other formats is converted to an XML representation when loaded into the server.
  • It's called an XML content server because it's built specifically for content. While XML can be used either to markup content or to wrap data, an XML content server is built specifically for content. As such, an XML content server has features that you won't find in an XML server built for data such as search, format conversion and pipeline processing)
  • It's called an XML content server because it's designed to be a specialized server in your enterprise architecture. An XML content server manages its own content repository and is accessed using the W3Cstandard XQuery language. (By analogy, a relational database is a specialized server that manages its own repository and is accessed through SQL.)

It's probably easiest to understand the concept of an "XML content server" with a demonstration. You can find an interesting demo at http://paycheck.demo.marklogic.com. What you'll see is a web-based application that allows you to explore information about public companies as extracted from their "DEF 14A" documents filed with the Securities and Exchange Commision (SEC) each year. While this might sound boring, it's in fact quite interesting because included in the "DEF 14A" filings must be a table listing the salary, bonus, and stock options of the company's top executives. By loading these documents, along with others from Hoover's public web site at hoovers.com, it's possible to get meaningful data out of what used to be a cluttered mess and to perform analytics and comparisons. How much do your company's top executives earn? How does that compare versus executives in the same role at competitor companies? Are they above or below average among all executives? Using a XML content server to sift through the constantly updated document set, you can "find the needles in the haystacks". When going through the demo, I recommend you read the little explanation boxes to learn how things go together.

There are several other interesting demos, but none except "paycheck" are up for public viewing. If you're in touch with Mark Logic's sales team, ask to see some of the others. Some of my favorites let you dig into medical textbooks and Arabic newspapers.

Getting MarkLogic Server

Really the best way to understand what you can do with an XML content server is to use one. You can download MarkLogic Server from Mark Logic's developer resource site. Among the things you'll find :

  • Binary downloads of MarkLogic Server
  • Binary downloads of numerous supporting libraries
  • Open source projects of community-developed libraries
  • Documentation on Mark Logic products and on XQuery
  • Regularly updated tutorials and articles
  • A mailing list for public discussion

To download MarkLogic Server you'll find a big button on the front page. The exact link is http://developer.marklogic.com/download/. Before downloading, make sure you have access to one of the supported platforms:

  • Microsoft Windows 2003 Server
  • Sun Solaris 8, 9 or 10
  • Red Hat Enterprise Linux 3 or 4

And also make sure that your machine satisfies the minimum requirements:

  • 512 MB of system memory
  • Three times the disk space of the source content to be loaded

More memory is always better when it comes to processing large data sets, so throw in that extra SIMM chip if you've got it. With the disk space requirement, if you plan on loading 10 GB of content into the database, you should reserve at least 30 GB of disk space. You won't usually see the full reserve used, but periodically the database needs to do housekeeping work that requires disk space.

The Community License only supports 50 MB or so, so if you're a Community Licensee and are low on disk space just delete a few MP3s and you're good.

As you see on the download page, there's one binary download for each platform. The license key you enter after install determines the functionality exposed. You can upgrade simply by changing license keys. To learn about the product editions and capabilities, I'll refer you to the What Is MarkLogic Server page.

Notice that right under each platform's download link there's a platform-specific install guide. This guide walks you through the steps necessary to install the database and enter a license key for the appropriate edition. We'll pause the tour now while you follow the guide's instructions and install the database. It takes just a couple minutes.

Administering MarkLogic Server

The install guide should have walked you through the process of browsing to the admin interface at http://localhost:8001 to enter the license key. Now you can go to the same web address to administer the server. The admin interface lets you control the creation, management, and configuration of databases, forests, servers, and hosts. The left navigation bar contains the "nouns". Use it to select the item you want to act upon. The top right tabs contain the "verbs". Select the verb after selecting the noun. Under the tab is a data entry area for making changes.

The main thing you need to understand when using the admin pages is the database topology. Documents are stored in forests. One or more forests are gathered together to form a database. Databases are logical units against which you can assign HTTP, WebDAV and XDBC (for XCC Java and .NET connectivity) servers and set various runtime configuration options. The name forests comes from the fact that XML documents are tree structures, and a collection of trees is a forest. Databases exist as a logical abstraction because in a distributed environment it can be useful to have the same logical database spread across different hosts, perhaps one host with two forests and another with three.

There's a full Administrator's Guide document available from http://developer.marklogic.com/pubs/. You can use it as your guide through the port 8001 administration pages. One thing it doesn't tell you is that the admin screens are written entirely in XQuery and their source is available in C:\Program Files\MarkLogicS\Admin on Windows and /opt/MarkLogic/Admin on the Unixes. It's a good source for example code, and also shows how you can script the server's control functions.

Writing a Query

Now that you have the database installed and had a chance to poke around the admin screens, I'll bet you're itching to write a query. Let's get to it.

The easiest way is to make use of the pre-existing Docs HTTP server that's by default setup and running on port 8000 against the Documents database. To write your first query, go (via the command line or with an explorer) to the Docs directory under your server root. That's C:\Program Files\MarkLogic\Docs on Windows and /opt/MarkLogic/Docs on Unix. In there, write the following to a new file you'll save as welcome.xqy:

(: This is welcome.xqy :)
<big>
Welcome to { xdmp:product-name() }
           { xdmp:version() }
           { xdmp:product-edition() } Edition!
</big>

Now with a web browser, go to http://localhost:8000/welcome.xqy. (Or replace localhost with your remote server's name if you're not running on the local machine.) You will see your query executed. Pretty cool, eh? When a client requests a query file using the special extension .xqy the server executes the query file content and returns the result. It's basically CGI for XQuery. And because XQuery so easily constructs dynamic XHTML output, it's an amazingly quick development and deployment model. There's no need to use Java classes in processing the result (although you can, as we'll see later).

Now rename your file to default.xqy and access http://localhost:8000/. You'll see the same dynamic content. That's because default.xqy is a standard welcome file and executed by default if no other file is specified. It's that easy to have a dynamic contentbase-backed home page to your site.

Under the Docs directory you may have noticed a subdirectory named use-cases. Using your browser again, hit http://localhost:8000/use-cases/. You will see a simple demo built from the XQuery Use Cases specification. To get started:

  1. Click the "Load source XML into database" link.
  2. Click on an example on the left and it populates the query textarea on the top right.
  3. Click Submit and you see the results (in XML or XHTML format) in the bottom right.

You can use these examples to get a taste for what XQuery code looks like. You can enter your own custom query into the textarea. Here's an example:

(: Try this in the textarea :)
for $i in input() return document-uri($i)

Running this gives you an unsorted listing of all the documents held within the database. The input() function returns a sequence of document nodes while document-uri($i) returns the URI (the identifier) for document $i.

For some explanation on the XQuery Use Cases, I'll point you toward the Getting Started with MarkLogic Server document available at http://developer.marklogic.com/pubs/.

Understanding XQuery

I'm glad you've stuck with me this far on the tour. This big pointy-cornered building we're looking at now is XQuery. (There's a rumor every joint in the building is held together with an XML angle bracket.) It's a large building, one we could get lost in if we took the tour inside, so let me describe it from the outside and point you elsewhere for internal details.

XQuery is a new language under development by the World Wide Web Consortium (W3C) that's designed to query collections of XML data. XQuery provides a mechanism to efficiently and easily extract data from XML documents or from any data source that can be viewed as XML such as relational databases or office documents. XQuery uses XML as a native data type, but thankfully the language itself isn't written in XML.

Usage of XQuery tends to fall into one of three camps. First, there's the streaming transformation model. In this fairly simple application, XQuery defines the mapping from one file format to another. It's similar to XSLT but with extra features and a more readable syntax.

Second, XQuery can be used as a meta query language executing against one or more data stores -- for example, an RDBMS and an XML content server side-by-side. In this scenario, XQuery accesses the relational stores as if their tables were XML documents, pushing query predicates to the database as SQL for optimized execution and then merging and manipulating the results with the XML content server using XQuery. This works because it's trivial to view relational data as XML, while it's extremely complicated and often impossible to "shred" true XML data into relational tables. Theoretically any data store, not just relational data, can be made accessible to the XQuery environment. In this usage, XQuery becomes a lingua franca query language, the X standing for "plug your data format in here" more than for XML.

Third, there's the "pure play" XQuery implementation, where instead of mapping XQuery to another query language, it's used directly against a database designed from the ground up for XQuery. This approach, used by MarkLogic Server, works well for managing data that until now hasn't been put into a database -- data that doesn't fit neatly into the rectangular boxes imposed by relational databases. Examples are medical records, textbook content, office documents, or web pages. In this model, you store the documents directly into the XQuery database -- possibly going through a conversion to XML but without any complicated "shredding" to a relational format needed. Then you query the documents to extract the bits and pieces deemed important.

MarkLogic Server version 3.1 implements the May 2003 draft of the XQuery specifications. It also adds, via module libraries placed in the xdmp, cts, and sec namespaces (among others), functions to support features lacking in "vanilla XQuery": text search, node-level update, and administration. You may remember seeing a few informational xdmp function calls in the welcome.xqy example.

Mark Logic also incorporated a change to the core language syntax by adding a try/catch syntax. XQuery has the ability to throw errors but without this ability you couldn't catch them.

We'll talk about some of Mark Logic's XQuery functions in the next section. To prepare for that, I recommend you check out some of the following to get familiar with the XQuery language.

Loading Documents

As you noticed while reading these articles (you did read them, right?) vanilla XQuery leaves certain areas underspecified. There's no standard mechanism to load a document or view a list of available documents. There's also no built-in support for efficient token-based text search. Mark Logic addresses these deficiencies with numerous builtin functions. There's a listing of these functions in the "Builtin Function documentation" link above. This section explains a few of the methods you need to understand in order to make the most of MarkLogic Server.

Watch out! If you're new to XQuery and skipped over the Getting Started links above, you're going to find the XQuery code in this section a little heavy. That's OK. I'll just assume you're having such a great time here that you can't wait to continue. Learn what you can. You can always come back.

Perhaps the most important MarkLogic Server builtin function is xdmp:load(), used to load documents from disk into the contentbase. You commonly pass it two parameters: a file to load and a URI name under which it should be loaded. If the URI name isn't specified, the file name is used. Here's a simple example:

xdmp:load("/tmp/bib.xml", "bib.xml")

This loads the file /tmp/bib.xml to the contentbase under the name bib.xml. Any legal URI string will do for a name. The name could even be http://developer.marklogic.com/bib.xml. The URI might look like a web address but queries are always done entirely inside MarkLogic Server; there's no external fetching at query time. It's convenient using web URIs as names when loading content from the web; you just store documents using the exact URI from which they came.

The xdmp:load() call returns the empty sequence on success and throws an error in case of problems. To print "Loaded" after a load, use the following trick:

xdmp:load("/tmp/bib.xml", "bib.xml"),
    "Loaded"

When you start writing code like this you'll know you're an XQuery master. This bit of code evaluates as a sequence of two items, the empty sequence (the output from the xdmp:load() function) followed by a string ("Loaded"). Put together the result is the simple string "Loaded". In case of error, the xdmp:load() call errors out and the trailing "Loaded" gets ignored. To handle errors, you can use try/catch:

try {
  xdmp:load("/tmp/bib.xml", "bib.xml"),
      "Loaded"
}
catch ($e) {
  Problem loading { $e/*:message/text() }.
}

The caught error is an XML node with elements like <message> that explain the reason for the error.

To view the content of a loaded document, use the standard doc() function:

doc("bib.xml")

This returns the document node associated with the given URI. To view a list of all loaded documents:

for $i in input() return document-uri($i)

You saw this query earlier when you typed it into the use-cases textarea. Bringing the two queries together lets you produce a "list and view" script:

let $uri := xdmp:get-request-field("uri")
return

if (empty($uri) or $uri = "") then
  (
    xdmp:set-response-content-type("text/html"),
    <ul>
    {
      for $i in input()
      let $doc := document-uri($i)
      return
        <li><a href=
          "view.xqy?uri={xdmp:url-encode($doc)}"
            >{$doc}</a></li>
    }
    </ul>
  )
else
  (
    xdmp:set-response-content-type("text/xml"),
    if (empty(doc($uri)))
    then <error>No content</error>
    else doc($uri)
  )

To give this query a spin, save it as a file named view.xqy under your Docs subdirectory. Then browse to http://localhost:8000/view.xqy. You'll see a <ul> listing of all the documents in the database. Each is clickable, and when you click on the document you see its raw content. (Because the script doesn't have any throttle support, be careful not to use it with long listings or large documents. Browsers don't always like showing <ul> lists of more than a thousand items or XML files of more than a megabyte.)

The script first fetches the uri query string parameter. If it's empty, then it treats it as a request for a listing. If it's not empty, then it's a request for the given URI to be displayed. To handle listings, we set the content type to text/html and print every document-uri() linking to itself. To handle a document view, we set the content type to text/xml and print the doc($uri) result or give a polite error note if the document couldn't be found for any reason.

Guru Tip: The parentheses (notice they're not curly braces) are required because the expression within a then or else clause has to be a single expression, and parenthese make the multiple items into a single, comma-separated sequence.

Searching Documents

Text search forms the core of a contentbase. Search is the process of selecting from a collection of elements, those "relevant" to some search condition. Results returned from a search are ordered according to relevance, the most relevant first.

Relevance is determined statistically: How frequently (as a percentage) does the word appear in the text versus how frequently the word appears across the full document set? A higher appearance percentage than average produces a higher score. Terms with more appearances (and the appearance of terms that are more rare) generate higher relevance.

The Mark Logic search processing model goes like this:

1. Specify a searchable expression

It must be a simple XPath. You can't use positions, reverse axes, and it can't be a user-defined function or start with a variable. For example, input()//section/title is OK but $random/preceding-sibling::sister[2] isn't even close.

What is XPath? It's a matching language for XML that lets you specify a set of criteria for selecting and extracting nodes from a set of documents. Think of XPath as regular expressions for XML. XQuery uses XPath frequently to select nodes. See the W3C XPath 2.0 Spec

2. Define a search constraint

Search constraints are given as query objects assembled with cts:query constructors — which look a lot like function calls.

3. Get a sequence of search result nodes

Call cts:search() and pass in the searchable expression along with the search constraint. For example, cts:search(input(), cts:word-query("foo")).

4. Optionally extract the relevance score for each item

The returned nodes are naturally sorted by relevance. If you want to quantify the difference between their relevance (to know if the first and second were close, or if the first had no competition), you can ask for the numeric score using the xdmp:score() function. Pass in the $item whose score you want. Scores are non-negative numbers. They grow logarithmically, kind of like earthquake magnitudes.

Searching is performed extremely quickly, even against massive data sets, because the content is indexed during load. If your search would be slow, cts:search() will complain that your "collection is unsearchable" which means you violated Rule #1 above. Don't feel bad, it happens to all of us.

Different options you pass when making the query objects let you toggle case sensitivity and punctuation sensitivity and let you assign weights to different query components. You can also, apart from any one query, assign a numerical quality to documents indicating the document's importance (i.e. ranking) in the database.

A simple search can be quite straightforward:

cts:search(//title, "xquery")[1 to 10]

This returns the top ten most relevant mentions of xquery among <title> elements across all documents. The database automatically promotes a string value like xquery to a cts:word-query(). You can get a bit fancier and include the score value in the result:

for $t in cts:search(//title, "xquery")[1 to 10]
return <result score="{ xdmp:score($t) }"
  >{ $t }</result>

With different search arguments you can get as complicated as you need. For an advanced example, the following returns the addresses for properties matching certain criteria. They must have a <property-status> of "for sale" or "vacant". Among those properties, we search based on if there's a <location> with the value "san diego" and a <price> containing the word "cheap".

let $status := ("for sale", "vacant")
let $q1 :=
      cts:element-value-query(
      expanded-QName("", "location"), "san diego",
        "case-insensitive", 3.0)
let $q2 :=
      cts:element-word-query(
      expanded-QName("", "price"), "cheap",
        "case-sensitive", 1.0)
let $query := cts:and-query($q1, $q2)

for $match in cts:search(
  //prop[property-status = $status], $query)[1 to 10]
return $match//address

The code makes the <property-status> a must requirement because it's part of the first parameter to cts:search() listing what gets searched. The search operates over an and of the $q1 and $q2 constraints. The extra arguments are for case sensitivity, punctuation sensitivity, and weight. For example, because case sensitivity is off in $q1 we'll match "San Diego" and "SAN DIEGO" as well as "san diego". See how the location test is weighted three times more important than the price test? That's because in property, it's all location, location, location.

Related to search is the function cts:contains(). You use this function in XPath predicates for efficient token-based matching but no relevance ranking. For example:

//title[cts:contains(., "xquery")]

Unlike cts:search(//title, "xquery") this query finds every <title> element that contains the token "XQuery" and returns them without regard to relevance ordering. The query executes quickly because like cts:search() it uses the pre-built indexes.

I'm just scratching the surface here. Make sure to read the Builtin Function Reference documentation for more information. One trick the reference materials don't explain is that simple strings are automatically promoted to cts:word-query() instances. That's how the "xquery" search above executed. Another trick to remember is that by default searches are case insensitive for all-lowercase tokens but case sensitive for tokens containing any uppercase characters. The logic is, if you bothered enough to capitalize, you probably meant it.

Java-Based and .NET-Based Queries with XCC

While web-based HTTP connections to MarkLogic Server enable fast and easy coding, there are times when you want to directly connect to the database from a separate application. For this, the database exposes an interface to Java and .NET clients called XDBC, and a client library in both Java and .NET languages called XCC. To get X working you need just a few things:

  • The appropriate XCC client-side package files, downloadable from http://developer.marklogic.com/download/.
  • The server configured to listen for XDBC connections. Use the admin pages to set this up.
  • Java or .NET code written against XCC that connects to the server, executes your query, and (optionally) iterates the result.

It's that easy. The full details are explained in the XCC Developer's Guide [pdf]. You'll find Javadocs and the .NET documentation for the XCC classes included in the distribution and also online on the developer network.

Web-Based Queries with CQ

Now let's move off campus and see what fun things can be found nearby (in the open source software fraternity house). If you're going to be doing much XQuery experimentation, one useful tool to have around is CQ (the name stands for client query). It's a web-based query executing form useful for writing quick queries without touching .xqy files, similar to what you saw with the use-cases demo but expanded with JavaScript hooks that make it practically an IDE.

You can download a copy of CQ from http://developer.marklogic.com/download/. The source code is included in the download, licensed under the open source Apache 2.0 license. You'll also find the source code checked into the developer network subversion repository. If you make improvements you're welcome to contribute them back. That way they'll appear in subsequent releases.

To install CQ, just copy the files from the downloaded zip under a directory served by MarkLogic Server and make a request to the directory. For example, by placing the files under the Docs/cq directory, you can run CQ as http://localhost:8000/cq/. Just please be very careful in exposing CQ on a production site as it allows queries to be written by remote clients.

The MLJAM Library for Connecting Back to Java

MLJAM is another useful open source library from the developer network. MLJAM enables the evaluation of Java code from the MarkLogic Server environment. It's a Java Access Module (kind of the reverse of XCC). MLJAM gives your XQuery programs access to the libraries and capabilities of Java, without any difficult glue coding. Example uses for MLJAM:

  • Extracting image metadata
  • Resizing and reformatting an image
  • Running an XSLT transformation
  • Generating a PDF from XSL-FO
  • Calculating an MD5 hash
  • Interfacting into a user authentication system
  • Accessing a credit card purchasing system
  • Connecting to a secure HTTPS web site
  • Re-encoding content as UTF-8

To demonstrate what MLJAM code looks like, here's an XQuery function that returns the MD5 hash of a passed-in string, built using MLJAM to access Java's MessageDigest class:

(: Assume start() and end() are called externally :)
define function md5($x as xs:string) as xs:string
{
  jam:set("md5src", $x),
  jam:eval('
      import java.security.MessageDigest;

      digest = MessageDigest.getInstance("MD5");
      md5hash = digest.digest(md5src.getBytes("UTF-8"));
  '),
  xs:string(jam:get("md5hash"))
}

The first line creates a Java variable "md5src" and assigns it a value based on the $x variable in XQuery. The second line executes a bit of Java code to calculate the MD5 hash of the string. The last line retrieves the hashed value and returns it as a string. This example and many others are included in the MLJAM distribution.

The Java code does not run within the MarkLogic Server process. It runs in a separate Java servlet engine process, probably on the same machine as MarkLogic but not necessarily. The XQuery and Java processes communicate via a REST-based web service protocol.

For a complete tutorial on MLJAM, see http://developer.marklogic.com/howto/tutorials/2006-05-mljam.xqy. You might also want to check out its partner project MLSQL at http://developer.marklogic.com/howto/tutorials/2006-04-mlsql.xqy.

The JSP Tag Library

As another example of the great projects available on the developer network, let's look at the JSP Tag Library. It's a set of JavaServer Pages tags that can greatly simplify your Java EE (Enterprise Edition) and XQuery integration. The JSP Tag Library tags make it easy to execute XQuery from withn JSP pages, sending the results directly to the client as XHTML or storing the results in a variable for tag-based manipulation. Modeled after the JSTL (JSP Standard Tag Library) tags for SQL, the tags make it much easier to put XQuery results on a web page.

You can use Subversion to check out the JSP Tag Library from http://developer.marklogic.com/svn/jsp.

Continuing On

Well, our tour's coming to an end. Let me leave you with one piece of advice: Join the developer network mailing list.

When new content is posted and new releases come out, the list is where the releases are announced. If you have questions, it's where you ask them. And if you have answers, it's where you share them. Here's the link:

http://developer.marklogic.com/discuss/

Hope to see you around!