[MarkLogic Dev General] RE: force a "buffer write"?

Kelly Stirman Kelly.Stirman at marklogic.com
Thu Jun 25 06:41:33 PDT 2009


Here's one way I would do it in MarkLogic:

1) create a range index on your link element or attribute.
2) iterate over all the unique links using cts:element-values()
3) either spawn or invoke a module that a) checks whether the link is valid, then b) records that it was invalid.

You might put the invalid entries back as documents in the database, or on their parent documents as properties or an attribute on the link - there are several options here.

This allows you to have lots of simultaneous "threads" working. If you spawn each link, then the number of threads is configurable on the task server configuration screen, and you can also check the status page to see how your process is coming along. Note that spawning has the disadvantage of not surviving server restarts, which is why you may decide to use CPF to process each document as it is inserted or updated.

Kelly

-----Original Message-----
From: general-bounces at developer.marklogic.com [mailto:general-bounces at developer.marklogic.com] On Behalf Of general-request at developer.marklogic.com
Sent: Thursday, June 25, 2009 8:59 AM
To: general at developer.marklogic.com
Subject: General Digest, Vol 60, Issue 33

Send General mailing list submissions to
        general at developer.marklogic.com

To subscribe or unsubscribe via the World Wide Web, visit
        http://xqzone.com/mailman/listinfo/general
or, via email, send a message with subject or body 'help' to
        general-request at developer.marklogic.com

You can reach the person managing the list at
        general-owner at developer.marklogic.com

When replying, please edit your Subject line so it is more specific
than "Re: Contents of General digest..."


Today's Topics:

   1. RE: force a "buffer write"? (Lee, David)
   2. RE: force a "buffer write"? (Geert Josten)
   3. Re: force a "buffer write"? (Jakob Fix)
   4. RE: force a "buffer write"? (Lee, David)


----------------------------------------------------------------------

Message: 1
Date: Thu, 25 Jun 2009 04:49:50 -0700
From: "Lee, David" <dlee at epocrates.com>
Subject: RE: [MarkLogic Dev General] force a "buffer write"?
To: "General Mark Logic Developer Discussion"
        <general at developer.marklogic.com>
Message-ID: <DD37F70D78609D4E9587D473FC61E0A710558246 at postoffice>
Content-Type: text/plain;       charset="iso-8859-1"

I've done link-checking programs before and I suggest this may be best done *outside* of ML.
What I would do if I were to do this

1) Use ML to generate an XML document with the info you need (xml file with list nodes)
2) Use a scripting language, or programming language that supports multithreads or multiprocessors
3) In batches of N threads/processes test the links and write the results to non-conflicting output files
4) wait for each batch to continue and aggregate the results  (maybe send this bit back to ML ?)
5) Goto 3 until done
6) Send the aggregated results back to ML


For the scripting or programming language for #2, there are many options .
My personal bias, of course, is xmlsh which runs background tasks as threads,
but sh, perl, java , C++ or any language that lets you do background processing of URL fetches will work.
One that will use threads instead of processes is desirable but I've done this kind of thing with sh before
and its acceptable.  You may want a URL fetch command that has a settable timeout, something like wget,
many urls' that are inaccessible may take 30 seconds or more to time-out using default timeouts.
OTOH if you use a language with effecient trheading you can do large batches (say 100 or more in parallel)
and the individual timeouts wont matter as much.

-David Lee





-----Original Message-----
From: general-bounces at developer.marklogic.com [mailto:general-bounces at developer.marklogic.com] On Behalf Of Jakob Fix
Sent: Thursday, June 25, 2009 7:39 AM
To: General Mark Logic Developer Discussion
Subject: Re: [MarkLogic Dev General] force a "buffer write"?

Thanks once more, Geert!

>From an architectural point of view, does it make sense to have a loop
over many thousand URLs run in an xquery which may take 5 seconds
each?? How does Mark Logic handle long-running queries?? The goal is
to assemble the information about the accessibility of these URLs in
an XML document that will be stored in Mark Logic and used for
analytical output.? Should the list be fragmented into smaller
sub-lists and be processed separately?

Would it be possible to have several threads run simultaneously?
Somehow I doubt it as there would probably be issues with the final
aggregation of the different thread sub-documents into a big one.


What about timeouts, especially if the function is called from inside
a web page, how does Mark Logic handle this issue (I saw that one can
tweak the timeout), or would this be a browser timeout problem?

I hope these questions are not too off-topic for the list.

Jakob.


On Thu, Jun 25, 2009 at 08:11, Geert Josten <Geert.Josten at daidalos.nl> wrote:
>
> Hi Jakob,
>
> You are looking for unbuffered response streams, but sending of the response is handled fully by the HTTP server. I don't believe you can influence that.
>
> Giving it some more thought, I am afraid that allowing unbuffered responses would break the idea of transactions. You don't want to send back response, unless you can guarantee no exceptions will be thrown. And I don't think that can be guaranteed.
>
> Perhaps a MarkLogic expert would like to comment?
>
> Hasn't this been discussed before? It vaguely rings a bell..
>
> Kind regards,
> Geert
>
> >
>
>
> Drs. G.P.H. Josten
> Consultant
>
>
> http://www.daidalos.nl/
> Daidalos BV
> Source of Innovation
> Hoekeindsehof 1-4
> 2665 JZ Bleiswijk
> Tel.: +31 (0) 10 850 1200
> Fax: +31 (0) 10 850 1199
> http://www.daidalos.nl/
> KvK 27164984
> De informatie - verzonden in of met dit emailbericht - is afkomstig van Daidalos BV en is uitsluitend bestemd voor de geadresseerde. Indien u dit bericht onbedoeld hebt ontvangen, verzoeken wij u het te verwijderen. Aan dit bericht kunnen geen rechten worden ontleend.
>
>
> > From: general-bounces at developer.marklogic.com
> > [mailto:general-bounces at developer.marklogic.com] On Behalf Of
> > Jakob Fix
> > Sent: donderdag 25 juni 2009 1:39
> > To: General Mark Logic Developer Discussion
> > Subject: [MarkLogic Dev General] force a "buffer write"?
> >
> > So, I've written a function that looks at one URL at a time
> > and returns true or false depending on its accessibility.
> > Now, my problem is that the result is returned only when all
> > (and that means potentially many) URLs have been checked.
> > Isn't there a way to "force a write"? I'm not sure I'm
> > expressing myself correctly, but hopefully you'll understand
> > what I mean. ?Thanks.
> >
> > (: consider this: if the timeout is 10 seconds, and I have
> > three "bad" URLs, I may have to wait up to 30 seconds before
> > seeing the results, isn't there a way to see a result every
> > ten seconds instead? :)
> >
> > for $url in $urls
> > ? ? return <xh:li>{$url}:
> > {utils:http-resource-available($url, $timeout)}</xh:li>
> >
> >
> > (: function that checks accessibility of a URL, written for
> > DOIs which are "placeholder" URLs which forward to the real URL :)
> >
> > declare function utils:http-resource-available
> > ? ? ($doi as xs:string, $oldtimeout as xs:integer?) as xs:boolean {
> > ? ? try {
> >
> > ? ? ? let $timeout := if ($oldtimeout) then $oldtimeout else 10
> >
> > ? ? ? let $head :=
> > xdmp:http-get(fn:concat($utils:doi-resolver, $doi),
> > ? ? ? ? ? <options
> > xmlns="xdmp:http"><timeout>{$timeout}</timeout></options>)
> > ? ? ? let $code := $head//xdh:code cast as xs:integer
> > ? ? ? let $location := $head//xdh:location
> > ? ? ? return ((fn:contains($location,
> > $utils:location-part-to-match)) and
> > ? ? ? ? ? ($code < 400)) (: we want 3XX or 2XX? :)
> > ? ? } catch ($ex) {
> > ? ? ? ? if ($ex/error:code eq 'SVC-SOCRECV')
> > ? ? ? ? then
> > ? ? ? ? ? ? fn:false()
> > ? ? ? ? else
> > ? ? ? ? ? ? xdmp:rethrow()
> > ? ? }
> > };
> >
> >
> >
>
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
General at developer.marklogic.com
http://xqzone.com/mailman/listinfo/general


------------------------------

Message: 2
Date: Thu, 25 Jun 2009 14:54:09 +0200
From: Geert Josten <Geert.Josten at daidalos.nl>
Subject: RE: [MarkLogic Dev General] force a "buffer write"?
To: General Mark Logic Developer Discussion
        <general at developer.marklogic.com>
Message-ID:
        <0260356C6DFE754BA6FA48E659A14338269A8FDBE0 at helios.olympus.borgus.nl>
Content-Type: text/plain; charset="iso-8859-1"

Hi,

I agree that it does not necessarily make sense to do this within MarkLogic. On the other hand, you might have a good reason. Particularly when there is really lots of information to analyse, and you have to store your information somewhere..

Have you considered taking an asynchronized approach? You can have one query gather all the uri's that need processing and store that somewhere, perhaps in batches. Then use Triggers or CPF to process those batches. MarkLogic is capable of handling those triggers in multiple threads, though if all uri's point to the same website, you perhaps don't want to overload it that way. Perhaps a small sleep would be appreciated by the website hoster..

You could also use scripts and programming languages to do stuff, but then it might be better to do that part outside MarkLogic all together, and only insert the log reports for analysis purposes..

Kind regards,
Geert

> -----Original Message-----
> From: general-bounces at developer.marklogic.com
> [mailto:general-bounces at developer.marklogic.com] On Behalf Of
> Lee, David
> Sent: donderdag 25 juni 2009 13:50
> To: General Mark Logic Developer Discussion
> Subject: RE: [MarkLogic Dev General] force a "buffer write"?
>
> I've done link-checking programs before and I suggest this
> may be best done *outside* of ML.
> What I would do if I were to do this
>
> 1) Use ML to generate an XML document with the info you need
> (xml file with list nodes)
> 2) Use a scripting language, or programming language that
> supports multithreads or multiprocessors
> 3) In batches of N threads/processes test the links and write
> the results to non-conflicting output files
> 4) wait for each batch to continue and aggregate the results
> (maybe send this bit back to ML ?)
> 5) Goto 3 until done
> 6) Send the aggregated results back to ML
>
>
> For the scripting or programming language for #2, there are
> many options .
> My personal bias, of course, is xmlsh which runs background
> tasks as threads, but sh, perl, java , C++ or any language
> that lets you do background processing of URL fetches will work.
> One that will use threads instead of processes is desirable
> but I've done this kind of thing with sh before and its
> acceptable.  You may want a URL fetch command that has a
> settable timeout, something like wget, many urls' that are
> inaccessible may take 30 seconds or more to time-out using
> default timeouts.
> OTOH if you use a language with effecient trheading you can
> do large batches (say 100 or more in parallel) and the
> individual timeouts wont matter as much.
>
> -David Lee
>
>
>
>
>
> -----Original Message-----
> From: general-bounces at developer.marklogic.com
> [mailto:general-bounces at developer.marklogic.com] On Behalf Of
> Jakob Fix
> Sent: Thursday, June 25, 2009 7:39 AM
> To: General Mark Logic Developer Discussion
> Subject: Re: [MarkLogic Dev General] force a "buffer write"?
>
> Thanks once more, Geert!
>
> >From an architectural point of view, does it make sense to
> have a loop
> over many thousand URLs run in an xquery which may take 5
> seconds each?? How does Mark Logic handle long-running
> queries?? The goal is to assemble the information about the
> accessibility of these URLs in an XML document that will be
> stored in Mark Logic and used for analytical output.? Should
> the list be fragmented into smaller sub-lists and be
> processed separately?
>
> Would it be possible to have several threads run simultaneously?
> Somehow I doubt it as there would probably be issues with the
> final aggregation of the different thread sub-documents into
> a big one.
>
>
> What about timeouts, especially if the function is called
> from inside a web page, how does Mark Logic handle this issue
> (I saw that one can tweak the timeout), or would this be a
> browser timeout problem?
>
> I hope these questions are not too off-topic for the list.
>
> Jakob.
>
>
> On Thu, Jun 25, 2009 at 08:11, Geert Josten
> <Geert.Josten at daidalos.nl> wrote:
> >
> > Hi Jakob,
> >
> > You are looking for unbuffered response streams, but
> sending of the response is handled fully by the HTTP server.
> I don't believe you can influence that.
> >
> > Giving it some more thought, I am afraid that allowing
> unbuffered responses would break the idea of transactions.
> You don't want to send back response, unless you can
> guarantee no exceptions will be thrown. And I don't think
> that can be guaranteed.
> >
> > Perhaps a MarkLogic expert would like to comment?
> >
> > Hasn't this been discussed before? It vaguely rings a bell..
> >
> > Kind regards,
> > Geert
> >
> > >
> >
> >
> > Drs. G.P.H. Josten
> > Consultant
> >
> >
> > http://www.daidalos.nl/
> > Daidalos BV
> > Source of Innovation
> > Hoekeindsehof 1-4
> > 2665 JZ Bleiswijk
> > Tel.: +31 (0) 10 850 1200
> > Fax: +31 (0) 10 850 1199
> > http://www.daidalos.nl/
> > KvK 27164984
> > De informatie - verzonden in of met dit emailbericht - is
> afkomstig van Daidalos BV en is uitsluitend bestemd voor de
> geadresseerde. Indien u dit bericht onbedoeld hebt ontvangen,
> verzoeken wij u het te verwijderen. Aan dit bericht kunnen
> geen rechten worden ontleend.
> >
> >
> > > From: general-bounces at developer.marklogic.com
> > > [mailto:general-bounces at developer.marklogic.com] On
> Behalf Of Jakob
> > > Fix
> > > Sent: donderdag 25 juni 2009 1:39
> > > To: General Mark Logic Developer Discussion
> > > Subject: [MarkLogic Dev General] force a "buffer write"?
> > >
> > > So, I've written a function that looks at one URL at a time and
> > > returns true or false depending on its accessibility.
> > > Now, my problem is that the result is returned only when all (and
> > > that means potentially many) URLs have been checked.
> > > Isn't there a way to "force a write"? I'm not sure I'm expressing
> > > myself correctly, but hopefully you'll understand what I mean. ?
> > > Thanks.
> > >
> > > (: consider this: if the timeout is 10 seconds, and I have three
> > > "bad" URLs, I may have to wait up to 30 seconds before seeing the
> > > results, isn't there a way to see a result every ten seconds
> > > instead? :)
> > >
> > > for $url in $urls
> > > ? ? return <xh:li>{$url}:
> > > {utils:http-resource-available($url, $timeout)}</xh:li>
> > >
> > >
> > > (: function that checks accessibility of a URL, written for DOIs
> > > which are "placeholder" URLs which forward to the real URL :)
> > >
> > > declare function utils:http-resource-available
> > > ? ? ($doi as xs:string, $oldtimeout as xs:integer?) as
> xs:boolean {
> > > ? ? try {
> > >
> > > ? ? ? let $timeout := if ($oldtimeout) then $oldtimeout else 10
> > >
> > > ? ? ? let $head :=
> > > xdmp:http-get(fn:concat($utils:doi-resolver, $doi),
> > > ? ? ? ? ? <options
> > > xmlns="xdmp:http"><timeout>{$timeout}</timeout></options>)
> > > ? ? ? let $code := $head//xdh:code cast as xs:integer
> > > ? ? ? let $location := $head//xdh:location
> > > ? ? ? return ((fn:contains($location,
> > > $utils:location-part-to-match)) and
> > > ? ? ? ? ? ($code < 400)) (: we want 3XX or 2XX? :)
> > > ? ? } catch ($ex) {
> > > ? ? ? ? if ($ex/error:code eq 'SVC-SOCRECV')
> > > ? ? ? ? then
> > > ? ? ? ? ? ? fn:false()
> > > ? ? ? ? else
> > > ? ? ? ? ? ? xdmp:rethrow()
> > > ? ? }
> > > };
> > >
> > >
> > >
> >
> > _______________________________________________
> > General mailing list
> > General at developer.marklogic.com
> > http://xqzone.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general
>

------------------------------

Message: 3
Date: Thu, 25 Jun 2009 15:03:48 +0200
From: Jakob Fix <jakob.fix at gmail.com>
Subject: Re: [MarkLogic Dev General] force a "buffer write"?
To: General Mark Logic Developer Discussion
        <general at developer.marklogic.com>
Message-ID:
        <d966f0ff0906250603g2bb50ab7kc9f629ca65ff93e2 at mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

Thanks David and Geert,

I'm currently looking into xmlsh. David, would there be sample scripts
available somewhere which could be used by me?  Geert, no I haven't
looked at neither triggers nor cpf, but I will explore this route too.

thanks for your input.
Jakob.



On Thu, Jun 25, 2009 at 14:54, Geert Josten<Geert.Josten at daidalos.nl> wrote:
> Hi,
>
> I agree that it does not necessarily make sense to do this within MarkLogic. On the other hand, you might have a good reason. Particularly when there is really lots of information to analyse, and you have to store your information somewhere..
>
> Have you considered taking an asynchronized approach? You can have one query gather all the uri's that need processing and store that somewhere, perhaps in batches. Then use Triggers or CPF to process those batches. MarkLogic is capable of handling those triggers in multiple threads, though if all uri's point to the same website, you perhaps don't want to overload it that way. Perhaps a small sleep would be appreciated by the website hoster..
>
> You could also use scripts and programming languages to do stuff, but then it might be better to do that part outside MarkLogic all together, and only insert the log reports for analysis purposes..
>
> Kind regards,
> Geert
>
>> -----Original Message-----
>> From: general-bounces at developer.marklogic.com
>> [mailto:general-bounces at developer.marklogic.com] On Behalf Of
>> Lee, David
>> Sent: donderdag 25 juni 2009 13:50
>> To: General Mark Logic Developer Discussion
>> Subject: RE: [MarkLogic Dev General] force a "buffer write"?
>>
>> I've done link-checking programs before and I suggest this
>> may be best done *outside* of ML.
>> What I would do if I were to do this
>>
>> 1) Use ML to generate an XML document with the info you need
>> (xml file with list nodes)
>> 2) Use a scripting language, or programming language that
>> supports multithreads or multiprocessors
>> 3) In batches of N threads/processes test the links and write
>> the results to non-conflicting output files
>> 4) wait for each batch to continue and aggregate the results
>> (maybe send this bit back to ML ?)
>> 5) Goto 3 until done
>> 6) Send the aggregated results back to ML
>>
>>
>> For the scripting or programming language for #2, there are
>> many options .
>> My personal bias, of course, is xmlsh which runs background
>> tasks as threads, but sh, perl, java , C++ or any language
>> that lets you do background processing of URL fetches will work.
>> One that will use threads instead of processes is desirable
>> but I've done this kind of thing with sh before and its
>> acceptable. ?You may want a URL fetch command that has a
>> settable timeout, something like wget, many urls' that are
>> inaccessible may take 30 seconds or more to time-out using
>> default timeouts.
>> OTOH if you use a language with effecient trheading you can
>> do large batches (say 100 or more in parallel) and the
>> individual timeouts wont matter as much.
>>
>> -David Lee
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: general-bounces at developer.marklogic.com
>> [mailto:general-bounces at developer.marklogic.com] On Behalf Of
>> Jakob Fix
>> Sent: Thursday, June 25, 2009 7:39 AM
>> To: General Mark Logic Developer Discussion
>> Subject: Re: [MarkLogic Dev General] force a "buffer write"?
>>
>> Thanks once more, Geert!
>>
>> >From an architectural point of view, does it make sense to
>> have a loop
>> over many thousand URLs run in an xquery which may take 5
>> seconds each?? How does Mark Logic handle long-running
>> queries?? The goal is to assemble the information about the
>> accessibility of these URLs in an XML document that will be
>> stored in Mark Logic and used for analytical output.? Should
>> the list be fragmented into smaller sub-lists and be
>> processed separately?
>>
>> Would it be possible to have several threads run simultaneously?
>> Somehow I doubt it as there would probably be issues with the
>> final aggregation of the different thread sub-documents into
>> a big one.
>>
>>
>> What about timeouts, especially if the function is called
>> from inside a web page, how does Mark Logic handle this issue
>> (I saw that one can tweak the timeout), or would this be a
>> browser timeout problem?
>>
>> I hope these questions are not too off-topic for the list.
>>
>> Jakob.
>>
>>
>> On Thu, Jun 25, 2009 at 08:11, Geert Josten
>> <Geert.Josten at daidalos.nl> wrote:
>> >
>> > Hi Jakob,
>> >
>> > You are looking for unbuffered response streams, but
>> sending of the response is handled fully by the HTTP server.
>> I don't believe you can influence that.
>> >
>> > Giving it some more thought, I am afraid that allowing
>> unbuffered responses would break the idea of transactions.
>> You don't want to send back response, unless you can
>> guarantee no exceptions will be thrown. And I don't think
>> that can be guaranteed.
>> >
>> > Perhaps a MarkLogic expert would like to comment?
>> >
>> > Hasn't this been discussed before? It vaguely rings a bell..
>> >
>> > Kind regards,
>> > Geert
>> >
>> > >
>> >
>> >
>> > Drs. G.P.H. Josten
>> > Consultant
>> >
>> >
>> > http://www.daidalos.nl/
>> > Daidalos BV
>> > Source of Innovation
>> > Hoekeindsehof 1-4
>> > 2665 JZ Bleiswijk
>> > Tel.: +31 (0) 10 850 1200
>> > Fax: +31 (0) 10 850 1199
>> > http://www.daidalos.nl/
>> > KvK 27164984
>> > De informatie - verzonden in of met dit emailbericht - is
>> afkomstig van Daidalos BV en is uitsluitend bestemd voor de
>> geadresseerde. Indien u dit bericht onbedoeld hebt ontvangen,
>> verzoeken wij u het te verwijderen. Aan dit bericht kunnen
>> geen rechten worden ontleend.
>> >
>> >
>> > > From: general-bounces at developer.marklogic.com
>> > > [mailto:general-bounces at developer.marklogic.com] On
>> Behalf Of Jakob
>> > > Fix
>> > > Sent: donderdag 25 juni 2009 1:39
>> > > To: General Mark Logic Developer Discussion
>> > > Subject: [MarkLogic Dev General] force a "buffer write"?
>> > >
>> > > So, I've written a function that looks at one URL at a time and
>> > > returns true or false depending on its accessibility.
>> > > Now, my problem is that the result is returned only when all (and
>> > > that means potentially many) URLs have been checked.
>> > > Isn't there a way to "force a write"? I'm not sure I'm expressing
>> > > myself correctly, but hopefully you'll understand what I mean.
>> > > Thanks.
>> > >
>> > > (: consider this: if the timeout is 10 seconds, and I have three
>> > > "bad" URLs, I may have to wait up to 30 seconds before seeing the
>> > > results, isn't there a way to see a result every ten seconds
>> > > instead? :)
>> > >
>> > > for $url in $urls
>> > > ? ? return <xh:li>{$url}:
>> > > {utils:http-resource-available($url, $timeout)}</xh:li>
>> > >
>> > >
>> > > (: function that checks accessibility of a URL, written for DOIs
>> > > which are "placeholder" URLs which forward to the real URL :)
>> > >
>> > > declare function utils:http-resource-available
>> > > ? ? ($doi as xs:string, $oldtimeout as xs:integer?) as
>> xs:boolean {
>> > > ? ? try {
>> > >
>> > > ? ? ? let $timeout := if ($oldtimeout) then $oldtimeout else 10
>> > >
>> > > ? ? ? let $head :=
>> > > xdmp:http-get(fn:concat($utils:doi-resolver, $doi),
>> > > ? ? ? ? ? <options
>> > > xmlns="xdmp:http"><timeout>{$timeout}</timeout></options>)
>> > > ? ? ? let $code := $head//xdh:code cast as xs:integer
>> > > ? ? ? let $location := $head//xdh:location
>> > > ? ? ? return ((fn:contains($location,
>> > > $utils:location-part-to-match)) and
>> > > ? ? ? ? ? ($code < 400)) (: we want 3XX or 2XX? :)
>> > > ? ? } catch ($ex) {
>> > > ? ? ? ? if ($ex/error:code eq 'SVC-SOCRECV')
>> > > ? ? ? ? then
>> > > ? ? ? ? ? ? fn:false()
>> > > ? ? ? ? else
>> > > ? ? ? ? ? ? xdmp:rethrow()
>> > > ? ? }
>> > > };
>> > >
>> > >
>> > >
>> >
>> > _______________________________________________
>> > General mailing list
>> > General at developer.marklogic.com
>> > http://xqzone.com/mailman/listinfo/general
>> _______________________________________________
>> General mailing list
>> General at developer.marklogic.com
>> http://xqzone.com/mailman/listinfo/general
>> _______________________________________________
>> General mailing list
>> General at developer.marklogic.com
>> http://xqzone.com/mailman/listinfo/general
>> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general
>


------------------------------

Message: 4
Date: Thu, 25 Jun 2009 06:17:56 -0700
From: "Lee, David" <dlee at epocrates.com>
Subject: RE: [MarkLogic Dev General] force a "buffer write"?
To: "General Mark Logic Developer Discussion"
        <general at developer.marklogic.com>
Message-ID: <DD37F70D78609D4E9587D473FC61E0A7105582FB at postoffice>
Content-Type: text/plain;       charset="iso-8859-1"

xmlsh is somewhat documented here (www.xmlsh.org),
there is a Mark Logic connector as an extension module which you have to download separately  http://www.xmlsh.org/ModuleMarkLogic

There are sample scriptlets in the documentation as well as in the test/* directory when you install the runtime.

If you have questions feel free to contact me off-list and I'd be happy to help you out.

-David Lee
dlee at epocrates.com
dlee at calldei.com




-----Original Message-----
From: general-bounces at developer.marklogic.com [mailto:general-bounces at developer.marklogic.com] On Behalf Of Jakob Fix
Sent: Thursday, June 25, 2009 9:04 AM
To: General Mark Logic Developer Discussion
Subject: Re: [MarkLogic Dev General] force a "buffer write"?

Thanks David and Geert,

I'm currently looking into xmlsh. David, would there be sample scripts
available somewhere which could be used by me?  Geert, no I haven't
looked at neither triggers nor cpf, but I will explore this route too.

thanks for your input.
Jakob.



On Thu, Jun 25, 2009 at 14:54, Geert Josten<Geert.Josten at daidalos.nl> wrote:
> Hi,
>
> I agree that it does not necessarily make sense to do this within MarkLogic. On the other hand, you might have a good reason. Particularly when there is really lots of information to analyse, and you have to store your information somewhere..
>
> Have you considered taking an asynchronized approach? You can have one query gather all the uri's that need processing and store that somewhere, perhaps in batches. Then use Triggers or CPF to process those batches. MarkLogic is capable of handling those triggers in multiple threads, though if all uri's point to the same website, you perhaps don't want to overload it that way. Perhaps a small sleep would be appreciated by the website hoster..
>
> You could also use scripts and programming languages to do stuff, but then it might be better to do that part outside MarkLogic all together, and only insert the log reports for analysis purposes..
>
> Kind regards,
> Geert
>
>> -----Original Message-----
>> From: general-bounces at developer.marklogic.com
>> [mailto:general-bounces at developer.marklogic.com] On Behalf Of
>> Lee, David
>> Sent: donderdag 25 juni 2009 13:50
>> To: General Mark Logic Developer Discussion
>> Subject: RE: [MarkLogic Dev General] force a "buffer write"?
>>
>> I've done link-checking programs before and I suggest this
>> may be best done *outside* of ML.
>> What I would do if I were to do this
>>
>> 1) Use ML to generate an XML document with the info you need
>> (xml file with list nodes)
>> 2) Use a scripting language, or programming language that
>> supports multithreads or multiprocessors
>> 3) In batches of N threads/processes test the links and write
>> the results to non-conflicting output files
>> 4) wait for each batch to continue and aggregate the results
>> (maybe send this bit back to ML ?)
>> 5) Goto 3 until done
>> 6) Send the aggregated results back to ML
>>
>>
>> For the scripting or programming language for #2, there are
>> many options .
>> My personal bias, of course, is xmlsh which runs background
>> tasks as threads, but sh, perl, java , C++ or any language
>> that lets you do background processing of URL fetches will work.
>> One that will use threads instead of processes is desirable
>> but I've done this kind of thing with sh before and its
>> acceptable. ?You may want a URL fetch command that has a
>> settable timeout, something like wget, many urls' that are
>> inaccessible may take 30 seconds or more to time-out using
>> default timeouts.
>> OTOH if you use a language with effecient trheading you can
>> do large batches (say 100 or more in parallel) and the
>> individual timeouts wont matter as much.
>>
>> -David Lee
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: general-bounces at developer.marklogic.com
>> [mailto:general-bounces at developer.marklogic.com] On Behalf Of
>> Jakob Fix
>> Sent: Thursday, June 25, 2009 7:39 AM
>> To: General Mark Logic Developer Discussion
>> Subject: Re: [MarkLogic Dev General] force a "buffer write"?
>>
>> Thanks once more, Geert!
>>
>> >From an architectural point of view, does it make sense to
>> have a loop
>> over many thousand URLs run in an xquery which may take 5
>> seconds each?? How does Mark Logic handle long-running
>> queries?? The goal is to assemble the information about the
>> accessibility of these URLs in an XML document that will be
>> stored in Mark Logic and used for analytical output.? Should
>> the list be fragmented into smaller sub-lists and be
>> processed separately?
>>
>> Would it be possible to have several threads run simultaneously?
>> Somehow I doubt it as there would probably be issues with the
>> final aggregation of the different thread sub-documents into
>> a big one.
>>
>>
>> What about timeouts, especially if the function is called
>> from inside a web page, how does Mark Logic handle this issue
>> (I saw that one can tweak the timeout), or would this be a
>> browser timeout problem?
>>
>> I hope these questions are not too off-topic for the list.
>>
>> Jakob.
>>
>>
>> On Thu, Jun 25, 2009 at 08:11, Geert Josten
>> <Geert.Josten at daidalos.nl> wrote:
>> >
>> > Hi Jakob,
>> >
>> > You are looking for unbuffered response streams, but
>> sending of the response is handled fully by the HTTP server.
>> I don't believe you can influence that.
>> >
>> > Giving it some more thought, I am afraid that allowing
>> unbuffered responses would break the idea of transactions.
>> You don't want to send back response, unless you can
>> guarantee no exceptions will be thrown. And I don't think
>> that can be guaranteed.
>> >
>> > Perhaps a MarkLogic expert would like to comment?
>> >
>> > Hasn't this been discussed before? It vaguely rings a bell..
>> >
>> > Kind regards,
>> > Geert
>> >
>> > >
>> >
>> >
>> > Drs. G.P.H. Josten
>> > Consultant
>> >
>> >
>> > http://www.daidalos.nl/
>> > Daidalos BV
>> > Source of Innovation
>> > Hoekeindsehof 1-4
>> > 2665 JZ Bleiswijk
>> > Tel.: +31 (0) 10 850 1200
>> > Fax: +31 (0) 10 850 1199
>> > http://www.daidalos.nl/
>> > KvK 27164984
>> > De informatie - verzonden in of met dit emailbericht - is
>> afkomstig van Daidalos BV en is uitsluitend bestemd voor de
>> geadresseerde. Indien u dit bericht onbedoeld hebt ontvangen,
>> verzoeken wij u het te verwijderen. Aan dit bericht kunnen
>> geen rechten worden ontleend.
>> >
>> >
>> > > From: general-bounces at developer.marklogic.com
>> > > [mailto:general-bounces at developer.marklogic.com] On
>> Behalf Of Jakob
>> > > Fix
>> > > Sent: donderdag 25 juni 2009 1:39
>> > > To: General Mark Logic Developer Discussion
>> > > Subject: [MarkLogic Dev General] force a "buffer write"?
>> > >
>> > > So, I've written a function that looks at one URL at a time and
>> > > returns true or false depending on its accessibility.
>> > > Now, my problem is that the result is returned only when all (and
>> > > that means potentially many) URLs have been checked.
>> > > Isn't there a way to "force a write"? I'm not sure I'm expressing
>> > > myself correctly, but hopefully you'll understand what I mean.
>> > > Thanks.
>> > >
>> > > (: consider this: if the timeout is 10 seconds, and I have three
>> > > "bad" URLs, I may have to wait up to 30 seconds before seeing the
>> > > results, isn't there a way to see a result every ten seconds
>> > > instead? :)
>> > >
>> > > for $url in $urls
>> > > ? ? return <xh:li>{$url}:
>> > > {utils:http-resource-available($url, $timeout)}</xh:li>
>> > >
>> > >
>> > > (: function that checks accessibility of a URL, written for DOIs
>> > > which are "placeholder" URLs which forward to the real URL :)
>> > >
>> > > declare function utils:http-resource-available
>> > > ? ? ($doi as xs:string, $oldtimeout as xs:integer?) as
>> xs:boolean {
>> > > ? ? try {
>> > >
>> > > ? ? ? let $timeout := if ($oldtimeout) then $oldtimeout else 10
>> > >
>> > > ? ? ? let $head :=
>> > > xdmp:http-get(fn:concat($utils:doi-resolver, $doi),
>> > > ? ? ? ? ? <options
>> > > xmlns="xdmp:http"><timeout>{$timeout}</timeout></options>)
>> > > ? ? ? let $code := $head//xdh:code cast as xs:integer
>> > > ? ? ? let $location := $head//xdh:location
>> > > ? ? ? return ((fn:contains($location,
>> > > $utils:location-part-to-match)) and
>> > > ? ? ? ? ? ($code < 400)) (: we want 3XX or 2XX? :)
>> > > ? ? } catch ($ex) {
>> > > ? ? ? ? if ($ex/error:code eq 'SVC-SOCRECV')
>> > > ? ? ? ? then
>> > > ? ? ? ? ? ? fn:false()
>> > > ? ? ? ? else
>> > > ? ? ? ? ? ? xdmp:rethrow()
>> > > ? ? }
>> > > };
>> > >
>> > >
>> > >
>> >
>> > _______________________________________________
>> > General mailing list
>> > General at developer.marklogic.com
>> > http://xqzone.com/mailman/listinfo/general
>> _______________________________________________
>> General mailing list
>> General at developer.marklogic.com
>> http://xqzone.com/mailman/listinfo/general
>> _______________________________________________
>> General mailing list
>> General at developer.marklogic.com
>> http://xqzone.com/mailman/listinfo/general
>> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general
>
_______________________________________________
General mailing list
General at developer.marklogic.com
http://xqzone.com/mailman/listinfo/general


------------------------------

_______________________________________________
General mailing list
General at developer.marklogic.com
http://xqzone.com/mailman/listinfo/general


End of General Digest, Vol 60, Issue 33
***************************************


More information about the General mailing list