retriever/templates/help.tpl

   1 {{*
   2  *      AUTOMATICALLY GENERATED TEMPLATE
   3  *      DO NOT EDIT THIS FILE, CHANGES WILL BE OVERWRITTEN
   4  *
   5  *}}
   6 <h2>Retriever Plugin Help</h2>
   7 <p>
   8 This plugin replaces the short excerpts you normally get in RSS feeds
   9 with the full content of the article from the source website.  You
  10 specify which part of the page you're interested in with a set of
  11 rules.  When each item arrives, the plugin downloads the full page
  12 from the website, extracts content using the rules, and replaces the
  13 original article.
  14 </p>
  15 <p>
  16 There's a few reasons you may want to do this.  The source website
  17 might be slow or overloaded.  The source website might be
  18 untrustworthy, in which case using Friendica to scrub the HTML is a
  19 good idea.  You might be on a LAN that blacklists certain websites.
  20 It also works neatly with the mailstream plugin, allowing you to read
  21 a news stream comfortably without needing continuous Internet
  22 connectivity.
  23 </p>
  24 <p>
  25 However, setting up retriever can be quite tricky since it depends on
  26 the internal design of the website.  That was designed to make life
  27 easy for the website's developers, not for you.  You'll need to have
  28 some familiarity with HTML, and be willing to adapt when the website
  29 suddenly changes everything without notice.
  30 </p>
  31 <h3>Configuring Retriever for a feed</h3>
  32 <p>
  33 To set up retriever for an RSS feed, go to the "Contacts" page and
  34 find your feed.  Then click on the drop-down menu on the contact.
  35 Select "Retriever" to get to the retriever configuration.
  36 </p>
  37 <p>
  38 The "Include" configuration section specifies parts of the page to
  39 include in the article.  Each row has three components:
  40 </p>
  41 <ul>
  42 <li>An HTML tag (e.g. "div", "span", "p")</li>
  43 <li>An attribute (usually "class" or "id")</li>
  44 <li>A value for the attribute</li>
  45 </ul>
  46 <p>
  47 A simple case is when the article is wrapped in a "div" element:
  48 </p>
  49 <pre>
  50     ...
  51     &lt;div class="ArticleWrapper"&gt;
  52       &lt;h2&gt;Man Bites Dog&lt;/h2&gt;
  53       &lt;img src="mbd.jpg"&gt;
  54       &lt;p&gt;
  55         Residents of the sleepy community of Nowheresville were
  56         shocked yesterday by the sight of creepy local weirdo Jim
  57         McOddman assaulting innocent local dog Snufflekins with his
  58         false teeth.
  59       &lt;/p&gt;
  60       ...
  61     &lt;/div&gt;
  62     ...
  63 </pre>
  64 <p>
  65 You then specify the tag "div", attribute "class", and value
  66 "ArticleWrapper".  Everything else in the page, such as navigation
  67 panels and menus and footers and so on, will be discarded.  If there
  68 is more than one section of the page you want to include, specify each
  69 one on a separate row.  If the matching section contains some sections
  70 you want to remove, specify those in the "Exclude" section in the same
  71 way.
  72 </p>
  73 <p>
  74 Once you've got a configuration that you think will work, you can try
  75 it out on some existing articles.  Type a number into the
  76 "Retrospectively Apply" box and click "Submit".  After a while
  77 (exactly how long depends on your system's cron configuration) the new
  78 articles should be available.
  79 </p>
  80 <h3>Techniques</h3>
  81 <p>
  82 You can leave the attribute and value blank to include all the
  83 corresponding elements with the specified tag name.  You can also use
  84 a tag name of just an asterisk ("*"), which will match any element type with the
  85 specified attribute regardless of the tag.
  86 </p>
  87 <p>
  88 Note that the "class" attribute is a special case.  Many web page
  89 templates will put multiple different classes in the same element,
  90 separated by spaces.  If you specify an attribute of "class" it will
  91 match an element if any of its classes matches the specified value.
  92 For example:
  93 </p>
  94 <pre>
  95     &lt;div class="article breaking-news"&gt;
  96 </pre>
  97 <p>
  98 In this case you can specify a value of "article", or "breaking-news".
  99 You can also specify "article breaking-news", but that won't match if
 100 the website suddenly changes to "breaking-news article", so that's not
 101 recommended.
 102 </p>
 103 <p>
 104 One useful trick you can try is using the website's "print" pages.
 105 Many news sites have print versions of all their articles.  These are
 106 usually drastically simplified compared to the live website page.
 107 Sometimes this is a good way to get the whole article when it's
 108 normally split across multiple pages.
 109 </p>
 110 <p>
 111 Hopefully the URL for the print page is a predictable variant of the
 112 normal article URL.  For example, an article URL like:
 113 </p>
 114 <pre>
 115     http://www.newssite.com/article-8636.html
 116 </pre>
 117 <p>
 118 ...might have a print version at:
 119 </p>
 120 <pre>
 121     http://www.newssite.com/print/article-8636.html
 122 </pre>
 123 <p>
 124 To change the URL used to retrieve the page, use the "URL Pattern" and
 125 "URL Replace" fields.  The pattern is a regular expression matching
 126 part of the URL to replace.  In this case, you might use a pattern of
 127 "/article" and a replace string of "/print/article".  A common pattern
 128 is simply a dollar sign ("$"), used to add the replace string to the end of the URL.
 129 </p>
 130 <h3>Background Processing</h3>
 131 <p>
 132 Note that retrieving and processing the articles can take some time,
 133 so it's done in the background.  Incoming articles will be marked as
 134 invisible while they're in the process of being downloaded.  If a URL
 135 fails, the plugin will keep trying at progressively longer intervals
 136 for up to a month, in case the website is temporarily overloaded or
 137 the network is down.
 138 </p>
 139 <h3>Retrieving Images</h3>
 140 <p>
 141 Retriever can also optionally download images and store them in the
 142 local Friendica instance.  Just check the "Download Images" box.  You
 143 can also download images in every item from your network, whether it's
 144 an RSS feed or not.  Go to the "Settings" page and
 145 click <a href="$config">"Plugin settings"</a>.  Then check the "All
 146 Photos" box in the "Retriever Settings" section and click "Submit".
 147 </p>
 148 <h2>Configure Feeds:</h2>
 149 <div>
 150 {{ for $feeds as $feed }}
 151 {{ inc contact_template.tpl with $contact=$feed }}{{ endinc }}
 152 {{ endfor }}
 153 </div>