From: Matthew Exon Date: Mon, 6 May 2013 01:50:03 +0000 (+0800) Subject: Merge remote-tracking branch 'upstream/master' into retriever X-Git-Url: https://git.mxchange.org/?a=commitdiff_plain;h=4fcd882f02505212d5916d00fecc918d7658e266;p=friendica-addons.git Merge remote-tracking branch 'upstream/master' into retriever Conflicts: retriever/view/help.tpl --- 4fcd882f02505212d5916d00fecc918d7658e266 diff --cc retriever/templates/help.tpl index 00000000,9e481188..23ae297b mode 000000,100644..100644 --- a/retriever/templates/help.tpl +++ b/retriever/templates/help.tpl @@@ -1,0 -1,153 +1,153 @@@ + {{* + * AUTOMATICALLY GENERATED TEMPLATE + * DO NOT EDIT THIS FILE, CHANGES WILL BE OVERWRITTEN + * + *}} +

Retriever Plugin Help

+

+ This plugin replaces the short excerpts you normally get in RSS feeds + with the full content of the article from the source website. You + specify which part of the page you're interested in with a set of + rules. When each item arrives, the plugin downloads the full page + from the website, extracts content using the rules, and replaces the + original article. +

+

+ There's a few reasons you may want to do this. The source website + might be slow or overloaded. The source website might be + untrustworthy, in which case using Friendica to scrub the HTML is a + good idea. You might be on a LAN that blacklists certain websites. + It also works neatly with the mailstream plugin, allowing you to read + a news stream comfortably without needing continuous Internet + connectivity. +

+

+ However, setting up retriever can be quite tricky since it depends on -the internal design of the website. This was designed to make life ++the internal design of the website. That was designed to make life + easy for the website's developers, not for you. You'll need to have + some familiarity with HTML, and be willing to adapt when the website + suddenly changes everything without notice. +

+

Configuring Retriever for a feed

+

+ To set up retriever for an RSS feed, go to the "Contacts" page and + find your feed. Then click on the drop-down menu on the contact. + Select "Retriever" to get to the retriever configuration. +

+

+ The "Include" configuration section specifies parts of the page to + include in the article. Each row has three components: +

+ +

+ A simple case is when the article is wrapped in a "div" element: +

+
+     ...
 -    <div class="main-content">
++    <div class="ArticleWrapper">
+       <h2>Man Bites Dog</h2>
+       <img src="mbd.jpg">
+       <p>
+         Residents of the sleepy community of Nowheresville were
+         shocked yesterday by the sight of creepy local weirdo Jim
+         McOddman assaulting innocent local dog Snufflekins with his
+         false teeth.
+       </p>
+       ...
+     </div>
+     ...
+ 
+

+ You then specify the tag "div", attribute "class", and value -"main-content". Everything else in the page, such as navigation ++"ArticleWrapper". Everything else in the page, such as navigation + panels and menus and footers and so on, will be discarded. If there + is more than one section of the page you want to include, specify each + one on a separate row. If the matching section contains some sections + you want to remove, specify those in the "Exclude" section in the same + way. +

+

+ Once you've got a configuration that you think will work, you can try + it out on some existing articles. Type a number into the + "Retrospectively Apply" box and click "Submit". After a while + (exactly how long depends on your system's cron configuration) the new + articles should be available. +

+

Techniques

+

+ You can leave the attribute and value blank to include all the + corresponding elements with the specified tag name. You can also use -a tag name of "*", which will match any element type with the ++a tag name of just an asterisk ("*"), which will match any element type with the + specified attribute regardless of the tag. +

+

+ Note that the "class" attribute is a special case. Many web page + templates will put multiple different classes in the same element, + separated by spaces. If you specify an attribute of "class" it will + match an element if any of its classes matches the specified value. + For example: +

+
+     <div class="article breaking-news">
+ 
+

+ In this case you can specify a value of "article", or "breaking-news". + You can also specify "article breaking-news", but that won't match if + the website suddenly changes to "breaking-news article", so that's not + recommended. +

+

+ One useful trick you can try is using the website's "print" pages. + Many news sites have print versions of all their articles. These are + usually drastically simplified compared to the live website page. + Sometimes this is a good way to get the whole article when it's + normally split across multiple pages. +

+

+ Hopefully the URL for the print page is a predictable variant of the + normal article URL. For example, an article URL like: +

+
+     http://www.newssite.com/article-8636.html
+ 
+

+ ...might have a print version at: +

+
+     http://www.newssite.com/print/article-8636.html
+ 
+

+ To change the URL used to retrieve the page, use the "URL Pattern" and + "URL Replace" fields. The pattern is a regular expression matching + part of the URL to replace. In this case, you might use a pattern of + "/article" and a replace string of "/print/article". A common pattern -is simply "$", used to add the replace string to the end of the URL. ++is simply a dollar sign ("$"), used to add the replace string to the end of the URL. +

+

Background Processing

+

+ Note that retrieving and processing the articles can take some time, + so it's done in the background. Incoming articles will be marked as + invisible while they're in the process of being downloaded. If a URL + fails, the plugin will keep trying at progressively longer intervals + for up to a month, in case the website is temporarily overloaded or + the network is down. +

+

Retrieving Images

+

+ Retriever can also optionally download images and store them in the + local Friendica instance. Just check the "Download Images" box. You + can also download images in every item from your network, whether it's + an RSS feed or not. Go to the "Settings" page and -click "Plugin settings". Then check the "All ++click "Plugin settings". Then check the "All + Photos" box in the "Retriever Settings" section and click "Submit". +

+

Configure Feeds:

+
-{{foreach $feeds as $feed}} -{{include file="contact_template.tpl" contact=$feed}} -{{/foreach}} ++{{ for $feeds as $feed }} ++{{ inc contact_template.tpl with $contact=$feed }}{{ endinc }} ++{{ endfor }} +