tips/downloading podcastsgit-annexhttp://git-annex.branchable.com/tips/downloading_podcasts/git-annexikiwiki2022-09-01T19:26:20ZFilename too longhttp://git-annex.branchable.com/tips/downloading_podcasts/comment_1_f04bc32a34baeeffcd691e9f7cce0230/ckeen2013-11-27T22:47:37Z2013-07-30T14:39:44Z
<p>It seems that some of my feeds get stored into keys that generate a too long filename:</p>
<pre><code>podcasts/.git/annex/tmp/b1f_325_URL-s143660317--http&c%%feedproxy.google.com%~r%mixotic%~5%urTIRWQK2OQ%Mixotic__258__-__Michael__Miller__-__Galactic__Technolgies.mp3.log.web:
openBinaryFile: invalid argument (File name too long)
</code></pre>
<p>Is there a way to work around this?</p>
comment 2http://git-annex.branchable.com/tips/downloading_podcasts/comment_2_a9a98cad7358d16792853a2ee413fe6c/joeyh.name2013-11-27T22:47:37Z2013-07-30T17:16:07Z
@ckeen You seem to be using a filesystem that does not support filenames 150 characters long. This is unusual -- even windows and android can support a filename up to 255 characters in length. <code>git-annex addurl</code> already deals with this sort of problem by limiting the filename to 255 characters. If you'd like to file a bug report with details about your system, I can try to make git-annex support its limitations, I suppose.
Great stuff!http://git-annex.branchable.com/tips/downloading_podcasts/comment_3_5a8068a5cb0fd864581157a3aa5d1113/nomeata2013-11-27T22:47:37Z2013-07-30T21:21:57Z
<p>Looking forward to seeing it in Debian unstable; where it will definitely replace my hpodder setup.</p>
<p>I guess there is no easy way to re-use the files already downloaded with hpodder? At first I thought that <code>git annex importfeed --relaxed</code> followed by adding the files to the git annex would work, but <code>importfeed</code> stores URLs, not content-based hashes, so it wouldn’t match up.</p>
comment 4http://git-annex.branchable.com/tips/downloading_podcasts/comment_4_e7072a9da30b4c4b4c526013144238d4/joeyh.name2013-11-27T22:47:37Z2013-07-30T21:29:50Z
<p>@nomeata, well, you can, but it has to download the files again.</p>
<p>When run without --fast, <code>importfeed</code> does use content based hashes, so if you run it in a temporary directory, it will download the content redundantly, hash it and see it's the same, and add the url to that hash. You can then delete the temporary directory, and the files hpodder had downloaded will have the url attached to them now. I don't know if this really buys you anything over deleting the hpodder files and starting over though.</p>
Force a reload of a feed?http://git-annex.branchable.com/tips/downloading_podcasts/comment_5_79b3f8d678ac9f67df4c0cd649657283/ckeen2013-11-27T22:47:37Z2013-07-31T10:35:50Z
Currently I have my podcasts imported with --fast. For some reason there are podcast episodes missing. This has been done propably during my period of toying with the feature. If I retry on a clean annex I see all episodes. My suspicion is that git-annex has been interrupted during downloading a feed but now somehow thinks it's already there. How can I debug this situation and/or force git annex to retry all the links in a feed?
use the forcehttp://git-annex.branchable.com/tips/downloading_podcasts/comment_6_35106fee5458bdd5c21868fbc49d3616/joeyh.name2013-11-27T22:47:37Z2013-07-31T16:20:39Z
<p>The only way it can skip downloading a file is if its url has already been seen before. Perhaps you deleted them?</p>
<p>I've made <code>importfeed --force</code> re-download files it's seen before.</p>
--force reload all URLshttp://git-annex.branchable.com/tips/downloading_podcasts/comment_7_ceb16498b7aadbf04a27acd5d6561d46/ckeen2013-11-27T22:47:37Z2013-08-01T09:47:34Z
Is it intentionally saving URLs with a prefixed 2_? I have sorted out all missing URLs and renamed it, so no harm done, but it has been a bit of a hassle to get there.
comment 8http://git-annex.branchable.com/tips/downloading_podcasts/comment_8_147397603f0b3fdb42ca387d1da7c5ef/joeyh.name2013-11-27T22:47:37Z2013-08-01T16:05:10Z
I've now made importfeed --force a bit smarter about reusing existing files.
How do I switch on the 'feeds' feature?http://git-annex.branchable.com/tips/downloading_podcasts/comment_9_6a26a6cc7683d38fae0f23c5a52d1e23/a-or-b [myopenid.com]2013-11-27T22:47:37Z2013-08-05T04:52:41Z
<p>Joey - your initial post said:</p>
<pre><code>git-annex must be built with the Feeds feature (run git annex version to check).
</code></pre>
<p>...but how do I actually switch on the feeds feature?</p>
<p>I install git-annex from cabal, so I do</p>
<pre><code>cabal update
cabal install git-annex
</code></pre>
<p>which I did this morning and now <code>git annex version</code> gives me:</p>
<pre><code>git-annex version: 4.20130802
build flags: Assistant Webapp Pairing Testsuite S3 WebDAV FsEvents XMPP DNS
</code></pre>
<p>So it is the latest version, but without Feeds. <img src="http://git-annex.branchable.com/smileys/sad.png" alt=":-(" /></p>
comment 10http://git-annex.branchable.com/tips/downloading_podcasts/comment_10_4d4f6c22070b58918ee8d34c5e7290ad/joeyh.name2013-11-27T22:47:37Z2013-08-05T16:47:30Z
<code>cabal install feed</code> should get the necessary library installed so that git-annex will build with feeds support.
comment 11http://git-annex.branchable.com/tips/downloading_podcasts/comment_11_d8d77048c7e2524968c188e1ad517873/a-or-b [myopenid.com]2013-11-27T22:47:37Z2013-08-06T04:20:16Z
<pre><code>$ cabal install feed
Resolving dependencies...
All the requested packages are already installed:
feed-0.3.9.1
Use --reinstall if you want to reinstall anyway.
</code></pre>
<p>Then I reinstalled <code>git-annex</code> but it still doesn't find the feeds flag.</p>
<pre><code>$ git annex version
git-annex version: 4.20130802
build flags: Assistant Webapp Pairing Testsuite S3 WebDAV FsEvents XMPP DNS
</code></pre>
<p>Do I need to do something like:</p>
<pre><code>cabal install git-annex --bindir=$HOME/bin -f"-assistant -webapp -webdav -pairing -xmpp -dns -feed"
</code></pre>
<p>...but what are the default flags to include in addition to <code>-feed</code></p>
comment 12http://git-annex.branchable.com/tips/downloading_podcasts/comment_12_0859317471b43c88744dd3df95c879f7/joeyh.name2013-11-27T22:47:37Z2013-08-06T04:24:10Z
<p>-f-Feed will disable the feature. -fFeed will try to force it on.</p>
<p>You can probably work out what's going wrong using cabal install -v3</p>
comment 13http://git-annex.branchable.com/tips/downloading_podcasts/comment_13_e8c3c97282d17e2a1d47fb9d5e2b2f7b/a-or-b [myopenid.com]2013-11-27T22:47:37Z2013-08-06T05:42:45Z
<p>So I ran <code>cabal install -v3</code> and looked at the output,</p>
<pre><code>Flags chosen: feed=True, tdfa=True, testsuite=True, android=False,
production=True, dns=True, xmpp=True, pairing=True, webapp=True,
assistant=True, dbus=True, inotify=True, webdav=True, s3=True
</code></pre>
<p>This looks like feed should be on.</p>
<p>There doesn't appear to be any errors in the compile either.</p>
<p>Is it as simple as a bug where this flag just doesn't show in the <code>git annex version</code> command?</p>
comment 14http://git-annex.branchable.com/tips/downloading_podcasts/comment_14_05a3694052de36848fbbad6eeeada895/joeyh.name2013-11-27T22:47:37Z2013-08-07T16:03:12Z
Yes, it did turn out to be as simple as my having forgotten that I have to manually add features to the version list.
No file extension?http://git-annex.branchable.com/tips/downloading_podcasts/comment_15_21028bed8858c2dae1ac9c2d014fd2a1/23.gs2013-11-27T22:47:37Z2013-08-12T13:21:50Z
It seems git-annex is a bit overzealous when sanitizing the file extension, currently I get: "Nerdkunde/Let_s_go_to_the_D_M_C_A_m4a" from http://www.nerdkunde.de/episodes.m4a.rss with the default template and only "Nerdkunde/Let_s_go_to_the_D_M_C_A._m4a" if I add the "." in the template myself...
comment 16http://git-annex.branchable.com/tips/downloading_podcasts/comment_16_4869fb5c9f896acc477c44de06c36ca7/arand2013-11-27T22:47:37Z2013-08-12T13:32:46Z
The filename extension is a known issue and already fixed in the development version, see <a href="http://git-annex.branchable.com/bugs/importfeed_uses___34____95__foo__34___as_extension/">http://git-annex.branchable.com/bugs/importfeed_uses___34____95__foo__34___as_extension/</a>
rss authenticationhttp://git-annex.branchable.com/tips/downloading_podcasts/comment_17_2e278ff200c1c15efd27c46a3e0aed40/Stephen2013-11-27T22:47:37Z2013-08-13T13:32:52Z
If a podcast requires authentication, is there a way to pass credentials through? I tried <code>http://user:pass@site.com/rss.xml</code> but it didn't work.
--fast and --relaxedhttp://git-annex.branchable.com/tips/downloading_podcasts/comment_18_382f2b970738d9b1af577955c3083e90/nomeata2013-11-27T22:47:37Z2013-08-16T07:27:59Z
<p>Hi,</p>
<p>the explanations to --fast and --relaxed on this page could be extended a bit. I looked it up in the man page, but it is not yet clear to me when I would use one or the other with feeds. Also, does “Next time you run git annex addurl it will only fetch any new items.” really only apply to --relaxed, and not --fast?</p>
<p>Furthermore, it would be good if there were a template variable <code>itemnum</code> that I can use to ensure that <code>ls</code> prints the casts in the right order, even when the titles of the items are not helpful.</p>
<p>Greetings,
Joachim</p>
comment 19http://git-annex.branchable.com/tips/downloading_podcasts/comment_19_f76fc6835e5787b0156380bf09fd81ca/joeyh.name2013-11-27T22:47:37Z2013-08-22T15:25:02Z
I would expect user:pass@site.com to work if the site is using http basic auth. <code>importfeed</code> just runs <code>wget</code> (or <code>curl</code>) to do all downloads, and wget's documentation says that works. It also says you can use ~/.netrc to store the password for a site.
comment 20http://git-annex.branchable.com/tips/downloading_podcasts/comment_20_65ebf3a3bbf0a2aebd2b69640b757e16/joeyh.name2013-11-27T22:47:37Z2013-08-22T15:29:11Z
<p>The git-annex man page has a bit more to say about --relaxed and --fast. Their behavior when used with <code>importfeed</code> is the same as with <code>addurl</code>.</p>
<p>If the podcast feed provides an <code>itemid</code>, you can use that in the filename template. I don't know how common that is. Due to the way <code>importfeed</code> works, it cannot keep track of eg, an incrementing item number itself.</p>
comment 21http://git-annex.branchable.com/tips/downloading_podcasts/comment_21_98a1dacc8d264ff31801e6c5c5f2612d/Sazius2014-07-01T20:52:06Z2014-07-01T20:52:06Z
For some podcast feeds I typically wish to view the description of the show before I decide to download it or not. Is there some way to perform that use case using git annex? I know <code>itemdescription</code> is something I can include in the template for the filename, but the descriptions can be really long... doesn't seem very elegant to have that in the file name. Could the description for example be included as metadata of the item?
metadatahttp://git-annex.branchable.com/tips/downloading_podcasts/comment_22_00cc7a2fb936d7ea3d5d3764a1637663/joeyh.name2014-07-03T18:25:32Z2014-07-03T18:25:32Z
<p>Good idea, Sazius!</p>
<p>I've made importfeed store the metadata, as long as annex.genmetadata is set in .git/config.</p>
itempubdatehttp://git-annex.branchable.com/tips/downloading_podcasts/comment_23_62603cda8e581a2eb2cc799dffe8a740/ewen2015-01-03T22:01:37Z2015-01-03T22:01:37Z
<p>Using a <code>--template='${feedtitle}/${itempubdate}-${itemtitle}${extension}'</code> with a libsyn RSS feed (eg, <a href="http://polyweekly.libsyn.com/rss">Poly Weekly</a>), I found that <code>itempubdate</code> was expanding to "none", even though there is a date with each entry in the RSS, eg,</p>
<pre><code><pubDate>Fri, 26 Dec 2014 15:25:38 +0000</pubDate>
</code></pre>
<p>Maybe the date string cannot be parsed? But it does look like a fairly typical datestamp to me. If the cause is the mixed-case in the tag, could <code>pubDate</code> be supported in addition to <code>pubdate</code>? (AFAICT <a href="http://validator.w3.org/feed/docs/rss2.html"><code>pubDate</code> is the standardised mix of lower/upper case</a>, but maybe not the most common, in which case supporting both <code>pubDate</code> and <code>pubdate</code> might help?) As seen with <code>git-annex version: 5.20141024~bpo70+1</code>, installed from Debian Backports; AFAICT it's still the latest release to make it to backports.</p>
<p>For now I'm just omitting "itempubdate" from my template.</p>
<p>Ewen</p>
pubDatehttp://git-annex.branchable.com/tips/downloading_podcasts/comment_24_e75af243654d15bc7b917fcd888bcf2f/joey2015-01-05T22:59:37Z2015-01-05T22:55:06Z
<p>@ewen, I tested that feed and it is able to get the pubDate from it and
parses it ok.</p>
<p>Most likely, your version of git-annex is not built with a new enough
version of the haskell feed library. Version 0.3.9 or newer is needed to be
able to extract pubdates. For example, Debian stable doesn't have a new
enough version.</p>
Track GUIDs to avoid duplicate downloadshttp://git-annex.branchable.com/tips/downloading_podcasts/comment_25_2ee88c3375eca23fe34cab65df1e7aeb/ewen2017-03-21T08:59:59Z2017-03-21T08:59:59Z
<p>While tracking podcast media URLs usually works to avoid duplicate downloads, when it fails it usually fails spectacularly. In particular if a podcast feed decides to update all the URLs (for old and new podcasts) to use a different URL scheme, then suddenly that looks like a huge volume of new URLs, and all of them get downloaded again -- even if the content has actually already been retrieved from a different URL (ie, older URL scheme). For instance the <code>acast.com</code> service has changed their URL scheme a couple of times in the last 1-2 years, rewriting all the historical URLs, so I have three copies of many of the episodes on podcasts on their service <img src="http://git-annex.branchable.com/smileys/sad.png" alt=":-(" /> (Many downloaded; some skipped once I caught the bulk download and stopped it/reran with <code>--fast</code> or <code>--relaxed</code> to make placeholders instead. <code>acast.com</code> seem to have managed to cause even more confusion by rewriting many of the older <code>mp3</code> files with new <code>id3</code> tags, thus changing the file size/hashes -- it definitely made cleaning up more complicated.)</p>
<p>Some (all?) podcast feeds also have a <code>guid</code> field, which specifies what should be a unique per-episode and unchanging, that other podcatchers use to track "seen this" content. In theory that <code>guid</code> value should be stable even across media URL changes -- at least if it isn't, then a podcaster changing the <code>guid</code> <em>and</em> media URL will almost certainly induce re-downloads in most podcatchers, and thus hopefully realise early on (eg, during testing) rather than in production.</p>
<p>Can <code>git-annex</code> be extended to track the <code>guid</code> values as well as the filenames, so <code>git annex importfeed</code> can avoid downloading episodes where it has already processed that <code>guid</code>, and instead just add the newly listed url as an alternate web URL for that specific episode (which has been my manual work around). Perhaps the episode <code>guid</code> could be stored as additional metadata, along with some sort of feed unique ID (link?), and then an index built/consulted when <code>importfeed</code> runs (although that "feed unique ID" would probably also have to be updatable by the user, to cope with "the feed URL has now changed from <code>http://</code> to <code>https://</code> which also seems to be happening a bunch at present.)</p>
<p>Ewen</p>
<p>PS: Apologies for duplicate partial comment; I think my browser decided some key combination meant "do default form action", which is post -- and I wasn't finished writing. I couldn't see a way to edit the comment, hence deleting/readding.</p>
comment 26http://git-annex.branchable.com/tips/downloading_podcasts/comment_26_a69b4c033d85406675bb70e6996590ce/joey2017-03-21T17:46:28Z2017-03-21T17:28:27Z
<p>@ewen importfeed already tracks guids, since 2015. Relevant commit is
<a href="http://source.git-annex.branchable.com/?p=source.git;a=commitdiff;h=f95a8c867223b2e17d036d0d3377bf0fc9d3adff">f95a8c867223b2e17d036d0d3377bf0fc9d3adff</a></p>
<p>You may well have an
older version of git-annex that didn't do that. But there are probably also
feeds that lack a useful guid, or that even make a change that changes the
guid of an existing item.</p>
<p>With <code>git annex metadata</code>, you can see the <code>itemid</code> which is where the guid
is stored.</p>
<p>PS, please post in <a href="http://git-annex.branchable.com/todo/">todo</a> when you have a request..</p>
Tracking GUIDshttp://git-annex.branchable.com/tips/downloading_podcasts/comment_27_e343aeda7c16c834599fb3caab2a51a2/ewen2017-03-21T21:46:27Z2017-03-21T21:46:27Z
<p>@joey - thanks, that's prompt feature request fulfilment <img src="http://git-annex.branchable.com/smileys/smile.png" alt=":-)" /></p>
<p>Looking more closely at the duplicates, it turns out that not <em>everything</em> got duplicated, just the "older" episodes. It turns out the newer episodes do have <code>guid</code> values saved (as <code>itemid</code> in the metadata) and the older episodes do not. I think this is most likely because I <em>was</em> running a fairly old git-annex until about October 2016, on a fairly old OS install, but then upgraded to a more recent one (now about 6 months old) which does track them. My assumption (without checking every file) is the episodes downloaded before October 2016 are ones that got duplicated.</p>
<p>I've edited the main page and added a note that GUIDs are tracked in versions since 2015, since I didn't obviously find that listed anywhere before.</p>
<p>Ewen</p>
howto importfeed youtube playlists (not entire channels)http://git-annex.branchable.com/tips/downloading_podcasts/comment_28_f57a89a32a55dfae0dfa237a8981a667/yarikoptic2022-09-01T19:26:20Z2022-09-01T19:26:20Z
e.g. channel <a href="https://www.youtube.com/channel/UCzLPuKXYJxfwK6Vg7zuqjZQ">https://www.youtube.com/channel/UCzLPuKXYJxfwK6Vg7zuqjZQ</a> has <a href="https://www.youtube.com/channel/UCzLPuKXYJxfwK6Vg7zuqjZQ/playlists">two playlists</a>, e.g. one pointed to by <a href="https://www.youtube.com/playlist?list=PLIa3r7AIaTikMGfiZlIuArLujETxVfOKP">https://www.youtube.com/playlist?list=PLIa3r7AIaTikMGfiZlIuArLujETxVfOKP</a>. I had a question "What would be the importfeed url for that one?". Google into <a href="https://www.youtube.com/watch?v=WmbPhkW8PHQ">this youtube video</a> and the fun is that it is the same approach but need <code>playlist_id</code> instead of <code>channel_id</code> in the <code>https://www.youtube.com/playlist?<query></code> URL. My immediate problem was that it fetched only last videos... will need to resort to <a href="http://git-annex.branchable.com/forum/importfeed_does_not_work_with_youtube_anymore/#comment-e8e0862d6c52de0e9ce61403cdfb1189">manual treat</a> or alike to feed the rest