"Malformed response from arXiv API - no data in feed" woes...

I have been having a hard time to get past my queries lately - they get into almost infinite loops of messages like:

> Malformed response from arXiv API - no data in feed
> Malformed response from arXiv API - no data in feed
> Malformed response from arXiv API - no data in feed
> ...

The queries actually return far less than 50000 results,  the supposed limit of arxiv's API - they reside anywhere between 3000 and 12000 results.  Here is an example:

`category='math.AG'; start_date='20170101'; end_date='20190827'; getpapers --api 'arxiv' --query "cat:$category AND lastUpdatedDate:[${start_date}* TO ${end_date}*] " --outdir "$category" -p -l debug`

In this issue (not strictly a 'bug') I document my attempts to get past those showstoppers. Here's what I did:

### Set page size to 1000
I experimented with page sizes from 200 to 2000: 

- At 200, it takes ages to get all 10000+ results and you run a higher risk of entering the above-mentioned infinite loop of death due to the much-increased number of extra queries required to fetch them all.
- At 2000, you get many responses that contain far less than 2000 results - yet the feed is not completely empty, so this is currently not detected. See https://github.com/ContentMine/getpapers/issues/177 for a description of this bug and a solution.
- At 500, it still takes too long to get them all.
- At 1000, you get more results at once, you finish faster, you send less queries - and the risk of entering the infinite loop of death is not higher than with just 500. Plus: you don't automatically get just 200 results back, as seems to be the case with 2000...

I thus settled for a page size of 1000 in _getpapers/lib/arxiv.js_:

`arxiv.pagesize = 1000`

### Set a higher delay between retries
I experimented with various delays too: the default 3 seconds are really too fast a hammering. 30 seconds is too much of sleeping. 15 or 20 seconds seem to be O.K., so I have set

`arxiv.page_delay  = 20000`

in _getpapers/lib/arxiv.js_

### Do not urlencode the whole query URL, only the parts that need it
See https://github.com/ContentMine/getpapers/issues/178 for this.

### Correct bug where the results feed is not empty - but not full either...
See https://github.com/ContentMine/getpapers/issues/177 for details.

Last but not least...(I will [repeat myself](https://github.com/ContentMine/getpapers/issues/167#issuecomment-352275909) on this): do yourself a favour and **spoof your User Agent** in _getpapers/lib/config.js_:

`config.userAgent = 'Mozilla/5.0 (X11; Linux x86_64; rv:58.0) Gecko/20100101 Firefox/58.0'`

With the above changes in place, things have been getting better for me - and I hope the same for you too! :-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

"Malformed response from arXiv API - no data in feed" woes... #179

Set page size to 1000

Set a higher delay between retries

Do not urlencode the whole query URL, only the parts that need it

Correct bug where the results feed is not empty - but not full either...

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

"Malformed response from arXiv API - no data in feed" woes... #179

Description

Set page size to 1000

Set a higher delay between retries

Do not urlencode the whole query URL, only the parts that need it

Correct bug where the results feed is not empty - but not full either...

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions