Request: Query range splitter for GCA caches

Discussion about the Geocaching Australia web site
Post Reply
MavEtJu
Posts: 486
Joined: 07 January 15 9:15 pm
Twitter: @mavetju
Location: Caringbah
Contact:

Request: Query range splitter for GCA caches

Post by MavEtJu » 12 March 17 9:05 pm

At http://geocaching.com.au/stats/range/ there is a Pocket Query splitter for the caches on the Groundspeak listing service (although it doesn't explicitly mention this).

For the GCA caches there isn't such a splitter because historically it was possible to download everything as one giant GPX file. However, the new GCA API only allows queries to be up to 500 caches, as such the need for such a splitter will become necessary.

Could this please be added to the toolset?

Edwin

User avatar
Chwiliwr
10000 or more caches found
10000 or more caches found
Posts: 900
Joined: 10 April 05 10:39 pm
Location: Leeming Western Australia

Re: Request: Query range splitter for GCA caches

Post by Chwiliwr » 12 March 17 9:58 pm

MavEtJu wrote:At http://geocaching.com.au/stats/range/ there is a Pocket Query splitter for the caches on the Groundspeak listing service (although it doesn't explicitly mention this).

For the GCA caches there isn't such a splitter because historically it was possible to download everything as one giant GPX file. However, the new GCA API only allows queries to be up to 500 caches, as such the need for such a splitter will become necessary.

Could this please be added to the toolset?

Edwin
It says "GC Caches by Hidden Date Range" at the top of the page so you are not correct when you imply it does not mention it explicity.

I do not understand why exactly would somebody using the API would want to get more than 500 GCA caches at a time requiring a page to tell them how to split them up. Wouldn't it be better to get them via the existing downloads?

MavEtJu
Posts: 486
Joined: 07 January 15 9:15 pm
Twitter: @mavetju
Location: Caringbah
Contact:

Re: Request: Query range splitter for GCA caches

Post by MavEtJu » 13 March 17 2:48 pm

Chwiliwr wrote:It says "GC Caches by Hidden Date Range" at the top of the page so you are not correct when you imply it does not mention it explicity.

Aha, I didn't spot the GC there at the beginning, I most likely read that "GC Cache" as "Geocaches". The page I came from was http://geocaching.com.au/stats/ which only mentions "Statistics - Hidden Ranges" (without the " - ") and as such I was under the impression that it was also usable for the queries on the GCA website.

> I do not understand why exactly would somebody using the API would want to get more than 500 GCA caches at a time requiring a page to tell them how to split them up. Wouldn't it be better to get them via the existing downloads?
I'm not sure what you mean with the "existing downloads". If you refer to the download of all caches in a single state as a giant GPX file, then yes, it would be great if this could be done. If you refer to the download of a randomly large number of caches collected via a query by whatever criteria you are interested in, then yes, it would be great if this could be done.

However, the GCA API doesn't allow this randomly large number, it only allows up to 500 caches.

So you can either download a random number of caches in GPX format (which means that you miss interesting GCA specific data) or you can download up to 500 caches in the GCA JSON format.

Edwin

User avatar
Chwiliwr
10000 or more caches found
10000 or more caches found
Posts: 900
Joined: 10 April 05 10:39 pm
Location: Leeming Western Australia

Re: Request: Query range splitter for GCA caches

Post by Chwiliwr » 13 March 17 8:11 pm

Perhaps you should explain better why are you trying to get large chunks of data through the API. I thought no external program was supposed to be trying to recreate the GCA database through the API which on the surface seems to be the only reason to try for more than 500 records.

MavEtJu
Posts: 486
Joined: 07 January 15 9:15 pm
Twitter: @mavetju
Location: Caringbah
Contact:

Re: Request: Query range splitter for GCA caches

Post by MavEtJu » 13 March 17 8:58 pm

Chwiliwr wrote:Perhaps you should explain better why are you trying to get large chunks of data through the API. I thought no external program was supposed to be trying to recreate the GCA database through the API which on the surface seems to be the only reason to try for more than 500 records.
The GPX file you can download, and which is sanctioned via examples in the wiki and posts in the forum and gives you all the caches in the database, already does do this. All caches, all logs. People are already using applications (GSAK, Geosphere) which does do this.

The GCA API doesn't allow to download it all in one go, but only in chunks of 500 caches. So if you want these applications and newer ones to use the GCA API to have a higher level of interaction with the GCA listing service, you have to make sure that it is possible for these applications to have at least the same level of functionality with the GCA API as they currently have with the GPX file. You can do that by either allowing the GCA API to download all the caches for a state in one go like you can with the GPX file, or by making it possible to download it in 500 cache chunks. If you don't do this, the authors of these applications will not use the GCA API because it makes their life only more frustrating.

So there are two things:

1. It is impossible to download a query for 1 September 2009 because that returns 896 records.
2. It is very hard to find out which queries to make to download all caches in a state.

Edwin

User avatar
caughtatwork
Posts: 17013
Joined: 17 May 04 12:11 pm
Location: Melbourne
Contact:

Re: Request: Query range splitter for GCA caches

Post by caughtatwork » 14 March 17 10:24 am

For the GCA caches there isn't such a splitter because historically it was possible to download everything as one giant GPX file. However, the new GCA API only allows queries to be up to 500 caches, as such the need for such a splitter will become necessary.
Could this please be added to the toolset?
The answer is yes it can. I can add it to the list. I cannot yet say when it would be available as there are quire a few things on my list which will take precedence, especially as there are other ways to achieve your goal.
http://wiki.geocaching.com.au/wiki/Geoc ... pment_List

This conversation has devolved into questioning the reasoning behind the API as well as the initial query, so for those who are not interested in the why's and wherefore's of the functioning of the API, you can turn off now. For anyone else interested here is some history and information as to why we have some restrictions on what can be offered. Restrictions are rarely put in place for fun. They are generally carefully considered taking into account the costs of the server, the bandwidth costs, the size of the server, how many people are accessing the data at the same time, how long people will wait, what is the least paint point people will accept or seeking some tens of thousands of dollars a year to fund an entirely different server architecture. These points are all considered when we do anything at Geocaching Australia and they will not always make everyone happy, but they have to be put in place to keep the technology working comfortably.

Firstly, let's avoid the emotion.

We use 500 caches as the limit for a query and a listed set of geocaches due to bandwidth limitations and to mimic the OKAPI and to some extent the GC LIVE API methods. We do not have unlimited bandwidth, but if someone would like to provide the funding we can move towards that model. If will be some $,000's a year.

The MyQuery function where you can get the data in compressed form (via the GPX in a ZIP file) is different to the API which is uncompressed JSON form. The full list of geocaches plus a reasonable number of logs is around 100MB in uncompressed JSON form due to the enriched data that the API provides over the basic GPX file . The same request in GPX in 36MB. The same in ZIP exhausts the memory allocation. Equating the capability of one function which is the GPX via compressed ZIP is not the same as the API which is uncompressed JSON, and as per the service description, the API is not meant to provide a full service to download every geocache in the database. The MyQuery function is about creating a query of the database that fits a certain purpose and use. By definition this would return fewer geocaches than the entire database, so using the MyQuery as a defacto download engine was not its purpose, but we don't explicitly block that unless you run the machine out of memory. Please accept the statement that they are different as they are never likley to be the same while we have developer and infrastructure (including bandwidth) limitations.

Using an API to gain access to data is not transparent. It's very much hidden from the user as they are unlikely to equate the function of what they are doing with the technical limitations that are enforced. The user could select "Get All" and get the 100MB download without realising they are downloading 16,000 geocaches of which 15,950 are too far away from them to be of any use in a mobile app. "Get next closest 500" would be much more meaningful and would also indicate that there is a large download chunk on its way which if they then decide it would be too expensive on their data plan they can choose not to proceed. "Get all" at 100MB I suspect will cause a non-trivial level of grief for some mobile users. If a few dozen user "Get all" each day, then mid-month we run out of bandwidth and the site stops. Downloading a GPX file via ZIP (at the moment anyway) requires a little more of a conscious decision (in my interpretation of how you might download that file) and if you try and get every geocache you get around 1/3rd of the size file, but it needs to be parsed via a GPX parser and if you try the ZIP approach you can't as it runs the function out of memory. We have limits as to what can actually be accomplished.

If getting all geocaches in the database is required then the API can still assist you, it's just a case of the way that you put the data together and how you go about making the request. I would recommend you have a look at all the methods of the API as there is away this can be done, but not by query.

How can you actually achieve a full download of every geocache in the database?

http://geocaching.com.au/api/services/s ... arest.html
Provides a list of geocaches, sorted by distance, limited to 500 at a time. That function also has an offset parameter which allows you to paginate results. e.g.
http://geocaching.com.au/api/services/s ... fset=16500
That should return {"results":[]} as there are no geocaches above 16,500. We don't have that many.

You can run a nearest search for 500, then another 500 offset by 500, another 500 offset by 1000 etc, etc, etc. This only returns a list of geocaches so you would then need to string that list together and call http://geocaching.com.au/api/services/c ... aches.html This will return the data for those 500 geocaches each time.

This is similar to the API design of the GC LIVE API and is the same as the OKAPI. i.e. Neither will allow you to fire a query that says "give me every geocache and log you have". They all have a speed bump where they require you to make small, frequent requests which allow the server to breathe between requests and service other client requests. I recall that GC has a limit on the number of geocaches you can pull down in a day. I think (not guaranteed to be correct) <10K for full data and a few tens of thousands for "loc" style data (not full data). At Geocaching Australia even though we have the same sort of speed bumps and therefore any developer of GC LIVE API or OKAPI or GCA API would need be familiar with pagination, Geocaching Australia doesn't restrict how far you can travel. This makes our API more open in terms of the overall volume of data you can retrieve. As the GCA LIVE API and the OKAPI have the same sort of the speed bumps and all app developers would have to live with those restrictions I cannot immediate see how the GCA API having the same restriction would be any more frustrating. It's a simple fact of almost every API I can immediately think of that you are never going to get everything from the database in one hit, you will need to paginate or make multiple queries.

So the question has been answered above and the information about why the speed bumps are in place has also been answered and details provided as to how you can use the API to achieve your goals, albeit with some more work on your end to save the infrastructure and bandwidth at our end.

Remember we are free and open to the point where the server is impacted and then we are somewhat less open as we need to balance the needs of the many against the needs of the few or the one.

Post Reply