Free GeoIP Database Format Change

Bug #1774502 reported by eMTee
28
This bug affects 3 people
Affects Status Importance Assigned to Milestone
DC++
Fix Released
High
Unassigned

Bug Description

"Updated versions of the GeoLite Legacy databases are now only available to redistribution license customers, although anyone can continue to download the March 2018 GeoLite Legacy builds. Starting January 2, 2019, the last build will be removed from our website. GeoLite Legacy database users will need to switch to the GeoLite2 or commercial GeoIP databases and update their integrations by January 2, 2019."
https://dev.maxmind.com/geoip/legacy/geolite/

Revision history for this message
poy (poy) wrote :

GeoLite 2 still has the info we use from GeoLite 1 databases, especially city-level geoloc, IPv6 support, region names...

unfortunately <https://github.com/maxmind/libmaxminddb> is licensed under Apache License 2 which is compatible with GPL 3 but not GPL 2 because of a missing patent clause in GPL 2. see:
- <https://www.apache.org/licenses/GPL-compatibility.html>
- <https://www.gnu.org/licenses/license-list.html#apache2>
- <https://opensource.stackexchange.com/a/1364>
- <https://bugs.wireshark.org/bugzilla/show_bug.cgi?id=10658> (same in wireshark, which got around it by writing a separate tool...)

I wonder whether some exception phrasing could alleviate that as has been done for openssl & wtl; can't find any existing legal advice on that.
other solutions include writing our own mmdb parser as the format is documented <http://maxmind.github.io/MaxMind-DB/> (hairy), a separate tool (no way), moving to GPL 3 (hmm?).

once that is sorted out, I would like to measure the cache size before & after migration as we had some optimizations in the version 1 reader to reduce mem use (see geoip/patches/patch.patch <<https://sourceforge.net/p/dcplusplus/code/ci/default/tree/geoip/patches/patch.patch>).

Changed in dcplusplus:
importance: Medium → High
status: New → Confirmed
Revision history for this message
eMTee (realprogger) wrote :

Probably a side effect of this change that DC++ is always showing Australia as the country for every IP in the TransferView for some time.

Revision history for this message
Derek (transxcorp) wrote :

r now it doesnt show anything at all. Time for GeoLite 2?

Revision history for this message
eMTee (realprogger) wrote :

In theory there's even a patch available for this problem (not sure why it isn't included here) but so far as I know currently a license incompatibility issue prevents the addition of GeoLite 2 libraries to DC++.

Revision history for this message
Roy Klein (royklein) wrote :

how do I find out more about the patch?

Revision history for this message
Francisco Blas Izquierdo Riera (klondike) (klondike) wrote :

Hi eMtee,

I coded this little library over the weekend, hopefully it will help :D
https://github.com/klondi/mmdb

Revision history for this message
Sopor (sopor) wrote :

It seems that GeoIP is not free any more..

"Due to upcoming data privacy regulations, we are making significant changes to how you access free GeoLite2 databases starting December 30, 2019"

Revision history for this message
eMTee (realprogger) wrote :

https://www.db-ip.com/db/lite.php has direct downloads which apparently are in the same format as GeoIP. AirDC++ uses this parsed with the free GeoLite 2 MaxMind lib.
The data content of the free version is pretty sparse but it looks as though any other richer dataset from any provider is not free anymore.
The lib's licence is still incompatible with GPLv2

Revision history for this message
eMTee (realprogger) wrote :

<eMTee> I've checked klondike's geoip lib and as I precieve it does the lookups staight from the data file. That doesn't sound too efficient. Also I'm not sure the whole thing is thread-safe at all. So maybe it's Maxmind's lib is the only option...
<cologic> I agree with you (or, well, your conclusion -- whatever your reasoning was) that klondike's design, while it is as he says, a "This is a purposefully slow implementation for memory constrained environments", achieves that fseek()ing and similar around the file in a thread-unsafe way, to avoid mmap() and other access methods. That the mmdb_t data structure itself can only meaningfully be used by one thread at a time, so the choice ends up being multiply loading in the files -- which, actually, sort of works with his design, since it is so low-memory-usage -- or keep fewer mmdb_t objects around than threads, and mux access to them somehow (say, mutexes: "TODO: mutex handling").
<cologic> It's a reasonable tradeoff for a different environment than DC++ lives in these days.
<cologic> He jumps through endless hoops just not to keep anything he doesn't need in memory, IMO to the code/design's detriment in this context.
<cologic> The whole dbip-country-lite-2020-05.mmdb is about 5MB, which seems completely acceptable to mmap() or just load completely into memory and not deal with all those seeks, mutex questions, etc.
<cologic> Just to take the most obvious option (the same overall approach, just, all in memory). Other options exist too.
<cologic> The completely unoptimized representation (text-based, no fancy reused DAG-like substructures, etc) from their CSV file is still only 17MB. So not saying this is a great option, but just a dumb CSV parser on that could work too.
<cologic> line by line of 2.17.115.0,2.17.115.255,GB -- trivial.
<cologic> The other concern I have with the style of klondike's code is that it's full of basically untrusted-data-driven pointer-chasing (via seeks at the moment, but still) which as klondike acknowledges can result in exponential blowup.
<cologic> It's not his fault -- he did what one could with the format -- but from working on other code dealing with a conceptually similar file format, I'm not a fan of it.
<cologic> There's something to be said for the relative simplicit/verifiability/fewer-weird-failure-modes of a constrained CSV parser which builds an in-memory representation with some reasonable std::foo data structure.
<cologic> But I suspect that if one fuzzed most of these geoip format parsers, the results would be dire

Revision history for this message
Francisco Blas Izquierdo Riera (klondike) (klondike) wrote :

I do agree with cologic's rationale on this :)

I did avoid mmap for the following reasons:
* My objective was a C implementation that was easy to use everywhere. mmap is UNIX specific, Windows for example uses a different API https://docs.microsoft.com/en-us/windows/win32/memory/file-mapping libraries like https://github.com/Shinmera/mmap exist but I was trying to keep things simple.
* Databases may be up to 4GB, fully memory mapping such a DB may be a problem on 32-bit systems.
* Using fseek things are a bit less likely to go awry.
* With OS caches in place the performance for a few ks of lookups should remain reasonable.

Sadly there isn't an atomic seek+read call that is portable to avoid the mutex issue :(

Also even if I read the whole DB in memory (via mmap or fread) the problem with database files being updated would remain. That isn't easy to solve in any case though.

And yes, the flexibility of the format sucks because I can't just create a fixed structure and be done with it...

I don't mind adapting the library to your needs if you don't mind waiting a bit. Just tell me what you need it to do. Otherwise you are welcome to fork and reuse any code in there to suit your (specific) needs :)

If you need me to drop by dcbase to discuss it also say so and I'll try to join :)

Revision history for this message
eMTee (realprogger) wrote :

Using https://mailfud.org/geoip-legacy/ databases with the old geoip-c-api library for the moment. To be reopened wheb this source stops providing db's with the old format.

Changed in dcplusplus:
status: Confirmed → Fix Committed
Revision history for this message
eMTee (realprogger) wrote :

Fixed in DC++ 0.880

Changed in dcplusplus:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.